Last edited on 1998-07-15 23:48:56 by stolfi

# Where are the bits? Local entropy distribution of various languages

## Abstract

This site contains samples texts in various languages, whose letters have been colorized to show whick of them are ``responsible'' for the kth-order entropy. The samples include one page each from the Herbal-A and Biological section.

## The general idea

In the pages listed below, each letter is painted with a color whose brightness and hue increase in proportion to the letter's ``unexpectedness'' in view of its immediate context.

More precisely, for the ith character xi of the text, let Li be the string of l characters immediately to its left xi-lxi-l+1ÀÀÀxi-1, and Ri the r characters to its right, xi+1xi+2ÀÀÀxi+r. We define the local information contents of that letter occurrence as

 vi   =   - log2 Prob(LixiRi | Li*Ri)   =   -log2 Prob(LixiRi) ----------------- SUMc Prob(LicRi)
(1)

where Prob(w) is the probability of pattern w occurring in the text, * means ``any character'', and the sum is to be taken over all letters of the alphabet.

In other words, vi is the information (in bits) that is provided by the letter occurrence xi, if we know the previous l and the next r characters, and the frequency distribution of all substrings of length n = l+1+r. In particular, if r = 0, the average of the vi is the familiar nth-order conditional entropy of the text, namely the average number of bits of information provided by each character, if we know the n-1 preceding ones.

Letters that are highly predictable from their surroundings (such as an u after a q in English, or h between t and or e) will have low (but non-negative) vi, whereas letters in unusual contexts (such as a w after th) will have high (in fact, unbounded) vi.

It can be seen in the colorized pages that the entropy of a character x for the contex (l,r) = (1,1) is lower than that for (2,0). The intuitive explanation is that the two letters of the (1,1) context are farther apart from each other, and closer to the letter x. So, in typical languages (where the correlation beyween symbols decreases with their separation), the (1,1) context carries more information than the (2,0) context, and imposes stronger constraints on the third letter.

## Probability estimation

To estimate the relative frequency of the third letter in a given context, we use a Bayesian model where ``a priori'' all histograms on the M letters of the input alphabet are equally likely. In this model, if a context w occurred Nw times, and a particular letter x occurred Nx times in that context, the estimated probability of x in that context is (Nx + 1)/(Nw + M).

In our case, the input alphabet was always a subset of all lower-case ISO-Latin-1 plain letters, accented letters, and ligatures; plus hyphen and apostrophe, the underscore "_" (representing a word break), and the decimal digits. Thus M was at most 26 + 32 + 2 + 1 + 10 = 71.

Because of this correction, the nth-order entropy computed for a fixed-size sample of natural-language text will eventually start increasing with n, and tend to log2(M) (which is less than 7.39) as n goes to infinity. (In contrast, if we had used the simple estimate Nx/Nw, the conditional entropy would fall zero as Mn-1 became comparable to the sample size.

## The samples

In the sample pages below, each letter xi is painted with a color whose brightness increases monotonically with its local information content vi. The color key is given at the bottom of each page.

For each sample, the n-tuple probabilities are estimated from a [ full ] text with about 20,000-60,000 characters, and then used to colorize one [ page ] of that text. (Colorizing the whole text would be a bit expensive, since it takes 27 bytes of HTML formatting to change the color!). The [ tuple ] button in each entry shows the list of n-tuples in the full text and the corresponding vi.

Note that the ``full'' texts are still relatively short, so we should not try to draw too many conclusions from experiments with high values of n = l+1+r. The comments below each entry are largely based on the l=2, r=0 case (the classical h_3), which seems to be the most informative of the four combinations that I have prepared.

All texts were mapped to lowercase to reduce non-linguistic noise and make them easier to compare to the VMs texts. For statistical purposes, strings of one or more consecutive blanks, punctuation characters, and line breaks were treated as a single ``word break'' character (printed as a punctuation or underscore). Paragraph breaks, however, were treated as n-1 separate consecutive word breaks. '#'-comments and line locators were also ignored. (All ignored text is shown in dark gray.)