Last edited on 1998-07-15 23:48:56 by stolfi

Where are the bits?
Local entropy distribution of various languages

Abstract

This site contains samples texts in various languages, whose letters have been colorized to show whick of them are ``responsible'' for the kth-order entropy. The samples include one page each from the Herbal-A and Biological section.

The general idea
Probability estimation
The samples:
The Unix recipe and associated files I used.

The general idea

In the pages listed below, each letter is painted with a color whose brightness and hue increase in proportion to the letter's ``unexpectedness'' in view of its immediate context.

More precisely, for the ith character x_i of the text, let L_i be the string of l characters immediately to its left x_i-lx_i-l+1ЗЗЗx_i-1, and R_i the r characters to its right, x_i+1x_i+2ЗЗЗx_i+r. We define the local information contents of that letter occurrence as

v_i = - log₂ Prob(L_ix_iR_i \| L_iR_i*) = -log₂	Prob(L_ix_iR_i)
	-----------------
	SUM_c Prob(L_icR_i)

(1)

where Prob(w) is the probability of pattern w occurring in the text, * means ``any character'', and the sum is to be taken over all letters of the alphabet.

In other words, v_i is the information (in bits) that is provided by the letter occurrence x_i, if we know the previous l and the next r characters, and the frequency distribution of all substrings of length n = l+1+r. In particular, if r = 0, the average of the v_i is the familiar nth-order conditional entropy of the text, namely the average number of bits of information provided by each character, if we know the n-1 preceding ones.

Letters that are highly predictable from their surroundings (such as an u after a q in English, or h between t and or e) will have low (but non-negative) v_i, whereas letters in unusual contexts (such as a w after th) will have high (in fact, unbounded) v_i.

It can be seen in the colorized pages that the entropy of a character x for the contex (l,r) = (1,1) is lower than that for (2,0). The intuitive explanation is that the two letters of the (1,1) context are farther apart from each other, and closer to the letter x. So, in typical languages (where the correlation beyween symbols decreases with their separation), the (1,1) context carries more information than the (2,0) context, and imposes stronger constraints on the third letter.

Probability estimation

To estimate the relative frequency of the third letter in a given context, we use a Bayesian model where ``a priori'' all histograms on the M letters of the input alphabet are equally likely. In this model, if a context w occurred N_w times, and a particular letter x occurred N_x times in that context, the estimated probability of x in that context is (N_x + 1)/(N_w + M).

In our case, the input alphabet was always a subset of all lower-case ISO-Latin-1 plain letters, accented letters, and ligatures; plus hyphen and apostrophe, the underscore "_" (representing a word break), and the decimal digits. Thus M was at most 26 + 32 + 2 + 1 + 10 = 71.

Because of this correction, the nth-order entropy computed for a fixed-size sample of natural-language text will eventually start increasing with n, and tend to log₂(M) (which is less than 7.39) as n goes to infinity. (In contrast, if we had used the simple estimate N_x/N_w, the conditional entropy would fall zero as M^n-1 became comparable to the sample size.

The samples

In the sample pages below, each letter x_i is painted with a color whose brightness increases monotonically with its local information content v_i. The color key is given at the bottom of each page.

For each sample, the n-tuple probabilities are estimated from a [ full ] text with about 20,000-60,000 characters, and then used to colorize one [ page ] of that text. (Colorizing the whole text would be a bit expensive, since it takes 27 bytes of HTML formatting to change the color!). The [ tuple ] button in each entry shows the list of n-tuples in the full text and the corresponding v_i.

Note that the ``full'' texts are still relatively short, so we should not try to draw too many conclusions from experiments with high values of n = l+1+r. The comments below each entry are largely based on the l=2, r=0 case (the classical h_3), which seems to be the most informative of the four combinations that I have prepared.

All texts were mapped to lowercase to reduce non-linguistic noise and make them easier to compare to the VMs texts. For statistical purposes, strings of one or more consecutive blanks, punctuation characters, and line breaks were treated as a single ``word break'' character (printed as a punctuation or underscore). Paragraph breaks, however, were treated as n-1 separate consecutive word breaks. '#'-comments and line locators were also ignored. (All ignored text is shown in dark gray.)

Where are the bits? Local entropy distribution of various languages

Abstract

Contents

The general idea

Probability estimation

The samples

Where are the bits?
Local entropy distribution of various languages