Last edited on 19980715 23:48:56 by stolfi
This site contains samples texts in various languages, whose letters have been colorized to show whick of them are ``responsible'' for the kthorder entropy. The samples include one page each from the HerbalA and Biological section.
In the pages listed below, each letter is painted with a color whose brightness and hue increase in proportion to the letter's ``unexpectedness'' in view of its immediate context.
More precisely, for the ith character x_{i} of the text,
let L_{i} be the string of l characters immediately to its left
x_{il}x_{il+1}···x_{i1}, and
R_{i} the r characters to its right,
x_{i+1}x_{i+2}···x_{i+r}.
We define the local information contents of that letter occurrence as

(1) 
In other words, v_{i} is the information (in bits) that is provided by the letter occurrence x_{i}, if we know the previous l and the next r characters, and the frequency distribution of all substrings of length n = l+1+r. In particular, if r = 0, the average of the v_{i} is the familiar nthorder conditional entropy of the text, namely the average number of bits of information provided by each character, if we know the n1 preceding ones.
Letters that are highly predictable from their surroundings (such as an u after a q in English, or h between t and or e) will have low (but nonnegative) v_{i}, whereas letters in unusual contexts (such as a w after th) will have high (in fact, unbounded) v_{i}.
It can be seen in the colorized pages that the entropy of a character x for the contex (l,r) = (1,1) is lower than that for (2,0). The intuitive explanation is that the two letters of the (1,1) context are farther apart from each other, and closer to the letter x. So, in typical languages (where the correlation beyween symbols decreases with their separation), the (1,1) context carries more information than the (2,0) context, and imposes stronger constraints on the third letter.
To estimate the relative frequency of the third letter in a given context, we use a Bayesian model where ``a priori'' all histograms on the M letters of the input alphabet are equally likely. In this model, if a context w occurred N_{w} times, and a particular letter x occurred N_{x} times in that context, the estimated probability of x in that context is (N_{x} + 1)/(N_{w} + M).
In our case, the input alphabet was always a subset of all lowercase ISOLatin1 plain letters, accented letters, and ligatures; plus hyphen and apostrophe, the underscore "_" (representing a word break), and the decimal digits. Thus M was at most 26 + 32 + 2 + 1 + 10 = 71.
Because of this correction, the nthorder entropy computed for a fixedsize sample of naturallanguage text will eventually start increasing with n, and tend to log_{2}(M) (which is less than 7.39) as n goes to infinity. (In contrast, if we had used the simple estimate N_{x}/N_{w}, the conditional entropy would fall zero as M^{n1} became comparable to the sample size.
In the sample pages below, each letter x_{i} is painted with a color whose brightness increases monotonically with its local information content v_{i}. The color key is given at the bottom of each page.
For each sample, the ntuple probabilities are estimated from a [ full ] text with about 20,00060,000 characters, and then used to colorize one [ page ] of that text. (Colorizing the whole text would be a bit expensive, since it takes 27 bytes of HTML formatting to change the color!). The [ tuple ] button in each entry shows the list of ntuples in the full text and the corresponding v_{i}.
Note that the ``full'' texts are still relatively short, so we should not try to draw too many conclusions from experiments with high values of n = l+1+r. The comments below each entry are largely based on the l=2, r=0 case (the classical h_3), which seems to be the most informative of the four combinations that I have prepared.
All texts were mapped to lowercase to reduce nonlinguistic noise and make them easier to compare to the VMs texts. For statistical purposes, strings of one or more consecutive blanks, punctuation characters, and line breaks were treated as a single ``word break'' character (printed as a punctuation or underscore). Paragraph breaks, however, were treated as n1 separate consecutive word breaks. '#'comments and line locators were also ignored. (All ignored text is shown in dark gray.)