Hacking at the Voynich manuscript - Side notes 026 Entropy-colorized version of the text Last edited on 1998-07-16 21:55:08 by stolfi I THE GENERAL IDEA The idea is to produce a version of the text where the entropy is localized to each character or character boundary. Let Info(C => E) = -log_2 \Prob(C => E) = information added by event E given that C was already known. Prob(A!B => E!F) = Prob(A!B & E!F) / Prob (AB) = probability of a pattern E!F occurring around an inter-character position (the `!') given that pattern A!B occurs at that same spot. c[i:k] = c[i]c[i+1]...c[i+k-1] = the k-character substring of string c[0..n-1] starting at c[i]. If i is negative, or i > n, c[i] is the "filler" letter `<' or `>', respectively. Thus, for example, Prob( zy!z => y!zyz ) is the probability of finding the substring `yzyz' in a text sample, starting at some position i, given that there is a string `zyz' starting at position i-1. The kth-order entropy of a piece of text c[0..n-1] is the sum of the information contributed by each character, given knowledge of the previous k-1 characters: h_k(c) = SUM_i { Info( c[i-k+1:k-1]! => !c[i:1] ) } (1) where it is understood that Info( w! => !">" ) is 0 if w ends with the `>' filler. Algebraic manipulation gives h_k(c) = SUM { Info( c[i-k:k-1]! & !c[i:1] ) : i IN Z } - SUM { Info( c[i-k:k-1] ) : i IN Z } = SUM { Info( c[i:k] ) } - SUM_i { Info( c[i:k-1] ) : i IN Z } We can imagine each term of (1) as being attached to the corresponding character c[i]. That is, if we define h_k(c)[i] = Info( c[i-k+1:k-1]! => !c[i:1] ) then h_k(c) = SUM { h_k(c)[i] : i IN Z } More generally we can define h_{r:1:s}(c)[i] = Info( c[i-r:r]!?c[i+1:s] => !c[i:1] ) where "?" is the pattern that matches any single letter. Then we can interpret v_{r:1:s}(c)[i] as the information carried specifically by the character occurrence c[i]. II GATHERING THE SAMPLES I gathered a sample "full.txt" with about 20,000-60,000 KB for several languages, and extracted from it a short sub-sample "page.txt". The Voynichese texts were extracted from my version of Landini's interlinear file, much amended. (See Note 019). They were cleaned with a special script: foreach s ( voyn-bio voyn-hea voyn-heb voyn-pha ) cat $s/full.evt | prettify-voynich-text > $s/full.txt end The "dain daiin latin" sample (latn-dai) was provided by Gabriel Landini. The word-scrambled latin (latn-zan) was provided by Rene Zandbergen. The Chinese sample with diacritic-coded tones was created from the one with numeric tones by the following script: cat chin-tao/full.txt \ | pinyin-num-tone-to-accent \ | sed \ -e 's/ng/ñ/g' \ -e 's/zh/ð/g' \ -e 's/ch/þ/g' \ -e 's/sh/ç/g' \ > chin-acc/full.txt The Reeds-compressed Voynichese samples (vmrc-bio-{w,p}{10,20}, vmrc-{hea,heb}-w10) were created as described in Note 027, then copied to this directory and slightly edited: cp ../027/bio-10.txt vmrc-bio-w10/full.txt cp ../027/bio-20.txt vmrc-bio-w20/full.txt cp ../027/bio-sp-10.txt vmrc-bio-p10/full.txt cp ../027/bio-sp-20.txt vmrc-bio-p20/full.txt cp ../027/hea-10.txt vmrc-hea-w10/full.txt cp ../027/hea-20.txt vmrc-hea-w20/full.txt cp ../027/heb-10.txt vmrc-heb-w10/full.txt cp ../027/heb-20.txt vmrc-heb-w20/full.txt III CREATING THE COLORIZED PAGES Samples that require mapping to lower case, ignoring digits: set lcsamples = ( \ engl-wow \ port-ate span-qui ital-mnz fran-mic \ latn-bel latn-ben latn-ock \ latn-gen latn-dai latn-zan \ viet-www chin-tao chin-acc \ voyn-pha voyn-hea voyn-heb voyn-bio \ ) foreach sam ( $lcsamples ) echo " "; echo ${sam} make-colorized-page-new -lc ${sam} 1 0 0.500 8 8.500 echo " "; make-colorized-page-new -lc ${sam} 2 0 0.500 8 8.500 echo " "; make-colorized-page-new -lc ${sam} 3 0 0.500 8 8.500 echo " "; make-colorized-page-new -lc ${sam} 1 1 0.500 8 8.500 end Samples that require distinguishing case, and ignoring digits: set ucsamples = ( \ vmrc-bio-w10 vmrc-bio-w20 \ vmrc-hea-w10 vmrc-hea-w20 \ vmrc-heb-w10 vmrc-heb-w20 \ vmrc-bio-p10 vmrc-bio-p20 \ ) foreach sam ( $ucsamples ) echo " "; echo ${sam} make-colorized-page-new ${sam} 1 0 0.500 8 8.500 echo " "; make-colorized-page-new ${sam} 2 0 0.500 8 8.500 echo " "; make-colorized-page-new ${sam} 3 0 0.500 8 8.500 echo " "; make-colorized-page-new ${sam} 1 1 0.500 8 8.500 end Samples that require mapping to lower case, and handling digits as letters: set dgsamples = ( \ chin-tao \ ) foreach sam ( $dgsamples ) echo " "; echo ${sam} make-colorized-page-new -lc -dg ${sam} 1 0 0.500 8 8.500 echo " "; make-colorized-page-new -lc -dg ${sam} 2 0 0.500 8 8.500 echo " "; make-colorized-page-new -lc -dg ${sam} 3 0 0.500 8 8.500 echo " "; make-colorized-page-new -lc -dg ${sam} 1 1 0.500 8 8.500 end All samples: set samples = ( ${dgsamples} ${lcsamples} ${ucsamples} ) 1998-07-13 stolfi ================= I created three alternative samples of Don Quijote to test the effect of sample size on entropy: span-qub : Don Quijote, Part III, Chapters XXVI-XXVII span-quc : Don Quijote, Part I, Chapters I-IV span-qud : Don Quijote, Parts I-III foreach sam ( span-{qub,quc,qud} ) echo " "; echo ${sam} make-colorized-page-new -lc ${sam} 1 1 0.500 8 8.500 make-colorized-page-new -lc ${sam} 2 0 0.500 8 8.500 make-colorized-page-new -lc ${sam} 3 0 0.500 8 8.500 end Modified the script to use the "classical" ("frequentistic") formula for probability estimation, and re-ran these tests: foreach sam ( span-{qui,qub,quc,qud} ) echo " "; echo ${sam} make-colorized-page-test ${sam} 1 0 0.500 8 8.500 make-colorized-page-test ${sam} 2 0 0.500 8 8.500 make-colorized-page-test ${sam} 3 0 0.500 8 8.500 make-colorized-page-test ${sam} 1 1 0.500 8 8.500 end Here are the numbers. "Bays" uses the Bayesian estimate with fixed M=71. Clss Bays Clss Bays Clss Bays sample chars h[2] h[2] h[3] h[3] h[4] h[4] -------- ------- ---- ---- ---- ---- ---- ---- span-qui 32,600 3.06 3.12 2.44 2.93 1.76 3.22 span-qub 49,000 3.09 3.13 2.51 2.89 1.88 3.13 span-quc 48,000 3.10 3.15 2.51 2.90 1.87 3.14 span-qud 450,000 3.11 3.11 2.59 2.66 2.08 2.43 Same test for the Voynichese samples: foreach sam ( voyn-{pha,hea,heb,bio} ) echo " "; echo ${sam} make-colorized-page-test ${sam} 1 0 0.500 8 8.500 make-colorized-page-test ${sam} 2 0 0.500 8 8.500 make-colorized-page-test ${sam} 3 0 0.500 8 8.500 end Clss Bays Clss Bays Clss Bays sample chars h[2] h[2] h[3] h[3] h[4] h[4] -------- ------- ---- ---- ---- ---- ---- ---- voyn-pha 13,200 2.10 2.22 1.77 2.29 1.55 2.77 voyn-hea 44,500 2.09 2.13 1.83 2.04 1.71 2.35 voyn-heb 20,000 2.09 2.17 1.76 2.16 1.60 2.60 voyn-bio 37,500 1.79 1.83 1.52 1.74 1.42 1.99 1998-07-14 stolfi ================= Converting the Chinese ASCII-pinyin to accented pinyin: cat chin-tao/full.txt \ | egrep -v '^#' \ | tr 'A-ZÜ' 'a-zü' \ | tr -c 'a-zü0-9' '\012' \ | egrep '.' \ > chin-acc/full.wds cat chin-acc/full.wds \ | sort | uniq \ | sed \ -e 's/\([aeiouüyw][aeiouüyw]*\)/-\1-/g' \ -e 's/^-/@-/' \ -e 's/-\([0-9]*\)$/-@-\1/' \ -e 's/-\([nr][g]*\)\([0-9]*\)$/-\1-\2/' \ | sort \ > chin-acc/words.fac cat chin-acc/words.fac \ | gawk -v FS='-' '/./{print $1}' \ | sort | uniq \ | fmt -w 60 @ b c ch d f g h j k l m n p q r s sh t x z zh cat chin-acc/words.fac \ | gawk -v FS='-' '/./{print $2}' \ | sort | uniq \ | fmt -w 60 a ai ao e ei i ia iao ie io iu o ou u ua uai ue ui uo wa wai we wei wo wu ya yao ye yi yo you yu yua yue cat chin-acc/words.fac \ | gawk -v FS='-' '/./{print $3}' \ | sort | uniq \ | fmt -w 60 @ n ng r cat chin-acc/words.fac \ | gawk -v FS='-' '/./{print $4}' \ | sort | uniq \ | fmt -w 60 0 1 2 3 4 Hm, it seems that this sample does not distinguish between "u" and "ü", (or indicates the latter by some non-standard convention). 1998-07-16 stolfi ================= Fixed the scripts "compute-cond-tuple-info-new" and "make-colorized-page-new" to treat digits as spaces (except for Chinese, oops!), and use the actual alphabet size seen in the sample, instead of the fixed maximum size (71). Recomputed all pages. Checking the effect of these changes on tuple entropy (engl-wow/L3R0.inf): (1) alphabet size M fixed at 71; (2) M = actual alphabet size, including digits and [-'_]; (3) actual alphabet size, including [-'_] but no digits. English (War of the Worlds) fixed M=71 real M=37 alpha M=29 ---------- ---------- ---------- 9.198 the- 9.156 the- 9.146 the- 9.198 thet 9.156 thet 9.146 thet 8.910 he_j 8.858 he_j 8.845 he_j 8.910 he_z 8.858 he_z 8.845 he_z 8.325 he_y 8.273 he_y 8.261 he_y 8.183 _thu 8.149 _thu 8.141 _thu 8.147 nd_j 8.058 nd_j 8.036 nd_j 8.147 nd_v 8.058 nd_v 8.036 nd_v 8.044 _ant 7.948 _ant 7.925 _ant 8.044 _anx 7.948 _anx 7.925 _anx 8.042 ing' 7.945 ing' 7.922 ing' 8.042 ingd 7.945 ingd 7.922 ingd 8.033 ed_j 7.937 ed_j 7.913 ed_j 8.033 ed_q 7.937 ed_q 7.913 ed_q 8.000 ng_j 7.901 ng_j 7.877 ng_j 8.000 ng_k 7.901 ng_k 7.877 ng_k 8.000 ng_q 7.901 ng_q 7.877 ng_q 8.000 ng_z 7.901 ng_z 7.877 ng_z 7.925 of_j 7.820 of_j 7.794 of_j .... ... ... 0.825 with 0.573 _to_ 0.475 rom_ 0.820 had_ 0.546 t_th 0.475 ted_ 0.773 ight 0.504 had_ 0.440 the_ 0.761 _was 0.490 tion 0.433 _the 0.739 t_th 0.488 with 0.419 had_ 0.731 _to_ 0.485 ight 0.408 ight 0.591 that 0.450 the_ 0.395 tion 0.577 n_th 0.441 _the 0.395 with 0.518 f_th 0.396 n_th 0.350 n_th 0.515 hat_ 0.377 that 0.322 that 0.493 the_ 0.315 _and 0.292 _and 0.476 _the 0.309 f_th 0.255 f_th 0.474 was_ 0.307 hat_ 0.253 hat_ 0.411 _and 0.263 was_ 0.228 ing_ 0.348 ing_ 0.252 ing_ 0.209 was_ 0.257 _of_ 0.157 and_ 0.133 and_ 0.257 and_ 0.154 _of_ 0.129 _of_ Tabulating entropies etc: summarize-statistics ${samples} span-{qub,quc,qud} > stats.txt sample chars M h_2 h_3 h_4 ------------- ------ -- ------ ------ ------ engl-wow 56147 29 3.299 2.724 2.542 War of the Worlds port-ate 30180 40 3.288 3.034 3.206 O Ateneu span-qui 32570 32 3.079 2.683 2.638 Don Quijote XLIV-XLV span-qub 48955 32 3.103 2.692 2.607 Don Quijote XXVI-XXVII span-quc 48156 34 3.117 2.708 2.649 Don Quijote I-IV span-qud 451577 34 3.110 2.618 2.269 Don Quijote I-XXVII ital-mnz 61078 32 3.191 2.764 2.660 I Promessi Sposi fran-mic 37317 39 3.269 2.746 2.796 Micromégas latn-bel 56410 24 3.295 2.689 2.418 De Bello Gallico latn-ben 47336 26 3.322 2.723 2.485 Benedictine Rule latn-ock 49347 25 3.257 2.525 2.222 Ockam's latn-gen 71165 24 3.302 2.651 2.334 Vulgate Genesis latn-dai 59435 22 2.420 1.909 1.696 Gabriel's dain daiin latn-zan 71165 24 2.974 2.388 2.347 Rene's letter sorting viet-www 23833 51 2.873 2.589 2.862 Vietnamese chin-tao 26737 31 2.063 1.821 1.923 Tao, numerical tones chin-acc 13281 49 2.638 2.706 3.039 Tao, diacritic tones voyn-pha 13326 20 2.123 1.938 2.064 VMs Pharmaceutical, EVA voyn-hea 44630 23 2.105 1.898 1.979 VMs Herbal-A, EVA voyn-heb 20029 20 2.106 1.882 2.003 VMs Herbal-B, EVA voyn-bio 37732 21 1.799 1.589 1.647 VMs Biological, EVA vmrc-bio-w10 26334 31 2.433 2.331 2.564 Jim's compress, word ×10 vmrc-hea-w10 33175 33 2.708 2.582 2.856 Jim's compress, word ×10 vmrc-heb-w10 14601 30 2.713 2.662 3.002 Jim's compress, word ×10 vmrc-bio-w20 22562 41 2.795 2.821 3.182 Jim's compress, word ×20 vmrc-hea-w20 28644 43 3.051 3.078 3.506 Jim's compress, word ×20 vmrc-heb-w20 12613 40 3.103 3.231 3.659 Jim's compress, word ×20 vmrc-bio-p10 24168 32 2.729 2.620 2.936 Jim's compress, para ×10 vmrc-bio-p20 19171 42 3.357 3.368 3.809 Jim's compress, para ×20