Hacking at the Voynich manuscript - Side notes
026 Entropy-colorized version of the text

Last edited on 1998-07-16 21:55:08 by stolfi

I THE GENERAL IDEA

The idea is to produce a version of the text where the entropy 
is localized to each character or character boundary.

Let
  
  Info(C => E) = -log_2 \Prob(C => E)
     = information added by event E given that C was already known.
  
  Prob(A!B => E!F) = Prob(A!B & E!F) / Prob (AB)
     = probability of a pattern E!F occurring around an inter-character 
     position (the `!') given that pattern A!B occurs at
     that same spot.
  
  c[i:k] = c[i]c[i+1]...c[i+k-1]
     = the k-character substring of string c[0..n-1] starting at c[i].
     If i is negative, or i > n, c[i] is the "filler" letter `<' or `>',
     respectively.
     
Thus, for example, Prob( zy!z => y!zyz ) is the probability of finding
the substring `yzyz' in a text sample, starting at some position i,
given that there is a string `zyz' starting at position i-1.

The kth-order entropy of a piece of text c[0..n-1] is the sum of the
information contributed by each character, given knowledge of the
previous k-1 characters:

  h_k(c) = SUM_i { Info( c[i-k+1:k-1]! => !c[i:1] ) }    (1)
  
where it is understood that Info( w! => !">" ) is 0 if w ends with the
`>' filler.
  
Algebraic manipulation gives

  h_k(c)
    = SUM { Info( c[i-k:k-1]! & !c[i:1] ) : i IN Z } 
    - SUM { Info( c[i-k:k-1] ) : i IN Z } 
    
    = SUM { Info( c[i:k] ) } - SUM_i { Info( c[i:k-1] ) : i IN Z } 
     
We can imagine each term of (1) as being attached to the corresponding character 
c[i].  That is, if we define

  h_k(c)[i] = Info( c[i-k+1:k-1]! => !c[i:1] )
  
then h_k(c) = SUM { h_k(c)[i] : i IN Z } 

More generally we can define

  h_{r:1:s}(c)[i] = Info( c[i-r:r]!?c[i+1:s] => !c[i:1] )
  
where "?" is the pattern that matches any single letter.
Then we can interpret v_{r:1:s}(c)[i] as the information
carried specifically by the character occurrence c[i].

II GATHERING THE SAMPLES

I gathered a sample "full.txt" with about 20,000-60,000 KB 
for several languages, and extracted from it a short
sub-sample "page.txt".
  
The Voynichese texts were extracted from my version of Landini's 
interlinear file, much amended. (See Note 019). They were cleaned with
a special script:

  foreach s ( voyn-bio voyn-hea voyn-heb voyn-pha  )
    cat $s/full.evt | prettify-voynich-text > $s/full.txt
  end

The "dain daiin latin" sample (latn-dai) was provided by Gabriel Landini.

The word-scrambled latin (latn-zan) was provided by Rene Zandbergen.

The Chinese sample with diacritic-coded tones was created from 
the one with numeric tones by the following script:

  cat chin-tao/full.txt \
    | pinyin-num-tone-to-accent \
    | sed \
        -e 's/ng/ñ/g' \
        -e 's/zh/ð/g' \
        -e 's/ch/þ/g' \
        -e 's/sh/ç/g' \
    > chin-acc/full.txt

The Reeds-compressed Voynichese samples (vmrc-bio-{w,p}{10,20},
vmrc-{hea,heb}-w10) were created as described in Note 027, then copied
to this directory and slightly edited:

  cp ../027/bio-10.txt vmrc-bio-w10/full.txt
  cp ../027/bio-20.txt vmrc-bio-w20/full.txt

  cp ../027/bio-sp-10.txt vmrc-bio-p10/full.txt
  cp ../027/bio-sp-20.txt vmrc-bio-p20/full.txt

  cp ../027/hea-10.txt vmrc-hea-w10/full.txt
  cp ../027/hea-20.txt vmrc-hea-w20/full.txt

  cp ../027/heb-10.txt vmrc-heb-w10/full.txt
  cp ../027/heb-20.txt vmrc-heb-w20/full.txt

III CREATING THE COLORIZED PAGES

Samples that require mapping to lower case, ignoring digits:

  set lcsamples = ( \
    engl-wow \
    port-ate span-qui ital-mnz fran-mic \
    latn-bel latn-ben latn-ock \
    latn-gen latn-dai latn-zan \
    viet-www chin-tao chin-acc \
    voyn-pha voyn-hea voyn-heb voyn-bio \
  )
  
  foreach sam ( $lcsamples )
    echo " "; echo ${sam}
    make-colorized-page-new -lc ${sam} 1 0 0.500 8 8.500 
    echo " ";                         
    make-colorized-page-new -lc ${sam} 2 0 0.500 8 8.500 
    echo " ";                         
    make-colorized-page-new -lc ${sam} 3 0 0.500 8 8.500 
    echo " ";                         
    make-colorized-page-new -lc ${sam} 1 1 0.500 8 8.500 
  end


Samples that require distinguishing case, and ignoring digits:

  set ucsamples = (  \
    vmrc-bio-w10 vmrc-bio-w20 \
    vmrc-hea-w10 vmrc-hea-w20 \
    vmrc-heb-w10 vmrc-heb-w20 \
    vmrc-bio-p10 vmrc-bio-p20 \
  )

  foreach sam ( $ucsamples )
    echo " "; echo ${sam}
    make-colorized-page-new ${sam} 1 0 0.500 8 8.500 
    echo " ";                         
    make-colorized-page-new ${sam} 2 0 0.500 8 8.500 
    echo " ";                         
    make-colorized-page-new ${sam} 3 0 0.500 8 8.500 
    echo " ";                         
    make-colorized-page-new ${sam} 1 1 0.500 8 8.500 
  end

Samples that require mapping to lower case, and handling digits
as letters:

  set dgsamples = (  \
    chin-tao \
  )

  foreach sam ( $dgsamples )
    echo " "; echo ${sam}
    make-colorized-page-new -lc -dg ${sam} 1 0 0.500 8 8.500 
    echo " ";                         
    make-colorized-page-new -lc -dg ${sam} 2 0 0.500 8 8.500 
    echo " ";                         
    make-colorized-page-new -lc -dg ${sam} 3 0 0.500 8 8.500 
    echo " ";                         
    make-colorized-page-new -lc -dg ${sam} 1 1 0.500 8 8.500 
  end

All samples:

  set samples = ( ${dgsamples} ${lcsamples} ${ucsamples} )

1998-07-13 stolfi
=================

I created three alternative samples of Don Quijote to test the
effect of sample size on entropy:

  span-qub : Don Quijote, Part III, Chapters XXVI-XXVII
  span-quc : Don Quijote, Part I, Chapters I-IV
  span-qud : Don Quijote, Parts I-III
  
  foreach sam ( span-{qub,quc,qud} )
    echo " "; echo ${sam}
    make-colorized-page-new -lc ${sam} 1 1 0.500 8 8.500 
    make-colorized-page-new -lc ${sam} 2 0 0.500 8 8.500 
    make-colorized-page-new -lc ${sam} 3 0 0.500 8 8.500 
  end

Modified the script to use the "classical" ("frequentistic") formula
for probability estimation, and re-ran these tests:

  foreach sam ( span-{qui,qub,quc,qud} )
    echo " "; echo ${sam}
    make-colorized-page-test ${sam} 1 0 0.500 8 8.500 
    make-colorized-page-test ${sam} 2 0 0.500 8 8.500 
    make-colorized-page-test ${sam} 3 0 0.500 8 8.500 
    make-colorized-page-test ${sam} 1 1 0.500 8 8.500 
  end

Here are the numbers. "Bays" uses the Bayesian estimate
with fixed M=71.

                     Clss Bays   Clss Bays   Clss Bays
  sample     chars   h[2] h[2]   h[3] h[3]   h[4] h[4]
  -------- -------   ---- ----   ---- ----   ---- ---- 
  span-qui  32,600   3.06 3.12   2.44 2.93   1.76 3.22
  span-qub  49,000   3.09 3.13   2.51 2.89   1.88 3.13
  span-quc  48,000   3.10 3.15   2.51 2.90   1.87 3.14
  span-qud 450,000   3.11 3.11   2.59 2.66   2.08 2.43

Same test for the Voynichese samples:

  foreach sam ( voyn-{pha,hea,heb,bio} )
    echo " "; echo ${sam}
    make-colorized-page-test ${sam} 1 0 0.500 8 8.500 
    make-colorized-page-test ${sam} 2 0 0.500 8 8.500 
    make-colorized-page-test ${sam} 3 0 0.500 8 8.500 
  end

                     Clss Bays   Clss Bays   Clss Bays
  sample     chars   h[2] h[2]   h[3] h[3]   h[4] h[4]
  -------- -------   ---- ----   ---- ----   ---- ---- 
  voyn-pha  13,200   2.10 2.22   1.77 2.29   1.55 2.77
  voyn-hea  44,500   2.09 2.13   1.83 2.04   1.71 2.35
  voyn-heb  20,000   2.09 2.17   1.76 2.16   1.60 2.60
  voyn-bio  37,500   1.79 1.83   1.52 1.74   1.42 1.99

1998-07-14 stolfi
=================

Converting the Chinese ASCII-pinyin to accented pinyin:

  cat chin-tao/full.txt \
    | egrep -v '^#' \
    | tr 'A-ZÜ' 'a-zü' \
    | tr -c 'a-zü0-9' '\012' \
    | egrep '.' \
    > chin-acc/full.wds 
    
  cat chin-acc/full.wds \
    | sort | uniq \
    | sed \
       -e 's/\([aeiouüyw][aeiouüyw]*\)/-\1-/g' \
       -e 's/^-/@-/' \
       -e 's/-\([0-9]*\)$/-@-\1/' \
       -e 's/-\([nr][g]*\)\([0-9]*\)$/-\1-\2/' \
    | sort \
    > chin-acc/words.fac

  cat chin-acc/words.fac \
    | gawk -v FS='-' '/./{print $1}' \
    | sort | uniq \
    | fmt -w 60
    
    @ b c ch d f g h j k l m n p q r s sh t x z zh

  cat chin-acc/words.fac \
    | gawk -v FS='-' '/./{print $2}' \
    | sort | uniq \
    | fmt -w 60
    
    a ai ao e ei i ia iao ie io iu o ou u ua uai ue ui uo wa
    wai we wei wo wu ya yao ye yi yo you yu yua yue

  cat chin-acc/words.fac \
    | gawk -v FS='-' '/./{print $3}' \
    | sort | uniq \
    | fmt -w 60
    
    @ n ng r

  cat chin-acc/words.fac \
    | gawk -v FS='-' '/./{print $4}' \
    | sort | uniq \
    | fmt -w 60
    
    0 1 2 3 4

Hm, it seems that this sample does not distinguish between "u" and "ü",
(or indicates the latter by some non-standard convention).
    
1998-07-16 stolfi
=================

Fixed the scripts "compute-cond-tuple-info-new" and
"make-colorized-page-new" to treat digits as spaces (except for
Chinese, oops!), and use the actual alphabet size seen in the sample,
instead of the fixed maximum size (71).  Recomputed all pages.

Checking the effect of these changes on tuple entropy (engl-wow/L3R0.inf): 
(1) alphabet size M fixed at 71; (2) M = actual alphabet size,
including digits and [-'_]; (3) actual alphabet size, including
[-'_] but no digits.

    English (War of the Worlds)
    fixed M=71      real  M=37    alpha M=29
    ----------      ----------    ----------
    9.198 the-      9.156 the-    9.146 the-
    9.198 thet      9.156 thet    9.146 thet
    8.910 he_j      8.858 he_j    8.845 he_j
    8.910 he_z      8.858 he_z    8.845 he_z
    8.325 he_y      8.273 he_y    8.261 he_y
    8.183 _thu      8.149 _thu    8.141 _thu
    8.147 nd_j      8.058 nd_j    8.036 nd_j
    8.147 nd_v      8.058 nd_v    8.036 nd_v
    8.044 _ant      7.948 _ant    7.925 _ant
    8.044 _anx      7.948 _anx    7.925 _anx
    8.042 ing'      7.945 ing'    7.922 ing'
    8.042 ingd      7.945 ingd    7.922 ingd
    8.033 ed_j      7.937 ed_j    7.913 ed_j
    8.033 ed_q      7.937 ed_q    7.913 ed_q
    8.000 ng_j      7.901 ng_j    7.877 ng_j
    8.000 ng_k      7.901 ng_k    7.877 ng_k
    8.000 ng_q      7.901 ng_q    7.877 ng_q
    8.000 ng_z      7.901 ng_z    7.877 ng_z
    7.925 of_j      7.820 of_j    7.794 of_j
   ....             ...           ...
    0.825 with      0.573 _to_    0.475 rom_ 
    0.820 had_      0.546 t_th    0.475 ted_ 
    0.773 ight      0.504 had_    0.440 the_ 
    0.761 _was      0.490 tion    0.433 _the 
    0.739 t_th      0.488 with    0.419 had_ 
    0.731 _to_      0.485 ight    0.408 ight 
    0.591 that      0.450 the_    0.395 tion 
    0.577 n_th      0.441 _the    0.395 with 
    0.518 f_th      0.396 n_th    0.350 n_th 
    0.515 hat_      0.377 that    0.322 that 
    0.493 the_      0.315 _and    0.292 _and 
    0.476 _the      0.309 f_th    0.255 f_th 
    0.474 was_      0.307 hat_    0.253 hat_ 
    0.411 _and      0.263 was_    0.228 ing_ 
    0.348 ing_      0.252 ing_    0.209 was_ 
    0.257 _of_      0.157 and_    0.133 and_ 
    0.257 and_      0.154 _of_    0.129 _of_ 

Tabulating entropies etc:

  summarize-statistics ${samples} span-{qub,quc,qud} > stats.txt

    sample         chars  M    h_2    h_3    h_4
    ------------- ------ -- ------ ------ ------

    engl-wow       56147 29  3.299  2.724  2.542 War of the Worlds

    port-ate       30180 40  3.288  3.034  3.206 O Ateneu

    span-qui       32570 32  3.079  2.683  2.638 Don Quijote XLIV-XLV
    span-qub       48955 32  3.103  2.692  2.607 Don Quijote XXVI-XXVII
    span-quc       48156 34  3.117  2.708  2.649 Don Quijote I-IV
    span-qud      451577 34  3.110  2.618  2.269 Don Quijote I-XXVII

    ital-mnz       61078 32  3.191  2.764  2.660 I Promessi Sposi

    fran-mic       37317 39  3.269  2.746  2.796 Micromégas

    latn-bel       56410 24  3.295  2.689  2.418 De Bello Gallico
    latn-ben       47336 26  3.322  2.723  2.485 Benedictine Rule
    latn-ock       49347 25  3.257  2.525  2.222 Ockam's 
    latn-gen       71165 24  3.302  2.651  2.334 Vulgate Genesis

    latn-dai       59435 22  2.420  1.909  1.696 Gabriel's dain daiin

    latn-zan       71165 24  2.974  2.388  2.347 Rene's letter sorting

    viet-www       23833 51  2.873  2.589  2.862 Vietnamese 

    chin-tao       26737 31  2.063  1.821  1.923 Tao, numerical tones
    chin-acc       13281 49  2.638  2.706  3.039 Tao, diacritic tones

    voyn-pha       13326 20  2.123  1.938  2.064 VMs Pharmaceutical, EVA
    voyn-hea       44630 23  2.105  1.898  1.979 VMs Herbal-A, EVA
    voyn-heb       20029 20  2.106  1.882  2.003 VMs Herbal-B, EVA
    voyn-bio       37732 21  1.799  1.589  1.647 VMs Biological, EVA

    vmrc-bio-w10   26334 31  2.433  2.331  2.564 Jim's compress, word ×10
    vmrc-hea-w10   33175 33  2.708  2.582  2.856 Jim's compress, word ×10
    vmrc-heb-w10   14601 30  2.713  2.662  3.002 Jim's compress, word ×10

    vmrc-bio-w20   22562 41  2.795  2.821  3.182 Jim's compress, word ×20
    vmrc-hea-w20   28644 43  3.051  3.078  3.506 Jim's compress, word ×20
    vmrc-heb-w20   12613 40  3.103  3.231  3.659 Jim's compress, word ×20

    vmrc-bio-p10   24168 32  2.729  2.620  2.936 Jim's compress, para ×10
    vmrc-bio-p20   19171 42  3.357  3.368  3.809 Jim's compress, para ×20