Hacking at the Voynich manuscript - Side notes
004 Effect of position-dependent ciphers on the word distribution

Last edited on 1999-07-28 01:48:10 by stolfi

  [ Used to be Notes/020, renumbered to Notes/004 on 1999-02-01 ]

1998-04-23 stolfi
=================

Let's examine the hypothesis that the VMs is in cipher, by comparing
the Voynichese word frequency distribution with those of 
natural languages, both plain and Vigenère-encoded.

Before reading this note, you must read these Web articles:

  Gabriel Landini
  Zipf's laws in the Voynich Manuscript.
  http://sun1.bham.ac.uk/G.Landini/evmt/zipf.htm

  Rene Zandbergen
  Currier A and B: two different languages?
  http://sun1.bham.ac.uk/G.Landini/evmt/lang.htm

word frequencies for the following texts:

  vren-eva  Rene's Voynichese word frequency table,
            expanded by repeating each word according to its count.
            
  vhea-eva  Herbal section, "Language A" subset, Friedman's transcription
  vheb-eva  Herbal section, "Language B" subset, Friedman's transcription
  vbio-eva  Biological section (LAnguage B), Friedman's transcription 
  vmix-eva  Mixture of 40% Herbal-A and 60% Biological

  engl-poi  Modern English (a Poirot novel, lowercase)
  engl-wow  Modern English (Well's War of the Worlds, lowercase)
  latn-ock  Medieval latin (a text by William Ockam, lowercase)
  latn-bel  Classical latin (Caesar's De Bello Gallico, lowercase)
  chin-mch  Modern Chinese (a beginner's reader, in pinyin).

  engl-v06  English coded with 6-letter Vigenère
  engl-v43  English coded with 43-letter Vigenère
  engl-vns  ditto, ignoring 1- and 2-letter words

  latn-v06  Latin coded with 6-letter Vigenère
  latn-v43  Latin coded with 43-letter Vigenère

All texts were randomly sampled so as to produce
files of roughly similar size (7500 words), except
when the original material was shorter than that limit.

The resulting ".wds" files have one word per line, in 
the original order.  The details of sample preparation
are shown below.

---------------------------------------------------------
English

  cat engl-poi.txt \
    | head -5

    the intense interest aroused in the public by what was known at the
    time as the styles case has now somewhat subsided nevertheless in
    view of the world wide notoriety which attended it i have been asked
    both by my friend poirot and the family themselves to write an account
    of the whole story this we trust will effectually silence the

  cat engl-poi.txt \
    | tr ' ' '\012' \
    | grep '.' \
    | gawk '(rand() <= 0.13){print;}' \
    > engl-poi.wds

  dicio-wc engl-poi.wds

  cat engl-wow.txt \
    | head -5

    No one would have believed in the last years of the
    nineteenth century that this world was being watched keenly
    and closely by intelligences greater than man's and yet as
    mortal as his own; that as men busied themselves about their
    various concerns they were scrutinised and studied, perhaps

  cat engl-wow.txt \
    | tr 'A-Z' 'a-z' \
    | tr -c -d ' a-z\012' \
    | tr ' ' '\012' \
    | grep '.' \
    | gawk '(rand() <= 0.125){print;}' \
    > engl-wow.wds

  dicio-wc engl-wow.wds

  cat engl-poi.wds engl-wow.wds \
    | gawk '(rand() <= 0.51){print;}' \
    > engl-mix.wds

  dicio-wc engl-mix.wds

---------------------------------------------------------
Latin

  cat latn-ock.txt \
    | head -5

    Discipulus: Quoniam ista quinta assertio, via 
    media inter alias quatuor incedendo, cum qualibet illarum in 
    quibusdam concordat et in aliquibus discrepare dignoscitur, ipsam 
    quo ad alias partes eius exquisite discutere etiam alias 
    quodammodo pertractare propono. Ideo de ipsa diffuse aliquantulum 

  cat latn-ock.txt \
    | tr 'A-Z' 'a-z' \
    | tr -c -d ' a-z\012' \
    | tr ' ' '\012' \
    | grep '.' \
    | gawk '(rand() <= 0.31){print;}' \
    > latn-ock.wds
    
  dicio-wc latn-ock.wds

  cat latn-bel.txt \
    | head -5

    Gallia est omnis divisa in partes tres, quarum unam incolunt Belgae,
    aliam Aquitani, tertiam qui ipsorum lingua Celtae, nostra Galli
    appellantur. Hi omnes lingua, institutis, legibus inter se differunt.
    Gallos ab Aquitanis Garumna flumen, a Belgis Matrona et Sequana
    dividit. Horum omnium fortissimi sunt Belgae, propterea quod a cultu

  cat latn-bel.txt \
    | tr 'A-Z' 'a-z' \
    | tr -c -d ' a-z\012' \
    | tr ' ' '\012' \
    | grep '.' \
    | gawk '(rand() <= 0.92){print;}' \
    > latn-bel.wds
    
  dicio-wc latn-bel.wds

  cat latn-ock.wds latn-bel.wds \
    | gawk '(rand() <= 0.51){print;}' \
    > latn-mix.wds

  dicio-wc latn-mix.wds

---------------------------------------------------------
Chinese:

  cat chin-mch.txt \
    | head -5

    lu3 xun4 shi4 jin4 dai4 shi3 shang4 zui4 you3 ying3 xiang3 li4 de wen2
    xue2 jia1 gen1 pi1 ping2 jia1 zhi1 yi1 yi1 ba1 ba1 yi1 nian2 chu1
    sheng1 zai4 zhe4 jiang1 shao4 xing1 yi2 ge xiang1 dang1 fu4 yu4 de
    jia1 ting2 li3 tong2 nian2 de shi2 hou yin1 wei4 zu3 fu4 ru4 yu4 fu4
    qin sheng1 bing4 jia1 ting2 de jing1 ji4 qing2 kuang4 tu1 ran2 bian4
        
  cat chin-mch.txt \
    | tr ' ' '\012' \
    | grep '.' \
    | gawk '(rand() <= 0.99){print;}' \
    > chin-mch.wds

  dicio-wc chin-mch.wds
  
Checking repeated words:

  cat chin-mch.txt \
    | tr ' ' '\012' \
    | grep '.' \
    | gawk 'BEGIN{w = "_";} /./{print w, $0; w = $0;} END{print w, "_";}' \
    | egrep '^(.*) \1$' \
    > .reps
  
  dicio-wc chin-mch.txt .reps

   lines   words file        
  ------ ------- ------------
     265    3777 chin-mch.txt
      16      32 .reps

Ditto, ignoring tone:

  cat chin-mch.txt \
    | tr ' ' '\012' \
    | tr -d '0-9' \
    | grep '.' \
    | gawk 'BEGIN{w = "_";} /./{print w, $0; w = $0;} END{print w, "_";}' \
    | egrep '^(.*) \1$' \
    > .reps
  
  dicio-wc chin-mch.txt .reps

     lines   words     bytes file        
    ------ ------- --------- ------------
       265    3777     18287 chin-mch.txt
        27      54       218 .reps

---------------------------------------------------------
Voynichese:

Let's recreate Rene's word list, by expanding his word histogram
and sampling it:

  cat Rene-words.frq \
    | gawk '(NF>=2){for(i=0;i<$1;i++){print $2;}}' \
    | gawk '(rand() <= 0.26){print;}' \
    > vren-eva.wds
    
  dicio-wc vren-eva.wds
    
Rene wondered whether the low frequency of the most common
word in Voynichese ("daiin" 2.7%, contrasted with "the" 4.6%)
is due to the presence of two different languages. 
Let's test that using the Friedman transcription of herbal-A,
Herbal-B, and Biological, plus a 40:60 mixture of Herbal-A and
Biological  (as Rene himself suggested).

  cat hea-f-eva-gut.wds \
    > vhea-eva.wds
  
  cat heb-f-eva-gut.wds \
    > vheb-eva.wds
  
  cat bio-f-eva-gut.wds \
    > vbio-eva.wds
  
  cat hea-f-eva-gut.wds \
    | gawk '(rand() <= 0.40){print;}' \
    > vmix-eva.wds
  cat bio-f-eva-gut.wds \
    | gawk '(rand() <= 0.60){print;}' \
    >>  vmix-eva.wds
    
  dicio-wc {vhea,vheb,vbio,vmix}-eva.wds

Let's also prepare "q"-less versions of all the Voynichese
samples:

  foreach f ( vren vhea vbio vheb vmix )
    cat $f-eva.wds \
    | sed -e 's/^q//g' \
    > ${f}-enq.wds
  end
  
And, just in case, also versions with "k" and "t" identified:

  foreach f ( vren vhea vbio vheb vmix )
    cat $f-enq.wds \
    | sed -e 's/^q//g' -e 's/t/k/g' \
    > ${f}-qkt.wds
  end
  
Let's count repeated words:

  cat hea-f-eva.txt \
    | tr ' ' '\012' \
    | grep '.' \
    | gawk 'BEGIN{w = "_";} /./{print w, $0; w = $0;} END{print w, "_";}' \
    | egrep '^(.*) \1$' \
    > .reps
  
  dicio-wc hea-f-eva.txt .reps

   lines   words file        
  ------ ------- ------------
    1216    8058 hea-f-eva.txt
      78     156 .reps

  cat bio-f-eva.txt \
    | tr ' ' '\012' \
    | grep '.' \
    | gawk 'BEGIN{w = "_";} /./{print w, $0; w = $0;} END{print w, "_";}' \
    | egrep '^(.*) \1$' \
    > .reps
  
  dicio-wc bio-f-eva.txt .reps

     lines   words file        
    ------ ------- ------------
       715    6281 bio-f-eva.txt
        68     136 .reps

---------------------------------------------------------
Vigenère-encoded texts:

  cat engl-wow.txt \
    | tr 'A-Z' 'a-z' \
    | vigenere -v key=qotchy \
    | head -5

    dc hpl ueief oyls ugsgujxf pl jvx nhqj mxcyq et mjl
    lybxvlcdha elljika afqh mjpq mcknk uqg ugplw ktvjfur dgllbm
    tpk abclgsw rm bpacbzbillssl iycqhxt afqb fcu'q qbw alr qg
    fqyrqz tu ogi cpp; afqh tu tcd pnupct hagtquzogz yrcnv afuwk
    xhpycnu jmdqxtuq jvxa dchs leysjwgkzct ogf zrkrbgk, nufacwq

  cat engl-wow.txt \
    | tr 'A-Z' 'a-z' \
    | vigenere -v key=thequickpuceopossumjumpsoverthelazycapybara \
    | head -5

    gv sdy eqeax jejt pwdcqeyp xf hci ctzx jezpu ou rie
    eiglxuyvvr rypxigm lzuf cbuh ocmpu phw mehli wprdhvd dlidfg
    cxs wnsgtzq ts uwnqadwbievlw rrdyveg riae mtu'w qhl aoi uu
    qcghsd ue qce dob; olrm hw xem zwsxce tyefzibpmu kqiwx hwsaj
    pmacajk qjrtxyrd tgca wtpf strnamdcagn phf whjrawx, bnltphg

  cat latn-ock.txt \
    | tr 'A-Z' 'a-z' \
    | tr -c -d ' a-z\012' \
    | vigenere -v key=qotchy \
    | head -5

    twlepnkznu xsebbct giht sbgdht czqufmkv tyo 
    fgkgq wgvlp qzbcz okomwvp ybvgkcdrh ebk gitnpzuh bnsyhif ku 
    okwuwzbqa vquaefwca cj wg csggibdbq twleycfokg kgwbhujgjik kwqqa 
    jwv yt oekhq fokvlq uwnu lvgibupru rbujsjskg lryof csgqg 
    jwvbqafqkm fskvyyshttl nhciqum yrxq kc ydlc kgvtnul ybwjwhljiewt 

  cat latn-ock.txt \
    | tr 'A-Z' 'a-z' \
    | tr -c -d ' a-z\012' \
    | vigenere -v key=thequickpuceopossumjumpsoverthelazycapybara \
    | head -5

    wpwscxwvjm syccwse cecu cjaboe rlzicthm xip 
    kfdza buxul inspm syoiigj czlyptfrj glf xyllhzgt xjmaiuf pr 
    goqdehxcq qdbuglpjn qi ab vpzjbmmur bksrpfprrx kmwhwumxnwv wegse 
    kgx up pdwvw gtyxps dgws tvruzsbai tcaeeiyti siwse uxrue 
    fmcyedfvhz pdpvrpauaie iysfivq ssyq hs xdks xuozghw ogmhnhreukso 

  foreach lang ( engl latn )
    cat ${lang}-mix.wds \
      | vigenere -v key=qotchy \
      > ${lang}-v06.wds

    cat ${lang}-mix.wds \
      | vigenere -v key=thequickpuceopossumjumpsoverthelazycapybara \
      > ${lang}-v43.wds
  end
  
  dicio-wc {engl,latn}-{v06,v43}.wds

Now a version of engl-v43.wds without the 1- and 2-letter words:

  cat engl-v43.wds \
    | grep '...' \
    > engl-vns.wds

---------------------------------------------------------

OK, here it is all:

  dicio-wc ????-???.wds 
  
   lines   words     bytes file        
  ------ ------- --------- ------------
    3743    3743     18124 chin-mch.wds
    7507    7507     39983 engl-mix.wds
    7431    7431     39141 engl-poi.wds
    7507    7507     39983 engl-v06.wds
    7507    7507     39983 engl-v43.wds
    5707    5707     34358 engl-vns.wds
    7508    7508     40808 engl-wow.wds
    7528    7528     52005 latn-bel.wds
    7546    7546     52321 latn-mix.wds
    7500    7500     52621 latn-ock.wds
    7546    7546     52321 latn-v06.wds
    7546    7546     52321 latn-v43.wds
    6182    6182     35751 vbio-enq.wds
    6182    6182     37279 vbio-eva.wds
    6182    6182     35751 vbio-qkt.wds
    7812    7812     45130 vhea-enq.wds
    7812    7812     45838 vhea-eva.wds
    7812    7812     45130 vhea-qkt.wds
    3223    3223     18996 vheb-enq.wds
    3223    3223     19326 vheb-eva.wds
    3223    3223     18996 vheb-qkt.wds
    6696    6696     38681 vmix-enq.wds
    6696    6696     39871 vmix-eva.wds
    6696    6696     38681 vmix-qkt.wds
    7460    7460     43503 vren-enq.wds
    7460    7460     44670 vren-eva.wds
    7460    7460     43503 vren-qkt.wds

---------------------------------------------------------
Let's compute the frequency distributions:

  foreach f ( ????-???.wds )
    echo $f
    cat ${f} \
      | sort | uniq -c | expand \
      | sort +0 -1nr \
      | compute-freqs \
      | head -50 \
      > ${f:r}.frq
    cat ${f:r}.frq \
      | gawk '//{printf "%7.5f %s\n", $2, $3;}' \
      > ${f:r}.pct
  end

---------------------------------------------------------
Comparing different samples of English:

  multicol -v lines=30 {engl-poi,engl-wow,engl-mix}.pct
  plot-word-freqs {engl-poi,engl-wow,engl-mix}.frq > .engl.gif
  xv .engl.gif &

    Poirot novel   Wells's WotW   Mixture
    -------------  -------------  -------------
    0.0444 the     0.0773 the     0.0613 the   
    0.0296 i       0.0465 and     0.0338 and   
    0.0240 to      0.0384 of      0.0305 of    
    0.0221 and     0.0264 a       0.0272 i     
    0.0218 a       0.0194 to      0.0258 a     
    0.0213 of      0.0190 i       0.0220 to    
    0.0201 it      0.0152 in      0.0156 it    
    0.0191 that    0.0135 was     0.0152 that  
    0.0183 was     0.0108 had     0.0147 was   
    0.0157 in      0.0104 it      0.0144 in    
    0.0151 you     0.0096 that    0.0095 he    
    0.0112 he      0.0085 my      0.0085 had   
    0.0098 not     0.0080 he      0.0085 my    
    0.0094 is      0.0079 at      0.0085 you   
    0.0092 had     0.0077 were    0.0075 not   
    0.0087 but     0.0073 as      0.0071 as    
    0.0087 she     0.0069 with    0.0068 at    
    0.0085 her     0.0059 from    0.0064 his   
    0.0078 his     0.0055 for     0.0061 me    
    0.0078 my      0.0052 me      0.0059 is    
    0.0078 said    0.0049 they    0.0057 but   
    0.0077 as      0.0048 we      0.0056 on    
    0.0074 have    0.0047 there   0.0056 were  
    0.0073 poirot  0.0045 this    0.0056 with  
    0.0065 me      0.0043 his     0.0055 have  
    0.0058 on      0.0043 on      0.0053 her   
    0.0058 with    0.0043 their   0.0052 she   
    0.0051 for     0.0040 but     0.0051 we    
    0.0050 at      0.0039 by      0.0049 for   
    0.0050 mrs     0.0037 out     0.0043 from  

We can see that

  (*) The most common word ("the") occurs at
      4.4% in Poirot, 7.7% in Well's.
  
  (*) The most common words are quite different.
  
  (*) The Wells's text has an almost perfect Zipf-like
      distribution.
      
  (*) The Poirot text is a bit non-Zipfian:
      words 3-8 are an almost flat plateau, 
      there is a large drop around word 10, then 
      a another nearly-flat region until about
      word 20.
      
  (*) The mixed text is intermediate; 
      words 2-5 are somewhat flat, words 7-10 are a 
      small plateau , and ditto for words 13-15.

---------------------------------------------------------
Comparing different samples of Latin:

  multicol -v lines=30 {latn-ock,latn-bel,latn-mix}.pct
  plot-word-freqs {latn-ock,latn-bel,latn-mix}.frq > .latn.gif
  xv .latn.gif &

    Ockam's book        Caesar's DBG       Mixture
    ------------------  -----------------  ------------------
    0.0412 et           0.0247 et          0.0329 et         
    0.0228 in           0.0224 in          0.0241 in         
    0.0213 non          0.0146 quod        0.0183 quod       
    0.0204 quod         0.0130 ad          0.0158 non        
    0.0195 est          0.0112 non         0.0144 ad         
    0.0133 ad           0.0109 cum         0.0125 est        
    0.0088 quam         0.0108 se          0.0087 ut         
    0.0084 ut           0.0088 ut          0.0080 qui        
    0.0084 vel          0.0085 qui         0.0077 quam       
    0.0076 qui          0.0082 esse        0.0076 cum        
    0.0075 si           0.0072 ex          0.0069 se         
    0.0071 quia         0.0070 a           0.0068 ex         
    0.0069 quae         0.0060 neque       0.0068 si         
    0.0067 propter      0.0060 quam        0.0062 de         
    0.0067 sed          0.0057 eo          0.0060 esse       
    0.0065 plures       0.0056 atque       0.0056 a          
    0.0063 autem        0.0056 si          0.0050 quae       
    0.0057 expedit      0.0053 caesar      0.0049 ab         
    0.0057 principatus  0.0052 est         0.0049 per        
    0.0056 de           0.0050 ab          0.0045 vel        
    0.0055 secundum     0.0049 sibi        0.0038 etiam      
    0.0049 fidelium     0.0045 aut         0.0038 quia       
    0.0048 etiam        0.0045 de          0.0038 sed        
    0.0048 unus         0.0044 eius        0.0037 plures     
    0.0047 ex           0.0041 ne          0.0037 propter    
    0.0047 hoc          0.0039 his         0.0034 neque      
    0.0045 per          0.0039 per         0.0034 principatus
    0.0043 unum         0.0037 quae        0.0034 sibi       
    0.0041 esse         0.0037 romani      0.0034 sunt       
    0.0041 sunt         0.0036 ea          0.0033 hoc        

We can see that

  (*) The most common word occurs at 2.5% for Caesar's
      4.1% for Ockam's.
      
  (*) The Caesar sample is mostly Zipf-like after 
      the first 15 words; before that it drops 
      a bit slower than 1/x.
      
  (*) The Ockam sample has a plateau at words 2-5,
      then a sharp drop, another plateau at words 7-10,
      etc.
      
  (*) The mixed text looks the most Zipf-like of the three.

---------------------------------------------------------
Comparing different versions of Voynichese Herbal-A:

  multicol -v lines=30 {vhea-eva,vhea-enq,vhea-qkt}.pct
  plot-word-freqs {vhea-eva,vhea-enq,vhea-qkt}.frq > .vhea.gif
  xv .vhea.gif &
  
    Herbal-A       without "q"    also "k"="t"
    -------------  -------------  -------------
    0.0527 daiin   0.0527 daiin   0.0527 daiin 
    0.0283 chol    0.0283 chol    0.0283 chol  
    0.0192 chor    0.0192 chor    0.0192 chor  
    0.0125 shol    0.0125 shol    0.0169 okchy 
    0.0123 cthy    0.0123 cthy    0.0160 ckhy  
    0.0122 chy     0.0122 chy     0.0159 oky   
    0.0118 sho     0.0118 sho     0.0125 shol  
    0.0113 dy      0.0113 dy      0.0122 chy   
    0.0111 s       0.0111 s       0.0118 sho   
    0.0095 dain    0.0100 otchy   0.0114 okol  
    0.0091 dar     0.0095 dain    0.0113 dy    
    0.0081 shor    0.0091 dar     0.0111 s     
    0.0078 shy     0.0084 oty     0.0104 okaiin
    0.0070 chey    0.0081 shor    0.0095 dain  
    0.0067 or      0.0078 shy     0.0091 dar   
    0.0065 cthol   0.0074 oky     0.0084 ckhol 
    0.0063 qotchy  0.0072 or      0.0081 shor  
    0.0060 dal     0.0070 chey    0.0078 shy   
    0.0058 ol      0.0069 okchy   0.0076 okchol
    0.0054 cthor   0.0065 cthol   0.0072 or    
    0.0051 dol     0.0060 dal     0.0070 chey  
    0.0050 qokchy  0.0059 ol      0.0068 kchy  
    0.0050 shey    0.0059 otol    0.0067 ckhor 
    0.0049 oty     0.0055 okaiin  0.0067 okchor
    0.0044 chaiin  0.0055 okol    0.0063 okor  
    0.0044 cthey   0.0054 cthor   0.0060 dal   
    0.0044 dam     0.0051 dol     0.0059 ckhey 
    0.0044 dor     0.0050 shey    0.0059 ol    
    0.0042 cheor   0.0049 otaiin  0.0051 dol   
    0.0042 oky     0.0047 otchol  0.0050 shey  

Some observations: 

  (*) The Herbal-A sample has a "natural" (5.7%) frequency for the
      first word, and is quite Zipf-like after the first 10 
      words or so.
      
  (*) Words 4 thru 9 are sort of a plateau.
      
  (*) Deleting the "q"s has little effect on the histogram.
  
  (*) Equating "k" and "t" has no effect on the 
      first three entries, but widens the plateau to 
      entries 4-12 (except for a Zipf-like step 
      between words 6 and 7).
      
Conclusion: for Herbal-A, removing the "q"s is optional, and 
equating "k" and "t" may be detrimental to its Zipfness.

---------------------------------------------------------
Comparing different version of Voynichese Biological:
  
  multicol -v lines=30 {vbio-eva,vbio-enq,vbio-qkt}.pct
  plot-word-freqs {vbio-eva,vbio-enq,vbio-qkt}.frq > .vbio.gif
  xv .vbio.gif &
  
    Biological      without "q"    also "k"="t"
    --------------  -------------  --------------
    0.0369 shedy    0.0505 ol      0.0505 ol     
    0.0312 chedy    0.0382 okaiin  0.0495 okaiin 
    0.0301 ol       0.0369 shedy   0.0469 okedy  
    0.0298 qokaiin  0.0315 okedy   0.0369 okeedy 
    0.0254 qokedy   0.0314 chedy   0.0369 shedy  
    0.0231 qokeedy  0.0278 okeedy  0.0314 chedy  
    0.0204 qol      0.0201 okal    0.0283 okal   
    0.0171 daiin    0.0171 daiin   0.0175 okeey  
    0.0165 qokal    0.0154 otedy   0.0175 oky    
    0.0136 chey     0.0144 okeey   0.0171 daiin  
    0.0134 shey     0.0137 chey    0.0149 okar   
    0.0118 qokeey   0.0134 shey    0.0137 chey   
    0.0115 dal      0.0115 dal     0.0134 shey   
    0.0104 dar      0.0113 otaiin  0.0115 dal    
    0.0095 qoky     0.0108 oky     0.0113 okain  
    0.0091 saiin    0.0104 dar     0.0110 okey   
    0.0089 or       0.0102 or      0.0104 dar    
    0.0084 okaiin   0.0097 okain   0.0102 or     
    0.0082 qokain   0.0091 oteedy  0.0091 saiin  
    0.0081 lchedy   0.0091 saiin   0.0089 chckhy 
    0.0081 qotedy   0.0086 okar    0.0081 lchedy 
    0.0081 sol      0.0082 otal    0.0081 sol    
    0.0078 dy       0.0081 lchedy  0.0078 dy     
    0.0073 otedy    0.0081 sol     0.0076 kedy   
    0.0063 cheey    0.0078 dy      0.0070 okol   
    0.0063 qokar    0.0076 okey    0.0063 cheey  
    0.0063 sheedy   0.0066 oty     0.0063 sheedy 
    0.0061 okedy    0.0063 cheey   0.0058 aiin   
    0.0061 otaiin   0.0063 otar    0.0058 shckhy 
    0.0060 qokey    0.0063 sheedy  0.0055 checkhy

Random observations: 

  (*) The raw distribution of the Biological words
      is practically flat for the first four entries ("shedy",
      "chedy", "ol", "qokaiin"), then gradually becomes quite
      Zipf-like.
      
  (*) Removing the "q"s has a drastic effect;
      "ol" and "okaiin" overtake "shedy".  The max frequency becomes
      5%, the palteau widens to words 1-5, but the distribution then
      becomes a bit more Zipf-like.
      
  (*) Equating "k" and "t" flattens completely the
      first three entries, and affects most frequencies below the
      maximum.  There appears a large jump after word 7, and plateaus
      in 4-5, 8-10, 12-13, etc.
      
  (*) We should try removing "q" and equating "ch" and "sh",
      while leaving "k" and "t" distinct.  The max frequency
      ("chedy") would become 6.7%, comparable only to Chinese.
      The second entry would be "ol" at 5%; a bit too high for Zipf,
      but perhaps that can be reduced by fixing word-break 
      transcription errors.
      
Conclusion: for Bio-B, there are weak reasons for deleting 
"q", and weak reasons for NOT equating "k" and "t".

Herbal-B and other versions are analyzed further on.

---------------------------------------------------------
Comparing English, Latin, Chinese, and Voynichese:

  multicol -v lines=30 {vhea-enq,vbio-enq,engl-wow,latn-bel,chin-mch}.pct
  plot-word-freqs \
    engl-wow.frq:1 latn-bel.frq:5 chin-mch.frq:3 \
    vhea-enq.frq:2 vbio-enq.frq:6 > .natu.gif
  xv .natu.gif &

    Herbal-A       Biological     English/WotW  Latin/DBG      Chinese/Mch
    -------------  -------------  ------------  -------------  -------------
    0.0527 daiin   0.0505 ol      0.0773 the    0.0247 et      0.0647 de    
    0.0283 chol    0.0382 okaiin  0.0465 and    0.0224 in      0.0310 shi4  
    0.0192 chor    0.0369 shedy   0.0384 of     0.0146 quod    0.0206 ren2  
    0.0125 shol    0.0315 okedy   0.0264 a      0.0130 ad      0.0163 ta1   
    0.0123 cthy    0.0314 chedy   0.0194 to     0.0112 non     0.0163 you3  
    0.0122 chy     0.0278 okeedy  0.0190 i      0.0109 cum     0.0144 xue2  
    0.0118 sho     0.0201 okal    0.0152 in     0.0108 se      0.0142 wen2  
    0.0113 dy      0.0171 daiin   0.0135 was    0.0088 ut      0.0134 shi2  
    0.0111 s       0.0154 otedy   0.0108 had    0.0085 qui     0.0131 zai4  
    0.0100 otchy   0.0144 okeey   0.0104 it     0.0082 esse    0.0112 guo2  
    0.0095 dain    0.0137 chey    0.0096 that   0.0072 ex      0.0110 yi2   
    0.0091 dar     0.0134 shey    0.0085 my     0.0070 a       0.0107 yi4   
    0.0084 oty     0.0115 dal     0.0080 he     0.0060 neque   0.0099 le    
    0.0081 shor    0.0113 otaiin  0.0079 at     0.0060 quam    0.0091 shuo1 
    0.0078 shy     0.0108 oky     0.0077 were   0.0057 eo      0.0088 bu4   
    0.0074 oky     0.0104 dar     0.0073 as     0.0056 atque   0.0088 ge    
    0.0072 or      0.0102 or      0.0069 with   0.0056 si      0.0085 shi   
    0.0070 chey    0.0097 okain   0.0059 from   0.0053 caesar  0.0083 dao4  
    0.0069 okchy   0.0091 oteedy  0.0055 for    0.0052 est     0.0083 jia1  
    0.0065 cthol   0.0091 saiin   0.0052 me     0.0050 ab      0.0075 sheng1
    0.0060 dal     0.0086 okar    0.0049 they   0.0049 sibi    0.0069 bu2   
    0.0059 ol      0.0082 otal    0.0048 we     0.0045 aut     0.0069 duo1  
    0.0059 otol    0.0081 lchedy  0.0047 there  0.0045 de      0.0069 jiu4  
    0.0055 okaiin  0.0081 sol     0.0045 this   0.0044 eius    0.0067 hen3  
    0.0055 okol    0.0078 dy      0.0043 his    0.0041 ne      0.0067 nian2 
    0.0054 cthor   0.0076 okey    0.0043 on     0.0039 his     0.0064 ye3   
    0.0051 dol     0.0066 oty     0.0043 their  0.0039 per     0.0064 yi1   
    0.0050 shey    0.0063 cheey   0.0040 but    0.0037 quae    0.0061 mei3  
    0.0049 otaiin  0.0063 otar    0.0039 by     0.0037 romani  0.0061 neng2 
    0.0047 otchol  0.0063 sheedy  0.0037 out    0.0036 ea      0.0061 yu3   

Random observations: 

  (*) The herbal-A and Biological samples have max frequency of 5%,
      comparable to those of English, Latin, and Chinese.
  
  (*) The Voynichese samples have Zipfeness comparable to those
      of the other languages.
      
---------------------------------------------------------
The effect of Vigenère encoding:

  multicol -v lines=30 {engl-v06,latn-v06,engl-v43,latn-v43,engl-vns}.pct
  plot-word-freqs \
    engl-v06.frq:1 engl-v43.frq:2 engl-vns.frq:3 \
    latn-v06.frq:5 latn-v43.frq:6 \
    > .gene.gif
  xv .gene.gif &

    English,    Latin,       English,    Latin,       English,
    6-letter    6-letter     43-letter   43-letter    43-letter
    Vigenère    Vigenère     Vigenère    Vigenère     Vigenère
    ----------  -----------  ----------  -----------  -----------
    0.0113 rxs  0.0064 uh    0.0049 a    0.0021 cv    0.0039 moi 
    0.0107 voc  0.0060 cj    0.0045 c    0.0021 yf    0.0023 xse 
    0.0105 jvx  0.0056 lr    0.0040 u    0.0019 dr    0.0021 hws 
    0.0103 hag  0.0052 sm    0.0037 m    0.0016 em    0.0021 kal 
    0.0099 afu  0.0052 xv    0.0036 i    0.0016 gt    0.0021 mwx 
    0.0087 mjl  0.0050 yb    0.0036 p    0.0016 ie    0.0021 ntt 
    0.0087 y    0.0049 ga    0.0032 moi  0.0016 vt    0.0021 ucs 
    0.0083 hh   0.0046 bp    0.0032 y    0.0016 xa    0.0019 izs 
    0.0067 qm   0.0045 pl    0.0027 h    0.0015 ah    0.0019 jbm 
    0.0065 hlt  0.0042 wg    0.0027 sfg  0.0015 cu    0.0019 tal 
    0.0061 qbw  0.0041 xser  0.0025 kal  0.0015 cz    0.0018 alp 
    0.0061 tpk  0.0040 ow    0.0023 npg  0.0015 gx    0.0018 olv 
    0.0056 q    0.0036 gqu   0.0023 olv  0.0015 kn    0.0018 tye 
    0.0055 o    0.0034 gd    0.0023 qp   0.0015 ui    0.0016 alu 
    0.0053 cub  0.0033 gzr   0.0023 vht  0.0013 pr    0.0016 hci 
    0.0053 k    0.0033 ku    0.0021 ntn  0.0013 tn    0.0016 hzw 
    0.0052 ogf  0.0032 gihf  0.0020 b    0.0013 yv    0.0016 lbq 
    0.0051 et   0.0029 bhp   0.0020 k    0.0012 jn    0.0016 tgc 
    0.0049 g    0.0029 hb    0.0020 nji  0.0012 sv    0.0014 ahh 
    0.0048 c    0.0029 jwvb  0.0020 o    0.0012 wl    0.0014 ehd 
    0.0048 cy   0.0029 okcw  0.0020 s    0.0012 wn    0.0014 iff 
    0.0048 mv   0.0029 umd   0.0020 t    0.0011 au    0.0014 ntn 
    0.0045 b    0.0028 sbmt  0.0020 x    0.0011 e     0.0014 phw 
    0.0045 ydr  0.0025 cih   0.0020 xse  0.0011 et    0.0014 rjai
    0.0043 re   0.0024 enqk  0.0019 dwy  0.0011 huhk  0.0014 uph 
    0.0043 vd   0.0024 leb   0.0019 ir   0.0011 ih    0.0014 vht 
    0.0039 mq   0.0024 qr    0.0019 r    0.0011 ik    0.0014 vls 
    0.0039 vv   0.0023 kh    0.0019 rje  0.0011 mv    0.0014 vrt 
    0.0039 w    0.0023 lqj   0.0017 bq   0.0011 pt    0.0014 xyx 
    0.0036 h    0.0023 slv   0.0017 cbq  0.0011 qlow  0.0012 eew 

Some observations:

  (*) The most common word has frequency between 0.2% and 1.0%,
      much less than in natural languages;
      
  (*) The distribution is very non-Zipfian: mostly flat
      for the first 30 words or so, with several multiword plateaus,
      then begins to decrease, but still not as fast as 1/i.
      
  (*) More importantly, the most common words are short,
      and the plateaus among words 1-50 are largely associated with
      words of the same length.

---------------------------------------------------------
Comparing languages A, B and whatnot:

  multicol -v lines=30 {vhea,vbio,vheb,vmix,vren}-eva.pct
  plot-word-freqs \
      vhea-eva.frq:2 vbio-eva.frq:4 \
      vheb-eva.frq:1 vmix-eva.frq:5 vren-eva.frq:3 \
    > .veva.gif
  xv .veva.gif &

    Herbal-A       Biological      Herbal-B        HerA/Bio mix    Rene's list    
    vhea-eva.pct   vbio-eva.pct    vheb-eva.pct    vmix-eva.pct    vren-eva.pct  
    -------------  --------------  --------------  --------------  --------------
    0.0527 daiin   0.0369 shedy    0.0267 daiin    0.0329 daiin    0.0275 daiin  
    0.0283 chol    0.0312 chedy    0.0186 chedy    0.0200 shedy    0.0142 chedy  
    0.0192 chor    0.0301 ol       0.0171 or       0.0188 qokaiin  0.0130 ol     
    0.0125 shol    0.0298 qokaiin  0.0158 chdy     0.0179 ol       0.0127 shedy  
    0.0123 cthy    0.0254 qokedy   0.0155 dar      0.0172 chedy    0.0114 chey   
    0.0122 chy     0.0231 qokeedy  0.0127 qokedy   0.0142 chol     0.0107 ar     
    0.0118 sho     0.0204 qol      0.0124 aiin     0.0140 qokedy   0.0099 chol   
    0.0113 dy      0.0171 daiin    0.0121 ar       0.0122 qol      0.0099 dar    
    0.0111 s       0.0165 qokal    0.0109 chckhy   0.0121 qokeedy  0.0098 qokedy 
    0.0095 dain    0.0136 chey     0.0109 okaiin   0.0108 dar      0.0098 qokeedy
    0.0091 dar     0.0134 shey     0.0109 shedy    0.0102 chor     0.0095 qokain 
    0.0081 shor    0.0118 qokeey   0.0096 ol       0.0100 chey     0.0094 qokeey 
    0.0078 shy     0.0115 dal      0.0090 okar     0.0096 shey     0.0088 qokaiin
    0.0070 chey    0.0104 dar      0.0087 dy       0.0093 dal      0.0086 aiin   
    0.0067 or      0.0095 qoky     0.0084 qokar    0.0091 dy       0.0084 or     
    0.0065 cthol   0.0091 saiin    0.0078 dal      0.0087 qokal    0.0076 shey   
    0.0063 qotchy  0.0089 or       0.0078 okedy    0.0079 shol     0.0074 okaiin 
    0.0060 dal     0.0084 okaiin   0.0078 saiin    0.0076 chy      0.0071 dain   
    0.0058 ol      0.0082 qokain   0.0071 okal     0.0076 or       0.0071 dal    
    0.0054 cthor   0.0081 lchedy   0.0062 qokaiin  0.0073 qoky     0.0068 s      
    0.0051 dol     0.0081 qotedy   0.0059 cheky    0.0072 dain     0.0062 cheol  
    0.0050 qokchy  0.0081 sol      0.0056 kar      0.0070 qokeey   0.0062 qokal  
    0.0050 shey    0.0078 dy       0.0056 otedy    0.0069 s        0.0058 chckhy 
    0.0049 oty     0.0073 otedy    0.0053 dam      0.0066 saiin    0.0058 cheey  
    0.0044 chaiin  0.0063 cheey    0.0053 oky      0.0054 cthy     0.0055 otaiin 
    0.0044 cthey   0.0063 qokar    0.0050 okeedy   0.0052 sol      0.0054 al     
    0.0044 dam     0.0063 sheedy   0.0050 otar     0.0051 otaiin   0.0054 shol   
    0.0044 dor     0.0061 okedy    0.0050 shdy     0.0049 dol      0.0051 chor   
    0.0042 cheor   0.0061 otaiin   0.0047 chey     0.0049 okaiin   0.0048 okeey  
    0.0042 oky     0.0060 qokey    0.0047 kchdy    0.0048 lchedy   0.0048 saiin  

Some observations (mostly confirming those in Rene's article):

  (*) Note that Herbal-A and Biological have widely different
      vocabularies. Among the first 10 words in each list,
      only "daiin" is shared; and its frequencies are 5.2%
      against 1.7%.
      
  (*) The word frequency plots of Herbal-A (vhea-eva, blue)
      and Biological (vbio-eva, magenta) are also quite different.
  
  (*) On the other hand, the graphs of Herbal-B (vheb-eva, red) and of the 
      mixture of 40% Herbal-A and 60% Biological (vmix_eva, maroon) are
      surprisingly similar! 
      .
  (*) However, even though the sorted frequencies of vheb-eva and
      vmix-eva are similar, the corresponding words are quite different.

Let's redo the comparison after omitting the "q"s:

  multicol -v lines=30 {vhea,vbio,vheb,vmix,vren}-enq.pct
  plot-word-freqs \
      vhea-enq.frq:2 vbio-enq.frq:4 \
      vheb-enq.frq:1 vmix-enq.frq:5 vren-enq.frq:3 \
    > .venq.gif
  xv .venq.gif &

    Herbal-A       Biological     Herbal-B       HerA/Bio mix   Rene's list    
    vhea-enq.pct   vbio-enq.pct   vheb-enq.pct   vmix-enq.pct   vren-enq.pct 
    -------------  -------------  -------------  -------------  -------------
    0.0527 daiin   0.0505 ol      0.0267 daiin   0.0329 daiin   0.0275 daiin 
    0.0283 chol    0.0382 okaiin  0.0205 okedy   0.0302 ol      0.0165 ol    
    0.0192 chor    0.0369 shedy   0.0186 chedy   0.0237 okaiin  0.0162 okaiin
    0.0125 shol    0.0315 okedy   0.0174 okar    0.0200 shedy   0.0142 chedy 
    0.0123 cthy    0.0314 chedy   0.0171 okaiin  0.0181 okedy   0.0142 okeey 
    0.0122 chy     0.0278 okeedy  0.0171 or      0.0173 chedy   0.0137 okedy 
    0.0118 sho     0.0201 okal    0.0158 chdy    0.0151 okeedy  0.0135 okeedy
    0.0113 dy      0.0171 daiin   0.0155 dar     0.0142 chol    0.0133 okain 
    0.0111 s       0.0154 otedy   0.0124 aiin    0.0114 okal    0.0127 shedy 
    0.0100 otchy   0.0144 okeey   0.0121 ar      0.0108 dar     0.0114 chey  
    0.0095 dain    0.0137 chey    0.0109 chckhy  0.0102 chor    0.0109 ar    
    0.0091 dar     0.0134 shey    0.0109 shedy   0.0100 chey    0.0102 okal  
    0.0084 oty     0.0115 dal     0.0102 okal    0.0096 shey    0.0099 chol  
    0.0081 shor    0.0113 otaiin  0.0099 ol      0.0093 dal     0.0099 dar   
    0.0078 shy     0.0108 oky     0.0093 otedy   0.0093 oky     0.0092 okar  
    0.0074 oky     0.0104 dar     0.0087 dy      0.0091 dy      0.0090 otaiin
    0.0072 or      0.0102 or      0.0087 oky     0.0090 or      0.0088 or    
    0.0070 chey    0.0097 okain   0.0084 okeedy  0.0085 okeey   0.0086 aiin  
    0.0069 okchy   0.0091 oteedy  0.0078 dal     0.0084 otaiin  0.0076 otedy 
    0.0065 cthol   0.0091 saiin   0.0078 saiin   0.0082 otedy   0.0076 shey  
    0.0060 dal     0.0086 okar    0.0071 otar    0.0079 shol    0.0071 dain  
    0.0059 ol      0.0082 otal    0.0062 okchdy  0.0076 chy     0.0071 dal   
    0.0059 otol    0.0081 lchedy  0.0059 cheky   0.0072 dain    0.0070 oteedy
    0.0055 okaiin  0.0081 sol     0.0056 kar     0.0069 s       0.0068 oky   
    0.0055 okol    0.0078 dy      0.0053 dam     0.0067 oty     0.0068 s     
    0.0054 cthor   0.0076 okey    0.0053 otal    0.0066 saiin   0.0067 otar  
    0.0051 dol     0.0066 oty     0.0050 ody     0.0057 okain   0.0064 oteey 
    0.0050 shey    0.0063 cheey   0.0050 okeey   0.0057 otal    0.0064 oty   
    0.0049 otaiin  0.0063 otar    0.0050 shdy    0.0055 cthy    0.0062 cheol 
    0.0047 otchol  0.0063 sheedy  0.0047 chey    0.0054 okey    0.0058 chckhy

Observations:
  
  (*) Again, the Herbal_A and Biological vocabularies and distributions
      are quite different.
      
  (*) Again, the herbal-B plot resembles very much that of the
      herbal-A/Biological mixture, and the vocabularies now 
      do have some resemblance.

Same, with "q" removed and "t" mapped to "k":

  multicol -v lines=30 {vhea,vbio,vheb,vmix,vren}-qkt.pct
  plot-word-freqs \
      vhea-qkt.frq:2 vbio-qkt.frq:4 \
      vheb-qkt.frq:1 vmix-qkt.frq:5 vren-qkt.frq:3 \
    > .vqkt.gif
  xv .vqkt.gif &

    Herbal-A       Biological      Herbal-B       HerA/Bio mix   Rene's list
    vhea-qkt.pct   vbio-qkt.pct    vheb-qkt.pct   vmix-qkt.pct   vren-qkt.pct 
    -------------  --------------  -------------  -------------  -------------
    0.0527 daiin   0.0505 ol       0.0298 okedy   0.0329 daiin   0.0275 daiin 
    0.0283 chol    0.0495 okaiin   0.0267 daiin   0.0321 okaiin  0.0252 okaiin
    0.0192 chor    0.0469 okedy    0.0245 okar    0.0302 ol      0.0213 okedy 
    0.0169 okchy   0.0369 okeedy   0.0217 okaiin  0.0263 okedy   0.0206 okeey 
    0.0160 ckhy    0.0369 shedy    0.0186 chedy   0.0200 shedy   0.0205 okeedy
    0.0159 oky     0.0314 chedy    0.0171 or      0.0193 okeedy  0.0172 okain 
    0.0125 shol    0.0283 okal     0.0158 chdy    0.0173 chedy   0.0165 ol    
    0.0122 chy     0.0175 okeey    0.0155 dar     0.0170 okal    0.0160 okar  
    0.0118 sho     0.0175 oky      0.0155 okal    0.0160 oky     0.0154 okal  
    0.0114 okol    0.0171 daiin    0.0133 chckhy  0.0142 chol    0.0142 chedy 
    0.0113 dy      0.0149 okar     0.0133 oky     0.0114 okeey   0.0133 oky   
    0.0111 s       0.0137 chey     0.0124 aiin    0.0108 dar     0.0127 shedy 
    0.0104 okaiin  0.0134 shey     0.0121 ar      0.0102 chor    0.0114 chey  
    0.0095 dain    0.0115 dal      0.0109 shedy   0.0100 chey    0.0109 ar    
    0.0091 dar     0.0113 okain    0.0099 okchdy  0.0096 shey    0.0099 chol  
    0.0084 ckhol   0.0110 okey     0.0099 okeedy  0.0093 dal     0.0099 dar   
    0.0081 shor    0.0104 dar      0.0099 ol      0.0093 okchy   0.0097 okol  
    0.0078 shy     0.0102 or       0.0087 dy      0.0091 dy      0.0088 or    
    0.0076 okchol  0.0091 saiin    0.0084 kar     0.0091 okol    0.0086 aiin  
    0.0072 or      0.0089 chckhy   0.0078 dal     0.0090 or      0.0084 okey  
    0.0070 chey    0.0081 lchedy   0.0078 saiin   0.0081 okar    0.0083 chckhy
    0.0068 kchy    0.0081 sol      0.0074 kedy    0.0079 shol    0.0076 shey  
    0.0067 ckhor   0.0078 dy       0.0068 cheky   0.0076 chy     0.0071 dain  
    0.0067 okchor  0.0076 kedy     0.0068 kchdy   0.0075 okey    0.0071 dal   
    0.0063 okor    0.0070 okol     0.0068 okeey   0.0072 dain    0.0068 s     
    0.0060 dal     0.0063 cheey    0.0062 okam    0.0069 s       0.0062 cheol 
    0.0059 ckhey   0.0063 sheedy   0.0062 ykedy   0.0067 okain   0.0059 okeol 
    0.0059 ol      0.0058 aiin     0.0056 okol    0.0066 ckhy    0.0058 cheey 
    0.0051 dol     0.0058 shckhy   0.0056 ykar    0.0066 saiin   0.0054 al    
    0.0050 shey    0.0055 checkhy  0.0053 dam     0.0063 chckhy  0.0054 okchy 

Observations:
  
  (*) Again, the Herbal_A and Biological vocabularies and distributions
      are quite different.
      
  (*) Again, the herbal-B plot resembles very much that of the
      herbal-A/Biological mixture, and the vocabularies now 
      do have some resemblance.