Hacking at the Voynich manuscript - Side notes 004 Effect of position-dependent ciphers on the word distribution Last edited on 1999-07-28 01:48:10 by stolfi [ Used to be Notes/020, renumbered to Notes/004 on 1999-02-01 ] 1998-04-23 stolfi ================= Let's examine the hypothesis that the VMs is in cipher, by comparing the Voynichese word frequency distribution with those of natural languages, both plain and Vigenère-encoded. Before reading this note, you must read these Web articles: Gabriel Landini Zipf's laws in the Voynich Manuscript. http://sun1.bham.ac.uk/G.Landini/evmt/zipf.htm Rene Zandbergen Currier A and B: two different languages? http://sun1.bham.ac.uk/G.Landini/evmt/lang.htm word frequencies for the following texts: vren-eva Rene's Voynichese word frequency table, expanded by repeating each word according to its count. vhea-eva Herbal section, "Language A" subset, Friedman's transcription vheb-eva Herbal section, "Language B" subset, Friedman's transcription vbio-eva Biological section (LAnguage B), Friedman's transcription vmix-eva Mixture of 40% Herbal-A and 60% Biological engl-poi Modern English (a Poirot novel, lowercase) engl-wow Modern English (Well's War of the Worlds, lowercase) latn-ock Medieval latin (a text by William Ockam, lowercase) latn-bel Classical latin (Caesar's De Bello Gallico, lowercase) chin-mch Modern Chinese (a beginner's reader, in pinyin). engl-v06 English coded with 6-letter Vigenère engl-v43 English coded with 43-letter Vigenère engl-vns ditto, ignoring 1- and 2-letter words latn-v06 Latin coded with 6-letter Vigenère latn-v43 Latin coded with 43-letter Vigenère All texts were randomly sampled so as to produce files of roughly similar size (7500 words), except when the original material was shorter than that limit. The resulting ".wds" files have one word per line, in the original order. The details of sample preparation are shown below. --------------------------------------------------------- English cat engl-poi.txt \ | head -5 the intense interest aroused in the public by what was known at the time as the styles case has now somewhat subsided nevertheless in view of the world wide notoriety which attended it i have been asked both by my friend poirot and the family themselves to write an account of the whole story this we trust will effectually silence the cat engl-poi.txt \ | tr ' ' '\012' \ | grep '.' \ | gawk '(rand() <= 0.13){print;}' \ > engl-poi.wds dicio-wc engl-poi.wds cat engl-wow.txt \ | head -5 No one would have believed in the last years of the nineteenth century that this world was being watched keenly and closely by intelligences greater than man's and yet as mortal as his own; that as men busied themselves about their various concerns they were scrutinised and studied, perhaps cat engl-wow.txt \ | tr 'A-Z' 'a-z' \ | tr -c -d ' a-z\012' \ | tr ' ' '\012' \ | grep '.' \ | gawk '(rand() <= 0.125){print;}' \ > engl-wow.wds dicio-wc engl-wow.wds cat engl-poi.wds engl-wow.wds \ | gawk '(rand() <= 0.51){print;}' \ > engl-mix.wds dicio-wc engl-mix.wds --------------------------------------------------------- Latin cat latn-ock.txt \ | head -5 Discipulus: Quoniam ista quinta assertio, via media inter alias quatuor incedendo, cum qualibet illarum in quibusdam concordat et in aliquibus discrepare dignoscitur, ipsam quo ad alias partes eius exquisite discutere etiam alias quodammodo pertractare propono. Ideo de ipsa diffuse aliquantulum cat latn-ock.txt \ | tr 'A-Z' 'a-z' \ | tr -c -d ' a-z\012' \ | tr ' ' '\012' \ | grep '.' \ | gawk '(rand() <= 0.31){print;}' \ > latn-ock.wds dicio-wc latn-ock.wds cat latn-bel.txt \ | head -5 Gallia est omnis divisa in partes tres, quarum unam incolunt Belgae, aliam Aquitani, tertiam qui ipsorum lingua Celtae, nostra Galli appellantur. Hi omnes lingua, institutis, legibus inter se differunt. Gallos ab Aquitanis Garumna flumen, a Belgis Matrona et Sequana dividit. Horum omnium fortissimi sunt Belgae, propterea quod a cultu cat latn-bel.txt \ | tr 'A-Z' 'a-z' \ | tr -c -d ' a-z\012' \ | tr ' ' '\012' \ | grep '.' \ | gawk '(rand() <= 0.92){print;}' \ > latn-bel.wds dicio-wc latn-bel.wds cat latn-ock.wds latn-bel.wds \ | gawk '(rand() <= 0.51){print;}' \ > latn-mix.wds dicio-wc latn-mix.wds --------------------------------------------------------- Chinese: cat chin-mch.txt \ | head -5 lu3 xun4 shi4 jin4 dai4 shi3 shang4 zui4 you3 ying3 xiang3 li4 de wen2 xue2 jia1 gen1 pi1 ping2 jia1 zhi1 yi1 yi1 ba1 ba1 yi1 nian2 chu1 sheng1 zai4 zhe4 jiang1 shao4 xing1 yi2 ge xiang1 dang1 fu4 yu4 de jia1 ting2 li3 tong2 nian2 de shi2 hou yin1 wei4 zu3 fu4 ru4 yu4 fu4 qin sheng1 bing4 jia1 ting2 de jing1 ji4 qing2 kuang4 tu1 ran2 bian4 cat chin-mch.txt \ | tr ' ' '\012' \ | grep '.' \ | gawk '(rand() <= 0.99){print;}' \ > chin-mch.wds dicio-wc chin-mch.wds Checking repeated words: cat chin-mch.txt \ | tr ' ' '\012' \ | grep '.' \ | gawk 'BEGIN{w = "_";} /./{print w, $0; w = $0;} END{print w, "_";}' \ | egrep '^(.*) \1$' \ > .reps dicio-wc chin-mch.txt .reps lines words file ------ ------- ------------ 265 3777 chin-mch.txt 16 32 .reps Ditto, ignoring tone: cat chin-mch.txt \ | tr ' ' '\012' \ | tr -d '0-9' \ | grep '.' \ | gawk 'BEGIN{w = "_";} /./{print w, $0; w = $0;} END{print w, "_";}' \ | egrep '^(.*) \1$' \ > .reps dicio-wc chin-mch.txt .reps lines words bytes file ------ ------- --------- ------------ 265 3777 18287 chin-mch.txt 27 54 218 .reps --------------------------------------------------------- Voynichese: Let's recreate Rene's word list, by expanding his word histogram and sampling it: cat Rene-words.frq \ | gawk '(NF>=2){for(i=0;i<$1;i++){print $2;}}' \ | gawk '(rand() <= 0.26){print;}' \ > vren-eva.wds dicio-wc vren-eva.wds Rene wondered whether the low frequency of the most common word in Voynichese ("daiin" 2.7%, contrasted with "the" 4.6%) is due to the presence of two different languages. Let's test that using the Friedman transcription of herbal-A, Herbal-B, and Biological, plus a 40:60 mixture of Herbal-A and Biological (as Rene himself suggested). cat hea-f-eva-gut.wds \ > vhea-eva.wds cat heb-f-eva-gut.wds \ > vheb-eva.wds cat bio-f-eva-gut.wds \ > vbio-eva.wds cat hea-f-eva-gut.wds \ | gawk '(rand() <= 0.40){print;}' \ > vmix-eva.wds cat bio-f-eva-gut.wds \ | gawk '(rand() <= 0.60){print;}' \ >> vmix-eva.wds dicio-wc {vhea,vheb,vbio,vmix}-eva.wds Let's also prepare "q"-less versions of all the Voynichese samples: foreach f ( vren vhea vbio vheb vmix ) cat $f-eva.wds \ | sed -e 's/^q//g' \ > ${f}-enq.wds end And, just in case, also versions with "k" and "t" identified: foreach f ( vren vhea vbio vheb vmix ) cat $f-enq.wds \ | sed -e 's/^q//g' -e 's/t/k/g' \ > ${f}-qkt.wds end Let's count repeated words: cat hea-f-eva.txt \ | tr ' ' '\012' \ | grep '.' \ | gawk 'BEGIN{w = "_";} /./{print w, $0; w = $0;} END{print w, "_";}' \ | egrep '^(.*) \1$' \ > .reps dicio-wc hea-f-eva.txt .reps lines words file ------ ------- ------------ 1216 8058 hea-f-eva.txt 78 156 .reps cat bio-f-eva.txt \ | tr ' ' '\012' \ | grep '.' \ | gawk 'BEGIN{w = "_";} /./{print w, $0; w = $0;} END{print w, "_";}' \ | egrep '^(.*) \1$' \ > .reps dicio-wc bio-f-eva.txt .reps lines words file ------ ------- ------------ 715 6281 bio-f-eva.txt 68 136 .reps --------------------------------------------------------- Vigenère-encoded texts: cat engl-wow.txt \ | tr 'A-Z' 'a-z' \ | vigenere -v key=qotchy \ | head -5 dc hpl ueief oyls ugsgujxf pl jvx nhqj mxcyq et mjl lybxvlcdha elljika afqh mjpq mcknk uqg ugplw ktvjfur dgllbm tpk abclgsw rm bpacbzbillssl iycqhxt afqb fcu'q qbw alr qg fqyrqz tu ogi cpp; afqh tu tcd pnupct hagtquzogz yrcnv afuwk xhpycnu jmdqxtuq jvxa dchs leysjwgkzct ogf zrkrbgk, nufacwq cat engl-wow.txt \ | tr 'A-Z' 'a-z' \ | vigenere -v key=thequickpuceopossumjumpsoverthelazycapybara \ | head -5 gv sdy eqeax jejt pwdcqeyp xf hci ctzx jezpu ou rie eiglxuyvvr rypxigm lzuf cbuh ocmpu phw mehli wprdhvd dlidfg cxs wnsgtzq ts uwnqadwbievlw rrdyveg riae mtu'w qhl aoi uu qcghsd ue qce dob; olrm hw xem zwsxce tyefzibpmu kqiwx hwsaj pmacajk qjrtxyrd tgca wtpf strnamdcagn phf whjrawx, bnltphg cat latn-ock.txt \ | tr 'A-Z' 'a-z' \ | tr -c -d ' a-z\012' \ | vigenere -v key=qotchy \ | head -5 twlepnkznu xsebbct giht sbgdht czqufmkv tyo fgkgq wgvlp qzbcz okomwvp ybvgkcdrh ebk gitnpzuh bnsyhif ku okwuwzbqa vquaefwca cj wg csggibdbq twleycfokg kgwbhujgjik kwqqa jwv yt oekhq fokvlq uwnu lvgibupru rbujsjskg lryof csgqg jwvbqafqkm fskvyyshttl nhciqum yrxq kc ydlc kgvtnul ybwjwhljiewt cat latn-ock.txt \ | tr 'A-Z' 'a-z' \ | tr -c -d ' a-z\012' \ | vigenere -v key=thequickpuceopossumjumpsoverthelazycapybara \ | head -5 wpwscxwvjm syccwse cecu cjaboe rlzicthm xip kfdza buxul inspm syoiigj czlyptfrj glf xyllhzgt xjmaiuf pr goqdehxcq qdbuglpjn qi ab vpzjbmmur bksrpfprrx kmwhwumxnwv wegse kgx up pdwvw gtyxps dgws tvruzsbai tcaeeiyti siwse uxrue fmcyedfvhz pdpvrpauaie iysfivq ssyq hs xdks xuozghw ogmhnhreukso foreach lang ( engl latn ) cat ${lang}-mix.wds \ | vigenere -v key=qotchy \ > ${lang}-v06.wds cat ${lang}-mix.wds \ | vigenere -v key=thequickpuceopossumjumpsoverthelazycapybara \ > ${lang}-v43.wds end dicio-wc {engl,latn}-{v06,v43}.wds Now a version of engl-v43.wds without the 1- and 2-letter words: cat engl-v43.wds \ | grep '...' \ > engl-vns.wds --------------------------------------------------------- OK, here it is all: dicio-wc ????-???.wds lines words bytes file ------ ------- --------- ------------ 3743 3743 18124 chin-mch.wds 7507 7507 39983 engl-mix.wds 7431 7431 39141 engl-poi.wds 7507 7507 39983 engl-v06.wds 7507 7507 39983 engl-v43.wds 5707 5707 34358 engl-vns.wds 7508 7508 40808 engl-wow.wds 7528 7528 52005 latn-bel.wds 7546 7546 52321 latn-mix.wds 7500 7500 52621 latn-ock.wds 7546 7546 52321 latn-v06.wds 7546 7546 52321 latn-v43.wds 6182 6182 35751 vbio-enq.wds 6182 6182 37279 vbio-eva.wds 6182 6182 35751 vbio-qkt.wds 7812 7812 45130 vhea-enq.wds 7812 7812 45838 vhea-eva.wds 7812 7812 45130 vhea-qkt.wds 3223 3223 18996 vheb-enq.wds 3223 3223 19326 vheb-eva.wds 3223 3223 18996 vheb-qkt.wds 6696 6696 38681 vmix-enq.wds 6696 6696 39871 vmix-eva.wds 6696 6696 38681 vmix-qkt.wds 7460 7460 43503 vren-enq.wds 7460 7460 44670 vren-eva.wds 7460 7460 43503 vren-qkt.wds --------------------------------------------------------- Let's compute the frequency distributions: foreach f ( ????-???.wds ) echo $f cat ${f} \ | sort | uniq -c | expand \ | sort +0 -1nr \ | compute-freqs \ | head -50 \ > ${f:r}.frq cat ${f:r}.frq \ | gawk '//{printf "%7.5f %s\n", $2, $3;}' \ > ${f:r}.pct end --------------------------------------------------------- Comparing different samples of English: multicol -v lines=30 {engl-poi,engl-wow,engl-mix}.pct plot-word-freqs {engl-poi,engl-wow,engl-mix}.frq > .engl.gif xv .engl.gif & Poirot novel Wells's WotW Mixture ------------- ------------- ------------- 0.0444 the 0.0773 the 0.0613 the 0.0296 i 0.0465 and 0.0338 and 0.0240 to 0.0384 of 0.0305 of 0.0221 and 0.0264 a 0.0272 i 0.0218 a 0.0194 to 0.0258 a 0.0213 of 0.0190 i 0.0220 to 0.0201 it 0.0152 in 0.0156 it 0.0191 that 0.0135 was 0.0152 that 0.0183 was 0.0108 had 0.0147 was 0.0157 in 0.0104 it 0.0144 in 0.0151 you 0.0096 that 0.0095 he 0.0112 he 0.0085 my 0.0085 had 0.0098 not 0.0080 he 0.0085 my 0.0094 is 0.0079 at 0.0085 you 0.0092 had 0.0077 were 0.0075 not 0.0087 but 0.0073 as 0.0071 as 0.0087 she 0.0069 with 0.0068 at 0.0085 her 0.0059 from 0.0064 his 0.0078 his 0.0055 for 0.0061 me 0.0078 my 0.0052 me 0.0059 is 0.0078 said 0.0049 they 0.0057 but 0.0077 as 0.0048 we 0.0056 on 0.0074 have 0.0047 there 0.0056 were 0.0073 poirot 0.0045 this 0.0056 with 0.0065 me 0.0043 his 0.0055 have 0.0058 on 0.0043 on 0.0053 her 0.0058 with 0.0043 their 0.0052 she 0.0051 for 0.0040 but 0.0051 we 0.0050 at 0.0039 by 0.0049 for 0.0050 mrs 0.0037 out 0.0043 from We can see that (*) The most common word ("the") occurs at 4.4% in Poirot, 7.7% in Well's. (*) The most common words are quite different. (*) The Wells's text has an almost perfect Zipf-like distribution. (*) The Poirot text is a bit non-Zipfian: words 3-8 are an almost flat plateau, there is a large drop around word 10, then a another nearly-flat region until about word 20. (*) The mixed text is intermediate; words 2-5 are somewhat flat, words 7-10 are a small plateau , and ditto for words 13-15. --------------------------------------------------------- Comparing different samples of Latin: multicol -v lines=30 {latn-ock,latn-bel,latn-mix}.pct plot-word-freqs {latn-ock,latn-bel,latn-mix}.frq > .latn.gif xv .latn.gif & Ockam's book Caesar's DBG Mixture ------------------ ----------------- ------------------ 0.0412 et 0.0247 et 0.0329 et 0.0228 in 0.0224 in 0.0241 in 0.0213 non 0.0146 quod 0.0183 quod 0.0204 quod 0.0130 ad 0.0158 non 0.0195 est 0.0112 non 0.0144 ad 0.0133 ad 0.0109 cum 0.0125 est 0.0088 quam 0.0108 se 0.0087 ut 0.0084 ut 0.0088 ut 0.0080 qui 0.0084 vel 0.0085 qui 0.0077 quam 0.0076 qui 0.0082 esse 0.0076 cum 0.0075 si 0.0072 ex 0.0069 se 0.0071 quia 0.0070 a 0.0068 ex 0.0069 quae 0.0060 neque 0.0068 si 0.0067 propter 0.0060 quam 0.0062 de 0.0067 sed 0.0057 eo 0.0060 esse 0.0065 plures 0.0056 atque 0.0056 a 0.0063 autem 0.0056 si 0.0050 quae 0.0057 expedit 0.0053 caesar 0.0049 ab 0.0057 principatus 0.0052 est 0.0049 per 0.0056 de 0.0050 ab 0.0045 vel 0.0055 secundum 0.0049 sibi 0.0038 etiam 0.0049 fidelium 0.0045 aut 0.0038 quia 0.0048 etiam 0.0045 de 0.0038 sed 0.0048 unus 0.0044 eius 0.0037 plures 0.0047 ex 0.0041 ne 0.0037 propter 0.0047 hoc 0.0039 his 0.0034 neque 0.0045 per 0.0039 per 0.0034 principatus 0.0043 unum 0.0037 quae 0.0034 sibi 0.0041 esse 0.0037 romani 0.0034 sunt 0.0041 sunt 0.0036 ea 0.0033 hoc We can see that (*) The most common word occurs at 2.5% for Caesar's 4.1% for Ockam's. (*) The Caesar sample is mostly Zipf-like after the first 15 words; before that it drops a bit slower than 1/x. (*) The Ockam sample has a plateau at words 2-5, then a sharp drop, another plateau at words 7-10, etc. (*) The mixed text looks the most Zipf-like of the three. --------------------------------------------------------- Comparing different versions of Voynichese Herbal-A: multicol -v lines=30 {vhea-eva,vhea-enq,vhea-qkt}.pct plot-word-freqs {vhea-eva,vhea-enq,vhea-qkt}.frq > .vhea.gif xv .vhea.gif & Herbal-A without "q" also "k"="t" ------------- ------------- ------------- 0.0527 daiin 0.0527 daiin 0.0527 daiin 0.0283 chol 0.0283 chol 0.0283 chol 0.0192 chor 0.0192 chor 0.0192 chor 0.0125 shol 0.0125 shol 0.0169 okchy 0.0123 cthy 0.0123 cthy 0.0160 ckhy 0.0122 chy 0.0122 chy 0.0159 oky 0.0118 sho 0.0118 sho 0.0125 shol 0.0113 dy 0.0113 dy 0.0122 chy 0.0111 s 0.0111 s 0.0118 sho 0.0095 dain 0.0100 otchy 0.0114 okol 0.0091 dar 0.0095 dain 0.0113 dy 0.0081 shor 0.0091 dar 0.0111 s 0.0078 shy 0.0084 oty 0.0104 okaiin 0.0070 chey 0.0081 shor 0.0095 dain 0.0067 or 0.0078 shy 0.0091 dar 0.0065 cthol 0.0074 oky 0.0084 ckhol 0.0063 qotchy 0.0072 or 0.0081 shor 0.0060 dal 0.0070 chey 0.0078 shy 0.0058 ol 0.0069 okchy 0.0076 okchol 0.0054 cthor 0.0065 cthol 0.0072 or 0.0051 dol 0.0060 dal 0.0070 chey 0.0050 qokchy 0.0059 ol 0.0068 kchy 0.0050 shey 0.0059 otol 0.0067 ckhor 0.0049 oty 0.0055 okaiin 0.0067 okchor 0.0044 chaiin 0.0055 okol 0.0063 okor 0.0044 cthey 0.0054 cthor 0.0060 dal 0.0044 dam 0.0051 dol 0.0059 ckhey 0.0044 dor 0.0050 shey 0.0059 ol 0.0042 cheor 0.0049 otaiin 0.0051 dol 0.0042 oky 0.0047 otchol 0.0050 shey Some observations: (*) The Herbal-A sample has a "natural" (5.7%) frequency for the first word, and is quite Zipf-like after the first 10 words or so. (*) Words 4 thru 9 are sort of a plateau. (*) Deleting the "q"s has little effect on the histogram. (*) Equating "k" and "t" has no effect on the first three entries, but widens the plateau to entries 4-12 (except for a Zipf-like step between words 6 and 7). Conclusion: for Herbal-A, removing the "q"s is optional, and equating "k" and "t" may be detrimental to its Zipfness. --------------------------------------------------------- Comparing different version of Voynichese Biological: multicol -v lines=30 {vbio-eva,vbio-enq,vbio-qkt}.pct plot-word-freqs {vbio-eva,vbio-enq,vbio-qkt}.frq > .vbio.gif xv .vbio.gif & Biological without "q" also "k"="t" -------------- ------------- -------------- 0.0369 shedy 0.0505 ol 0.0505 ol 0.0312 chedy 0.0382 okaiin 0.0495 okaiin 0.0301 ol 0.0369 shedy 0.0469 okedy 0.0298 qokaiin 0.0315 okedy 0.0369 okeedy 0.0254 qokedy 0.0314 chedy 0.0369 shedy 0.0231 qokeedy 0.0278 okeedy 0.0314 chedy 0.0204 qol 0.0201 okal 0.0283 okal 0.0171 daiin 0.0171 daiin 0.0175 okeey 0.0165 qokal 0.0154 otedy 0.0175 oky 0.0136 chey 0.0144 okeey 0.0171 daiin 0.0134 shey 0.0137 chey 0.0149 okar 0.0118 qokeey 0.0134 shey 0.0137 chey 0.0115 dal 0.0115 dal 0.0134 shey 0.0104 dar 0.0113 otaiin 0.0115 dal 0.0095 qoky 0.0108 oky 0.0113 okain 0.0091 saiin 0.0104 dar 0.0110 okey 0.0089 or 0.0102 or 0.0104 dar 0.0084 okaiin 0.0097 okain 0.0102 or 0.0082 qokain 0.0091 oteedy 0.0091 saiin 0.0081 lchedy 0.0091 saiin 0.0089 chckhy 0.0081 qotedy 0.0086 okar 0.0081 lchedy 0.0081 sol 0.0082 otal 0.0081 sol 0.0078 dy 0.0081 lchedy 0.0078 dy 0.0073 otedy 0.0081 sol 0.0076 kedy 0.0063 cheey 0.0078 dy 0.0070 okol 0.0063 qokar 0.0076 okey 0.0063 cheey 0.0063 sheedy 0.0066 oty 0.0063 sheedy 0.0061 okedy 0.0063 cheey 0.0058 aiin 0.0061 otaiin 0.0063 otar 0.0058 shckhy 0.0060 qokey 0.0063 sheedy 0.0055 checkhy Random observations: (*) The raw distribution of the Biological words is practically flat for the first four entries ("shedy", "chedy", "ol", "qokaiin"), then gradually becomes quite Zipf-like. (*) Removing the "q"s has a drastic effect; "ol" and "okaiin" overtake "shedy". The max frequency becomes 5%, the palteau widens to words 1-5, but the distribution then becomes a bit more Zipf-like. (*) Equating "k" and "t" flattens completely the first three entries, and affects most frequencies below the maximum. There appears a large jump after word 7, and plateaus in 4-5, 8-10, 12-13, etc. (*) We should try removing "q" and equating "ch" and "sh", while leaving "k" and "t" distinct. The max frequency ("chedy") would become 6.7%, comparable only to Chinese. The second entry would be "ol" at 5%; a bit too high for Zipf, but perhaps that can be reduced by fixing word-break transcription errors. Conclusion: for Bio-B, there are weak reasons for deleting "q", and weak reasons for NOT equating "k" and "t". Herbal-B and other versions are analyzed further on. --------------------------------------------------------- Comparing English, Latin, Chinese, and Voynichese: multicol -v lines=30 {vhea-enq,vbio-enq,engl-wow,latn-bel,chin-mch}.pct plot-word-freqs \ engl-wow.frq:1 latn-bel.frq:5 chin-mch.frq:3 \ vhea-enq.frq:2 vbio-enq.frq:6 > .natu.gif xv .natu.gif & Herbal-A Biological English/WotW Latin/DBG Chinese/Mch ------------- ------------- ------------ ------------- ------------- 0.0527 daiin 0.0505 ol 0.0773 the 0.0247 et 0.0647 de 0.0283 chol 0.0382 okaiin 0.0465 and 0.0224 in 0.0310 shi4 0.0192 chor 0.0369 shedy 0.0384 of 0.0146 quod 0.0206 ren2 0.0125 shol 0.0315 okedy 0.0264 a 0.0130 ad 0.0163 ta1 0.0123 cthy 0.0314 chedy 0.0194 to 0.0112 non 0.0163 you3 0.0122 chy 0.0278 okeedy 0.0190 i 0.0109 cum 0.0144 xue2 0.0118 sho 0.0201 okal 0.0152 in 0.0108 se 0.0142 wen2 0.0113 dy 0.0171 daiin 0.0135 was 0.0088 ut 0.0134 shi2 0.0111 s 0.0154 otedy 0.0108 had 0.0085 qui 0.0131 zai4 0.0100 otchy 0.0144 okeey 0.0104 it 0.0082 esse 0.0112 guo2 0.0095 dain 0.0137 chey 0.0096 that 0.0072 ex 0.0110 yi2 0.0091 dar 0.0134 shey 0.0085 my 0.0070 a 0.0107 yi4 0.0084 oty 0.0115 dal 0.0080 he 0.0060 neque 0.0099 le 0.0081 shor 0.0113 otaiin 0.0079 at 0.0060 quam 0.0091 shuo1 0.0078 shy 0.0108 oky 0.0077 were 0.0057 eo 0.0088 bu4 0.0074 oky 0.0104 dar 0.0073 as 0.0056 atque 0.0088 ge 0.0072 or 0.0102 or 0.0069 with 0.0056 si 0.0085 shi 0.0070 chey 0.0097 okain 0.0059 from 0.0053 caesar 0.0083 dao4 0.0069 okchy 0.0091 oteedy 0.0055 for 0.0052 est 0.0083 jia1 0.0065 cthol 0.0091 saiin 0.0052 me 0.0050 ab 0.0075 sheng1 0.0060 dal 0.0086 okar 0.0049 they 0.0049 sibi 0.0069 bu2 0.0059 ol 0.0082 otal 0.0048 we 0.0045 aut 0.0069 duo1 0.0059 otol 0.0081 lchedy 0.0047 there 0.0045 de 0.0069 jiu4 0.0055 okaiin 0.0081 sol 0.0045 this 0.0044 eius 0.0067 hen3 0.0055 okol 0.0078 dy 0.0043 his 0.0041 ne 0.0067 nian2 0.0054 cthor 0.0076 okey 0.0043 on 0.0039 his 0.0064 ye3 0.0051 dol 0.0066 oty 0.0043 their 0.0039 per 0.0064 yi1 0.0050 shey 0.0063 cheey 0.0040 but 0.0037 quae 0.0061 mei3 0.0049 otaiin 0.0063 otar 0.0039 by 0.0037 romani 0.0061 neng2 0.0047 otchol 0.0063 sheedy 0.0037 out 0.0036 ea 0.0061 yu3 Random observations: (*) The herbal-A and Biological samples have max frequency of 5%, comparable to those of English, Latin, and Chinese. (*) The Voynichese samples have Zipfeness comparable to those of the other languages. --------------------------------------------------------- The effect of Vigenère encoding: multicol -v lines=30 {engl-v06,latn-v06,engl-v43,latn-v43,engl-vns}.pct plot-word-freqs \ engl-v06.frq:1 engl-v43.frq:2 engl-vns.frq:3 \ latn-v06.frq:5 latn-v43.frq:6 \ > .gene.gif xv .gene.gif & English, Latin, English, Latin, English, 6-letter 6-letter 43-letter 43-letter 43-letter Vigenère Vigenère Vigenère Vigenère Vigenère ---------- ----------- ---------- ----------- ----------- 0.0113 rxs 0.0064 uh 0.0049 a 0.0021 cv 0.0039 moi 0.0107 voc 0.0060 cj 0.0045 c 0.0021 yf 0.0023 xse 0.0105 jvx 0.0056 lr 0.0040 u 0.0019 dr 0.0021 hws 0.0103 hag 0.0052 sm 0.0037 m 0.0016 em 0.0021 kal 0.0099 afu 0.0052 xv 0.0036 i 0.0016 gt 0.0021 mwx 0.0087 mjl 0.0050 yb 0.0036 p 0.0016 ie 0.0021 ntt 0.0087 y 0.0049 ga 0.0032 moi 0.0016 vt 0.0021 ucs 0.0083 hh 0.0046 bp 0.0032 y 0.0016 xa 0.0019 izs 0.0067 qm 0.0045 pl 0.0027 h 0.0015 ah 0.0019 jbm 0.0065 hlt 0.0042 wg 0.0027 sfg 0.0015 cu 0.0019 tal 0.0061 qbw 0.0041 xser 0.0025 kal 0.0015 cz 0.0018 alp 0.0061 tpk 0.0040 ow 0.0023 npg 0.0015 gx 0.0018 olv 0.0056 q 0.0036 gqu 0.0023 olv 0.0015 kn 0.0018 tye 0.0055 o 0.0034 gd 0.0023 qp 0.0015 ui 0.0016 alu 0.0053 cub 0.0033 gzr 0.0023 vht 0.0013 pr 0.0016 hci 0.0053 k 0.0033 ku 0.0021 ntn 0.0013 tn 0.0016 hzw 0.0052 ogf 0.0032 gihf 0.0020 b 0.0013 yv 0.0016 lbq 0.0051 et 0.0029 bhp 0.0020 k 0.0012 jn 0.0016 tgc 0.0049 g 0.0029 hb 0.0020 nji 0.0012 sv 0.0014 ahh 0.0048 c 0.0029 jwvb 0.0020 o 0.0012 wl 0.0014 ehd 0.0048 cy 0.0029 okcw 0.0020 s 0.0012 wn 0.0014 iff 0.0048 mv 0.0029 umd 0.0020 t 0.0011 au 0.0014 ntn 0.0045 b 0.0028 sbmt 0.0020 x 0.0011 e 0.0014 phw 0.0045 ydr 0.0025 cih 0.0020 xse 0.0011 et 0.0014 rjai 0.0043 re 0.0024 enqk 0.0019 dwy 0.0011 huhk 0.0014 uph 0.0043 vd 0.0024 leb 0.0019 ir 0.0011 ih 0.0014 vht 0.0039 mq 0.0024 qr 0.0019 r 0.0011 ik 0.0014 vls 0.0039 vv 0.0023 kh 0.0019 rje 0.0011 mv 0.0014 vrt 0.0039 w 0.0023 lqj 0.0017 bq 0.0011 pt 0.0014 xyx 0.0036 h 0.0023 slv 0.0017 cbq 0.0011 qlow 0.0012 eew Some observations: (*) The most common word has frequency between 0.2% and 1.0%, much less than in natural languages; (*) The distribution is very non-Zipfian: mostly flat for the first 30 words or so, with several multiword plateaus, then begins to decrease, but still not as fast as 1/i. (*) More importantly, the most common words are short, and the plateaus among words 1-50 are largely associated with words of the same length. --------------------------------------------------------- Comparing languages A, B and whatnot: multicol -v lines=30 {vhea,vbio,vheb,vmix,vren}-eva.pct plot-word-freqs \ vhea-eva.frq:2 vbio-eva.frq:4 \ vheb-eva.frq:1 vmix-eva.frq:5 vren-eva.frq:3 \ > .veva.gif xv .veva.gif & Herbal-A Biological Herbal-B HerA/Bio mix Rene's list vhea-eva.pct vbio-eva.pct vheb-eva.pct vmix-eva.pct vren-eva.pct ------------- -------------- -------------- -------------- -------------- 0.0527 daiin 0.0369 shedy 0.0267 daiin 0.0329 daiin 0.0275 daiin 0.0283 chol 0.0312 chedy 0.0186 chedy 0.0200 shedy 0.0142 chedy 0.0192 chor 0.0301 ol 0.0171 or 0.0188 qokaiin 0.0130 ol 0.0125 shol 0.0298 qokaiin 0.0158 chdy 0.0179 ol 0.0127 shedy 0.0123 cthy 0.0254 qokedy 0.0155 dar 0.0172 chedy 0.0114 chey 0.0122 chy 0.0231 qokeedy 0.0127 qokedy 0.0142 chol 0.0107 ar 0.0118 sho 0.0204 qol 0.0124 aiin 0.0140 qokedy 0.0099 chol 0.0113 dy 0.0171 daiin 0.0121 ar 0.0122 qol 0.0099 dar 0.0111 s 0.0165 qokal 0.0109 chckhy 0.0121 qokeedy 0.0098 qokedy 0.0095 dain 0.0136 chey 0.0109 okaiin 0.0108 dar 0.0098 qokeedy 0.0091 dar 0.0134 shey 0.0109 shedy 0.0102 chor 0.0095 qokain 0.0081 shor 0.0118 qokeey 0.0096 ol 0.0100 chey 0.0094 qokeey 0.0078 shy 0.0115 dal 0.0090 okar 0.0096 shey 0.0088 qokaiin 0.0070 chey 0.0104 dar 0.0087 dy 0.0093 dal 0.0086 aiin 0.0067 or 0.0095 qoky 0.0084 qokar 0.0091 dy 0.0084 or 0.0065 cthol 0.0091 saiin 0.0078 dal 0.0087 qokal 0.0076 shey 0.0063 qotchy 0.0089 or 0.0078 okedy 0.0079 shol 0.0074 okaiin 0.0060 dal 0.0084 okaiin 0.0078 saiin 0.0076 chy 0.0071 dain 0.0058 ol 0.0082 qokain 0.0071 okal 0.0076 or 0.0071 dal 0.0054 cthor 0.0081 lchedy 0.0062 qokaiin 0.0073 qoky 0.0068 s 0.0051 dol 0.0081 qotedy 0.0059 cheky 0.0072 dain 0.0062 cheol 0.0050 qokchy 0.0081 sol 0.0056 kar 0.0070 qokeey 0.0062 qokal 0.0050 shey 0.0078 dy 0.0056 otedy 0.0069 s 0.0058 chckhy 0.0049 oty 0.0073 otedy 0.0053 dam 0.0066 saiin 0.0058 cheey 0.0044 chaiin 0.0063 cheey 0.0053 oky 0.0054 cthy 0.0055 otaiin 0.0044 cthey 0.0063 qokar 0.0050 okeedy 0.0052 sol 0.0054 al 0.0044 dam 0.0063 sheedy 0.0050 otar 0.0051 otaiin 0.0054 shol 0.0044 dor 0.0061 okedy 0.0050 shdy 0.0049 dol 0.0051 chor 0.0042 cheor 0.0061 otaiin 0.0047 chey 0.0049 okaiin 0.0048 okeey 0.0042 oky 0.0060 qokey 0.0047 kchdy 0.0048 lchedy 0.0048 saiin Some observations (mostly confirming those in Rene's article): (*) Note that Herbal-A and Biological have widely different vocabularies. Among the first 10 words in each list, only "daiin" is shared; and its frequencies are 5.2% against 1.7%. (*) The word frequency plots of Herbal-A (vhea-eva, blue) and Biological (vbio-eva, magenta) are also quite different. (*) On the other hand, the graphs of Herbal-B (vheb-eva, red) and of the mixture of 40% Herbal-A and 60% Biological (vmix_eva, maroon) are surprisingly similar! . (*) However, even though the sorted frequencies of vheb-eva and vmix-eva are similar, the corresponding words are quite different. Let's redo the comparison after omitting the "q"s: multicol -v lines=30 {vhea,vbio,vheb,vmix,vren}-enq.pct plot-word-freqs \ vhea-enq.frq:2 vbio-enq.frq:4 \ vheb-enq.frq:1 vmix-enq.frq:5 vren-enq.frq:3 \ > .venq.gif xv .venq.gif & Herbal-A Biological Herbal-B HerA/Bio mix Rene's list vhea-enq.pct vbio-enq.pct vheb-enq.pct vmix-enq.pct vren-enq.pct ------------- ------------- ------------- ------------- ------------- 0.0527 daiin 0.0505 ol 0.0267 daiin 0.0329 daiin 0.0275 daiin 0.0283 chol 0.0382 okaiin 0.0205 okedy 0.0302 ol 0.0165 ol 0.0192 chor 0.0369 shedy 0.0186 chedy 0.0237 okaiin 0.0162 okaiin 0.0125 shol 0.0315 okedy 0.0174 okar 0.0200 shedy 0.0142 chedy 0.0123 cthy 0.0314 chedy 0.0171 okaiin 0.0181 okedy 0.0142 okeey 0.0122 chy 0.0278 okeedy 0.0171 or 0.0173 chedy 0.0137 okedy 0.0118 sho 0.0201 okal 0.0158 chdy 0.0151 okeedy 0.0135 okeedy 0.0113 dy 0.0171 daiin 0.0155 dar 0.0142 chol 0.0133 okain 0.0111 s 0.0154 otedy 0.0124 aiin 0.0114 okal 0.0127 shedy 0.0100 otchy 0.0144 okeey 0.0121 ar 0.0108 dar 0.0114 chey 0.0095 dain 0.0137 chey 0.0109 chckhy 0.0102 chor 0.0109 ar 0.0091 dar 0.0134 shey 0.0109 shedy 0.0100 chey 0.0102 okal 0.0084 oty 0.0115 dal 0.0102 okal 0.0096 shey 0.0099 chol 0.0081 shor 0.0113 otaiin 0.0099 ol 0.0093 dal 0.0099 dar 0.0078 shy 0.0108 oky 0.0093 otedy 0.0093 oky 0.0092 okar 0.0074 oky 0.0104 dar 0.0087 dy 0.0091 dy 0.0090 otaiin 0.0072 or 0.0102 or 0.0087 oky 0.0090 or 0.0088 or 0.0070 chey 0.0097 okain 0.0084 okeedy 0.0085 okeey 0.0086 aiin 0.0069 okchy 0.0091 oteedy 0.0078 dal 0.0084 otaiin 0.0076 otedy 0.0065 cthol 0.0091 saiin 0.0078 saiin 0.0082 otedy 0.0076 shey 0.0060 dal 0.0086 okar 0.0071 otar 0.0079 shol 0.0071 dain 0.0059 ol 0.0082 otal 0.0062 okchdy 0.0076 chy 0.0071 dal 0.0059 otol 0.0081 lchedy 0.0059 cheky 0.0072 dain 0.0070 oteedy 0.0055 okaiin 0.0081 sol 0.0056 kar 0.0069 s 0.0068 oky 0.0055 okol 0.0078 dy 0.0053 dam 0.0067 oty 0.0068 s 0.0054 cthor 0.0076 okey 0.0053 otal 0.0066 saiin 0.0067 otar 0.0051 dol 0.0066 oty 0.0050 ody 0.0057 okain 0.0064 oteey 0.0050 shey 0.0063 cheey 0.0050 okeey 0.0057 otal 0.0064 oty 0.0049 otaiin 0.0063 otar 0.0050 shdy 0.0055 cthy 0.0062 cheol 0.0047 otchol 0.0063 sheedy 0.0047 chey 0.0054 okey 0.0058 chckhy Observations: (*) Again, the Herbal_A and Biological vocabularies and distributions are quite different. (*) Again, the herbal-B plot resembles very much that of the herbal-A/Biological mixture, and the vocabularies now do have some resemblance. Same, with "q" removed and "t" mapped to "k": multicol -v lines=30 {vhea,vbio,vheb,vmix,vren}-qkt.pct plot-word-freqs \ vhea-qkt.frq:2 vbio-qkt.frq:4 \ vheb-qkt.frq:1 vmix-qkt.frq:5 vren-qkt.frq:3 \ > .vqkt.gif xv .vqkt.gif & Herbal-A Biological Herbal-B HerA/Bio mix Rene's list vhea-qkt.pct vbio-qkt.pct vheb-qkt.pct vmix-qkt.pct vren-qkt.pct ------------- -------------- ------------- ------------- ------------- 0.0527 daiin 0.0505 ol 0.0298 okedy 0.0329 daiin 0.0275 daiin 0.0283 chol 0.0495 okaiin 0.0267 daiin 0.0321 okaiin 0.0252 okaiin 0.0192 chor 0.0469 okedy 0.0245 okar 0.0302 ol 0.0213 okedy 0.0169 okchy 0.0369 okeedy 0.0217 okaiin 0.0263 okedy 0.0206 okeey 0.0160 ckhy 0.0369 shedy 0.0186 chedy 0.0200 shedy 0.0205 okeedy 0.0159 oky 0.0314 chedy 0.0171 or 0.0193 okeedy 0.0172 okain 0.0125 shol 0.0283 okal 0.0158 chdy 0.0173 chedy 0.0165 ol 0.0122 chy 0.0175 okeey 0.0155 dar 0.0170 okal 0.0160 okar 0.0118 sho 0.0175 oky 0.0155 okal 0.0160 oky 0.0154 okal 0.0114 okol 0.0171 daiin 0.0133 chckhy 0.0142 chol 0.0142 chedy 0.0113 dy 0.0149 okar 0.0133 oky 0.0114 okeey 0.0133 oky 0.0111 s 0.0137 chey 0.0124 aiin 0.0108 dar 0.0127 shedy 0.0104 okaiin 0.0134 shey 0.0121 ar 0.0102 chor 0.0114 chey 0.0095 dain 0.0115 dal 0.0109 shedy 0.0100 chey 0.0109 ar 0.0091 dar 0.0113 okain 0.0099 okchdy 0.0096 shey 0.0099 chol 0.0084 ckhol 0.0110 okey 0.0099 okeedy 0.0093 dal 0.0099 dar 0.0081 shor 0.0104 dar 0.0099 ol 0.0093 okchy 0.0097 okol 0.0078 shy 0.0102 or 0.0087 dy 0.0091 dy 0.0088 or 0.0076 okchol 0.0091 saiin 0.0084 kar 0.0091 okol 0.0086 aiin 0.0072 or 0.0089 chckhy 0.0078 dal 0.0090 or 0.0084 okey 0.0070 chey 0.0081 lchedy 0.0078 saiin 0.0081 okar 0.0083 chckhy 0.0068 kchy 0.0081 sol 0.0074 kedy 0.0079 shol 0.0076 shey 0.0067 ckhor 0.0078 dy 0.0068 cheky 0.0076 chy 0.0071 dain 0.0067 okchor 0.0076 kedy 0.0068 kchdy 0.0075 okey 0.0071 dal 0.0063 okor 0.0070 okol 0.0068 okeey 0.0072 dain 0.0068 s 0.0060 dal 0.0063 cheey 0.0062 okam 0.0069 s 0.0062 cheol 0.0059 ckhey 0.0063 sheedy 0.0062 ykedy 0.0067 okain 0.0059 okeol 0.0059 ol 0.0058 aiin 0.0056 okol 0.0066 ckhy 0.0058 cheey 0.0051 dol 0.0058 shckhy 0.0056 ykar 0.0066 saiin 0.0054 al 0.0050 shey 0.0055 checkhy 0.0053 dam 0.0063 chckhy 0.0054 okchy Observations: (*) Again, the Herbal_A and Biological vocabularies and distributions are quite different. (*) Again, the herbal-B plot resembles very much that of the herbal-A/Biological mixture, and the vocabularies now do have some resemblance.