Hacking at the Voynich manuscript - Side notes 104 Listing repeated words in Voynichese and other languages Last edited on 2012-05-03 20:45:19 by stolfilocal INTRODUCTION In this note we list and tabulate repeated words found in Voynichese and other languages. SETTING UP THE ENVIRONMENT Links: ln -s ../tr-stats/dat ln -s ../tr-stats/exp ln -s /home/staff/stolfi/voynich/work ln -s work/compute-freqs ln -s work/list-duplicate-words ln -s work/update-paper-include ln -s work/fix-words ln -s ../100/subsections.tags LISTING REPEATED WORDS Language samples: set samples = ( \ voyn/tak \ \ engl/wow \ engl/cul \ engl/twp \ latn/ptt \ latn/nwt \ latn/ock \ grek/nwt \ span/qvi \ ital/psp \ fran/tal \ port/csm \ germ/sim \ russ/pic \ russ/ptt \ arab/quf \ arab/quv \ arab/qud \ arab/qph \ arab/qcs \ hebr/tav \ hebr/tad \ geez/gok \ viet/ptt \ viet/nwt \ chin/ptt \ chin/ptn \ chin/red \ chin/voa \ chip/voa \ tibe/vim \ tibe/ccv \ tibe/pmi \ \ enrc/wow \ chrc/red \ envt/wow \ envg/wow \ \ voyp/grs \ voyp/grm \ viep/grs \ viep/mky \ \ engl/wnm \ engl/cpn \ ) Generate the dup-word listings foreach sample ( $samples ) set lang = ${sample:h} set book = ${sample:t} make LANG=${lang} BOOK=${book} -f dup-word-lists.make all end Printing summary: set sfile = ".summary" rm -rf ${sfile} foreach sample ( $samples ) set tfile = "exp/${sample}/tot.1/raw-dup-summary.tex" echo -n "${sample}" >> ${sfile} grep -h DupTotCt ${tfile} \ | gawk '//{ s = $0; gsub(/[\\A-Za-z{}]/, "", s); printf " %5d", s; }' >> ${sfile} grep -h DupTotFr ${tfile} \ | gawk '//{ s = $0; gsub(/[\\A-Za-z{}]/, "", s); printf " %.5f", s; }' >> ${sfile} grep -h DupMaxCt ${tfile} \ | gawk '//{ s = $0; gsub(/[\\A-Za-z{}]/, "", s); printf " %5d", s; }' >> ${sfile} grep -h DupMaxFr ${tfile} \ | gawk '//{ s = $0; gsub(/[\\A-Za-z{}]/, "", s); printf " %.5f", s; }' >> ${sfile} grep -h DupMaxWd ${tfile} \ | gawk '//{ s = $0; gsub(/^[\\]def[\\][A-Za-z]*/, "", s); printf " %s\n", s; }' >> ${sfile} end cat ${sfile} | sort -b +2 -3gr Output (manually resorted) voyn/tak 317 0.00802 22 0.00056 {Chol} chin/red 351 0.00995 44 0.00125 {lao\tn{3}{1}} chip/voa 151 0.00427 35 0.00099 {guo\tn{2}{}} viet/nwt 106 0.00294 18 0.00050 {nga`i} ital/psp 92 0.00258 8 0.00022 {no} tibe/ccv 90 0.00257 54 0.00154 {MA} chin/voa 91 0.00255 35 0.00098 {guo\tn{2}{}} chin/ptn 85 0.00238 12 0.00034 {ji\tn{4}{4}} russ/pic 80 0.00221 4 0.00011 {davaj} chin/ptt 67 0.00186 7 0.00019 {ji\tn{4}{4}} port/csm 60 0.00171 8 0.00023 {não} grek/nwt 63 0.00170 16 0.00043 {\alpha\mu\eta\nu} tibe/pmi 54 0.00154 10 0.00029 {KHANG} latn/nwt 57 0.00153 16 0.00043 {amen} engl/twp 54 0.00130 6 0.00014 {alas} viet/ptt 43 0.00119 10 0.00028 {ddo+`i} tibe/vim 27 0.00077 12 0.00034 {DE} hebr/tad 25 0.00066 3 0.00008 {b¤b¤qr} fran/tal 22 0.00061 3 0.00008 {de} geez/gok 17 0.00049 8 0.00023 {'alElene} engl/wow 16 0.00045 3 0.00008 {had} hebr/tav 16 0.00042 3 0.00008 {b¤äb¤öqêr} latn/ock 11 0.00031 4 0.00011 {et} russ/ptt 11 0.00031 3 0.00009 {÷åóøíá} engl/cul 11 0.00030 4 0.00011 {it} span/qvi 20 0.00028 4 0.00006 {el} latn/ptt 9 0.00024 2 0.00005 {septena} germ/sim 8 0.00023 2 0.00006 {halt} arab/qud 6 0.00016 1 0.00003 {alllh} arab/qcs 4 0.00011 1 0.00003 {allh} arab/quf 2 0.00005 1 0.00003 {allhî} arab/quv 2 0.00005 1 0.00003 {allhî} arab/qph 1 0.00003 1 0.00003 {fyhî} voyp/grm 4 0.00551 1 0.00138 {dal} voyp/grs 5 0.00256 1 0.00051 {Chedy} viep/mky 80 0.00222 7 0.00019 {cho} viep/grs 16 0.00051 4 0.00013 {nga} envg/wow 0 0.00000 0 0.00000 {-} envt/wow 39 0.00110 11 0.00031 {ngu+o+i} enrc/wow 16 0.00045 3 0.00008 {dlxxvi} chrc/red 351 0.00995 44 0.00125 {slkiy} NOTES The VMS has 8.0 dups per 1000 tokens; that is only the second highest rate, slightly less than the Dream of Red Chamber (10 dups per 1000). The next significant entries are the Voice of America broadcasts in Pinyin, with about half as many dups (4.2/1000). Then comes a bunch of languages with very similar values, starting with Vietnamese (2.9/1000) and ending with Arabic (virtually zero). Rugg's method hand-applied to Voynichese produces about 70% of the duplication seen in the real VMS (about 5.5/1000); the software version produces much less (about 2.6/1000). When applied to Vietnamese, Rugg's method produces little duplicates (0.5/1000). A second-order Markov monkey gives 2.2/1000, about the same as the true text (2.9/1000). Roman-coded English and Chinese, as expected, have the same dups as the plain versions. Viet-coded English has more dups than english, presumably because of the use of combinations of the form "X-Y" and "Y" for distinct English words; the sequence "X-Y Y" will be counted as a dup of Y. Well's "War of the Worlds" has 16 dups (0.45 per 1000 tokens) but they are all destroyed by Vigenère encoding (the only file with zero dups).