Hacking at the Voynich manuscript Notebook - volume 4 Warning: these notebooks aren't strictly chronological logs. Sometimes I go back and redo things, clarify comments, delete garbage, etc. Summary of previous notebooks ============================= On 97-07-05 I obtained Landini's interlinear transcription of the VMs, version 1.6 (landini-interln16.evt) from http://sun1.bham.ac.uk/G.Landini/evmt/intrln16.zip I manually extracted from it a homogeneous, full-text sample bio-m-evt.evt, consisting of pages 147-166 (f75r--f84v) of the "biological" section, in Currier's Language B, hand 2. This section includes Currier's and Friedman's transcriptions. Currier's seems to be the most complete of them. The two versions have many differences (affecting 5-10% of the words), and often disagree even in the grouping of symbols: where one sees two words the other sees a single word, what is [A] for one may be [CI] for the other, and so on. So I decided to break all characters doen to individual "logical" strokes, and use one (computer) character to encode each stroke. I called this new encoding "jsa" (Jorge's Super-Analytic). After mapping to jsa, I generated a "consensus" version of the biological section cat bio-m-evt.evt \ | fsg2jsa \ > bio-m-jsa.evt cat bio-m-jsa.evt \ | make-consensus-interlin \ > bio-x-jsa.evt cat bio-x-jsa.evt \ | egrep '^<.*;J> ' \ | sed \ -e 's/{[^}]*}//g' \ > bio-j-jsa.evt extract-words-from-interlin \ -chars "qocilgysxju" \ bio-j-jsa.evt \ bio-j-jsa lines words bytes file ------ ------- --------- ------------ 7054 7054 62690 bio-j-jsa.wds 2132 2132 24925 bio-j-jsa.dic 4661 4661 40897 bio-j-jsa-gut.wds 992 992 9720 bio-j-jsa-gut.dic 840 840 2445 bio-j-jsa-fun.wds 2 2 5 bio-j-jsa-fun.dic 1553 1553 19348 bio-j-jsa-bad.wds 1138 1138 15200 bio-j-jsa-bad.dic Digraph counts: q o c i l g y s x j u TOT ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- . 1398 965 1877 361 60 . . . . . . 4661 q 1 . 1229 18 . 1 154 . . . 700 . 2103 o 21 486 1 63 1087 1071 . . . . . . 2729 c 4 167 176 6137 1209 232 2114 2921 1019 . . . 13979 i 4 1 1 8 1997 2 . . 560 1616 37 457 4683 l . . . . . . 16 . . . 1566 . 1582 g 52 . 74 2150 4 4 . . . . . . 2284 y 2790 26 2 47 13 43 . . . . . . 2921 s 463 1 99 1013 1 2 . . . . . . 1579 x 827 24 105 488 5 167 . . . . . . 1616 j 46 . 76 2175 6 . . . . . . . 2303 u 453 . 1 3 . . . . . . . . 457 ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- TOT 4661 2103 2729 13979 4683 1582 2284 2921 1579 1616 2303 457 40897 Some conclusions we get from this and other data: The valid \i/ sequences are \ij/ \is/ \iis/ \iiu/ \iiiu/ \ix/; the others are likely to be scription or transcription errors. \ci/ and \o/ are lexically similar but distinct glyphs. The suffixes \ij/, \iis/, \iiu/, and \iiiu/ are preceded almost exclusively by \ci/ and strictly word-final. It seems plausible that these are errors: \oij/ (4 occurrences) should be \ciij/ ( 32 occurrences) \oiiu/ (2 occurrences) should be \ciiiu/ (109 occurrences) \ciiu/ (4 occurrences) should be \ciiiu/ (109 occurrences) \oiiiu/ (9 occurrences) should be \ciiiiu/ (329 occurrences) \ciiiiiu/ (4 occurrences) should be \ciiiiu/ (329 occurrences) \ciiix/ (2 occurrences) should be \ciix/ (403 occurrences) \ciiis/ (19 occurrences) may also be a misreading of \ciis/ (291 occurrences). \cg/ is always a glyph. \qo/ is a combination that occurs only in word-initial position. \qc/ is likely to be a misreading/miswriting of \qo/. \cy/ is always a glyph, almost certainly a final form of \ci/. \qj/, \lj/, \qg/, \lg/ are glyphs. \cs/ is a glyph closely related to (but distinct from) \c/. \ccg/ is almost always followed by \ci/ or \cy/. Here "glyph" means a group of strokes that can be treated as a single symbol for analysis; it may actually be part of a larger, still unrecognized symbol. Summarizing again: \iiiu/, \iiu/, \iis/, \ij/ The ziggies: strictly final, preceded always by \ci/ or, more rarely, by \o/. \cy/ Almost always final, but occasionaly followed by other letters. Preceded by about the same letters as \ci/; indeed, it is probably the final form of \ci/. \cg/ May be followed by many letters, most often \cy/ and \ci/. Almost always prededed by \c/, or initial; rarely by \ix/ or \o/. \cs/ Most often followed by \c/, somewhat less often by \o/, \ci/, or word break. Most often initial, but also preceded by \ix/, gallows, \c/, \cy/, \cg/, \is/. \lg/, \qg/, \lj/, \qj/ The capitals: Very similar to each other, different from the rest. probably to be combined with \c/ on both sides. It is very likely that \l/ and \q/ are exactly equivalent. Also, \lg/ and \qg/ may be the capital form of \cg/, used mainly in the first line of each paragraph (and perhaps of each page?) \qo/ Strictly initial, almost always followed by a capital. Sometimes misread as \qc/? \ix/ Usually initial or preceded by \ci/ or \o/; followed by any letter except ziggies and \qo/, \ix/, \is/ \is/ Similar to \ix/ except that it cannot be followed by capitals or \cg/, either. \ci/ May be followed only by the ziggies, \ix/, or \ir/ only. Often follows a capital, but also \cg/, \cs/, \c/, \ix/, \is/, or word break. \o/ Similar to \a/, but is very often word-initial. Other conclusions: * The manuscript does not appear to use any hyphenation mark. Either words are not broken across lines, which would be unusual, or they are broken without any extra marks. Such word breaks may result in statistical anomalies at the beginning and end of lines. Could this explain Currier's claim that lines are "functional units"? * Note that parsing sequences like \cij/, \ciis/, and \ciiis/ requires some care: the right parsings are c+ij, c+iis, ci+iis. * The parsing of \ciis/ is ambiguous: ci+is or c+iis. Declaring \ciiis/ to be a misreading of \ciis/ would remove the ambiguity. * The parsing of \ciiiu/ is ambiguous, too; but since the \iu/ series does not seem to follow a bare \c/, it seems safe to parse it as ci+iiu. * The gallows characters \qj/ and \lj/ appear to be closely related: for every common word with \lj/, there appears to be a a word with \qj/ that occurs with about 1/4 the frequency. * The seems to be a kinship between the glyphs \cs/ (when not attached to the following \c/s) \ir/, and the gallows \lj/ and \qj/ (also, when unattached). * The same phenomenon can be noted with respect to prefixes containing \cc/ and \csc/: for every word beginning with \cc/, there is a word where the first \cc/ is replaced by \csc/, and practically the same frequency. * There apepars to be much confusion between the suffixes \iu/ and \iiiu/. * There appears to be much confusions between \o/ and \ci/ The strings of \c/, \cs/, \lj/, \qj/, \lg/, \qg/ must be treated together, after collapsing the glyphs listed above, since there seems to be glyphs consisting of gallows preceded and followed by \c/ or \cc/. When this is taken into account, we can see that a single \c/ is not a glyph, but \cs/ is. In fact, after shrinking \ci/ to `a', \cs/ to `z', the gallows to `H' or `P', the only possible glyphs of the form [czHp]* with length at most 3 are freq glyph ---- ----- 795 H 52 P 152 z 138 cc 70 zc 482 Hc 484 ccc 439 zcc ? 493 Hcc ? 19 cHc 4 cPc The ones marked `?' may be composite, z+cc and H+cc, but this hypothesis does not seem very likely (perhaps they are *sometimes* composite?) The significant strings of length 4 that cannot be parsed into the glyphs above are 20 cHcc 4 cPcc Strings with 4 or more [czHP]'s tend to be quite ambiguous. 97-07-29 stolfi =============== Let's try to determine whether the \lg/ and \qg/ gallows are legitimate glyphs, or merely an ornate form of some other glyph. It seems thath these sequences only occur at the beginning of a paragraph. Let's check: cat bio-j-jsa.txt \ | sed \ -e 's/^/_/g' \ -e 's/$/_/g' \ -e 's/[ql]j/H/g' \ -e 's/[ql]g/P/g' \ -e 's/cs/z/g' \ -e 's/ij/k/g' \ -e 's/ix/e/g' \ -e 's/is/r/g' \ -e 's/iiu/n/g' \ -e 's/y/i/g' \ -e 's/ci/a/g' \ -e 's/cg/8/g' \ -e 's/ir/v/g' \ -e 's/iin/m/g' \ -e 's/in/m/g' \ | tac \ | number-lines-from-end-of-paragraph \ | tac \ | number-lines-in-paragraph \ | egrep 'P' \ | sort +0 -1n 1 3 _Pccoe cPcoe zoe?Hcoe Hc8a qoHc8a qoHcc8a qoHcca oeHcc8a Hca?qoHc8a qoPor oea //_ 1 4 _H??e ccc8a qoHcca oHc8a 8aHc8a oHcc8a oezcc8 oPzcc8 aHzcc8a qo?jc8a oPoea //_ 1 4 _Pcccoe 8ar qoHcca ccccHa qoH??e 8ae c?cc8a Pcc8a roe qoHc8a roe //_ 1 4 _PoeHczcoe oPa zcca qoPzcc8a qoHcc8a qoHoe zca Hoezc8 qoHa //_ 1 4 _Poeccc8a zcc8a qoHcc8a qoHan o8a cccH?cz oHae //_ 1 4 _Poezcca oeHzcc8a zccoe aHcca oHcca cccor zccc8a oe //_ 1 4 _Por?ar?or aHcca Hcca o??ar ozcca qoHa ccca oHcca e8a or??e //_ 1 4 _Pzccoe8a 8Hzcca qoHoPa ror oPor oePocHca oeHa8a //_ 1 5 _Hzcc8a qoHc?m zcc8a qoHaz oHae qoPzcc8a qoHa eccc8a qoP??eo?? //_ 1 5 _P8aezcor zcHoe qoHa Pzcai? zcc8a oHae8a 8ar oHar oHc8a 8a roe //_ 1 5 _Pccca Hzccoe qoHc?m o?gccc8a oHaezc8a oeHav oHam oHcc8a //_ 1 5 _Poe zcar zcar Pccca oHzcca oHaoz am oHzcca 8aeHccca?ra //_ 1 5 _Poeam oeHcc8a qoHccca 8aHcc8a qoHcc8a oPccc8a zcoe ora //_ 1 6 _HoHoe oePccc8a qoHcc8a qoHc8ae zcoe qoHae oH8ae 8??e oezcc8a //_ 1 6 _P?cc8a 8oePccc8a qoHcc8a qoHc8a qoHoe?Pccc8a roea //_ 1 6 _Pcccoe?Hae 8ae Horcca qoHca qoHa rccc8a qoHae oeHae?zcc8a ccHa //_ 1 6 _Poe oe zcae??jc?m oHcca eHcca qoHae oHcc?s8a oHczc8a //_ 1 6 _PoeHcca qoHoe oHzc8a oeHa orH??r zcccPc?c8a oeHae //_ 1 6 _Poeccc8a qoHar zcca qoHe oe?c?cca qoHan ccca qoHa qoHar //_ 1 6 _c?jcc??a rcccHa zccPccc8a qoH??r oe qocHcca ?eccca qoHc8a Horom?a //_ 1 6 _qHor zcc8a zcca Hcc8a zae ram ???Pcc8a 8ar ccc8a qoPccc8a roroe //_ 1 6 _qo8a zcar acH?ca qoHccca Hccca oeHan oPccc8a q?goe zcHa orae //_ 1 7 _?g??ae?zc8a zcocPcc8a ??H??r zcc8a oPzcc8a oHzc8a qoHc8c?e zc8a zoe8a //_ 1 7 _Horoezcz8a oPccca zccPcca qoHam zccHca qoHc8a 8aea //_ 1 7 _Poe?Hc8a ezcccHca oeHa oH oeHcca rccca qcHca rccca rae //_ 1 7 _Poe?zcoe Hccca qoHoe zcc8a qoHzcca zaea Hccca zHoe?Pcca //_ 1 7 _Poecca cPaeov o?jc?m oHam cc?cca Ham aeor oeHcca qoHae //_ 1 7 _Pzcoe?PcccPc8a qoHcc8a 8a qoHc8a 8am zccHcc8a qoHam ccccHca 8ar ccccHca ak //_ 1 8 _Hccc8a ePccc8a oPzcc8a cccPoe Pccc8ar zcc8a qoPccc8a //_ 1 8 _Poe8aHa 8aeoe oHc8a ccHav oPccc8a q?Hae c?cc8a cccPccc8a oPccca //_ 1 8 _Pzcaroe zccHca qoHzc8a qo?jae8a oPc?cc8a qoHar or am?oe //_ 1 8 _zaHam oHcc8a ccc8a qoHc?m cPcca oPcccca oHa?z?am ??Hara //_ 1 9 _Hoecc8a qoHc8a qoPoe qoHc8or cco???ccc8a qoHae?ccc8a Hak //_ 1 9 _P??ecc8 ??ccc8a z?ccHca oeHa 8ar oPaeHam oqoPccc8a ??r?a?m oPoea oroea //_ 1 9 _Par zcca ?jcc8a zccHae 8ae 8ar oe Pccc8a zccH 8c?m oPae?zccHa //_ 1 9 _Pccc8ar zcc8ae qoHar zcc8ae oHcc8a qoHc8a ?oHaie zcc8a qoHa??ezcc?? //_ 1 9 _Pccor ccccPcc8a qoHc8a ezc?c8a qoHcc8a rzcc8?Hc8a qoPzc8a qoPa //_ 1 9 _Poe8zcc8a oeH??ra qoHoeoe oH??e8a oHc8oe?or oeoroe //_ 1 9 _Poearar oHor oPcccca aHcca oPccaea ezcc8a qo?gcc8ae eHo8ae oPa Horoe?s //_ 1 9 _Pzc?c?e8a oPaezcc8a qoHzcc8a qoHc8a 8or zcca oPccc8a 8ae ?so?Pccak //_ 1 10 _qcHc8a zcc8a qoHoe o8ae ccae 8ar qoPzcc8a qoHc8a qoHc8a qoHc8a 8ae //_ 1 11 _Pccc8ar oPccc8a qoHc8a oPccc8a qoP8a 8an cccHa?s ccc?gcca qoHak z //_ 1 11 _Poe zcc8a qocc8a qoHam cccPcca qoe eHam zcc8a qoe //_ 1 11 _Pzcoe Hc?m oeHar zcca qoHc?m 8??e oeHam oHan zcae qoHa //_ 1 12 _Poeaecc8a Pz?cc8a oPccc8a qoHa?s aHzccoe qoHcc8a oHa ezcc8 ccPzccc8a aHae //_ 1 13 _Har zcccHca qoHae qoHc?m cc?c?e Hc8a rccc8a Pccc8a rzc?oe 8ae //_ 1 13 _Pccc8a ezcccHcc8a qoHcca qoHam oeHa oPccc8a Pccc8a //_ 1 14 _Hcc??8cca ezcc8a 8aea 8ae 8zccc8a Pcccoe ePoe oe?c?cc8a qoHa //_ 1 14 _Po?sae?zca qoHc8a cPcae zcc8a zccoe Hccc8a eHam zcc8a qoea //_ 1 15 _Poecc?a qoHc8a zcoe ?Pccc8a oePccc8a o?gzc8a oea //_ 1 15 _Poezca 8ae zcc8a qoHan 8ar cc?s8a ?c?c??ga qo?jar zcc8a e8a //_ 1 15 _cPc8or zcc8ae qoPcc8a 8zcc8a zcc8a Hccc8a ezcccHae zcccPca 8am 8a //_ 1 16 _P??e?8ar?av??e qoHoe ccca qoPccc8a qoPccc8a 8a?ecccz ??eHc8a e??ea //_ 1 16 _Po??c?m oe qoHcc8a Hccoe Hccae?oeHcc aPccca oPccc8a Haea //_ 1 16 _Poecc8a qoPccoe qoHor oePccc8a oPoe or???sa //_ 1 17 _P8oe ?gzcc8a q?PoeHc?m ocHcor oHcc8a qoHcc8a qoq?cccoe oe?jom a8arzcca //_ 1 18 _Hor zcc8a oHc8a qoHc8a oHc8ar cPc8??roe //_ 1 18 _PoHan oHcc8a or cccza zoe?zcca qoHcca qoHc8a oeHc8a ccca?Hae 8a qoe //_ 1 18 _Poe?zca ?c??sca?Hcc8 qoP oHcc8a o??jc?c??ga ?oHc???8c??? ??ai??? //_ 1 19 _PaHc8a oePccc8a qoHc8a zPcca ccc8a roe 8??r oPccc8a qoHc8a //_ 1 19 _qo?gcccoe oPccc8a qoHc?m oPccc8a?eccca cPcar oe oHaeor //_ 1 21 _Pccoe?zcc8a zcH??e oHc8a oPccoe?or Hcc8a oPccc??a?z?am cPccca oez //_ 1 25 _Hzcc8??r zcc8a qoPccc8a qoHc8a 8a?qoHoe o?ja //_ 1 29 _???q??cc???c??? ???Pc?c??a ?o??gzc??a ?q???g?c???c???c???n aza //_ 4 1 _Poe oeHccc?i?zccoe qoHcca // =_ 4 1 _azccca oezcca oeHzcc??a z?ccPzcca // =_ 5 1 _??o8c??cca ?jar oHc?m oPar oHc?m oeHca // =_ 6 1 _Hcccoe Hc8a Pcccoe Hc?m zcc8a qoHam oHca qoHc8a ccc8a // =_ 6 1 _Pcccoe zcc8a qoHam zcoe8a // =_ 8 1 _Pccc8ae oHc8a zcccHcc8cca qoHa ccc8a ccara // =_ 11 1 _Poezc8a qoHcc8a zcHa oeoeccca // =_ 14 1 _zcca oPccca cHcca oecca // =_ 16 1 _8am 8oe z?c?r cPcca e?gcc?c?e zccar qoecca // =_ 16 1 _eccc8a??Pccc8a qoHam 8??r // =_ 5 2 _zor oeHa qoHa?Ha?Hor c?cca?Ha HoHoe oPccc8a qoHc?m zccHa qoHc?m oe //_ 6 2 _Pccc8a qoHccc8a oHam cccHca zcc8a oHc?8a qoHa qoHc8a oe?oHc8a oHc8a rak //_ 8 2 _Poeccc8 oHan zcc8a zcc8a 8ae ccc8ar qoHcca ??Hcca ez??r?am ora //_ 8 2 _aH???e or zcc8a ???qoe?Hcc8a 8am 8Hc?m cPc?8a oe8a //_ 12 2 _8zcc8a Pccc8a qoHan ccc8a 8??ecce qoHcc8a qoH??e oeccca //_ 12 2 _Hzcc8a qoHc?m ccc8ae Pccc8a cccHa zc?m ccc8a qoHan oe //_ 15 2 _8an ccPam oPaea HoHaea //_ 3 3 _Poe oeor ccca qoHc?m zcc8a qoHc?m 8eccc?sa oe r?c?m?8ar //_ 5 3 _Poe ??e am oeHae zcar zcc8a qoHoe cc8a e8oe 8ar ae //_ 5 3 _Poe Har zcc8a qoHc8a oHae zcca qoHar?cccHca oHccca qoHccc8a cc???ca qoHam //_ 9 3 _8am cceccPzccca Hc?e cccoe 8a?? c?roHca 8am //_ 2 4 _8c?m zcca??jcc8a eH?? oPccc8a qoHc8a oHca Hae 8an oHcca oHa //_ 2 4 _??ccca 8ae?zccc8a cPccc8a oHc?m zcca qoPccca oe 8or?ccc8ccca //_ 2 4 _Hccc8a Pccc8a qoHcca ?soe oe8av zcccHca qoe e c?ccc8a qoHcc8a eoe?ccc8a //_ 3 4 _???qoHcc8a aHcca zccca or or am acPam ?cccHcca 8??? aHa //_ 3 4 _zam zcc8a Pz?cc8a qoHar zccoe qoeccca qoHa qoHae qoHa?? //_ 23 4 _Pccca Hzcca qoH?ca qoHae ?zc??ca qo?? ccc8a qoHak //_ 5 5 _zae?cccoe Har zcc8a zae?Hc8a zav qoHc8a q?Pc?c8a ecccPcc8a e8ar //_ 7 5 _zccca q?Pccc8a qoe cccc8a qoHcar cccca eoea 8a //_ 13 5 _azcca e?zcca?Ham eor a?m zccHca ?P?cca ???oe?Hcc8a?Har oHa ear //_ 25 5 _qoHc8a eHca Hae zcc8a qoPccca qoe Pccc8a oHcca cccHca zcca eoe ???e zccca?8a? //_ 4 6 _zaeHcca zor zcccHca 8am oHar ccPccc8a cPcca oHa oeor oHcca raecce //_ 11 6 _Pccc8a qoHca oHcoe qoe zccor zcc8a qcHc8az //_ 14 6 _Horc?m zcc8a ??ccor or zccH oHar?Pcc8a oPc?c?eor oHae zcc8a //_ 6 7 _Poecc8arcn zccHca qoHa aHar aeoe eHam oezcc8a oHc?m oHar oPar HoePa //_ 8 7 _Pccc8a rzccae 8ae8a qoHcc8a rzcc8a qo?jcc8a qoHcc8a eoccc8a //_ 9 7 _qoHccza qoHc8a qo?jcc8a cccPcca //_ 4 8 _Poe or oeHc?m ocHca qoHam oHc?m oHar zcca qoeHa //_ 9 8 _Par?o8a zcccPca cccoe qoHae?8ar oHc8a oea //_ 7 9 _Poezc8ae oHc8av oPzcc8ae qoHc8a zcc8a Pzcc?c8ae HzccoHcc8a ozccPoez //_ 7 10 _8zcc8a qoHcc8a qoPccc8a qoHaea 8ar oHam oHccae ak //_ 8 10 _???Pam zc?c?e qoHc?m cccHa qoHcc8a qoHar zccHca qoH??e zccc?jca qoHc?m oeHak //_ 10 10 _8zcc8a qo??c8a or?zcc8a Pccc8a qoHc?c8a oHc8a oPccc8a //_ 12 10 _qoHc8a zcca Hae oHc8a aPccc8a Hc8a eoeor //_ 17 10 _Pccc8ar zccPcca ec?cc8ara 8ae c?cae zca?Hoe zcc8a qoHak //_ 2 11 _qoeccca zccHca qoHca ecca oPccca 8an zc8a qoHc8ar ??ecc8a zo?? oHa 8a eccc8a //_ 3 11 _Pzc8a oPcc8a qoHc8a qoHcc8a qoHc8a qoe?Hc8a qoHc8a oHa //_ 8 11 _oHae oPae oHcca eoe oeoe??joe?Hc8a qoHc8a qoHa ePcc8a qoHcc8a eoe //_ 15 11 _zom Har Hc8a?P?c?ca Hcc8oeH8a //_ 5 12 _qoPccc8a qoe ccc8a qoHcca o8am rc?m 8aea //_ 7 12 _???s?c??c8a qocc8a oe cca rzc8a ezcc8a 8ar cc8a Pcc8a //_ 7 12 _Por?zcca oHan ccc8a qoe?zccoe oeccc8a //_ 18 12 _8aezcc8a qoe?zccc8a qoHae8a cccPa 8c?m ae?oeor oec?m ccc8a zcccHca qoHc?? eor //_ 5 14 _8zcc8a aHcc8a zcccH?a 8am oHc8a qoHcc8 qoHc8a e?ccP?c8a //_ 5 14 _Pom oe Hc8a oHc8a qoHa oHcc8a qoHca //_ 13 14 _Pccc8a Hcc8a qoHc8a qoHc8a qoHc8a qoHc8a qoHan oezcc8a //_ 4 15 _aHcz?ca 8c?cc8?a?Hc8a aHc8a 8ar aHc8a?Pca qoHa aHc8a oHae //_ 7 20 _Pccc8a qoHzc8a aHan ccc8a qoHar cca eoe ccc8a qoHa //_ 6 21 _Pccca?H??c?r oeHa 8ar oHca qoHan cccHca qoHcc8a qoHa //_ 7 23 _zcc8 ae?zccl?ca ?sc?m cccPcc8a an oeHcca eHar an oHcca eHan ccc8a 8ar 8aea //_ 5 25 _zoe?ccc8a qoHcc8a qoPccc8a qoH?cc? ??oe zcc8a qoHc8a z?cca o?jcc?s ae ae ccc8c?m?8a //_ 2 28 _8zcc8a qoH8?8aar cHca?s cccP c?an oHan qoHcor zcc8a qoe ?i??iu zccae?s qoHcca //_ So, `P's occur mostly , but not exclusively, at the beginning of the paragraph (64 lines out of 126). It seems that the end-of-paragraph is often missing in the transcript file, especially when it precedes a page break. 97-07-30 stolfi =============== Let's repeat the investigation of [czHP] strings, but including the `8' letter: cat bio-j-jsa-gut.wds \ | sed \ -e 's/^/_/g' \ -e 's/$/_/g' \ -e 's/[ql]j/H/' \ -e 's/[ql]g/P/' \ -e 's/cs/z/g' \ -e 's/ij/k/g' \ -e 's/ix/e/g' \ -e 's/is/r/g' \ -e 's/iiu/n/g' \ -e 's/y/i/g' \ -e 's/ci/a/g' \ -e 's/cg/8/g' \ | enum-contexts -vPAT='[czHP8][czHP8]*' -vCTX=0 \ | wfreq 793 0.19 H 382 0.09 8 374 0.09 Hc8 314 0.07 Hcc8 305 0.07 ccc8 277 0.06 zcc8 178 0.04 Hcc 163 0.04 ccc 152 0.04 z 140 0.03 zcc 102 0.02 Hc 74 0.02 cc 56 0.01 cccHc 49 0.01 zccc8 49 0.01 Pccc8 49 0.01 P 48 0.01 cc8 46 0.01 zc 41 0.01 ccccHc 40 0.01 cccc 39 0.01 zccHc 35 0.01 zcccHc 35 0.01 Hccc8 34 0.01 zccc 27 0.01 cccc8 25 0.01 Hccc 24 0.01 zccH 20 0.00 zc8 18 0.00 cccH 18 0.00 8zcc8 16 0.00 Hzcc 15 0.00 Hzcc8 14 0.00 cHc 14 0.00 Pccc 13 0.00 zcccH 12 0.00 ccccH 12 0.00 cHcc 11 0.00 cccz 11 0.00 ccH 11 0.00 8ccc8 9 0.00 zccHcc 9 0.00 cccHc8 8 0.00 cHcc8 7 0.00 cccHcc8 7 0.00 Pzcc8 7 0.00 Hzc8 6 0.00 zccHcc8 6 0.00 8cc 5 0.00 zcccHcc8 5 0.00 zcH 5 0.00 ccccz 5 0.00 cHc8 5 0.00 c 5 0.00 Pcc8 5 0.00 8cc8 4 0.00 zzcc8 4 0.00 ccccHcc 4 0.00 cccHcc 4 0.00 cH 4 0.00 Pcc 4 0.00 Hcccc 4 0.00 8ccc 3 0.00 zcccz 3 0.00 cccP 3 0.00 ccHc8 3 0.00 cPcc 3 0.00 cPc 3 0.00 P8 3 0.00 8zcc 3 0.00 8zc 2 0.00 zzcc 2 0.00 zcccPc 2 0.00 zcccHcc 2 0.00 zcccHc8 2 0.00 zccPcc 2 0.00 zccHc8 2 0.00 ccz 2 0.00 ccccHcc8 2 0.00 cccPcc8 2 0.00 cccPcc 2 0.00 cP 2 0.00 Pzc8 2 0.00 Pzc 2 0.00 Pcccc 2 0.00 Hzc 2 0.00 Hczcc 2 0.00 Hczc 2 0.00 Hccz 2 0.00 Hcccc8 2 0.00 H8 2 0.00 8zccc8 2 0.00 8zccc 2 0.00 8cccc 2 0.00 8c8 2 0.00 8Hc8 1 0.00 zzcccHc 1 0.00 zzccH 1 0.00 zzcHcc8 1 0.00 zcz8 1 0.00 zccz 1 0.00 zccccHcc 1 0.00 zcccc 1 0.00 zcccHcc8cc 1 0.00 zccPccc8 1 0.00 zccP 1 0.00 zccHccc 1 0.00 zcHcc 1 0.00 zcHc 1 0.00 zPcc 1 0.00 zHcc 1 0.00 zH 1 0.00 ccccc 1 0.00 ccccPcc8 1 0.00 cccPccc8 1 0.00 cccHccc8 1 0.00 ccc8cc 1 0.00 ccPzccc8 1 0.00 ccPzccc 1 0.00 ccPccc8 1 0.00 ccP 1 0.00 ccHcc8 1 0.00 ccHcc 1 0.00 cc8cc 1 0.00 cPccc8 1 0.00 cPccc 1 0.00 cPcc8 1 0.00 cPc8 1 0.00 cHccz 1 0.00 cHccc8 1 0.00 Pzcc 1 0.00 Hczc8 1 0.00 Hcz8 1 0.00 Hcz 1 0.00 Hccz8 1 0.00 Hc8zcc8 1 0.00 Hc8cc 1 0.00 Hc8c8 1 0.00 Hc8c 1 0.00 8zcccz 1 0.00 8zc8 1 0.00 8cccc8 1 0.00 8cccHcc8 1 0.00 8Hzcc 1 0.00 8Hcc 1 0.00 88 ----- ---- ---- 4282 1.00 TOT Apparently the `8' (\cg/) does not tend to be surrounded by [czHP] strokes, it is either preceded or followed by them. Thus `8' seems quite unlike `P'. Let's look at some `P' strings and try to find similar words with the `P' replaced by something else: cat bio-j-jsa-gut.wds \ | sed \ -e 's/^/_/g' \ -e 's/$/_/g' \ -e 's/[ql]j/H/' \ -e 's/[ql]g/P/' \ -e 's/cs/z/g' \ -e 's/ij/k/g' \ -e 's/ix/e/g' \ -e 's/is/r/g' \ -e 's/iiu/n/g' \ -e 's/y/i/g' \ -e 's/ci/a/g' \ -e 's/cg/8/g' \ | egrep '[^czHP]P[^czHP]' \ | wfreq 8 0.15 _Poe_ 2 0.04 _oPoea_ 2 0.04 _oPar_ 2 0.04 _oPa_ 2 0.04 _Poeccc8a_ 1 0.02 _qoPor_ 1 0.02 _qoPoe_ 1 0.02 _qoPa_ 1 0.02 _qoP_ 1 0.02 _qoP8a_ 1 0.02 _qoHoPa_ 1 0.02 _oePocHca_ 1 0.02 _oPor_ 1 0.02 _oPoe_ 1 0.02 _oPaezcc8a_ 1 0.02 _oPaea_ 1 0.02 _oPae_ 1 0.02 _oPaeHain_ 1 0.02 _ePoe_ 1 0.02 _Poin_ 1 0.02 _Poezcca_ 1 0.02 _Poezca_ 1 0.02 _Poezc8ae_ 1 0.02 _Poezc8a_ 1 0.02 _Poeccc8_ 1 0.02 _Poecca_ 1 0.02 _Poecc8arcn_ 1 0.02 _Poecc8a_ 1 0.02 _Poearar_ 1 0.02 _Poeain_ 1 0.02 _Poeaecc8a_ 1 0.02 _PoeHczcoe_ 1 0.02 _PoeHcca_ 1 0.02 _Poe8zcc8a_ 1 0.02 _Poe8aHa_ 1 0.02 _PoHan_ 1 0.02 _Par_ 1 0.02 _PaHc8a_ 1 0.02 _P8oe_ 1 0.02 _P8aezcor_ 1 0.02 _HoePa_ ----- ---- ---- 52 1.00 TOT set noglob foreach f ( \ '_'.'oe_' \ '_o'.'oea_' \ '_o'.'ar_' \ '_o'.'a_' \ '_'.'oeccc8a_' \ ) echo " " echo "-----------------------------------------------------------------------" echo " " cat bio-j-jsa-gut.wds \ | sed \ -e 's/^/_/g' \ -e 's/$/_/g' \ -e 's/[ql]j/H/' \ -e 's/[ql]g/P/' \ -e 's/cs/z/g' \ -e 's/ij/k/g' \ -e 's/ix/e/g' \ -e 's/is/r/g' \ -e 's/iiu/n/g' \ -e 's/y/i/g' \ -e 's/ci/a/g' \ -e 's/cg/8/g' \ | compare-contexts -rctx 0 -lctx 0 -colw 24 \ "${f:r}P${f:e}" "${f:r}[^P]${f:e}" "${f:r}[^P][^P]${f:e}" end unset noglob ----------------------------------------------------------------------- 8 1.00 _Poe_ 81 0.52 _qoe_ 11 0.19 _zcoe_ ----- ---- ---- 25 0.16 _zoe_ 11 0.19 _oHoe_ 8 1.00 TOT 17 0.11 _eoe_ 11 0.19 _ccoe_ 17 0.11 _8oe_ 8 0.14 _oeoe_ 9 0.06 _roe_ 5 0.09 _oroe_ 6 0.04 _Hoe_ 3 0.05 _Hcoe_ ----- ---- ---- 2 0.04 _aroe_ 155 1.00 TOT 1 0.02 _qooe_ 1 0.02 _oqoe_ 1 0.02 _eHoe_ 1 0.02 _e8oe_ 1 0.02 _aeoe_ 1 0.02 _8roe_ ----- ---- ---- 57 1.00 TOT ----------------------------------------------------------------------- 2 1.00 _oPoea_ 2 1.00 _oroea_ ----- ---- ---- ----- ---- ---- ----- ---- ---- 0 1.00 TOT 2 1.00 TOT 2 1.00 TOT ----------------------------------------------------------------------- 2 1.00 _oPar_ 35 0.92 _oHar_ 7 1.00 _oeHar_ ----- ---- ---- 1 0.03 _orar_ ----- ---- ---- 2 1.00 TOT 1 0.03 _oear_ 7 1.00 TOT 1 0.03 _o8ar_ ----- ---- ---- 38 1.00 TOT ----------------------------------------------------------------------- 2 1.00 _oPa_ 25 0.45 _oHa_ 21 0.53 _oHca_ ----- ---- ---- 23 0.41 _oea_ 10 0.25 _oeHa_ 2 1.00 TOT 6 0.11 _ora_ 9 0.23 _oe8a_ 2 0.04 _o8a_ ----- ---- ---- ----- ---- ---- 40 1.00 TOT 56 1.00 TOT ----------------------------------------------------------------------- 2 1.00 _Poeccc8a_ 6 0.67 _qoeccc8a_ ----- ---- ---- ----- ---- ---- 2 0.22 _zoeccc8a_ 0 1.00 TOT 2 1.00 TOT 1 0.11 _8oeccc8a_ ----- ---- ---- 9 1.00 TOT ----------------------------------------------------------------------- It ssems that isolated `P' = {\lg/,\qg/} is closely related to `r'=\is/, `q' = \q/, `z' = \cs/, `8' = \cg/, `e' = \ix/, `H' = {\lj/,\qj/}. cat bio-j-jsa-gut.wds \ | sed \ -e 's/^/_/g' \ -e 's/$/_/g' \ -e 's/[ql]j/H/' \ -e 's/[ql]g/P/' \ -e 's/cs/z/g' \ -e 's/ij/k/g' \ -e 's/ix/e/g' \ -e 's/is/r/g' \ -e 's/iiu/n/g' \ -e 's/y/i/g' \ -e 's/ci/a/g' \ -e 's/cg/8/g' \ | egrep '[^czHP]Pccc[^czHP]' \ | wfreq 14 0.22 _oPccc8a_ 14 0.22 _Pccc8a_ 8 0.13 _qoPccc8a_ 4 0.06 _oePccc8a_ 4 0.06 _oPccca_ 4 0.06 _Pcccoe_ 4 0.06 _Pccc8ar_ 3 0.05 _Pccca_ 2 0.03 _qoPccca_ 1 0.02 _oqoPccc8a_ 1 0.02 _ePccc8a_ 1 0.02 _aPccca_ 1 0.02 _aPccc8a_ 1 0.02 _Pccc8ae_ 1 0.02 _8oePccc8a_ ----- ---- ---- 63 1.00 TOT set noglob foreach f ( \ '_o'.'ccc8a_' \ '_'.'ccc8a_' \ '_qo'.'ccc8a_' \ '_oe'.'ccc8a_' \ '_o'.'ccca_' \ '_'.'cccoe_' \ '_'.'ccc8ar_' \ '_'.'ccca_' \ '_qo'.'ccca_' \ ) echo " " echo "-----------------------------------------------------------------------" echo " " cat bio-j-jsa-gut.wds \ | sed \ -e 's/^/_/g' \ -e 's/$/_/g' \ -e 's/[ql]j/H/' \ -e 's/[ql]g/P/' \ -e 's/cs/z/g' \ -e 's/ij/k/g' \ -e 's/ix/e/g' \ -e 's/is/r/g' \ -e 's/iiu/n/g' \ -e 's/y/i/g' \ -e 's/ci/a/g' \ -e 's/cg/8/g' \ | compare-contexts -rctx 0 -lctx 0 -colw 24 \ "${f:r}P${f:e}" "${f:r}[^P]${f:e}" "${f:r}[^P][^P]${f:e}" end unset noglob ----------------------------------------------------------------------- 14 1.00 _oPccc8a_ 23 0.79 _oeccc8a_ 1 0.50 _orzccc8a_ ----- ---- ---- 5 0.17 _oHccc8a_ 1 0.50 _oecccc8a_ 14 1.00 TOT 1 0.03 _o8ccc8a_ ----- ---- ---- ----- ---- ---- 2 1.00 TOT 29 1.00 TOT ----------------------------------------------------------------------- 14 1.00 _Pccc8a_ 52 0.37 _eccc8a_ 23 0.49 _oeccc8a_ ----- ---- ---- 36 0.26 _zccc8a_ 5 0.11 _oHccc8a_ 14 1.00 TOT 19 0.13 _cccc8a_ 4 0.09 _ezccc8a_ 14 0.10 _Hccc8a_ 2 0.04 _rzccc8a_ 9 0.06 _8ccc8a_ 2 0.04 _qoccc8a_ 5 0.04 _rccc8a_ 2 0.04 _ecccc8a_ 4 0.03 _accc8a_ 2 0.04 _eHccc8a_ 1 0.01 _qccc8a_ 1 0.02 _o8ccc8a_ 1 0.01 _occc8a_ 1 0.02 _eoccc8a_ ----- ---- ---- 1 0.02 _azccc8a_ 141 1.00 TOT 1 0.02 _aeccc8a_ 1 0.02 _acccc8a_ 1 0.02 _8zccc8a_ 1 0.02 _8cccc8a_ ----- ---- ---- 47 1.00 TOT ----------------------------------------------------------------------- 8 1.00 _qoPccc8a_ 11 0.65 _qoHccc8a_ 2 0.40 _qoezccc8a_ ----- ---- ---- 6 0.35 _qoeccc8a_ 2 0.40 _qoHcccc8a_ 8 1.00 TOT ----- ---- ---- 1 0.20 _qoecccc8a_ 17 1.00 TOT ----- ---- ---- 5 1.00 TOT ----------------------------------------------------------------------- 4 1.00 _oePccc8a_ 1 1.00 _oecccc8a_ 1 1.00 _oeoHccc8a_ ----- ---- ---- ----- ---- ---- ----- ---- ---- 4 1.00 TOT 1 1.00 TOT 1 1.00 TOT ----------------------------------------------------------------------- 4 1.00 _oPccca_ 12 0.63 _oeccca_ 2 0.40 _oezccca_ ----- ---- ---- 6 0.32 _oHccca_ 2 0.40 _oecccca_ 4 1.00 TOT 1 0.05 _orccca_ 1 0.20 _oeHccca_ ----- ---- ---- ----- ---- ---- 19 1.00 TOT 5 1.00 TOT ----------------------------------------------------------------------- 4 1.00 _Pcccoe_ 3 0.27 _zcccoe_ 1 0.33 _oecccoe_ ----- ---- ---- 3 0.27 _8cccoe_ 1 0.33 _cccccoe_ 4 1.00 TOT 2 0.18 _ecccoe_ 1 0.33 _8ccccoe_ 2 0.18 _Hcccoe_ ----- ---- ---- 1 0.09 _acccoe_ 3 1.00 TOT ----- ---- ---- 11 1.00 TOT ----------------------------------------------------------------------- 4 1.00 _Pccc8ar_ 1 0.25 _eccc8ar_ ----- ---- ---- ----- ---- ---- 1 0.25 _cccc8ar_ 0 1.00 TOT 4 1.00 TOT 1 0.25 _accc8ar_ 1 0.25 _Hccc8ar_ ----- ---- ---- 4 1.00 TOT ----------------------------------------------------------------------- 3 1.00 _Pccca_ 31 0.39 _cccca_ 12 0.35 _oeccca_ ----- ---- ---- 23 0.29 _zccca_ 6 0.18 _oHccca_ 3 1.00 TOT 15 0.19 _eccca_ 3 0.09 _ezccca_ 5 0.06 _Hccca_ 3 0.09 _azccca_ 4 0.05 _rccca_ 2 0.06 _ecccca_ 1 0.01 _accca_ 2 0.06 _8zccca_ 1 0.01 _8ccca_ 1 0.03 _rcccca_ ----- ---- ---- 1 0.03 _qoccca_ 80 1.00 TOT 1 0.03 _orccca_ 1 0.03 _acccca_ 1 0.03 _Hcccca_ 1 0.03 _8cccca_ ----- ---- ---- 34 1.00 TOT ----------------------------------------------------------------------- 2 1.00 _qoPccca_ 8 0.50 _qoeccca_ 1 0.50 _qoeHccca_ ----- ---- ---- 8 0.50 _qoHccca_ 1 0.50 _qoHcccca_ 2 1.00 TOT ----- ---- ---- ----- ---- ---- 16 1.00 TOT 2 1.00 TOT It seems that `Pccc' is closely related to `Hccc' `eccc' `zccc' `8ccc' `cccc'. cat bio-j-jsa-gut.wds \ | sed \ -e 's/^/_/g' \ -e 's/$/_/g' \ -e 's/[ql]j/H/' \ -e 's/[ql]g/P/' \ -e 's/cs/z/g' \ -e 's/ij/k/g' \ -e 's/ix/e/g' \ -e 's/is/r/g' \ -e 's/iiu/n/g' \ -e 's/y/i/g' \ -e 's/ci/a/g' \ -e 's/cg/8/g' \ | egrep '[^czHP]Pzcc[^czHP]' \ | wfreq 3 0.38 _qoPzcc8a_ 2 0.25 _oPzcc8a_ 1 0.12 _oPzcc8ae_ 1 0.12 _oPzcc8_ 1 0.12 _Pzccoe8a_ ----- ---- ---- 8 1.00 TOT set noglob foreach f ( \ '_qo'.'zcc8a_' \ '_o'.'zcc8a_' \ '_o'.'zcc8ae_' \ '_o'.'zcc8_' \ '_'.'zccoe8a_' \ ) echo " " echo "-----------------------------------------------------------------------" echo " " cat bio-j-jsa-gut.wds \ | sed \ -e 's/^/_/g' \ -e 's/$/_/g' \ -e 's/[ql]j/H/' \ -e 's/[ql]g/P/' \ -e 's/cs/z/g' \ -e 's/ij/k/g' \ -e 's/ix/e/g' \ -e 's/is/r/g' \ -e 's/iiu/n/g' \ -e 's/y/i/g' \ -e 's/ci/a/g' \ -e 's/cg/8/g' \ | compare-contexts -rctx 0 -lctx 0 -colw 24 \ "${f:r}P${f:e}" "${f:r}[^P]${f:e}" "${f:r}[^P][^P]${f:e}" end unset noglob ----------------------------------------------------------------------- 3 1.00 _qoPzcc8a_ 7 0.88 _qoHzcc8a_ ----- ---- ---- ----- ---- ---- 1 0.12 _qoezcc8a_ 0 1.00 TOT 3 1.00 TOT ----- ---- ---- 8 1.00 TOT ----------------------------------------------------------------------- 2 1.00 _oPzcc8a_ 14 0.88 _oezcc8a_ 1 1.00 _oeHzcc8a_ ----- ---- ---- 2 0.12 _oHzcc8a_ ----- ---- ---- 2 1.00 TOT ----- ---- ---- 1 1.00 TOT 16 1.00 TOT ----------------------------------------------------------------------- 1 1.00 _oPzcc8ae_ ----- ---- ---- ----- ---- ---- ----- ---- ---- 0 1.00 TOT 0 1.00 TOT 1 1.00 TOT ----------------------------------------------------------------------- 1 1.00 _oPzcc8_ 2 1.00 _oezcc8_ ----- ---- ---- ----- ---- ---- ----- ---- ---- 0 1.00 TOT 1 1.00 TOT 2 1.00 TOT ----------------------------------------------------------------------- 1 1.00 _Pzccoe8a_ ----- ---- ---- ----- ---- ---- ----- ---- ---- 0 1.00 TOT 0 1.00 TOT 1 1.00 TOT Again the `P' seems to be similar to `H' and `e'. And now for something completely different. Let's look at how the words are distributed among the paragraphs: cat bio-j-jsa.wds \ | sed \ -e 's/[ql]j/H/g' \ -e 's/[ql]g/P/g' \ -e 's/ij/k/g' \ -e 's/ix/e/g' \ -e 's/is/r/g' \ -e 's/iiu/n/g' \ -e 's/cy/a/g' \ -e 's/ci/a/g' \ -e 's/in/m/g' \ -e 's/ir/w/g' \ -e 's/cs/z/g' \ -e 's/cg/8/g' \ | enum-words-in-blocks -vWPB=100 \ | egrep -v '[^a-zA-Z0-9_ ]' \ | sort +1 -2 +0 -1n \ | make-word-location-map -vNBLOCKS=71 \ > .foo The result has been posted as http://www.dcc.unicamp.br/~stolfi/voynich/word-distr-map.html Recomputing with fewer large blocks: cat bio-j-jsa.wds \ | sed \ -e 's/[ql]j/H/g' \ -e 's/[ql]g/P/g' \ -e 's/ij/k/g' \ -e 's/ix/e/g' \ -e 's/is/r/g' \ -e 's/iiu/n/g' \ -e 's/cy/a/g' \ -e 's/ci/a/g' \ -e 's/in/m/g' \ -e 's/ir/w/g' \ -e 's/cs/z/g' \ -e 's/cg/8/g' \ | enum-words-in-blocks -vWPB=1010 \ | egrep -v '[^a-zA-Z0-9_ ]' \ | sort +1 -2 +0 -1n \ | make-word-location-map -vCTWD=3 -vNBLOCKS=7 \ > .foo cat .foo \ | gawk '/./ { printf"%5d %-16s ", $1, $2; for (i=3; i<=NF; i++) printf " %2d", int(($(i)*99/$1)+0.5); printf "\n" }' \ > .bar Results posted in my Voynich page. 97-08-01 stolfi =============== Recomputed the word distributions, adding the average positions and deviation, and using only good words (so that the blocks would be more uniform):. cat bio-j-jsa-gut.wds \ | sed \ -e 's/[ql]j/H/g' \ -e 's/[ql]g/P/g' \ -e 's/ij/k/g' \ -e 's/ix/e/g' \ -e 's/is/r/g' \ -e 's/iiu/n/g' \ -e 's/cy/a/g' \ -e 's/ci/a/g' \ -e 's/in/m/g' \ -e 's/ir/w/g' \ -e 's/cs/z/g' \ -e 's/cg/8/g' \ | enum-words-in-blocks -vWPB=100 \ | sort +1 -2 +0 -1n \ | make-word-location-map -vCTWD=1 -vPERCENT=1 -vNBLOCKS=47 \ > .baz Here are the table lines for the most popular words (at least 20 occurrences) sorted by length (L) and total frequency: TOTAL AVG DEV L WORD ABSOLUTE FREQUENCY BY BLOCK RELATIVE FREQUENCY BY BLOCK ----- ----- ----- - ---------------- ----------------------------------------------- ----------------------------------------------- 127 24.7 12.2 2 oe 15.66...2.11.12.3134926354.623697511211.4..2361 00.00...0.00.00.0000100000.000010000000.0..0000 40 21.4 12.9 2 or 21.121..4........2124411.2...22.2.......1.121.1 00.000..1........0001100.0...00.0.......0.000.0 35 22.8 15.1 2 8a 222.2..1.2...11..1111....1...21.3.2112...11..3. 111.1..0.1...00..0000....0...10.1.1001...00..1. 20 21.7 11.5 2 am ...1.11..1.12......2..11.3..1.2....1...1......1 ...0.00..0.01......1..00.1..0.1....0...0......0 81 22.9 12.7 3 qoe .4.441.1111.3.341..553.13.214123442211...141.12 .0.000.0000.0.000..110.00.000000000000...000.00 73 26.3 13.1 3 8am ..1.3..1.3.322334.611.21....12..7224.12.2312224 ..0.0..0.0.000000.100.00....00..1000.00.0000000 51 22.8 16.1 3 8ar 125312111.....12.122.11.2.1...1.121.2.1..12721. 001100000.....00.000.00.0.0...0.000.0.0..00100. 50 21.0 13.3 3 8ae 1131211.1.11.13341142.....1..2..111221.21.3...1 0010000.0.00.01110010.....0..0..000000.00.1...0 31 23.7 14.0 3 zam 1....1..211412..11.1.1.1....1...1..1.1132...21. 0....0..100101..00.0.0.0....0...0..0.0011...10. 25 23.3 11.4 3 oHa ..1....221.1......21....213..1.2.11..21....1... ..0....110.0......10....101..0.1.00..10....0... 25 26.7 12.7 3 zoe 11.1.......11.....1.1.12..121.1.1.1...21311.... 00.0.......00.....0.0.01..010.0.0.0...10100.... 23 23.1 13.4 3 oea 21..1..1.1.........11.2211.1...22...1.....1.11. 10..0..0.0.........00.1100.0...11...0.....0.00. 79 24.4 13.6 4 qoHa 521311.21..3111.32...2.31352161.141.62323..222. 100000.00..0000.00...0.00010010.000.10000..000. 69 22.8 14.7 4 zcca 62.21.16111.2..22...224.2.2114311.1.12212121321 10.00.01000.0..00...001.0.0001000.0.00000000000 67 24.3 13.6 4 ccca .2.3.154.2213.1.1....21..111332.4123236121111.. .0.0.011.0000.0.0....00..000000.1000001000000.. 39 25.6 12.4 4 oHae ..2.2...21.....1.1111321.11214..21...211...1.22 ..0.0...00.....0.0000100.00001..00...000...0.00 37 24.7 11.8 4 oHam ..1.11.1...3.211..1.112131.123..1....11.2311... ..0.00.0...1.000..0.000010.001..0....00.0100... 35 15.4 12.6 4 oHar 21511.13221......211.1..2111..12.1.........1.1. 10100.01110......100.0..1000..01.0.........0.0. 25 25.9 15.0 4 Hc8a 1..3........1....4112....1..1..1.....1...112121 0..1........0....1001....0..0..0.....0...001010 21 21.8 14.1 4 oHca 1.1..4.....1.....2..11...2..1..21..1........21. 0.0..2.....0.....1..00...1..0..10..0........10. 204 25.0 14.4 5 zcc8a 26431595524383364713542211463334552525467789574 00000000000000000000000000000000000000000000000 172 22.7 13.8 5 ccc8a 46.1438463556719332362.413133235343254436532272 00.0000000000001000000.000000000000000000000000 113 25.8 12.3 5 qoHae 221411.311...31955...32.148.5432.45.665118312.. 000000.000...00100...00.001.0000.00.000001000.. 91 25.3 11.4 5 qoHam ......221..3874251.5.1111114521.1.74.362.433... ......000..0110000.0.0000000000.0.10.010.000... 83 25.2 16.4 5 oHc8a .15421.23432.211.417...111.....2211....32.31994 .01000.00000.000.001...000.....0000....00.00110 54 13.6 10.6 5 qoHan 8214212....11621426111.1..11..1..11.1........1. 1001000....00100101000.0..00..0..00.0........0. 48 23.2 13.4 5 qoHar 4..23.3....1........171..34211..121.222.11111.. 1..01.1....0........010..11000..000.000.00000.. 43 21.7 14.4 5 qoHca .22.1243...1.1.2..1.22.1.12..112111.3.....11112 .00.0011...0.0.0..0.00.0.00..000000.1.....00000 34 23.8 12.9 5 oHcca 1...2122.1.1.......1..1233.1...3.311.....2111.. 0...1011.0.0.......0..0111.0...1.100.....1000.. 31 25.0 12.0 5 cccca 1..11....1...23.11..1.11.1..31.1.131.21.11...1. 0..00....0...11.00..0.00.0..10.0.010.10.00...0. 23 19.8 9.9 5 zccca ....1.1.2111.11.1..11.21211.1......2......1.... ....0.0.1000.00.0..00.10100.0......1......0.... 21 25.4 9.9 5 qoHoe ..........1.3.1..11..1.12...1311....11.....11.. ..........0.1.0..00..0.01...0100....00.....00.. 198 24.0 14.2 6 qoHc8a 41946411238.556699393.1271123..5345284285377583 00000000000.000001000.0000000..0000000000000000 81 25.3 12.7 6 qoHcca .11.244.11111311222..2.355.4.113.1264.31224.211 .00.000.00000000000..0.011.0.000.0010.00000.000 56 22.4 13.7 6 oHcc8a 111.22.128.11....113111335....1211..1.1..11.62. 000.00.001.00....000000001....0000..0.0..00.10. 52 21.7 12.6 6 eccc8a 11.1..4111331311221.21.2.....3..1.134.242...... 00.0..1000110100000.00.0.....1..0.011.010...... 50 23.0 12.8 6 cccHca 31...1321.1.12.11.11.212.25.2.22211.111..1221.. 10...0100.0.00.00.00.000.01.0.00000.000..0000.. 37 24.7 14.8 6 zccHca ..2.1.251.1.....1...211...222.2...1....2.113211 ..0.0.010.0.....0...000...000.0...0....0.001000 36 21.3 12.9 6 zccc8a 11.21.1..4.1.11.21..3.122...2..2..11..11.2..2.. 00.10.0..1.0.00.10..1.011...1..1..00..00.1..1.. 21 24.7 13.5 6 ezcc8a ...1..1.1.12..111...1.....1..2...2..2...1..11.1 ...0..0.0.01..000...0.....0..1...1..1...0..00.0 183 21.4 13.3 7 qoHcc8a 449.44212683996225532.3863523222.1584335343353. 001.00000000100000000.0000000000.0000000000000. 35 24.5 11.4 7 ccccHca .1...11..121.1.1111..1..213611...1...1111..2.1. .0...00..010.0.0000..0..101200...0...0000..1.0. 31 23.6 12.2 7 zcccHca .1.21.1..112....1.1.......4521....2.11.21.1.... .0.10.0..001....0.0.......1110....1.00.10.0.... 23 20.4 14.8 7 oeccc8a ..1131..2.2....2...11.....1...1...2..11.1.....2 ..0010..1.1....1...00.....0...0...1..00.0.....1 Recomputing the coarse table: cat bio-j-jsa-gut.wds \ | sed \ -e 's/[ql]j/H/g' \ -e 's/[ql]g/P/g' \ -e 's/ij/k/g' \ -e 's/ix/e/g' \ -e 's/is/r/g' \ -e 's/iiu/n/g' \ -e 's/cy/a/g' \ -e 's/ci/a/g' \ -e 's/in/m/g' \ -e 's/ir/w/g' \ -e 's/cs/z/g' \ -e 's/cg/8/g' \ | enum-words-in-blocks -vWPB=666 \ | sort +1 -2 +0 -1n \ | make-word-location-map -vCTWD=3 -vPERCENT=1 -vNBLOCKS=7 \ > .bar Results posted in my Voynich page. For comparison, let's try English and Portuguese: cat engl.wds | tr '[A-Z]' '[a-z]' | head -4661 \ | enum-words-in-blocks -vWPB=100 \ | sort +1 -2 +0 -1n \ | make-word-location-map -vCTWD=1 -vPERCENT=1 -vNBLOCKS=47 \ > .baz TOTAL AVG DEV WORD ABSOLUTE FREQUENCY BY BLOCK RELATIVE FREQUENCY BY BLOCK ----- ----- ----- ---------------- ----------------------------------------------- ----------------------------------------------- 199 24.3 14.0 the 9.134354855516419225114342267572.41699419576541 0.000000000000000000000000000000.00000000000000 165 23.2 13.0 a 14335233324554364463275662314144313.55413634432 00000000000000000000000000000000000.00000000000 117 25.5 13.1 and 21122322313312.3323521..52255444113423313334332 00000000000000.0000000..00000000000000000000000 114 23.4 12.9 of 211242314.3123632464.23432.3141.1262.431842..31 000000000.0000000000.00000.0000.0000.000100..00 114 24.1 14.3 to 233324312221261.2325331321212321332331122246423 000000000000000.0000000000000000000000000000000 105 23.3 13.3 i 356.12.114113.34.312233117213512125221445132..1 001.00.000000.00.000000001000000000000000000..0 80 24.7 13.6 in 32.1.21.12.35132212221321.3221221..22221.324421 00.0.00.00.01000000000000.0000000..00000.000000 59 25.1 12.5 she ..21.214..11112...1311.14332.114222111...12.321 ..00.001..00000...0000.01000.001000000...00.000 58 24.4 15.0 was 1312.11131.221..112213.11.12.3..1.12....125432. 0000.00000.000..000000.00.00.0..0.00....001100. 54 27.0 13.5 her ..2311.2....1112.1.31...313...1..3652..112.1213 ..0100.0....0000.0.10...101...0..1110..000.0001 51 26.1 13.2 that ..122.11111...22..2.3121..21.51.1..11.423.23.2. ..000.00000...00..0.1000..00.10.0..00.101.01.0. 50 22.8 11.3 you ..2..21411..22..11..3.3.144...119.12...3.....1. ..0..00100..00..00..1.1.011...002.00...1.....0. 45 21.7 17.0 had 142361...1.11..1.1.1.1.......32.....11124.1213. 010110...0.00..0.0.0.0.......10.....00001.0001. 43 23.7 15.9 as 11223121.2.1...1..1.1...111.1.1.312.21.1.3.1122 00001000.0.0...0..0.0...000.0.0.100.00.0.1.0000 42 22.3 13.3 my 222..1...121.123.1..11.1111222.2..2....412....1 000..0...000.001.0..00.0000000.0..0....100....0 38 18.6 12.6 he .222311.1.12...1.23..4.1....4.1.2....112...1... .000100.0.00...0.01..1.0....1.0.0....000...0... 38 20.6 13.6 at 11.121111142...112...11..21.2.1111.....112111.. 00.000000010...000...00..00.0.0000.....000000.. 38 24.1 14.4 with 1.11.1.111.2311..32....111.12.1....1..1.2.2.33. 0.00.0.000.0100..10....000.00.0....0..0.0.0.11. 34 19.9 10.9 it 11.....1323.1...131.1115..1..2..21....1...11... 00.....0111.0...010.0001..0..1..10....0...00... 30 24.5 15.4 for .2.212.11.1..1..1........111..1.11.121.2221...1 .1.101.00.0..0..0........000..0.00.010.1110...0 30 26.1 9.4 me .1......1.....11.11.22.2113211..1112..1.12..... .0......0.....00.00.11.1001100..0001..0.01..... 27 23.2 10.3 is ......12..2..1.2..2.22..1212.......12121....... ......01..1..0.1..1.11..0101.......01010....... 27 29.8 12.5 mrs ..1..1.........121.11..2..2.11.1.1..1.1..21231. ..0..0.........010.00..1..1.00.0.0..0.0..10110. 26 14.3 11.7 his .3.324....1..1...22111.......11..2........1.... .1.111....0..0...11000.......00..1........0.... 24 30.9 14.2 we 11.......1.1....2..........2.13..1...2....13.5. 00.......0.0....1..........1.01..0...1....01.2. 23 25.0 12.9 on ...1..2.1.1..1...121.1...2.....11.11.2..1.2..1. ...0..1.0.0..0...010.0...1.....00.00.1..0.1..0. 22 23.2 13.2 be ..2....2....22.1.....112..1.1..1......12.111... ..1....1....11.0.....001..0.0..0......01.000... 21 25.3 13.1 up .1....1.1.21.1........1..1121....111...3..1...1 .0....0.0.10.0........0..0010....000...1..0...0 20 21.9 14.9 an 1.11..1.1.111.11..1........1...2.1.1...1..1..2. 0.00..0.0.000.00..0........0...1.0.0...0..0..1. 20 22.2 12.1 john .1..11..11...111.1.......1111.2..1..112........ .0..00..00...000.0.......0000.1..0..001........ 20 23.3 13.5 but ....2.22..........1..122..1.....111..1....1.11. ....1.11..........0..011..0.....000..0....0.00. cat port.wds | tr '[A-Z]' '[a-z]' \ | egrep -v '^x$' \ | head -4661 \ | enum-words-in-blocks -vWPB=100 \ | sort +1 -2 +0 -1n \ | make-word-location-map -vCTWD=1 -vPERCENT=1 -vNBLOCKS=47 \ > .boh 97-08-02 stolfi =============== A wild guess: the `P' gallows may be just an ornate form of the \s/ plume that an occur on top of the \cc/ ligature (FSG [T], Currier [S], Frogguy ). I.e. the `cPc' combination is a variant of `zc' (FSG [S], Currier [Z], Frogguy ). Let's check: cat bio-j-jsa-gut.wds \ | sed \ -e 's/^/_/g' \ -e 's/$/_/g' \ -e 's/[ql]j/H/g' \ -e 's/[ql]g/P/g' \ -e 's/cs/z/g' \ -e 's/ij/k/g' \ -e 's/ix/e/g' \ -e 's/is/r/g' \ -e 's/iiu/n/g' \ -e 's/y/i/g' \ -e 's/ci/a/g' \ -e 's/cg/8/g' \ -e 's/ir/w/g' \ -e 's/in/m/g' \ | compare-contexts -lctx 1 -rctx 1 -colw 24 \ 'cPc' \ 'cHc' \ 'zc' 10 0.45 ccPcc 170 0.61 ccHca 529 0.64 _zcc 5 0.23 _cPcc 46 0.16 ccHcc 87 0.11 ezcc 2 0.09 ccPca 16 0.06 ccHc8 32 0.04 Hzcc 2 0.09 _cPca 10 0.04 ocHcc 27 0.03 8zcc 1 0.05 ocPcc 8 0.03 _cHcc 19 0.02 azcc 1 0.05 _cPco 6 0.02 ocHca 19 0.02 _zco 1 0.05 _cPc8 4 0.01 qcHcc 16 0.02 _zca ----- ---- ---- 3 0.01 _cHc8 13 0.02 ezc8 22 1.00 TOT 2 0.01 zcHcc 12 0.01 rzcc 2 0.01 qcHc8 10 0.01 Pzcc 2 0.01 ccHco 8 0.01 zzcc 2 0.01 acHca 7 0.01 Hzc8 2 0.01 _cHco 6 0.01 _zcH 2 0.01 _cHca 6 0.01 _zc8 1 0.00 zcHco 4 0.00 ozcc 1 0.00 qcHca 3 0.00 _zce 1 0.00 ocHco 3 0.00 8zco 1 0.00 ccHc_ 2 0.00 ezco 1 0.00 acHcc 2 0.00 ezca ----- ---- ---- 2 0.00 czcc 280 1.00 TOT 2 0.00 Pzc8 2 0.00 Hzca 1 0.00 zzcH 1 0.00 rzc8 1 0.00 ezcz 1 0.00 ezce 1 0.00 ezc_ 1 0.00 ezcH 1 0.00 czco 1 0.00 czca 1 0.00 czc8 1 0.00 _zcr 1 0.00 _zc_ 1 0.00 Pzco 1 0.00 Pzca 1 0.00 8zc8 ----- ---- ---- 825 1.00 TOT Hmmm... `cPc' is not much like `zc' but not to unlike either. It resembles `zc' more than it resembles `cHc'. cat bio-j-jsa-gut.wds \ | sed \ -e 's/^/_/g' \ -e 's/$/_/g' \ -e 's/[ql]j/H/g' \ -e 's/[ql]g/P/g' \ -e 's/cs/z/g' \ -e 's/ij/k/g' \ -e 's/ix/e/g' \ -e 's/is/r/g' \ -e 's/iiu/n/g' \ -e 's/y/i/g' \ -e 's/ci/a/g' \ -e 's/cg/8/g' \ -e 's/ir/w/g' \ -e 's/in/m/g'\ > .wds set ff = ( 'cPc' 'zc' ) set ofiles = ( ) foreach f ( $ff ) cat .wds \ | grep $f \ | sed -e "s/${f}/@/g" \ | sort | uniq -c \ > ${f}.wds set ofiles = ( ${ofiles} ${f}.wds ) end /n/gnu/bin/join -a1 -e '---' -j1 2 -j2 2 -o 0,1.1,2.1 ${ofiles} \ | gawk '/./ { printf "%5d %5d %-16s\n", $2, $3, $1 }' \ | sort -nr unset ff unset ofiles zc cPc word ---- ---- ------- 69 3 _@ca_ 36 1 _@cc8a_ 23 1 _@cca_ 11 1 _@oe_ 5 1 _@ar_ 1 1 _@ae_ 0 2 _zcc@a_ 0 2 _zc@ca_ 0 2 _cc@ca_ 0 1 _zco@c8a_ 0 1 _zc@cc8a_ 0 1 _ecc@c8a_ 0 1 _ccc@c8a_ 0 1 _cc@cc8a_ 0 1 _cc@c8a_ 0 1 _c@cc8a_ 0 1 _@8or_ Hmm. Not exactly impressive. But not discouraging either... OK, now something completely different again. Let's look for colocates of the popular words: cat bio-j-jsa.wds \ | sed \ -e 's/[ql]j/H/g' \ -e 's/[ql]g/P/g' \ -e 's/cs/z/g' \ -e 's/ij/k/g' \ -e 's/ix/e/g' \ -e 's/is/r/g' \ -e 's/iiu/n/g' \ -e 's/y/i/g' \ -e 's/ci/a/g' \ -e 's/cg/8/g' \ -e 's/ir/w/g' \ -e 's/in/m/g'\ > .wds foreach f ( `cat .popular.wds` ) ( echo ' ' ;\ echo ' after '"$f" ;\ echo ' -----------------------' ;\ cat .wds \ | enum-words-after -vWORD=${f} \ | wfreq \ | head -5 \ ) \ | gawk '/./ { printf "%-24s\n", $0 }' end after zcca after ccca ----------------------- ----------------------- 4 0.06 qoHc?m 4 0.06 qoHc?m 4 0.06 qoHc8a 4 0.06 // 4 0.06 qoHam 3 0.04 qoHcc8a 3 0.04 qoHcc8a 2 0.03 rae 3 0.04 qoHa 2 0.03 qoe after zcc8a after ccc8a after oeHc8a ----------------------- ----------------------- ----------------------- 16 0.08 qoHc8a 11 0.06 // 4 0.21 // 13 0.06 qoHcc8a 9 0.05 qoHc8a 2 0.11 qoHc8a 11 0.05 qoHam 8 0.05 qoe 1 0.05 zcccHcc8a 9 0.04 // 8 0.05 qoHc?m 1 0.05 r??r 8 0.04 qoe 6 0.03 qoHar 1 0.05 qoe?ccca after ezcc8a after oeccc8a ----------------------- ----------------------- 3 0.14 // 6 0.26 // 2 0.10 qoHcc8a 2 0.09 qoHcc8a 2 0.10 qoHa 1 0.04 qor 1 0.05 qoeccc8a 1 0.04 qoe?zcc8a 1 0.05 qoe 1 0.04 qoe after qoHc8a after qoHcc8a after Hc8a after oHc8a ----------------------- ----------------------- ----------------------- ----------------------- 14 0.07 qoHcc8a 15 0.08 qoHc8a 2 0.08 qoHc8a 8 0.10 qoHc8a 14 0.07 qoHc8a 8 0.04 qoHcc8a 2 0.08 oHc8a 4 0.05 zcc8a 8 0.04 zcc8a 8 0.04 qoHae 2 0.08 ccc8a 4 0.05 oHc8a 8 0.04 ccc8a 6 0.03 ccc8a 2 0.08 8ar 3 0.04 Hc8a 7 0.04 oHc8a 5 0.03 qoHcca 2 0.08 // 3 0.04 // after zccc8a after cccc8a after eccc8a ----------------------- ----------------------- ----------------------- 5 0.14 qoHcc8a 3 0.16 qoHcc8a 9 0.17 // 3 0.08 qoe 2 0.11 qoHam 3 0.06 zcc8a 3 0.08 qoHc8a 2 0.11 qoHae 3 0.06 qoHcc8a 2 0.06 qoHcca 2 0.11 // 3 0.06 qoHam 1 0.03 zccHa 1 0.05 z??e 3 0.06 qoHa after oHcc8a ----------------------- 5 0.09 // 4 0.07 qoHa 3 0.05 zcc8a 3 0.05 qoHcc8a 3 0.05 qoHc8a after zccHca ----------------------- 3 0.08 qoHc8a 2 0.05 qoea 2 0.05 qoHca 1 0.03 zoea 1 0.03 zcc8a after cccHca ----------------------- 4 0.08 qoHa 4 0.08 // 3 0.06 qoHc?m 2 0.04 zae 2 0.04 qoHcca after ccccHca after zcccHca ----------------------- ----------------------- 3 0.09 qoHae 3 0.10 qoHc?m 2 0.06 qoHc?m 3 0.10 qoHae 2 0.06 qoHa 2 0.06 qoHcc8a 2 0.06 oHae 2 0.06 qoHc8a 2 0.06 eor 1 0.03 zcca after oHcca ----------------------- 2 0.06 qoe 2 0.06 oHa 1 0.03 raecce 1 0.03 qoe?ccca 1 0.03 qoHoe after cccca after zccca ----------------------- ----------------------- 2 0.06 ram 2 0.09 qoHc?m 2 0.06 qoHc?m 2 0.09 qoHc8a 2 0.06 qoHam 2 0.09 qoHam 1 0.03 zcccHcca 2 0.09 or 1 0.03 zcccHca 1 0.04 ra after qoHca after oHca ----------------------- ----------------------- 3 0.07 qoHae 2 0.10 qoHc8a 3 0.07 qoHa 1 0.05 zccHcc8a 3 0.07 ezcc8a 1 0.05 qoezca 2 0.05 qoHc?m 1 0.05 qoHca 2 0.05 qoHc8a 1 0.05 qoHan after qoHcca after oeHcca ----------------------- ----------------------- 6 0.07 qoHc8a 1 0.05 zcoHam 4 0.05 oHcca 1 0.05 rccz 3 0.04 zcc8a 1 0.05 rccca 3 0.04 ram 1 0.05 ram 3 0.04 qoHae 1 0.05 rae after oe after qoe after or ----------------------- ----------------------- ----------------------- 16 0.13 // 9 0.11 ccc8a 3 0.07 zcc8a 10 0.08 zcc8a 9 0.11 // 3 0.07 // 6 0.05 ccc8a 6 0.07 zcc8a 2 0.05 or 3 0.02 zccc8a 4 0.05 cccc8a 2 0.05 ccca 3 0.02 oHcc8a 3 0.04 oe 2 0.05 ae after eoe ----------------------- 10 0.59 // 1 0.06 oeoe??joe?Hc8a 1 0.06 cccor 1 0.06 ccca??scca 1 0.06 ccca after qoHoe ----------------------- 3 0.14 zcc8a 2 0.10 ccoe 2 0.10 cc8a 1 0.05 zccc8a 1 0.05 zccHcca after qoHae after qoHam after qoHan after qoHar ----------------------- ----------------------- ----------------------- ----------------------- 9 0.08 ccc8a 6 0.07 zcc8a 6 0.11 cccHca 7 0.15 zcc8a 9 0.08 // 5 0.05 zccHca 3 0.06 ccc8a 4 0.08 oe 7 0.06 zcc8a 5 0.05 ccccHca 3 0.06 8ar 2 0.04 zcca 5 0.04 8ar 5 0.05 ccc8a 2 0.04 zccoe 2 0.04 cccHca 3 0.03 zcccHca 5 0.05 // 2 0.04 oHc8a 2 0.04 ccc8a after oHae after oHam after oHar ----------------------- ----------------------- ----------------------- 4 0.10 ccc8a 3 0.08 zcc8a 4 0.11 zcc8a 4 0.10 // 3 0.08 // 3 0.09 oHc8a 2 0.05 8ae 2 0.05 oHc?m 3 0.09 ccc8a 1 0.03 zccca 2 0.05 oHc8a 2 0.06 oe 1 0.03 zccc8a 2 0.05 cccHca 2 0.06 8ar after qoHa ----------------------- 20 0.25 // 4 0.05 8am 3 0.04 zam 3 0.04 qoHae 3 0.04 ccc8a after 8am after 8ar after 8ae ----------------------- ----------------------- ----------------------- 8 0.11 // 5 0.10 // 8 0.16 // 5 0.07 ccca 4 0.08 zcc8a 4 0.08 zcc8a 3 0.04 zcca 4 0.08 oe 2 0.04 or 3 0.04 zcc8a 2 0.04 zcca 2 0.04 oeccc8a 2 0.03 zccHca 1 0.02 zccor 2 0.04 eccc8a after 8a ----------------------- 11 0.31 // 3 0.09 8am 2 0.06 qoHae 1 0.03 zcccHa 1 0.03 zcca after oHa ----------------------- 7 0.28 // 1 0.04 zcoe 1 0.04 zcca 1 0.04 qoHor 1 0.04 qoHca after zam after zoe ----------------------- ----------------------- 3 0.10 // 3 0.12 zcc8a 2 0.06 zcca 2 0.08 ccca 2 0.06 zcc8a 2 0.08 ccc8a 2 0.06 ccc8a 1 0.04 zcccoe 1 0.03 zcccHca 1 0.04 zccca after am ----------------------- 2 0.10 oHc?m 2 0.10 // 1 0.05 zcc8a 1 0.05 z?cc? 1 0.05 z?Hae after oea ----------------------- 18 0.78 // 2 0.09 zcca 1 0.04 qoHcc8a 1 0.04 oHae 1 0.04 cccc8a after // ----------------------- 75 0.10 = 28 0.04 qoHcc8a 15 0.02 zoe 15 0.02 qoHc8a 15 0.02 8zcc8a after = ----------------------- 3 0.04 Poe 2 0.03 Poeccc8a 2 0.03 Pccc8ar 1 0.01 zcca 1 0.01 zaHam Once again, for the words that are often followed by `//' (which may be actually half-words), skipping the `//': foreach f ( ccc8a oeHc8a ezcc8a oeccc8a eccc8a oHcc8a oe qoe or eoe 8am 8ar 8ae 8a oHa zam oea ) ( echo ' ' ;\ echo ' after '"$f" ;\ echo ' -----------------------' ;\ cat .wds \ | grep -v '//' \ | enum-words-after -vWORD=${f} \ | wfreq \ | head -5 \ ) \ | gawk '/./ { printf "%-24s\n", $0 }' end after ccc8a after oeHc8a ----------------------- ----------------------- 9 0.05 qoHc8a 2 0.11 qoHc8a 8 0.05 qoe 1 0.05 zcccHcc8a 8 0.05 qoHc?m 1 0.05 r??r 6 0.03 qoHar 1 0.05 qoe?ccca 6 0.03 qoHae 1 0.05 qoHam after ezcc8a after eccc8a after oeccc8a after oHcc8a ----------------------- ----------------------- ----------------------- ----------------------- 2 0.10 qoHcc8a 3 0.06 zcc8a 2 0.09 qoHcc8a 5 0.09 qoHcc8a 2 0.10 qoHa 3 0.06 qoHcc8a 2 0.09 qoHc?m 4 0.07 qoHa 2 0.10 = 3 0.06 qoHam 1 0.04 qor 3 0.05 zcc8a 1 0.05 qoeccc8a 3 0.06 qoHa 1 0.04 qoe?zcc8a 3 0.05 qoHc8a 1 0.05 qoe 2 0.04 qoHc8a 1 0.04 qoe 2 0.04 oeHc?m after oe after qoe ----------------------- ----------------------- 10 0.08 zcc8a 9 0.11 ccc8a 6 0.05 ccc8a 6 0.07 zcc8a 3 0.02 zccc8a 4 0.05 cccc8a 3 0.02 qoHcc8a 3 0.04 qoe 3 0.02 oHcc8a 3 0.04 oe after or ----------------------- 3 0.07 zcc8a 2 0.05 or 2 0.05 ccca 2 0.05 ae 1 0.03 zccoeo after eoe ----------------------- 1 0.06 zzcc8a 1 0.06 qocHca 1 0.06 qoHcca 1 0.06 qoHcc8a 1 0.06 qoHan after 8am after 8ar after 8ae ----------------------- ----------------------- ----------------------- 5 0.07 ccca 4 0.08 zcc8a 4 0.08 zcc8a 3 0.04 zcca 4 0.08 oe 3 0.06 qoHc8a 3 0.04 zcc8a 2 0.04 zcca 2 0.04 or 3 0.04 oHc8a 2 0.04 qoHcca 2 0.04 oeccc8a 2 0.03 zccHca 2 0.04 8ar 2 0.04 eccc8a after 8a ----------------------- 3 0.09 qoHcc8a 3 0.09 qoHae 3 0.09 8am 2 0.06 qoHam 1 0.03 zzcc8a after oHa ----------------------- 2 0.08 qoHca 1 0.04 zoeHa 1 0.04 zcoe 1 0.04 zcca 1 0.04 zc?m after zam ----------------------- 2 0.06 zcca 2 0.06 zcc8a 2 0.06 ccc8a 1 0.03 zoeHcc8a 1 0.03 zcccHca after oea ----------------------- 2 0.09 zcca 2 0.09 zc?m 2 0.09 qoe 1 0.04 zoeccca 1 0.04 zcoe Let's compare with English: cat engl.wds \ | head -7053 \ | tr '[A-Z]' '[a-z]' \ > .e.wds foreach f ( `cat .popengl.wds` ) ( echo ' ' ;\ echo ' after '"$f" ;\ echo ' -----------------------' ;\ cat .e.wds \ | enum-words-after -vWORD=${f} \ | wfreq \ | head -5 \ ) \ | gawk '/./ { printf "%-24s\n", $0 }' end after the ----------------------- 11 0.03 door 9 0.03 house 7 0.02 hall 6 0.02 village 4 0.01 same after a ----------------------- 7 0.03 very 7 0.03 great 7 0.03 few 5 0.02 little 4 0.02 man after and ----------------------- 13 0.07 i 9 0.05 the 4 0.02 we 4 0.02 that 4 0.02 he after of ----------------------- 34 0.20 the 12 0.07 a 9 0.05 his 8 0.05 her 8 0.05 course after to ----------------------- 23 0.12 the 11 0.06 be 9 0.05 me 7 0.04 her 5 0.03 see after i ----------------------- 15 0.09 had 8 0.05 was 6 0.04 will 6 0.04 have 6 0.04 asked after in ----------------------- 31 0.27 the 15 0.13 a 7 0.06 his 6 0.05 her 3 0.03 an after she ----------------------- 9 0.11 was 4 0.05 is 3 0.04 seemed 3 0.04 said 3 0.04 looked after was ----------------------- 19 0.18 a 5 0.05 to 4 0.04 in 3 0.03 waiting 3 0.03 very after her ----------------------- 5 0.07 hand 5 0.07 as 3 0.04 own 3 0.04 husband 2 0.03 to after that ----------------------- 6 0.08 i 3 0.04 we 3 0.04 she 3 0.04 night 3 0.04 he after you ----------------------- 5 0.06 think 5 0.06 know 4 0.05 could 4 0.05 are 3 0.04 in after had ----------------------- 13 0.19 a 9 0.13 been 2 0.03 taken 2 0.03 seen 2 0.03 occurred after as ----------------------- 9 0.16 i 8 0.14 a 6 0.11 we 5 0.09 she 4 0.07 he after my ----------------------- 5 0.08 mind 4 0.07 dear 3 0.05 first 2 0.03 window 2 0.03 wife after he ----------------------- 9 0.14 was 8 0.13 had 3 0.05 turned 2 0.03 looked 2 0.03 came after at ----------------------- 11 0.22 the 5 0.10 once 4 0.08 styles 3 0.06 her 2 0.04 tadminster after with ----------------------- 14 0.24 a 9 0.16 the 3 0.05 us 2 0.03 you 2 0.03 some after it ----------------------- 11 0.17 was 5 0.08 to 5 0.08 is 3 0.05 seemed 2 0.03 the after for ----------------------- 6 0.15 the 5 0.13 some 5 0.13 a 3 0.08 me 2 0.05 us after me ----------------------- 3 0.07 that 2 0.04 with 2 0.04 to 2 0.04 the 2 0.04 she after is ----------------------- 4 0.10 a 3 0.08 very 2 0.05 up 2 0.05 to 2 0.05 mrs after mrs ----------------------- 24 0.56 inglethorp 10 0.23 cavendish 3 0.07 inglethorps 2 0.05 cavendishs 1 0.02 rolleston after his ----------------------- 4 0.09 wife 4 0.09 face 2 0.04 mothers 2 0.04 manner 2 0.04 brother after we ----------------------- 7 0.16 had 4 0.09 are 3 0.07 drove 2 0.04 were 2 0.04 should after on ----------------------- 15 0.44 the 3 0.09 a 2 0.06 our 2 0.06 his 1 0.03 to after be ----------------------- 4 0.12 done 4 0.12 a 2 0.06 mine 2 0.06 able 1 0.03 was after up ----------------------- 6 0.19 in 4 0.12 to 4 0.12 at 2 0.06 the 2 0.06 my after an ----------------------- 3 0.12 old 1 0.04 otherwise 1 0.04 orphan 1 0.04 inaccessible 1 0.04 impression after john ----------------------- 5 0.18 cavendish 2 0.07 he 1 0.04 with 1 0.04 will 1 0.04 was after but ----------------------- 4 0.12 she 4 0.12 im 2 0.06 there 2 0.06 the 2 0.06 as after him ----------------------- 3 0.09 i 3 0.09 and 2 0.06 from 2 0.06 at 1 0.03 youve Seems great, let's try it for Portuguese: cat port.wds \ | head -7053 \ | tr '[A-Z]' '[a-z]' \ > .p.wds foreach f ( `cat .popport.wds` ) ( echo ' ' ;\ echo ' after '"$f" ;\ echo ' -----------------------' ;\ cat .p.wds \ | enum-words-after -vWORD=${f} \ | wfreq \ | head -5 \ ) \ | gawk '/./ { printf "%-24s\n", $0 }' end after de ----------------------- 94 0.23 x 36 0.09 um 16 0.04 colagem 14 0.03 triângulos 13 0.03 tipo after a ----------------------- 18 0.07 superfície 18 0.07 figura 13 0.05 topologia 12 0.04 x 12 0.04 aresta after e ----------------------- 46 0.25 x 11 0.06 a 7 0.04 o 7 0.04 faces 6 0.03 que after que ----------------------- 13 0.07 a 11 0.06 x 8 0.05 os 8 0.05 cada 6 0.03 são after um ----------------------- 22 0.16 complexo 10 0.07 vértice 7 0.05 ladrilho 6 0.04 modelo 6 0.04 arco after é ----------------------- 16 0.15 um 9 0.08 uma 9 0.08 o 6 0.06 a 5 0.05 possível after da ----------------------- 20 0.19 superfície 16 0.15 aresta 9 0.08 triangulação 9 0.08 mesma 6 0.06 figura after uma ----------------------- 11 0.11 aresta 8 0.08 função 5 0.05 variedade 5 0.05 superfície 5 0.05 configuração after o ----------------------- 6 0.06 ladrilho 5 0.05 complexo 5 0.05 arco 4 0.04 problema 4 0.04 primeiro after do ----------------------- 26 0.34 complexo 11 0.14 ladrilho 5 0.06 objeto 3 0.04 x 3 0.04 plano after aresta ----------------------- 19 0.28 x 9 0.13 de 5 0.07 orientada 5 0.07 dual 2 0.03 veja after para ----------------------- 13 0.17 a 7 0.09 o 6 0.08 todo 6 0.08 cada 4 0.05 os after complexo ----------------------- 25 0.41 celular 5 0.08 original 4 0.07 x 2 0.03 dado 2 0.03 com after em ----------------------- 6 0.09 x 5 0.08 cada 4 0.06 que 3 0.05 vez 3 0.05 três after os ----------------------- 14 0.21 vértices 5 0.07 arcos 4 0.06 pontos 4 0.06 dois 3 0.04 ângulos after por ----------------------- 11 0.18 um 9 0.15 x 6 0.10 uma 6 0.10 exemplo 3 0.05 outro after cada ----------------------- 12 0.20 aresta 10 0.16 ladrilho 6 0.10 um 5 0.08 vértice 5 0.08 face after as ----------------------- 16 0.23 arestas 8 0.11 faces 6 0.09 funções 5 0.07 duas 3 0.04 relações after com ----------------------- 8 0.14 x 5 0.09 a 4 0.07 vértices 4 0.07 o 4 0.07 mesma after arestas ----------------------- 9 0.15 e 5 0.08 x 5 0.08 de 5 0.08 da 2 0.03 ve after superfície ----------------------- 5 0.10 e 5 0.10 de 3 0.06 que 3 0.06 como 2 0.04 seja after no ----------------------- 8 0.20 sentido 4 0.10 mesmo 3 0.07 espaço 2 0.05 plano 2 0.05 máximo after são ----------------------- 4 0.10 os 2 0.05 suficientes 2 0.05 similares 2 0.05 representados 2 0.05 partes after vértices ----------------------- 17 0.31 de 7 0.13 e 3 0.05 do 2 0.04 distintos 2 0.04 a after face ----------------------- 7 0.23 x 4 0.13 de 3 0.10 esquerda 2 0.06 seja 2 0.06 e after faces ----------------------- 5 0.15 de 4 0.12 adjacentes 3 0.09 e 2 0.06 vértices 2 0.06 triangulares after arco ----------------------- 11 0.39 x 4 0.14 encontrado 3 0.11 de 2 0.07 inicial 2 0.07 com after como ----------------------- 5 0.14 sendo 5 0.14 a 4 0.11 uma 3 0.08 um 2 0.05 x after na ----------------------- 9 0.23 figura 4 0.10 fronteira 3 0.07 verdade 3 0.07 superfície 3 0.07 etapa after se ----------------------- 6 0.16 x 2 0.05 toda 2 0.05 elas 2 0.05 comporta 2 0.05 a after figura ----------------------- 7 0.18 a 3 0.08 note 2 0.05 portanto 2 0.05 mostra 2 0.05 mais after ser ----------------------- 2 0.06 feita 2 0.06 colado 1 0.03 úteis 1 0.03 visto 1 0.03 variedades after ao ----------------------- 6 0.19 mesmo 5 0.16 longo 4 0.12 redor 4 0.12 avançarmos 2 0.06 ladrilho after ou ----------------------- 6 0.16 seja 5 0.14 x 3 0.08 em 2 0.05 dois 2 0.05 curvas after celular ----------------------- 8 0.32 x 3 0.12 é 3 0.12 e 2 0.08 original 2 0.08 existem after vértice ----------------------- 11 0.31 de 5 0.14 x 2 0.06 minimizar 2 0.06 mais 2 0.06 l after ladrilho ----------------------- 5 0.11 de 3 0.07 x 3 0.07 vértice 2 0.04 é 2 0.04 tem after não ----------------------- 3 0.11 é 2 0.07 tem 2 0.07 são 2 0.07 ocorrem 1 0.04 têm after dos ----------------------- 3 0.10 vértices 3 0.10 triângulos 3 0.10 quais 2 0.06 x 2 0.06 polinômios after topologia ----------------------- 6 0.30 de 4 0.20 do 3 0.15 da 2 0.10 das 1 0.05 prova-se after complexos ----------------------- 11 0.61 celulares 4 0.22 que 1 0.06 orientáveis 1 0.06 não 1 0.06 admitem after lado ----------------------- 4 0.24 da 2 0.12 x 2 0.12 do 2 0.12 de 2 0.12 a after colagem ----------------------- 6 0.18 de 2 0.06 só 2 0.06 que 2 0.06 e 2 0.06 deve after tem ----------------------- 2 0.10 um 2 0.10 todos 1 0.05 x 1 0.05 uma 1 0.05 todas after das ----------------------- 9 0.36 arestas 5 0.20 funções 4 0.16 faces 2 0.08 energias 1 0.04 idéias 97-08-08 stolfi =============== Over the past weekend I stayed home and played a bit with the word-pair tables above. I printed the Voynich word-pair table and cut it up into little index cards, one for each left-word. Then I tried to group the left-words into classes, based on the most popular words that followed them. I identified the following classes: (1) positional class: a coarse classification, based on how often the word occurs in line-final position, i.e. right before "//". Very often final: oea 8a qoHa oHa eoe Moderately often final: czcc8a oeccc8a eccc8a am ccca ccc8a oeHc8a or oHae qoHae qoe oe qoHam oHam zam 8am 8ar 8ae oHcc8a cccHca Hc8a qoHcca oHc8a Rarely if ever final: cccca oeHcca oHcca oHca zccHca zcccHca qoHan qoHcc81 qoHc8a qoHca ccccHca zcc8a zccca zcca cccc8a zccc8a Presumably, if a word is unusually common in that position, the cause is that it often occurs at the end of sentences, hence at the end of paragraphs, which always end at the end of a line. (2) post-contextual class: a finer classification, based on the few most common words following the word in question (including "//", if common enough). Mostly followed by {// zcca}: oHa oea Mostly followed by {// zcc8a, ccc8a}: qoHam qoHc?m oham qohae oHae zamm ram, oHc?m 8am 8ar 8ae or oe qoe Mostly followed by {// 8am zam}: qoHa Mostly followed by {// qoHc8a}: ccc8a oeHc8a ccca Mostly followed by {// qoHa}: cccHca oHcc8a Mostly followed by {// qoHcc8a qoe}: oeccc8a eccc8a ezcc8a Mostly followed by {qoHc8a qohcc8a ccc8a}: qoHc8a qoHcc8a Mostly followed by {zcc8a ccc8a oe}: qoHar oHar qoHoe zoe Mostly followed by {qoHam qoHc?m qoHc8a qohar //}: zcca zccca zcc8a zcccHca Mostly followed by {qoHcc8a}: zccc8a cccc8a Mostly followed by {qoHae qoHc?m qoHa}: ccccHca qoHca Mostly followed by {qoHc8a qoHca}: oHca zccHca Mostly followed by {qoHc8a oHc8a}: Hc8a oHc8a qoHcca The `qoHc?m' words are generally instances where Friedman has [4ODAM] and Currier has [4ODAN]. The general impression is that of words in a natural language (as opposed to random words). I wrote a script to compute and print word-pair frequencies. To save memory, the words are divided into two sets, the "keys" K (usually the 20-so most common words) and the "bores" B (all the rest); and only the K-K, K-B, and B-K sub-tables are computed. cat bio-j-jsa.wds \ | sed \ -e 's/[ql]j/H/g' \ -e 's/[ql]g/P/g' \ -e 's/cs/z/g' \ -e 's/ij/k/g' \ -e 's/ix/e/g' \ -e 's/is/r/g' \ -e 's/iiu/n/g' \ -e 's/y/i/g' \ -e 's/ci/a/g' \ -e 's/cg/8/g' \ -e 's/ir/w/g' \ -e 's/in/m/g'\ > .wds cat .keys // zcc8a ccc8a oe oHc8a qoHcc8a qoHc8a qoHa qoHae qoe eccc8a oHcc8a zccc8a ccca zcca cccHca zccHca ccccHca zcccHca zccca cccca zam 8am 8ar 8ae oHae oHam oHar qoHan qoHam qoHar qoHcca oHcca or lines words bytes file ------ ------- --------- ------------ 7054 7054 43161 .wds 34 34 192 .keys To avoid excessive words, I decided to replace all words containing any `?' by `???'. Here are the tables (as redone on 97-08-08): cat .wds \ | sed -e '/?/s/^.*$/???/g' \ | enum-word-pairs \ | count-diword-freqs -v keyfile=.keys max word length = 11 (key,key) word pair counts: ----- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- q c z o q e o z c z c c q z c o H o q c H c c c c c z c q q q o o c c H c H q o c c c c z c c c c c c o o o o o o H H T c c c c c o H q c c c c c H H H H c c z 8 8 8 H H H H H H c c O / 8 8 o 8 8 8 H a o 8 8 8 c c c c c c c c a a a a a a a a a a c c o T / a a e a a a a e e a a a a a a a a a a a m m r e e m r n m r a a r ----- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- // 765 . 2 4 1 3 28 15 2 9 10 3 . 4 1 2 . . . . 1 1 13 15 4 3 1 . . 5 12 2 13 2 1 zcc8a 204 9 5 5 5 3 13 16 5 3 8 . 1 . 2 3 . . 1 . . . . . . 2 1 . . 2 11 2 4 1 . ccc8a 172 11 1 4 1 3 5 9 1 6 8 3 . . . . . 3 1 1 . . . 2 1 . . 1 1 4 4 6 4 . . oe 127 16 10 6 2 1 1 . 2 . . 1 3 3 2 1 2 . 1 . 2 2 1 1 1 1 . 3 . . . 1 . . 1 oHc8a 83 3 4 1 1 4 2 8 2 1 1 2 . 1 . . . 1 . . . 1 . . . . 1 1 . . . 1 . 1 . qoHcc8a 183 4 2 6 . 2 8 15 2 8 2 5 5 . 1 . 1 . 2 . 1 1 . 3 2 . 1 2 . 2 4 5 5 . . qoHc8a 198 6 8 8 2 7 14 14 5 6 4 2 2 1 1 1 2 1 . . . . . 1 2 3 2 1 2 3 . . 2 1 . qoHa 79 20 1 3 . . . 2 1 3 2 2 2 . 1 1 1 . . . . . 3 4 . . . 1 . . . 1 . . . qoHae 113 9 7 9 . . . 1 1 2 2 1 . 1 2 2 3 1 1 3 . . . 2 5 3 . . 1 . . . . 1 1 qoe 81 9 6 9 3 1 . . . 1 2 . . 2 2 . . . . 1 2 1 1 . . . . 2 1 . . . . 1 1 eccc8a 52 9 3 1 1 . 3 2 3 . . . . . 1 . 1 . . . . . . . . . . . . . 3 . . . . oHcc8a 56 5 3 2 1 2 3 3 4 1 1 . . . . 1 . . . . . . . . . . 1 . . . . 1 . . 1 zccc8a 36 . . 1 1 . 5 3 . 1 3 1 . . . . . . . . . 1 . 1 . . . . . . 1 1 2 . . ccca 67 4 . . 1 . 3 1 2 . 2 2 . . . . . . . . . . 1 1 1 . . . 1 1 2 1 1 1 . zcca 69 3 1 1 2 . 3 4 3 . . 1 . . . . . . . . . . . 1 1 . . . . 1 4 1 1 . . cccHca 50 4 1 . . . 1 . 4 2 . 1 1 . . 1 . 1 . . . . . 1 . . . 1 1 2 1 . 2 . 1 zccHca 37 . 1 . . 1 . 3 1 . . . . . . . 1 . . . . . 1 . . . . . . 1 . . 1 . . ccccHca 35 . . . . . 1 . 2 3 . . 1 . . . . . . . . . . . 1 . 2 . . 1 . . . . 1 zcccHca 31 . . . 1 . 2 2 1 3 1 . . . 1 1 . . . . . . . 1 . 1 1 . . . . 1 1 . . zccca 23 . . . 1 . . 2 . 1 . . . . . . 1 . . . . . . . . 1 . . 1 . 2 . 1 . 2 cccca 31 . . . . 1 . . . . . 1 . . . 1 . . . 1 . . . 1 . 1 . . . . 2 . . . . zam 31 3 2 2 . . . . . . 1 . . 1 1 2 . 1 . 1 . . . 1 . 1 . . . . . . . . . 8am 73 8 3 2 2 2 . . . . . . . . 5 3 1 2 2 1 . . . 1 . . 1 2 1 . . . . . . 8ar 51 5 4 1 4 . . . . . . . . . . 2 . 1 1 . . . . . 1 . 1 1 1 . 1 1 1 . 1 8ae 50 8 4 2 1 . . 1 . . 1 2 . 1 . . 1 . . . . . . 2 1 1 . . . . . . . . 2 oHae 39 4 1 4 1 . . . . . . . 1 1 . 1 1 . . . 1 . . . 1 2 . . . . . . . . . oHam 37 3 3 1 1 2 . . . . . . 1 . 1 . 2 . . . . . . . . . . . . . . . . . . oHar 35 1 4 3 2 3 . 1 . . . . . . . 1 . . . . . . . . 2 1 . . 1 . . . . . . qoHan 54 1 1 3 1 2 . . . . . . . 1 2 . 6 1 . 1 1 1 . 1 3 2 1 2 1 1 . . . . 1 qoHam 91 5 6 5 1 1 1 1 . 1 . . 1 . 2 . 2 5 5 1 . . 2 . . 1 1 1 . . 2 1 . . . qoHar 48 1 7 2 4 . . . . 1 . . 1 1 1 2 2 1 1 . . . . . . . . . 1 . . . . . 1 qoHcca 81 3 3 3 1 1 2 6 1 3 . 1 2 . . . . . . . . . . 2 1 1 1 . . 1 2 . 1 4 1 oHcca 34 1 . . 1 . . . . 1 2 1 . . . . 1 . . . . . . 1 . . . . . . 1 1 1 . . or 40 3 3 1 1 . . . . . . . . 1 2 . . . 1 . 1 . . . . . . . . . . . . . 2 ----- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- TOT 7054 765 204 172 127 83 183 198 79 113 81 52 56 36 67 69 50 37 35 31 23 31 31 73 51 50 39 37 35 54 91 48 81 34 40 Incidentally: Rene Zandberger just posted his guess at the names of the planets: EVA Frogguy JSA Tables above freq ------------- -------------- ------------------- ------------- ---- okal olpax oljciix oHae 97 dolchsody 8oxctso89 cgoixcccsocgcy 8oecczo8a 0 yfain 9ljaiv cylgciiiu aPan 0 ytoaiin 9qpoaiiv cyqjociiiiu aHoam 0 ofar,oeoldain olja2,ocox8aiv olgciis,ocoixcgciiiu oPar,ocoe8an 2,0 opcholdy oqjctox89 oqgccoixcgcy oPccoe8a 0 okain.am olpaiv aig oljciiiu ciiij oHan aik 16 0 97-08-09 stolfi =============== I prepared a WWW page with the word pair tables above. I also prepared analogous tables for the English and Portuguese texts: cat engl.txt | sed -e 's@$@ //@g' | tr ' ' '\012' | egrep '.' > engl2.wds cat engl2-keys.dic // the a an and of in on at to for with as up but i you he she it is was had be my his her him me that mrs john cynthia inglethorp cat engl2.wds | tr '[A-Z]' '[a-z]' | head -4661 \ | enum-word-pairs \ | count-diword-freqs -v keyfile=engl2-keys.dic \ > .baz cat port.txt | sed -e 's@$@ //@g' | tr ' ' '\012' | egrep '.' > port2.wds cat port2-keys.dic // a da na o do no ao as das os dos um uma cada de em por para e ou como que é ser não são aresta face arestas faces complexo vértices celular cat port2.wds | sed -e 's/^x$/???/g' | head -7000 \ | enum-word-pairs \ | count-diword-freqs -v keyfile=port2-keys.dic \ > .baz The results were posted on my Voynich WWW page. Decided to recompute the tables, adding the left and right probabilities. cat .wds \ | sed -e '/?/s/^.*$/???/g' \ | enum-word-pairs \ | count-diword-freqs -v keyfile=.keys \ > .baz 97-08-10 stolfi =============== Fiddled with the right-probability table, obtaining the following clustering ----- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- c z q c z c c z q o q e o c z c c c c c c z o H o o c H q q q q o c c c c c c c z c c c H c H H c c q o o o o o o o H T c c H H H H c c c c c c c c c c c o H q 8 H H 8 H H 8 H H z c O / c c c c c c c c 8 8 8 8 8 c 8 8 8 H a o o a a a a a a a a a a o c T / a a a a a a a a a a a a a a a a a a n e e r r r e e e m m m m r a -----++--+--+--+--+--+--+--+--+--++--+-----++--+--+--+--+-----++--++--+-----+--------+--------+-----------+--+--+ // 20 || | | | | | | | | || | || 1| 3| 1| | || || | 1 | | 1| 1 1 1| | | -----++--+--+--+--+--+--+--+--+--++--+-----++--+--+--+--+-----++--++--+-----+--------+--------+-----------+--+--+ cccca 25 || | | | | | | 3| | 3|| | || | | | 3| 3 || || | | | 3 | 3 6 | | | -----++--+--+--+--+--+--+--+--+--++--+-----++--+--+--+--+-----++--++--+-----+--------+--------+-----------+--+--+ zccca 52 || | | | 4| | | | | || | || 8| | 4| | || || | 4| 4 | 4 4| 8 | 8| | -----++--+--+--+--+--+--+--+--+--++--+-----++--+--+--+--+-----++--++--+-----+--------+--------+-----------+--+--+ cccHca 49 || 7| | | | 1| | | | 1|| | 1|| | 1| 3| | 1 1|| 7|| 3| | 1 | 3| 1 1 1 | 1| | -----++--+--+--+--+--+--+--+--+--++--+-----++--+--+--+--+-----++--++--+-----+--------+--------+-----------+--+--+ zccHca 27 || | | | 2| | | | | || | 2|| 8| | 2| 2| || 2|| 2| | | | 2| | | -----++--+--+--+--+--+--+--+--+--++--+-----++--+--+--+--+-----++--++--+-----+--------+--------+-----------+--+--+ ccccHca 34 || | | | | | | | | || | || | 2| | | 2|| 5|| 2| | 2 | 5 8| | 2| | -----++--+--+--+--+--+--+--+--+--++--+-----++--+--+--+--+-----++--++--+-----+--------+--------+-----------+--+--+ zcccHca 54 || | | | | | | | 3| 3|| | || 6| 6| 3| | || 3|| | 3 3| 3| 3 3 9| 3 | | | -----++--+--+--+--+--+--+--+--+--++--+-----++--+--+--+--+-----++--++--+-----+--------+--------+-----------+--+--+ ccca 37 || 5| | | | | | | | || | || 1| 4| 1| | 2 || 2|| 1| 2 1| 1 1 1| | 1 2 1| | 1| -----++--+--+--+--+--+--+--+--+--++--+-----++--+--+--+--+-----++--++--+-----+--------+--------+-----------+--+--+ zcca 39 || 4| | | | | | | | || | 1 1|| 5| 4| 1| | 1 || 4|| 1| 2| 1 1| | 1 5 | | | =====++==+==+==+==+==+==+==+==+==++==+=====++==+==+==+==+=====++==++==+=====+========+========+===========+==+==+ zccc8a 58 || | 2| | | | | | | || | 2 || 8|13| 5| | 2 || || | 8 2| 2| 2| 2 2 | | | -----++--+--+--+--+--+--+--+--+--++--+-----++--+--+--+--+-----++--++--+-----+--------+--------+-----------+--+--+ ccc8a 46 || 6| | | | 1| | | | || | 2 || 5| 2| 2| 1| 1 || || 2| 4 | 3| 3| 1 2 | | | zcc8a 49 || 4| | | | | | | | 1|| | 2 2|| 7| 6| 1| 1| || 2|| | 3 2| | 1| 5 | | | =====++==+==+==+==+==+==+==+==+==++==+=====++==+==+==+==+=====++==++==+=====+========+========+===========+==+==+ qoHc8a 51 || 3| | | 1| | | | | || | 4 4|| 7| 7| 1| 3| 1 1|| 2|| 1| 2 1| 1 1 | 1 1 3| | | | -----++--+--+--+--+--+--+--+--+--++--+-----++--+--+--+--+-----++--++--+-----+--------+--------+-----------+--+--+ qoHcc8a 48 || 2| | | | | 1| | | || | 3 1|| 8| 4| 2| 1| 2 2|| 1|| 1| 1 | 1 2| 4| 1 1 2 | | | -----++--+--+--+--+--+--+--+--+--++--+-----++--+--+--+--+-----++--++--+-----+--------+--------+-----------+--+--+ qoHcca 49 || 3| | | | | | | | || | 3 3|| 7| 2| 1| 1| 1 2|| 1|| 1| 1| 1 | 1 1 3| 2 2 | 1| 4| -----++--+--+--+--+--+--+--+--+--++--+-----++--+--+--+--+-----++--++--+-----+--------+--------+-----------+--+--+ oHc8a 43 || 3| 1| | | 1| | | | || 1| 1 4|| 9| 2| | 4| 2 || 2|| | 1 1| 1| 1 1| 1 | | 1| -----++--+--+--+--+--+--+--+--+--++--+-----++--+--+--+--+-----++--++--+-----+--------+--------+-----------+--+--+ eccc8a 51 ||17| | | 1| | | | 1| || | 1 5|| 3| 5| | | || 5|| | 1| | | 5 | | | oHcc8a 51 || 8| | | | | | | | 1|| | 3 5|| 5| 5| | 3| || 7|| | 1 1| 1| 1 1| | 1| | =====++==+==+==+==+==+==+==+==+==++==+=====++==+==+==+==+=====++==++==+=====+========+========+===========+==+==+ qoHa 60 ||25| | | 1| | | | 1| 1|| | 3 1|| 2| | | | 2 2|| 1|| | 2 | 1| 3| 5 1 3| | | =====++==+==+==+==+==+==+==+==+==++==+=====++==+==+==+==+=====++==++==+=====+========+========+===========+==+==+ qoHan 61 || 1| 1| 1|11| 1| | 1| 3| || 1| 5 1|| | | | 3| || || 1| 1| 5 1 | 3 1 | 1 3 | 1| | -----++--+--+--+--+--+--+--+--+--++--+-----++--+--+--+--+-----++--++--+-----+--------+--------+-----------+--+--+ qoe 55 ||11| 1| 2| | | | 1| 2| || 2|11 7|| | | | 1| || || | 2 3| 1 | 1| 2 1| 1| 1| oe 50 ||12| 1| 1| 1| | | | 1| || 2| 4 7|| | | | | 2|| 1|| | 1| | | 2 | | | -----++--+--+--+--+--+--+--+--+--++--+-----++--+--+--+--+-----++--++--+-----+--------+--------+-----------+--+--+ 8ar 50 || 9| | | | 1| 1| | | 3|| | 1 7|| | | 1| | || || | 7| 1 1 1| 1 | 1 1 | 1| | oHar 54 || 2| | | | | | | | 2|| | 8 11|| 2| | | 8| || || | 5| 5 2 | 2 | | | | qoHar 54 || 2| | | 4| 2| 2| | 2| 4|| 2| 4 14|| | | | | 2|| || | 8| 2 | 2| | 2| | -----++--+--+--+--+--+--+--+--+--++--+-----++--+--+--+--+-----++--++--+-----+--------+--------+-----------+--+--+ 8ae 53 ||15| | | 1| | | | | || 1| 3 7|| 1| | | | 3 || || | 1 1| 1 | 1 | 3 | 3| | oHae 46 ||10| | 2| 2| | | | | 2|| 2|10 2|| | | | | 2|| || | 2| 2 | 5 | | | | qoHae 51 || 7| | | 2| | | 2| 1| 1|| | 7 6|| | | | | || || | 1 | 4 | 2 1| 1 | | | -----++--+--+--+--+--+--+--+--+--++--+-----++--+--+--+--+-----++--++--+-----+--------+--------+-----------+--+--+ 8am 49 ||10| | | 1| 2| 2| 1| 6| 4|| | 2 4|| | | | 2| || || | 2| 1 | 1 | 1 2 | | | oHam 37 || 8| | | 5| | | | 2| || | 2 8|| | | | 5| 2|| || | 2| | | | | | qoHam 49 || 5| | | 2| 5| 5| 1| 2| || | 5 6|| 1| 1| | 1| 1|| || | 1| 1| 1 1 1| 1 2 2| | | zam 51 || 9| | | | 3| | 3| 3| 6|| 3| 6 6|| | | | | || || | 3 | | 3 | 3 | | | -----++--+--+--+--+--+--+--+--+--++--+-----++--+--+--+--+-----++--++--+-----+--------+--------+-----------+--+--+ or 37 || 7| | 2| | | 2| | 4| || 2| 2 7|| | | | | || || | 2| | | | 4| | -----++--+--+--+--+--+--+--+--+--++--+-----++--+--+--+--+-----++--++--+-----+--------+--------+-----------+--+--+ oHcca 32 || 2| | | 2| | | | | || | || | | 2| | 2 || || | 5 2| 2| 2| 2 2 | | | =====++==+==+==+==+==+==+==+==+==++==+=====++==+==+==+==+=====++==++==+=====+========+========+===========+==+==+ TOT 44 ||10| | | | | | | | || | 2 2|| 2| 2| 1| 1| || 1|| | 1 1| | 1| 1 1 | | | It seems that the ending of one word determines somewhat the beginning of the next one. Here is the same table, with independent clustering of rows and columns: ----- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- c z q c z c c z q o q o e c z c c c c c c z o H o q q q q o H o c c c c c c c c z c c c H c q H o o o o o o o H c H c T c c H H H H c c c c c c c o q c H H H H 8 8 8 H H H c c c z c O / c c c c c c c c 8 8 8 8 8 H o c a a a a a a a a a a 8 8 c a 8 o o T / a a a a a a a a a a a a a a e a e m n r r m e e r m a a a m a e r -----++--+--+--+--+--+--+--+--+--+--+-----++--+--+--+--+--+-----------+--------+--------+-----+--+--+--+--+--+ // 20 || | | | | | | | | | | || 1| 3| | 1| 1| 1 1 | 1 | | | | 1| | | | -----++--+--+--+--+--+--+--+--+--+--+-----++--+--+--+--+--+-----------+--------+--------+-----+--+--+--+--+--+ oHcca 32 || 2| | | 2| | | | | | | || | | | 5| 2| 2 2 2| 2 | | | | | 2| 2| | -----++--+--+--+--+--+--+--+--+--+--+-----++--+--+--+--+--+-----------+--------+--------+-----+--+--+--+--+--+ zccca 52 || | | | 4| | | | | | | || 8| | | | 4| 4 8 | 4| 4 | | | | | 4| 8| -----++--+--+--+--+--+--+--+--+--+--+-----++--+--+--+--+--+-----------+--------+--------+-----+--+--+--+--+--+ zccHca 27 || | | | 2| | | | | | | 2|| 8| | 2| | 2| 2 | | | 2 | | 2| | | | -----++--+--+--+--+--+--+--+--+--+--+-----++--+--+--+--+--+-----------+--------+--------+-----+--+--+--+--+--+ zcccHca 54 || | | | | | | | 3| 3| | || 6| 6| 3| 3| 3| 9 3| 3 3| 3 | | | | | 3| | -----++--+--+--+--+--+--+--+--+--+--+-----++--+--+--+--+--+-----------+--------+--------+-----+--+--+--+--+--+ cccca 25 || | | | | | | 3| | 3| | || | | | | | 6 | 3 3| | 3 | | | 3| | | -----++--+--+--+--+--+--+--+--+--+--+-----++--+--+--+--+--+-----------+--------+--------+-----+--+--+--+--+--+ cccHca 49 || 7| | | | 1| | | | 1| | 1|| | 1| 7| | 3| 3 1 3 | 1 | 1 1| 1| | | 1| | 1| -----++--+--+--+--+--+--+--+--+--+--+-----++--+--+--+--+--+-----------+--------+--------+-----+--+--+--+--+--+ ccccHca 34 || | | | | | | | | | | || | 2| 5| | | 8 2 | 2 | 5 | 2| | | | | 2| -----++--+--+--+--+--+--+--+--+--+--+-----++--+--+--+--+--+-----------+--------+--------+-----+--+--+--+--+--+ ccca 37 || 5| | | | | | | | | | || 1| 4| 2| 2| 1| 2 1 1| 1 1 | 1 | | 1| 1| 2| 1| | -----++--+--+--+--+--+--+--+--+--+--+-----++--+--+--+--+--+-----------+--------+--------+-----+--+--+--+--+--+ zcca 39 || 4| | | | | | | | | | 1 1|| 5| 4| 4| | 1| 5 1 1| 1 1 | | | | | 1| 2| | -----++--+--+--+--+--+--+--+--+--+--+-----++--+--+--+--+--+-----------+--------+--------+-----+--+--+--+--+--+ zccc8a 58 || | 2| | | | | | | | | 2 || 8|13| | 8| 5| 2 2 2| 2 | | | | | 2| 2| | -----++--+--+--+--+--+--+--+--+--+--+-----++--+--+--+--+--+-----------+--------+--------+-----+--+--+--+--+--+ ccc8a 46 || 6| | | | 1| | | | | | 2 || 5| 2| | 4| 2| 3 2 2 3| 1 | | 1 | | | 1| | | zcc8a 49 || 4| | | | | | | | 1| | 2 2|| 7| 6| 2| 3| 1| 1 5 | | | 1 | | | | 2| | -----++--+--+--+--+--+--+--+--+--+--+-----++--+--+--+--+--+-----------+--------+--------+-----+--+--+--+--+--+ qoHc8a 51 || 3| | | 1| | | | | | | 4 4|| 7| 7| 2| 2| 1| 3 1 | 1 1| 1 1 | 3 1| | | 1| 1| | -----++--+--+--+--+--+--+--+--+--+--+-----++--+--+--+--+--+-----------+--------+--------+-----+--+--+--+--+--+ qoHcc8a 48 || 2| | | | | 1| | | | | 3 1|| 8| 4| 1| 1| 2| 4 2 1 2| 1 1 | 1| 1 2| | | 2| | | -----++--+--+--+--+--+--+--+--+--+--+-----++--+--+--+--+--+-----------+--------+--------+-----+--+--+--+--+--+ qoHcca 49 || 3| | | | | | | | | | 3 3|| 7| 2| 1| | 1| 3 2 1 | 1 2 1| 1 | 1 2| 4| | 1| 1| 1| -----++--+--+--+--+--+--+--+--+--+--+-----++--+--+--+--+--+-----------+--------+--------+-----+--+--+--+--+--+ oHc8a 43 || 3| 1| | | 1| | | | | 1| 1 4|| 9| 2| 2| 1| | 1 1| | 1 1| 4 | 1| | 2| 1| | -----++--+--+--+--+--+--+--+--+--+--+-----++--+--+--+--+--+-----------+--------+--------+-----+--+--+--+--+--+ eccc8a 51 ||17| | | 1| | | | 1| | | 1 5|| 3| 5| 5| | | 5 | | | | | | | 1| | oHcc8a 51 || 8| | | | | | | | 1| | 3 5|| 5| 5| 7| 1| | 1 1| | 1 | 3 | | | | 1| 1| =====++==+==+==+==+==+==+==+==+==+==+=====++==+==+==+==+==+===========+========+========+=====+==+==+==+==+==+ qoHa 60 ||25| | | 1| | | | 1| 1| | 3 1|| 2| | 1| 2| | 3 1| 5 | 1| 2| | 3| 2| | | =====++==+==+==+==+==+==+==+==+==+==+=====++==+==+==+==+==+===========+========+========+=====+==+==+==+==+==+ qoHan 61 || 1| 1| 1|11| 1| | 1| 3| | 1| 5 1|| | | | | | 1 | 5 1 3| 1 1 3| 3 | | | | 1| 1| -----++--+--+--+--+--+--+--+--+--+--+-----++--+--+--+--+--+-----------+--------+--------+-----+--+--+--+--+--+ qoe 55 ||11| 1| 2| | | | 1| 2| | 2|11 7|| | | | 2| | 1 | | 1 2| 1 | 1| 1| | 3| 1| oe 50 ||12| 1| 1| 1| | | | 1| | 2| 4 7|| | | 1| | | | | 2| 2| | | | 1| | -----++--+--+--+--+--+--+--+--+--+--+-----++--+--+--+--+--+-----------+--------+--------+-----+--+--+--+--+--+ 8ar 50 || 9| | | | 1| 1| | | 3| | 1 7|| | | | | 1| 1 1| 1 | 1 1 1| | | | | 7| 1| oHar 54 || 2| | | | | | | | 2| | 8 11|| 2| | | | | | 5 2| 2 | 8 | | | | 5| | qoHar 54 || 2| | | 4| 2| 2| | 2| 4| 2| 4 14|| | | | | | 2 | | 2 | 2| | | | 8| 2| -----++--+--+--+--+--+--+--+--+--+--+-----++--+--+--+--+--+-----------+--------+--------+-----+--+--+--+--+--+ 8ae 53 ||15| | | 1| | | | | | 1| 3 7|| 1| | | 1| | | 1 3 1| | | | | 3| 1| 3| oHae 46 ||10| | 2| 2| | | | | 2| 2|10 2|| | | | | | | 2 5| | 2| | | | 2| | qoHae 51 || 7| | | 2| | | 2| 1| 1| | 7 6|| | | | 1| | 1 | 4 1 2| | | | | | | | -----++--+--+--+--+--+--+--+--+--+--+-----++--+--+--+--+--+-----------+--------+--------+-----+--+--+--+--+--+ 8am 49 ||10| | | 1| 2| 2| 1| 6| 4| | 2 4|| | | | | | | 1 | 1 1 2| 2 | | | | 2| | oHam 37 || 8| | | 5| | | | 2| | | 2 8|| | | | | | | | | 5 2| | | | 2| | qoHam 49 || 5| | | 2| 5| 5| 1| 2| | | 5 6|| 1| 1| | | | 1 2 1| 1| 1 1| 1 1| | 2| | 1| | zam 51 || 9| | | | 3| | 3| 3| 6| 3| 6 6|| | | | 3| | | 3 3| | | | | | | | -----++--+--+--+--+--+--+--+--+--+--+-----++--+--+--+--+--+-----------+--------+--------+-----+--+--+--+--+--+ or 37 || 7| | 2| | | 2| | 4| | 2| 2 7|| | | | | | | | | | | | | 2| 4| =====++==+==+==+==+==+==+==+==+==+==+=====++==+==+==+==+==+===========+========+========+=====+==+==+==+==+==+ TOT 44 ||10| | | | | | | | | | 2 2|| 2| 2| 1| 1| 1| 1 1 | 1 | | 1 | | | | 1| | Again, with more columns and rows: row probabilities ---- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- c z o q c c c z z c e e o e o q q c c c c z o c c c c z c z H o c q q H o q q q o q c c c c z c c H c c c c c c c o o o o c H c H q o o c H o o o H o T H H c c c c c c c c z H H c z c c o H H H H c c o c 8 8 8 c q o H H c c H H H c H O / c c c c c 8 8 8 8 8 a c c c o 8 8 H a a a c 8 c o e 8 o 8 a a a 8 o H a a 8 8 a a c c o T / a a a a a a a a a a m a a a e a a a e m r a a a e a a r a e m r a e a e m a a n r a a e ---- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- // 27 . . . . . . . . . . . 3 . . . 1 . . . . . . . . . . . . . . . 2 . . 1 . 1 3 3 1 . . . 1 . 8a 71 31 2 2 . . 2 . . 2 . . . . . . . 2 . . . . . . . . . . . . . . 8 . . 2 . 5 . 2 2 . 2 . . . oHa 55 27 . . . . 3 . . 3 . . . . . . . . . . . . . . . . . . . . 3 3 . 3 . . 3 . . . . . . 3 . . oea 99 78 . . . . 8 . . . 4 . . . . . . . . . 4 . . . . . . . . . . . . . . . . . . 4 . . . . . . qoHa 65 25 1 . . 1 1 3 1 . . . 3 . . . . 2 . . . 2 . . 2 . . . . . . . 6 . . 2 1 3 1 . 2 . 1 . . 1 Hc8a 63 7 . . . . . 7 . 7 . . . . . . . 3 . . . 3 . 3 . . 3 . . . . 3 . 7 . . . . . . 7 . . . 3 . oeccc8a 69 26 . 4 . . . . . 4 . . . . . . . . . . . . . 4 . . . . . . . . 4 . . 4 . 4 8 8 . . . . . . ezcc8a 57 14 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 . 4 . 4 9 . 4 9 . . . . . 4 eccc8a 51 17 1 . . 1 . 1 5 . . . . . . . . . . . . . . . . . 1 . . . . . . . . . 5 . 5 5 3 . . . . . cccc8a 68 10 . . 5 . . . . 5 . . . . . . . 5 . . . . . . . . . . . . . . . . . 5 . 10 10 15 . . . . . . zccc8a 63 . . . 2 . . 2 . . . . . . . . . 2 . . . 2 . . . . 2 . . . . . 2 . . 8 . 2 5 13 8 . 2 . 5 . cccHca 57 7 . . . . 1 . 1 . . . . 1 . . . 1 . . . 1 1 1 1 . . . . 1 . . 1 . . . 7 3 7 1 . 3 . . 3 . zccHca 35 . 2 . . . . . 2 2 . . 5 . . . . . . . . . . . . . . . . . . . . . . . 2 . . . 8 2 . 5 2 . ccccHca 51 . . . . . . . . . . . . . . . 2 . 2 . 5 2 . . 2 . . . 2 2 . . . 2 . . 5 8 5 2 . 2 . . . . zcccHca 64 . . . . 3 3 . . . . . . . . . . . . . 3 . . . . . 3 . . . . 3 3 . . 3 3 9 9 6 6 . 3 . 3 . oHca 47 4 . . . . . . . . . . . . . . . 4 . . . 4 . . . . . . . . 4 4 . 4 . . . . . . 9 4 . 4 . . qoHca 72 4 2 . . . . . . 4 . . 2 2 . . 2 . 6 . . . 2 . . 2 2 . 2 2 . . 2 . 2 2 6 6 4 . 4 2 . . 2 . oHcca 49 2 2 . . . . . . . . . . . . . . 2 . 5 . 2 . . . . 2 2 . . . . 2 . . 5 . 2 5 . . . 2 . 2 2 qoHcca 58 3 . . . . . 3 3 1 . . . . . . . 1 1 1 1 2 . . 2 4 1 . . 1 . 1 2 1 . . 1 3 3 2 7 1 . 2 1 . cccca 35 . . . . . 3 . . 3 . . . . 3 . . 3 . . . . . . . . . . . . . 3 3 . . . . . 12 . . . . 3 . . zccca 60 . 4 . . . . . . . . . . . . . . . . . . . 4 . . . 4 . . 8 . 4 . . . . . 4 17 . 8 . . . 4 . ccca 43 5 . . . . . . . . . . 1 . . . . 2 . . . . 1 . . 1 1 . . . . . 1 1 . 2 2 . 8 4 1 1 1 . 1 . zcca 50 4 . . . . . 1 1 . . . . . . . . 1 1 . . . . . . . 2 . . . . . 1 1 . . 4 . 11 4 5 1 1 2 1 1 oHc8a 54 3 . . 1 . . 1 4 4 . 1 . 1 . . 1 2 1 . 1 1 . 1 . 1 1 1 1 . 1 . . . 3 1 2 1 . 2 9 . 1 . . . qoHc8a 57 3 1 . . . . 4 4 3 . . . . . . . 1 . . 1 1 1 . 1 . 1 . . . . 1 1 1 . 2 2 3 . 7 7 1 . 1 1 . oHcc8a 53 8 . . . . 1 3 5 3 . . . . . . . . . . 1 . . . . . 1 . . 1 . . . . . 1 7 1 . 5 5 . 1 1 . . qoHcc8a 57 2 . 1 . . . 3 1 1 . . . . . . . 2 . 1 . 1 . . 2 . . 1 . . 1 . 1 1 . 1 1 4 3 4 8 1 2 1 2 1 ccc8a 57 6 . . . . . 2 . 1 . . . 1 . . . 1 . 1 . . . . . . . . . . 1 . 1 . . 4 . 3 6 2 5 2 3 1 2 . zcc8a 58 4 . . . . 1 2 2 1 . . . . . . . . . . . . . . . . 2 . . . . . . . . 3 2 1 9 6 7 . . . 1 1 8ae 57 15 1 . . . . 3 7 . . 1 . . . . . 3 . . . . . . . . 1 . 3 3 . 1 3 1 . 1 . . . . 1 . . . . . oHae 56 10 2 . . . 2 10 2 . . 2 . . . 2 . . . . . . . . 2 . 2 . . . 2 5 2 2 2 . . . 2 . . . . . . . qoHae 58 7 2 . . 1 1 7 6 . . . . . 2 . . . . . . . . . . . . . 1 . 1 2 3 4 . 1 . 1 . . . . . . . . oe 56 12 1 . 1 1 . 4 7 . . 2 . . . 1 . . . . . 2 . . 2 . 1 . . . . . 2 . 2 . 1 . . . . . . . . . qoHoe 42 . . . . 4 . . 14 4 4 4 . . . . . . . . . 4 . . 4 . . . . . . . . . . . . . . . . . . . . . qoe 61 11 . . 1 2 . 11 7 1 4 2 1 . 1 2 . . . . . 2 1 . . 1 3 . . 1 1 . . . . 2 . 1 . . . . . . . . zoe 59 . . . 3 7 3 7 11 . 3 3 . . . 3 . . 3 . . . . . . . 3 . . 3 . . . . . . . . . . . . . . . . 8am 58 8 . 1 . 7 2 3 3 2 . . . 1 . . . . . . . 2 . . . . 4 . . . 2 . 3 . . . . . . . . . . . . . oHam 43 3 3 . . 1 2 1 6 2 . . . . 2 . 1 . . . 2 3 2 . 1 . 3 2 . . . . . . . . . . . . . . . . . . qoHam 54 2 3 4 . 1 1 7 5 . . . . 2 . . . . . . 1 2 . . 1 . 2 . . . 1 . . . . . . . 1 . . . . . . . zam 53 5 . 1 . 1 3 5 7 . . 1 . 1 1 1 . . . 1 . 1 . . . . 3 3 . . . 1 1 . . 1 . . . . . 1 . . . . qoHan 62 1 11 . 1 3 . 5 1 3 1 1 . 1 1 1 . . . . 1 3 1 . . . 1 . . 1 . 3 1 5 . . . . . . . 1 . . . . 8ar 56 9 . 1 . . 3 1 7 . . . 1 1 . . . . . . 1 1 1 1 . . 7 . 1 1 . . . 1 . . . . 1 . . . 1 . 1 . oHar 62 2 . . . . 2 8 11 8 2 . . . . . . . . . . 2 2 2 . . 5 . . . . 2 . 5 . . . . . . 2 . . . . . qoHar 60 2 4 2 . 2 4 4 14 . . 2 . 2 . . . . . . . . 2 . 2 . 8 . 2 2 2 . . . . . . 2 . . . . . 2 . . or 39 7 . 2 . 4 . 2 7 . . 2 . . . 2 . . . . . . . . . . 2 2 . 4 . . . . . . . . . . . . . . . . ---- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- TOT 50 5 . . . . . 2 3 1 . . . . . . . . . . . 1 . . . . 1 . . . . . 1 . . 1 1 1 3 2 3 . . . 1 . col probabilities ---- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- c z o q c c c z z c o e e e q o q c c c c c z c c c z o H o c c z q q q q o q H o q c c c c c c c c z c c c o o o H o c H c H c c q o o o o H o c H o T c H c c H c z c H c c H c z o H H H c H c c o c 8 8 8 c c c o H H H H c H c c H q O / 8 c c 8 c c a 8 c c 8 c c o H a a a 8 c 8 c o e 8 o 8 a a a 8 8 8 H a a a a 8 c 8 c o o T / a a a a a a m a a a a a a e a e m r a a a a e a a r a e m r a a a a e m n r a a a a e e ---- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- // 5 . 2 . 1 . . 3 44 . . 2 11 . 4 59 . 2 . . 3 . . 5 . . 4 2 . 5 19 7 . 5 9 2 7 11 9 4 7 13 15 16 14 12 8a 0 1 . 1 . . 2 . . . . 1 . . . . . . . . 1 . . . . . . . . . 2 . . 1 . . 1 . . 2 . . . . . 1 oHa 0 . . . . . . . . . . 1 . . . . . . . . 1 . . . . . . . 2 1 . 1 . . . 1 . . . . . 2 . . . . oea 0 2 . . . 5 . . . . . 2 . . . . . 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . qoHa 1 2 1 1 1 . . . 5 . . 1 . . . . . . 2 . . . 3 . . . . . . . 4 . . 3 . 1 2 . . 2 1 . . . 4 2 Hc8a 0 . 1 . . . . . . . . . . . . . . . 1 . 2 4 . . . . . . . 1 . 3 . 1 . . . . . . 1 . . 1 . . ezcc8a 0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 . 1 . . . 2 . . . . . . 1 . 4 1 eccc8a 0 1 . 1 1 . . . . 1 . . . . . . . . . . . . . . . . . . . . . . . . . 3 . 1 . . 1 . 1 . . . oeccc8a 0 . . . . . 2 . . . . . . . . . . . . . 1 4 . . . . . . . . . . . . . . . . . . . . 1 . . 1 cccc8a 0 . . . . . . 3 . . . . . . . . . . . . 1 . . . . . . . . . . . . 1 . . 1 . . . . . 1 . . 1 zccc8a 0 . . . . . . 3 . . . . . . . . . . 1 . . . . . . . . . . . . . . 1 . . . . . 2 1 . 2 2 . 3 ccccHca 0 . . . . . . . . . . . . . . 3 . 5 1 . . . 1 . . . 4 2 . . . 1 . . 4 2 2 . 1 . . . . . . . zcccHca 0 . . . 1 . . . . . . 1 . . . . . 2 . . . . . . . . . . . 1 . . . . . 1 2 1 . 2 1 . 1 1 . 1 cccHca 0 . . . . . . . . . 2 1 . . . . . . 1 2 . 4 1 . . . . 2 . . . . . 1 . 5 1 1 3 . . . . 2 . . zccHca 0 . . 1 . . . . 3 . . . . . . . . . . . 1 . . . . . . . . . . . . . . 1 . . 1 . 1 4 . 1 . . oHca 0 . . . . . . . . . . . . . . . . . 1 . . . . . . . . . 2 1 . 1 . 1 . . . . 1 . 1 2 . . . . qoHca 0 . . 1 . . . . 1 . 2 . . . . 3 . . . 2 2 . . 2 . . 4 2 . . . . 3 . 14 3 2 . 1 . 1 . . 1 . 1 oHcca 0 . . 1 . . . . . . . . . . . . 7 . 1 . . . . . . 4 . . . . . . . 1 . . . . . 2 . . . 1 4 2 qoHcca 1 . 1 . . . . . . 1 . . . . . . 3 2 2 . 1 . 3 11 . . . 2 . 1 1 1 . 1 4 1 2 1 1 . 3 4 1 1 . . cccca 0 . . . . . . . . . . 1 . 3 . . . . . . 1 . . . . . . . . 1 . . . 1 . . . 1 . . . 2 . . . . zccca 0 . . 1 . . . . . . . . . . . . . . . 2 . . . . . . . 4 . 1 . . . . . . . 1 . . 1 . . 1 . . ccca 0 . . . . . . . 1 . . . . . . . . . . 2 . . . 2 . . . . . . . 1 . 3 . 2 . 2 1 2 . . 1 1 . 2 zcca 0 . . . . . . . . . . . . . . . . . . . . . . . 1 . . . . . . 1 . 1 4 3 . 3 1 2 2 4 1 1 4 . oHc8a 1 . . . . . . 3 . 1 2 . 2 . . 3 . 2 1 . 4 4 . 2 . 4 4 . 2 . . . 11 3 4 2 . . . 2 4 . 1 . . 1 qoHc8a 3 . 4 3 1 . . . . 3 2 1 2 . . . 3 5 2 5 8 4 3 2 1 . 4 . 2 5 1 3 . 3 4 6 5 . 5 . 7 6 7 2 4 4 oHcc8a 0 . 1 . . . . . . 1 . 1 . . . . . 2 . . 2 . . . . . . 2 . . . . . . . 5 . . . 2 1 2 1 . . 1 qoHcc8a 2 . 3 1 1 5 5 3 . . . . . . 4 . 7 2 2 . 2 4 8 . . 8 . . 5 . 2 3 3 9 4 2 7 3 3 10 7 4 4 6 9 2 ccc8a 2 1 2 . . . 2 . . . 8 . . 3 . 3 7 . 1 2 3 . . . . 4 . . 5 . 1 1 . 5 4 1 5 5 7 12 4 6 2 4 4 9 zcc8a 3 1 2 . 2 . 2 . . 2 . 4 . . . 3 . 2 1 . 3 . 1 2 3 . . . 5 3 . . . . 9 6 2 9 3 4 8 . 7 4 14 9 8ae 0 1 1 1 . . . . . 1 . . 2 . . . . . . . . . . . . . 8 4 . 1 1 1 . 3 . . . . . . . . . . . 1 oHae 0 . 2 1 . . . . . . . 1 2 . 4 . . . . . . . 1 . . . . . 2 3 . 1 3 . . . . . . . . . . . . . qoHae 1 1 5 5 2 . 2 . . 3 2 2 2 9 . . . . . 2 . . . 2 . 4 8 2 5 5 3 9 . 1 . 1 1 . . . . . . . . 2 oe 1 2 3 3 2 5 2 6 1 4 . 1 8 . 8 3 . . 3 . 1 . 5 . 1 4 . 2 . 1 2 1 11 1 . 2 . . . 2 . . . . . . qoHoe 0 . . . 1 5 . . . 1 . . 2 . . . . . 1 . 1 . 1 . . . . . . . . . . . . . . . . . . . . . . . qoe 1 1 5 . 2 21 . 3 1 2 . . 5 3 8 . . . 2 2 1 . . 2 2 . . 2 2 . . . . . . . . . . . . . . . . 2 zoe 0 . 1 . 2 5 . 3 . 1 . 1 2 . 4 . . . . . . . . . . . . 2 . . . . . . 4 . . . . . . . . . . . 8am 1 1 2 1 11 5 5 . 1 1 5 4 . 3 . . 3 2 3 2 3 4 . . 3 4 . . 8 . 3 . . . . . . . . . . . . . . . oHam 0 . . 5 1 . . . . 2 . 2 . 6 . 3 . 5 3 5 2 . 1 . 2 8 . . . . . . . . . . . . . . . . . . . . qoHam 3 . 8 13 4 10 28 . 3 5 13 5 2 6 . 3 3 7 7 . 2 9 5 . 4 . 8 . 8 3 . 3 . . . . . 1 . 2 . . . 1 . . zam 0 . 1 . 1 . 2 . . 1 2 2 2 3 4 . 3 . 1 . . . . . 1 8 . . . 1 . . . . . . . . 1 . . . . . . 1 qoHan 0 . 1 11 2 5 . 3 . . 2 . 2 3 4 . . 2 2 2 2 . . . . . . 2 . 3 . 5 . . . . . . 1 . . . . . . . 8ar 0 . . . . . 2 . 1 1 2 2 . . . . . 2 1 2 . 4 . . 3 . 4 2 . . . 1 . . . . . . . 2 . . . 1 . . oHar 0 . 1 . . 5 . . . 1 . 1 . . . . . . 1 2 3 4 . . 1 . . . . 1 . 3 . . . . . . . . . . . . . . qoHar 0 . 1 3 1 . 2 . . 3 2 2 2 . . . . . . 2 . . 1 . 3 . 4 2 2 . . . . . . . . . . . . 2 . . . . or 0 . . . 2 . 2 . . 1 . . 2 . 4 . . . . . . . . . . 4 . 4 . . . . . . . . . . . . . . . . . . ---- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- TOT 50 27 61 71 52 73 68 32 69 55 51 52 58 41 47 91 43 51 52 45 60 52 42 38 45 56 56 47 59 57 62 66 35 63 71 60 56 59 49 56 58 58 56 53 66 67 Tried to recompute the table, collapsing the prefixes and suffixes into categories: --- collapse-words ------------------------ #! /n/gnu/bin/sed -f s/^cc\(..\)$/K\1/g s/^zc\(..\)$/K\1/g s/^ccc\(..\)$/K\1/g s/^zcc\(..\)$/K\1/g s/^cccH\(..\)$/K\1/g s/^zccH\(..\)$/K\1/g s/^ccccH\(..\)$/K\1/g s/^zcccH\(..\)$/K\1/g s/^cccc\(..\)$/K\1/g s/^zccc\(..\)$/K\1/g s/^qoH\(..\)$/Q\1/g s/^qoHc\(..\)$/Q\1/g s/^qoHcc\(..\)$/Q\1/g s/^oH\(..\)$/O\1/g s/^oHc\(..\)$/O\1/g s/^oHcc\(..\)$/O\1/g s/^8\(..\)$/B\1/g ------------------------------------------- cat .keys \ | collapse-words \ | sort | uniq \ > .keys.cooked cat .wds \ | sed -e 's/c?m$/am/g' \ | sed -e '/?/s/^.*$/???/g' \ | collapse-words \ | enum-word-pairs \ | count-diword-freqs -v keyfile=.keys.cooked \ > .baz Here are the results, with 0 and 1 mapped to "." raw pair counts ----- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- e c c q T B B B K K O O O O O Q Q Q Q Q Q c o q z O / a a a 8 c 8 a a a c 8 a a a a c 8 o o H o a T / e m r a a a e m r a a e m n r a a e r a e m ----- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- // 765 . 3 20 4 10 5 3 . . . 2 43 9 24 5 2 19 3 . . 2 10 23 Bae 50 8 . 2 . 7 . . . . . . . . . . . . 2 . 2 . . . Bam 100 9 . 4 . 10 17 3 . 3 . . . . . . . . . 5 . . . 1 Bar 51 5 . . . 6 4 . . . . . . . . . . . . 4 . . . 1 K8a 452 25 2 4 . 17 14 8 2 3 . . 56 12 35 8 10 14 5 7 . 6 20 . Kca 343 11 3 5 3 4 8 4 3 2 3 2 22 9 31 6 3 13 5 5 4 13 3 3 O8a 139 8 . . . 11 3 6 2 . . 2 16 2 . . 2 . 2 2 . 6 2 . Oae 40 4 2 . . 7 3 . . . . . . . . . . . . . . . . . Oam 76 3 . . . 7 8 3 3 3 2 . . . . . . . . 3 . . . . Oar 36 . . . 2 8 . 3 . . . . . . . . . . . 2 . . . . Oca 61 4 . 2 . . . . . 2 . . 2 . 2 . . 2 2 . . . 2 . Q8a 382 10 3 6 4 26 11 16 3 4 2 3 51 14 8 5 5 13 7 2 . 7 6 . Qae 114 9 3 4 5 17 12 . . . . . . 2 . . . . . . . . 2 . Qam 200 5 2 . 2 30 31 5 3 6 . 2 2 . 4 . . . . 6 . . . 2 Qan 54 . 2 . 3 6 12 2 . 2 . . . . . . . . . . . . . . Qar 49 . . . . 11 8 . . . . . . . . . . . . 4 . . . . Qca 132 5 . 3 . 6 2 5 . 2 . 5 11 6 6 2 . 5 . 2 2 6 . 1 eccc8a 52 9 . . . 4 2 . . . . . 6 . 3 . . . . . . 3 . . oe 127 16 . 3 . 22 10 4 . 3 . . . . . . . . . 2 . 2 . 1 or 40 3 . . . 6 4 . . . . . . . . . . . . . 2 . . . qoHa 79 20 . 5 . 4 3 2 . 2 . . 2 3 . . . . 2 . . . 2 3 qoe 81 9 . . . 22 6 . . 2 . . . . . . . . . 3 . . 2 1 zam 52 3 . . . 8 7 . . . . . . . . . . . . 2 . . . . ----- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- TOT 7054 765 50 100 51 452 343 139 40 76 36 61 382 114 200 54 49 132 52 127 40 79 81 52 next word probabilities ---- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- e c c q T B B B K K O O O O O Q Q Q Q Q Q c o q z O / a a a 8 c 8 a a a c 8 a a a a c 8 o o H o a T / e m r a a a e m r a a e m n r a a e r a e m ---- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- // 24 . . 2 . . . . . . . . 5 . 3 . . 2 . . . . . 3 Kca 48 3 . . . . 2 . . . . . 6 2 9 . . 3 . . . 3 . . Qca 56 3 . 2 . 4 . 3 . . . 3 8 4 4 . . 3 . . . 4 . . Oca 40 6 . 3 . . . . . 3 . . 3 . 3 . . 3 3 . . . 3 . K8a 55 5 . . . 3 3 . . . . . 12 2 7 . 2 3 . . . . 4 . O8a 48 5 . . . 7 2 4 . . . . 11 . . . . . . . . 4 . . Q8a 53 2 . . . 6 2 4 . . . . 13 3 2 . . 3 . . . . . . eccc8a 53 17 . . . 7 3 . . . . . 11 . 5 . . . . . . 5 . . Bam 54 8 . 3 . 9 16 2 . 2 . . . . . . . . . 4 . . . . Oam 43 3 . . . 9 10 3 3 3 2 . . . . . . . . 3 . . . . Qam 51 2 . . . 14 15 2 . 2 . . . . . . . . . 2 . . . . zam 48 5 . . . 15 13 . . . . . . . . . . . . 3 . . . . Bar 56 9 . . . 11 7 . . . . . . . . . . . . 7 . . . 1 Oar 61 2 2 . 5 22 2 8 . 2 2 2 2 . . . . . . 5 . . . . Qar 59 2 . . . 22 16 2 . . 2 . . 2 . . . 2 . 8 2 . . . Qan 62 . 3 . 5 11 22 3 . 3 . . . . . . . . . . . . . . Bae 53 15 . 3 . 13 . . . . . . . . . . . . 3 . 3 . . . Oae 57 9 4 2 2 17 7 2 . . . 2 . 2 2 . . . . 2 . . . . Qae 53 7 2 3 4 14 10 . . . . . . . . . . . . . . . . . or 39 7 . . . 14 9 . . . . . . . . . . . . 2 4 . . . qoe 61 11 . . . 27 7 . . 2 . . . . . . . . . 3 . . 2 1 oe 54 12 . 2 . 17 7 3 . 2 . . . . . . . . . . . . . . qoHa 64 25 . 6 . 5 3 2 . 2 . . 2 3 . . . . 2 . . . 2 3 ---- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- TOT 49 10 . . . 6 4 . . . . . 5 . 2 . . . . . . . . . prev word probabilities ---- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- e c c q T B B B K K O O O O O Q Q Q Q Q Q c o q z O / a a a 8 c 8 a a a c 8 a a a a c 8 o o H o a T / e m r a a a e m r a a e m n r a a e r a e m ---- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- // 10 . 5 19 7 2 . 2 2 . . 3 11 7 11 9 4 14 5 . 2 2 12 44 K8a 6 3 3 3 . 3 4 5 4 3 2 . 14 10 17 14 20 10 9 5 . 7 24 . O8a . . . . . 2 . 4 4 . . 3 4 . . . 4 . 3 . 2 7 2 . Q8a 5 . 5 5 7 5 3 11 7 5 5 4 13 12 3 9 10 9 13 . . 8 7 . eccc8a 0 . . . . . . . . . . . . . . . . . . . . 3 . . Bae 0 . . . . . . . . . . . . . . . . . 3 . 4 . . . Oae 0 . 3 . . . . . . . . . . . . . . . . . . . . . Qae . . 5 3 9 3 3 . . . 2 . . . . . . . . . 2 . 2 . Bam . . . 3 . 2 4 2 2 3 2 . . . . . . . . 3 . . . 1 Oam . . . . . . 2 2 7 3 5 . . . . . . . . 2 . . . . Qam 2 . 3 . 3 6 9 3 7 7 . 3 . . . . 2 . . 4 . . . 3 zam 0 . . . . . 2 . . . . . . . . . . . . . . . . . Qan 0 . 3 . 5 . 3 . 2 2 2 . . . . . . . . . 2 . . . Bar 0 . . . . . . . 2 . 2 . . . . . 2 . . 3 2 . . 1 Oar 0 . . . 3 . . 2 . . 2 . . . . . . . . . . . . . Qar 0 . . . . 2 2 . . . 2 . . . . . . . . 3 2 . . . Kca 4 . 5 4 5 . 2 2 7 2 8 3 5 7 15 11 6 9 9 3 9 16 3 5 Oca 0 . . . . . . . . 2 . . . . . . 2 . 3 . . . 2 . Qca . . . 2 . . . 3 2 2 2 8 2 5 2 3 . 3 . . 4 7 . 1 oe . 2 . 2 . 4 2 2 . 3 . . . . . . 2 . . . 2 2 . 1 qoe . . . . . 4 . . . 2 2 . . . . . . . . 2 2 . 2 1 qoHa . 2 . 4 . . . . . 2 . . . 2 . . 2 . 3 . . . 2 5 or 0 . . . . . . . . . . . . . . . . . . . 4 . . . ---- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- TOT 49 22 51 60 58 55 50 48 52 49 44 39 56 54 58 53 55 53 59 44 44 59 64 69 97-08-11 stolfi =============== Recomputed word pair tables with the new cont-diword-freqs: cat .wds \ | sed -e 's/c?m$/am/g' \ | sed -e '/?/s/^.*$/???/g' \ > .fix.wds cat .fix.wds \ | sort | uniq \ > .fix.dic cat .fix.wds \ | enum-word-pairs \ | count-diword-freqs -v rows=.fix.dic -v cols=.ckeys \ > .baz tac .fix.wds \ | sed -e '/=/d' \ | enum-word-pairs \ | count-diword-freqs -v rows=.fix.dic -v cols=.rkeys \ > .bar I took all words with 8 or more occurrences, and looked at the probabilities "in" and "fn" of the "word" occurring at beginning-of-line and end-of-line, respectively. I added the two probabilities, and got the probability "ex" of each word occuring at an extremal position in the line. Here are the words, sorted by the probability "in" of being line-initial: word freq in fn ex ------------ ---- -- -- -- Poe 8 99 0 99 8zcc8a 17 88 0 88 zor 10 79 0 79 azcc8a 8 74 0 74 8ccc8a 9 66 0 66 zoe 25 59 0 59 Hccc8a 14 49 0 49 zam 52 44 5 49 Pccc8a 14 42 7 49 aHcc8a 12 41 8 49 8an 12 33 0 33 zae 14 28 14 42 zar 11 27 27 54 aHc8a 12 24 8 32 eoe 17 23 58 81 zccor 9 22 0 22 8am 100 19 8 27 qoHccc8a 11 18 9 27 zcoe 11 18 0 18 8oe 17 17 0 17 ??? 1294 17 18 35 qoHcca 81 16 3 19 qoHcc8a 183 15 2 17 oeHcc8a 14 14 0 14 qoHoe 21 14 0 14 qoHca 43 13 4 17 qoPccc8a 8 12 12 24 qoe 81 12 11 23 qoeccca 8 12 0 12 cccoe 17 11 0 11 qoHam 200 11 2 13 zccc8a 36 11 0 11 oeHcca 19 10 0 10 ccoe 11 9 0 9 eor 10 9 39 48 ezcc8a 21 9 14 23 qoHan 54 9 1 10 oeccca 12 8 8 16 8ar 51 7 9 16 oezcc8a 14 7 14 21 qoHae 113 7 7 14 qoHc8a 198 7 3 10 Ham 16 6 0 6 oHan 16 6 12 18 8ae 50 5 15 20 eccc8a 52 5 17 22 oHcca 34 5 2 7 zccoe 17 5 0 5 oHc8a 83 4 3 7 oeccc8a 23 4 26 30 qoHar 48 4 2 6 zcca 69 4 4 8 zccca 23 4 0 4 cccca 31 3 0 3 ccc8a 172 2 6 8 oHae 39 2 10 12 or 40 2 7 9 qoHa 79 2 25 27 ccca 67 1 5 6 8a 35 0 31 31 Hae 10 0 9 9 Hc8a 25 0 7 7 Hcc8a 14 0 0 0 aHcca 9 0 0 0 ae 12 0 24 24 am 20 0 9 9 cc8a 16 0 12 12 cccHa 12 0 0 0 cccHc8a 8 0 0 0 cccHca 50 0 7 7 cccc8a 19 0 10 10 ccccHa 12 0 0 0 ccccHca 35 0 0 0 cccz 8 0 0 0 e8a 8 0 74 74 eHam 8 0 0 0 eHc8a 8 0 0 0 eccca 15 0 19 19 oHa 25 0 27 27 oHam 76 0 3 3 oHar 35 0 2 2 oHca 21 0 4 4 oHcc8a 56 0 8 8 oHoe 11 0 0 0 oPccc8a 14 0 7 7 oe 127 0 12 12 oe8a 9 0 77 77 oeHa 10 0 0 0 oeHam 22 0 4 4 oeHc8a 19 0 21 21 oea 23 0 78 78 oeoe 8 0 49 49 oeor 13 0 23 23 oezcca 8 0 12 12 qoHccca 8 0 0 0 ram 14 0 21 21 roe 9 0 33 33 zca 9 0 11 11 zcc8a 204 0 4 4 zccHa 14 0 7 7 zccHca 37 0 0 0 zccHcca 8 0 0 0 zcccHa 12 0 8 8 zcccHca 31 0 0 0 By the probability "fn" of being line-final: word freq in fn ex ------------ ---- -- -- -- oea 23 0 78 78 oe8a 9 0 77 77 e8a 8 0 74 74 eoe 17 23 58 81 oeoe 8 0 49 49 eor 10 9 39 48 roe 9 0 33 33 8a 35 0 31 31 oHa 25 0 27 27 zar 11 27 27 54 oeccc8a 23 4 26 30 qoHa 79 2 25 27 ae 12 0 24 24 oeor 13 0 23 23 oeHc8a 19 0 21 21 ram 14 0 21 21 eccca 15 0 19 19 ??? 1294 17 18 35 eccc8a 52 5 17 22 8ae 50 5 15 20 ezcc8a 21 9 14 23 oezcc8a 14 7 14 21 zae 14 28 14 42 cc8a 16 0 12 12 oHan 16 6 12 18 oe 127 0 12 12 oezcca 8 0 12 12 qoPccc8a 8 12 12 24 qoe 81 12 11 23 zca 9 0 11 11 cccc8a 19 0 10 10 oHae 39 2 10 12 8ar 51 7 9 16 Hae 10 0 9 9 am 20 0 9 9 qoHccc8a 11 18 9 27 8am 100 19 8 27 aHc8a 12 24 8 32 aHcc8a 12 41 8 49 oHcc8a 56 0 8 8 oeccca 12 8 8 16 zcccHa 12 0 8 8 Hc8a 25 0 7 7 Pccc8a 14 42 7 49 cccHca 50 0 7 7 oPccc8a 14 0 7 7 or 40 2 7 9 qoHae 113 7 7 14 zccHa 14 0 7 7 ccc8a 172 2 6 8 ccca 67 1 5 6 zam 52 44 5 49 oHca 21 0 4 4 oeHam 22 0 4 4 qoHca 43 13 4 17 zcc8a 204 0 4 4 zcca 69 4 4 8 oHam 76 0 3 3 oHc8a 83 4 3 7 qoHc8a 198 7 3 10 qoHcca 81 16 3 19 oHar 35 0 2 2 oHcca 34 5 2 7 qoHam 200 11 2 13 qoHar 48 4 2 6 qoHcc8a 183 15 2 17 qoHan 54 9 1 10 8an 12 33 0 33 8ccc8a 9 66 0 66 8oe 17 17 0 17 8zcc8a 17 88 0 88 Ham 16 6 0 6 Hcc8a 14 0 0 0 Hccc8a 14 49 0 49 Poe 8 99 0 99 aHcca 9 0 0 0 azcc8a 8 74 0 74 cccHa 12 0 0 0 cccHc8a 8 0 0 0 ccccHa 12 0 0 0 ccccHca 35 0 0 0 cccca 31 3 0 3 cccoe 17 11 0 11 cccz 8 0 0 0 ccoe 11 9 0 9 eHam 8 0 0 0 eHc8a 8 0 0 0 oHoe 11 0 0 0 oeHa 10 0 0 0 oeHcc8a 14 14 0 14 oeHcca 19 10 0 10 qoHccca 8 0 0 0 qoHoe 21 14 0 14 qoeccca 8 12 0 12 zccHca 37 0 0 0 zccHcca 8 0 0 0 zccc8a 36 11 0 11 zcccHca 31 0 0 0 zccca 23 4 0 4 zccoe 17 5 0 5 zccor 9 22 0 22 zcoe 11 18 0 18 zoe 25 59 0 59 zor 10 79 0 79 By probability "ex" of being line-extreme: word freq in fn ex ------------ ---- -- -- -- Poe 8 99 0 99 8zcc8a 17 88 0 88 eoe 17 23 58 81 zor 10 79 0 79 oea 23 0 78 78 oe8a 9 0 77 77 azcc8a 8 74 0 74 e8a 8 0 74 74 8ccc8a 9 66 0 66 zoe 25 59 0 59 zar 11 27 27 54 Hccc8a 14 49 0 49 Pccc8a 14 42 7 49 aHcc8a 12 41 8 49 oeoe 8 0 49 49 zam 52 44 5 49 eor 10 9 39 48 zae 14 28 14 42 ??? 1294 17 18 35 8an 12 33 0 33 roe 9 0 33 33 aHc8a 12 24 8 32 8a 35 0 31 31 oeccc8a 23 4 26 30 8am 100 19 8 27 oHa 25 0 27 27 qoHa 79 2 25 27 qoHccc8a 11 18 9 27 ae 12 0 24 24 qoPccc8a 8 12 12 24 ezcc8a 21 9 14 23 oeor 13 0 23 23 qoe 81 12 11 23 eccc8a 52 5 17 22 zccor 9 22 0 22 oeHc8a 19 0 21 21 oezcc8a 14 7 14 21 ram 14 0 21 21 8ae 50 5 15 20 eccca 15 0 19 19 qoHcca 81 16 3 19 oHan 16 6 12 18 zcoe 11 18 0 18 8oe 17 17 0 17 qoHca 43 13 4 17 qoHcc8a 183 15 2 17 8ar 51 7 9 16 oeccca 12 8 8 16 oeHcc8a 14 14 0 14 qoHae 113 7 7 14 qoHoe 21 14 0 14 qoHam 200 11 2 13 cc8a 16 0 12 12 oHae 39 2 10 12 oe 127 0 12 12 oezcca 8 0 12 12 qoeccca 8 12 0 12 cccoe 17 11 0 11 zca 9 0 11 11 zccc8a 36 11 0 11 cccc8a 19 0 10 10 oeHcca 19 10 0 10 qoHan 54 9 1 10 qoHc8a 198 7 3 10 Hae 10 0 9 9 am 20 0 9 9 ccoe 11 9 0 9 or 40 2 7 9 ccc8a 172 2 6 8 oHcc8a 56 0 8 8 zcca 69 4 4 8 zcccHa 12 0 8 8 Hc8a 25 0 7 7 cccHca 50 0 7 7 oHc8a 83 4 3 7 oHcca 34 5 2 7 oPccc8a 14 0 7 7 zccHa 14 0 7 7 Ham 16 6 0 6 ccca 67 1 5 6 qoHar 48 4 2 6 zccoe 17 5 0 5 oHca 21 0 4 4 oeHam 22 0 4 4 zcc8a 204 0 4 4 zccca 23 4 0 4 cccca 31 3 0 3 oHam 76 0 3 3 oHar 35 0 2 2 Hcc8a 14 0 0 0 aHcca 9 0 0 0 cccHa 12 0 0 0 cccHc8a 8 0 0 0 ccccHa 12 0 0 0 ccccHca 35 0 0 0 cccz 8 0 0 0 eHam 8 0 0 0 eHc8a 8 0 0 0 oHoe 11 0 0 0 oeHa 10 0 0 0 qoHccca 8 0 0 0 zccHca 37 0 0 0 zccHcca 8 0 0 0 zcccHca 31 0 0 0 Since there are 765 occurrences of "//" in about 6900 words, the expected probability of a word occuring at a specific end of a line is about 12%, and 24% of it occuring at either end. Taking 12% as the split point for "in" or "fn", we get the following tentative categories: Extremists: word freq in fn ex ------------ ---- -- -- -- ??? 1294 17 18 35 eoe 17 23 58 81 qoPccc8a 8 12 12 24 zae 14 28 14 42 zar 11 27 27 54 Finalists: word freq in fn ex ------------ ---- -- -- -- 8a 35 0 31 31 8ae 50 5 15 20 ae 12 0 24 24 cc8a 16 0 12 12 e8a 8 0 74 74 eccc8a 52 5 17 22 eccca 15 0 19 19 eor 10 9 39 48 ezcc8a 21 9 14 23 oHa 25 0 27 27 oHan 16 6 12 18 oe 127 0 12 12 oe8a 9 0 77 77 oeHc8a 19 0 21 21 oea 23 0 78 78 oeccc8a 23 4 26 30 oeoe 8 0 49 49 oeor 13 0 23 23 oezcc8a 14 7 14 21 oezcca 8 0 12 12 qoHa 79 2 25 27 ram 14 0 21 21 roe 9 0 33 33 Initialists: word freq in fn ex ------------ ---- -- -- -- 8am 100 19 8 27 8an 12 33 0 33 8ccc8a 9 66 0 66 8oe 17 17 0 17 8zcc8a 17 88 0 88 Hccc8a 14 49 0 49 Pccc8a 14 42 7 49 Poe 8 99 0 99 aHc8a 12 24 8 32 aHcc8a 12 41 8 49 azcc8a 8 74 0 74 oeHcc8a 14 14 0 14 qoHca 43 13 4 17 qoHcc8a 183 15 2 17 qoHcca 81 16 3 19 qoHccc8a 11 18 9 27 qoHoe 21 14 0 14 qoe 81 12 11 23 qoeccca 8 12 0 12 zam 52 44 5 49 zccor 9 22 0 22 zcoe 11 18 0 18 zoe 25 59 0 59 zor 10 79 0 79 Medialists: word freq in fn ex ------------ ---- -- -- -- cccoe 17 11 0 11 qoHam 200 11 2 13 zccc8a 36 11 0 11 oeHcca 19 10 0 10 ccoe 11 9 0 9 qoHan 54 9 1 10 oeccca 12 8 8 16 8ar 51 7 9 16 qoHae 113 7 7 14 qoHc8a 198 7 3 10 Ham 16 6 0 6 oHcca 34 5 2 7 zccoe 17 5 0 5 oHc8a 83 4 3 7 qoHar 48 4 2 6 zcca 69 4 4 8 zccca 23 4 0 4 cccca 31 3 0 3 ccc8a 172 2 6 8 oHae 39 2 10 12 or 40 2 7 9 ccca 67 1 5 6 Hae 10 0 9 9 Hc8a 25 0 7 7 Hcc8a 14 0 0 0 aHcca 9 0 0 0 am 20 0 9 9 cccHa 12 0 0 0 cccHc8a 8 0 0 0 cccHca 50 0 7 7 cccc8a 19 0 10 10 ccccHa 12 0 0 0 ccccHca 35 0 0 0 cccz 8 0 0 0 eHam 8 0 0 0 eHc8a 8 0 0 0 oHam 76 0 3 3 oHar 35 0 2 2 oHca 21 0 4 4 oHcc8a 56 0 8 8 oHoe 11 0 0 0 oPccc8a 14 0 7 7 oeHa 10 0 0 0 oeHam 22 0 4 4 qoHccca 8 0 0 0 zca 9 0 11 11 zcc8a 204 0 4 4 zccHa 14 0 7 7 zccHca 37 0 0 0 zccHcca 8 0 0 0 zcccHa 12 0 8 8 zcccHca 31 0 0 0 Note that the average line has about 10 words. The average number of lines per paragraph is at most 10 (but an unknown number of paragraph breaks may have been lost in the transcription). Here are three explanations I can think of for a word w to have a marked preference for or aversion to these extremal positions: (1) Grammar: If w occurs preferably at the end of a sentence, it will be often found at the end of paragraphs, which are a significant fraction (10% or more) of all end-of-lines. This effect can only boost the end-of-line probability up to the fraction F of sentences that end at end-of-line. The extreme cases are `e8a' and `oe8a' (around 75%). To explain these numbers by cause (1), it would require at least 3/4 of all sentences to end at end-of-line. Conversely, if w has preference for beginning-of-sentence, it will be found at end-of-line only if a paragraph contains two or more sentences. An extreme case is `qoHam', that occurs 200 times, but only 2% of those occurrences are at end-of-line. We tentatively conclude that at most 2% of the sentences begin one word before end-of-line. If the second and subsequent sentences of a paragraph begin at random positions of the line, then such sentences are less than 20% of all sentences, and hence 80% of all paragraphs contain only one sentence. (2) Word splitting. In the VMs, words may have been split across line breaks without obvious markings. The left halves of split words would then show up as end-loving, begin-loathing; and symmetrically for ther right halves. This explanation canot account for the many end-loving words ending in `8a', like `eccc8a', because `8a' rarely occurs in the middle of a word: it is almost always final, and a few times initial. Likewise, it cannot account for end-loathing words that begin with `qo', which appears to be strictly word-initial. Also, this effect cannot explain words that avoid both ends of the line, like Hcc8a 14 aHcca 9 cccHa 12 cccHc8a 8 ccccHa 12 ccccHca 35 cccz 8 eHam 8 eHc8a 8 oHoe 11 oeHa 10 qoHccca 8 zccHca 37 zccHcca 8 zcccHca 31 (3) False line breaks: In a sense the opposite of (2). Suppose w is part of a longer word x, but the letter spacing is such that x is often transcribed as two or three separate words, one of them being w. Then w will seem to avoid end-of-line, begin-of-line, or both, depending on the position of w in x. This effect can only explain end-avoidance, not end-attraction. Also, it seems unlikely to be due to bad judgement by the transcribers; the word spaces in VMs are usually pretty clear, and anyway I only considered word breaks where both Friedman and Currier agreed. So, this explanation only flies if the the word spaces are bogus by design. Conclusion: the most likely explanation for most anomalous words seems to be (1). Posted an improved version of these comments to the "voynich" list. 97-08-12 stolfi =============== Did a general cleanup.