Hacking at the Voynich manuscript - Side notes 017 OKOKOKO: The fine structure of Voynichese words Last edited on 1999-09-23 23:55:55 by stolfi [ A first version of this note was posted around 1998-03-11, to the voynich mailing list. This version was extensively revised between 1998-03-21 and 1998-03-29. The section about word and line breaks was added on 1999-02-01. ] [ If you decide to print this note, be warned that some lines have almost 120 characters.] The basic QOIXEOIXEO paradigm ----------------------------- Let "X" be any set of letters. We can always break any string whatsoever into zero or more "X"s, each surrounded by letters which are not "X"s: N X N X N X N ... X N where "X" represents exactly one letter from the set, and "N" is any string (possibly empty) of non-"X" letters. Now let's apply this decomposition to the Voynichese words, using as "X" the set of letters { sh ch ee k ckh ck ikh t cth ct ith f cfh cf ifh p cph cp iph d r l s g m n } (I am using the basic EVA alphabet, without capitals). It turns out that, for this choice of "X", the intervening "N" strings are highly constrained. In fact, most words can be decomposed as Q O I X E O I X E O I X E O... I X E O where Q is empty or "q"; O is zero or more elements from the set A = { a o y }; I is empty, or one of { i ii iii }; E is empty, or "e". The QOKOKOKO schema ------------------- In fact, we can constrain these pieces even more. With very few exceptions, "E" may be non-empty only after { sh ch ee k ckh t cth p cph f cfh d } "I" may be non-empty only before { r l g m n s d } Note that "d" is exceptional in that it may be accompanied by either "e" or "i" strings; but the two are mutually exclusive. (In fact the letter pairs "id" and "de" are both extremely rare.) That is, we can write the generic word as Q O K O K O K ... K O K O where O is as above, and K is one of the "main elements" { k t p f ke te pe fe ckh cth cph cfh ckhe cthe cphe cfhe ikh ith iph ifh ck ct cf cp sh ch ee she che eee de d r l g m n s id ir il ig im in is iid iir iil iig iim iin iis iiid iiir iiil iiig iiim iiin iiis } Note that * The letters "p" and "f" are probably ornate versions of other letters: most likely "k" and "t", but perhaps others. * Various statistics suggest that "k" and "t" may be the same letter. * Ditto for "p" and "f". * Ditto for "y" and "o". * Ditto for "g" and "m". * The letter "q" does not seem to be part of the word; it may be an abbreviation for "and". * The groups { ikh ith iph ifh } may be equivalent to { ckh cth cph cfh }, respectively. * Instances of { ee eee } may be instances { ch che } with missing ligature. Finally, many of the "K" elements are so rare that they are probably errors. If we consider only elements with frequency 0.1% or higher, and exclude the elements with "i*h", "p", and "f", we are left with only 25 "significant" elements: K* = { k ke ckh ckhe t te cth cthe ch che sh she ee eee l m s d n r in ir iin iir iiin } Parsing ambiguities ------------------- Note that the inclusion in "X" of the groups { ikh ith iph ifh } does not create any ambiguity with the "I" modifiers, since the presence of "h" after a tall letter forces one to parse the preceding letter (which must be "i" or "c") as part of the same element. Indeed, the elements { ikh ith iph ifh } may be merely calligraphic variants of { ckh cth cph cfh }, and are the only instances where the letters { k t p f } may be preceded by "i". On the other hand, including the string "ee" in the set "X" leads to an ambiguity in the parsing of words with three or more consecutive "e"s. For example, "okeeedy" could be parsed either as Q O I X E O I X E O I X E O - o - k - - - ee e - - d - y or as Q O I X E O I X E O I X E O - o - k e - - ee - - - d - y Several Voynichologists (Rene and Dennis, among others) are unhappy about this ambiguity; they favor excluding "ee" from the set "X", and perhaps allowing "ee" and "eee" as possible "E" modifiers. But there are reasons for including "ee" in "X". For one thing, while an isolated "e" is pretty common within words, it practically never occurs right after { d r l } or before the first "X"; but "ee" and "eee" often occurs in those positions. That is, while a single "e" must always be attached to a preceding "X", the groups "ee" and "eee" can stand on their own, like the other "X" groups. (One could argue that the "c" in the elements { ck ct cf cp }, which may occur before any other "X" group in some words, is in fact an instance of "e". However, in the few cases I have checked, the "c" has a noticeable ligature, even though the matching "h" is missing. So it seems indeed valid to write those combinations with "c" and not with "e".) One must keep in mind also that an "ee" group may well be a "ch" element whose ligature was omitted (by the scribe or the transcriber). Similarly, the very rare occurrences of "se" may well be instances of "sh" with missing ligature. Conversely, it may be that the `natural' form of the letters { ch che sh she } is { ee eee se see }, respectively; and the ligatures are optional calligraphic devices added to clarify the parsing, almost as an afterthought. Parsing the text ---------------- The words that fail this "QOKOKOKO" pattern are quite rare. Let's count them in the following files: hea-u.wds a few herbal-A pages, which I carefully transcribed from Jacques Guy's images; hea-f.wds herbal-A pages in Friedman's transcription; heb-f.wds herbal-B pages in Friedman's transcription; bio-f.wds biological (language B) pages in Friedman's transcription; vdp-z.wds a list of all words that occur at least twice, transcribed by the EVMT team. (The "-f" files were created between 97-11-11 and 98-11-12, as {hea,heb,bio}-f-gut.wds, from Landini's interlinear converted to EVA. The last one was created by expanding a word frequency list posted by Rene Zandbergen on march/98; an entry "N W" in that list generated "N" copies of word "W" in file "vdp-z.wds".) foreach file ( hea-u hea-f heb-f bio-f vdp-z ) cat ${file}.wds \ | egrep -v '[*]' \ | sed -f factor-OK.sed \ > ${file}.fac cat ${file}.fac \ | egrep -e '[#@%=]' \ > ${file}-weird.fac dicio-wc ${file}.fac ${file}-weird.fac end --- factor-OK.sed ------------------------ # Map "sh", "ch", and "ee" to single letters to simplify the parsing. # Note that "eee" groups are paired off from left end. s/ch/C/g s/sh/S/g s/ee/E/g # Map platformed and half-platformed letters to capitals to simplify the parsing: s/ckh/K/g s/cth/T/g s/cfh/F/g s/cph/P/g # s/ikh/G/g s/ith/H/g s/ifh/M/g s/iph/N/g # s/ck/U/g s/ct/V/g s/cf/X/g s/cp/Y/g # Put down scanning head in "@" state s/$/@/ :x # If in "@" state, copy "[aoy]" group, and switch to "#" state: s/\([aoy][aoy]*\)@/#\1/ s/@/#_/ # If in "#" state, copy next main letter and "e" complements, # insert "}" delimiter, and switch to "%" or "=" state depending on # whether "i"s are allowed or not: s/\([CSEktfpKTFPd]e\)#/=\1}/g s/\([CSEktfpKTFPGHMNUVXY]\)#/=\1}/g s/\([rlgmnsd]\)#/%\1}/g # If in "%" state, attach "i" string to group, go to "=" state: s/\(iii\)%/=\1/ s/\(ii\)%/=\1/ s/\(i\)%/=\1/ s/%/=/ # If in "=" state, insert "{" delimiter, and go back to "@" state: s/=/@{/ tx # We should exit the loop only in the "#" state. # Split "q" prefix and discard scanning head if done: s/^[q]#/{q}/ s/^#/{_}/ # Unfold letter folding: s/U/ck/g s/V/ct/g s/X/cf/g s/Y/cp/g # s/G/ikh/g s/H/ith/g s/M/ifh/g s/N/iph/g # s/K/ckh/g s/T/cth/g s/P/cph/g s/F/cfh/g # s/C/ch/g s/S/sh/g s/E/ee/g ------------------------------------------ lines words bytes file ------ ------- --------- ------------ 803 803 11751 hea-u.fac 0 0 0 hea-u-weird.fac lines words bytes file ------ ------- --------- ------------ 7812 7812 113448 hea-f.fac 93 93 1144 hea-f-weird.fac lines words bytes file ------ ------- --------- ------------ 3223 3223 47932 heb-f.fac 46 46 564 heb-f-weird.fac lines words bytes file ------ ------- --------- ------------ 6182 6182 90650 bio-f.fac 39 39 474 bio-f-weird.fac lines words bytes file ------ ------- --------- ------------ 28939 28939 420444 vdp-z.fac 142 142 1339 vdp-z-weird.fac So, the exceptions to the QOKOKOKO pattern are less than 1.5% in Friedman's transcription, less than 0.5% in Rene's list, and none in my own transcription. (The last result is not that impressive, of course. Even though I did my transcription before I had worked out the structure above, I already had some intuition about it, so my reading was not impartial.) The exceptions in Rene's word list ---------------------------------- Here is a breakdown of the 142 words (counting multiple occurrences) in Rene's file that did not fit the QOKOKOKO pattern. (Let's keep in mind that Rene's file only includes words that occur at least twice.) It seems that some of these exceptions can be explained as "mutations" from other letters: scribal errors, calligraphic variations, pen running out of ink, vellum defects, spots, fading, and of course poor copy quality. Some are harder to explain, however, and may require extending the basic schema. * Words with groups { ckhh cthh cphh cfhh } (42 cases): chckhhy(9) cthhy(4) chcthhy(4) shcthhy(3) qcthhy(3) ckhhy(3) chcphhy(3) chcfhhy(3) shocthhy(2) shcphhy(2) qcphhedy(2) ockhhy(2) ocfhhy(2) These exceptions account for 0.15% of all words. I propose that these are calligraphic accidents; that is, "ckhh" is a "ckhe" whose ligature was overextended, and similarly for the other groups. * Words with "oe" (41 cases): qoedy(5) qoedaiin(3) oedy(2) qoeol(5) qoear(2) qoeor(2) qoekeey(3) oekaiin(3) qoekol(2) oekeey(2) qoekedy(2) oekey(2) oekeody(2) choety(2) choeky(2) sheoeky(2) These exceptions account for approximately 0.15% of all words. The cases with "eke" could be explained as instances of "ckh" with missing ligature. The others may be true exceptions to the schema. Note that the "oe" occurs only at the beginning of the word, or after the initial "q", or after an initial "ch" or "she" (which, in language A, seem to behave like "q" to some extent). * Words beginning with "e" or "qe" (20 cases): ety(6) qekeey(3) qekchdy(3) qety(2) qekor(2) qekaiin(2) etaiin(2) These word-initial "e"s could be explained as partly erased instances of { a o y }. Note that if we replace the initial "e" by "o" or "y" we get fairly common words in all these cases. * Words with the special letters "x" and "v" (20 cases): x(10) v(8) xar(2) Note that these letters (picnic table and caret) occur mostly as isolated letters. Therefore, they may be non-phonetic symbols, or abbreviations. * Words with "e" after "s" (5 cases): chsey(3) shese(2) These exceptions could be instances of "sh" without the ligature. * Isolated "e"s (4 cases): e(4) These exceptions could be instances of "s" with missing plume. * Words with "eeb" (3 cases): cheeb(3) I propose that "eeb" is merely a calligraphic variation of "an" or "iin". * Words with "ykh" (3 cases): ykhey(3) I can't think of a good explanation for these cases. * Letter "o" before "q" (2 cases): oqokain(2) Perhaps the extra "o" is a separate word, or part of the previous one? * Letter "i" in word-final position (2 cases): okai(2) These exceptions could be truncated "in" or "ir" groups. Frequencies for "K" elements ---------------------------- Here are the statistics for the "K" groups. foreach file ( hea-u hea-f heb-f bio-f vdp-z ) cat ${file}.fac \ | egrep -v '[@%#=]' \ | sed \ -e 's/^[^{}]*{//g' \ -e 's/}[^{}]*$//g' \ -e 's/}[^{}]*{/./g' \ | tr '.' '\012' \ | egrep -e '.' \ | sort | uniq -c | expand | sort -b +0 -1nr \ | compute-freqs | sed -e 's/^ //g' \ > ${file}-k.frq dicio-wc ${file}-k.frq end lines file ------ ------------ 39 hea-u-k.frq 41 hea-f-k.frq 36 heb-f-k.frq 35 bio-f-k.frq 44 vdp-z-k.frq multicol {hea-u,hea-f,heb-f,bio-f,vdp-z}-k.frq hea-u hea-f heb-f bio-f vdp-z ---------------- ---------------- ---------------- ---------------- ---------------- 752 0.304 _ 7024 0.297 _ 2856 0.284 _ 4627 0.242 _ 24167 0.273 _ 292 0.118 ch 2524 0.107 ch 1459 0.145 d 2512 0.131 d 9928 0.112 d 216 0.087 d 2194 0.093 d 765 0.076 k 2140 0.112 l 7523 0.085 l 183 0.074 l 1702 0.072 l 608 0.060 l 1516 0.079 q 6470 0.073 k 178 0.072 r 1466 0.062 r 600 0.060 r 1422 0.074 k 4855 0.055 r 119 0.048 t 1257 0.053 k 557 0.055 ch 828 0.043 che 4698 0.053 ch 102 0.041 k 1177 0.050 t 424 0.042 iin 804 0.042 r 4630 0.052 q 101 0.041 iin 1090 0.046 iin 366 0.036 che 775 0.041 ee 3641 0.041 t 93 0.038 sh 832 0.035 sh 353 0.035 t 723 0.038 iin 3545 0.040 iin 76 0.031 s 695 0.029 q 321 0.032 q 670 0.035 she 3384 0.038 ee 58 0.023 cth 632 0.027 s 250 0.025 ee 615 0.032 t 3328 0.038 che 51 0.021 che 464 0.020 che 207 0.021 ke 476 0.025 ch 1663 0.019 she 51 0.021 q 453 0.019 cth 176 0.017 s 377 0.020 ke 1644 0.019 sh 36 0.015 m 353 0.015 ee 176 0.017 she 357 0.019 s 1608 0.018 s 23 0.009 ee 253 0.011 m 175 0.017 sh 316 0.017 sh 1428 0.016 in 22 0.009 in 216 0.009 p 123 0.012 p 194 0.010 te 1370 0.015 ke 20 0.008 p 187 0.008 she 113 0.011 m 168 0.009 p 789 0.009 te 19 0.008 she 186 0.008 in 110 0.011 te 142 0.007 ckh 734 0.008 p 17 0.007 ckh 176 0.007 ckh 89 0.009 ckh 113 0.006 in 632 0.007 m 11 0.004 te 130 0.005 ke 74 0.007 f 81 0.004 cth 573 0.006 cth 8 0.003 ke 78 0.003 cph 67 0.007 in 72 0.004 m 511 0.006 ckh 7 0.003 cph 75 0.003 f 51 0.005 ir 42 0.002 ckhe 379 0.004 ir 5 0.002 ir 75 0.003 n 38 0.004 cth 31 0.002 ir 223 0.003 eee 4 0.002 ct 70 0.003 te 24 0.002 cthe 29 0.002 cthe 177 0.002 ckhe 4 0.002 iiin 65 0.003 cthe 19 0.002 ckhe 21 0.001 eee 134 0.002 cthe 4 0.002 n 59 0.002 ir 13 0.001 eee 21 0.001 f 125 0.001 f 3 0.001 cthe 57 0.002 eee 13 0.001 iir 12 0.001 cphe 116 0.001 iiin 3 0.001 de 47 0.002 ckhe 9 0.001 iiin 10 0.001 n 95 0.001 iir 3 0.001 eee 27 0.001 cfh 6 0.001 cphe 7 0.000 cph 82 0.001 cph 3 0.001 f 24 0.001 iir 5 0.000 cfh 7 0.000 iiin 67 0.001 n 3 0.001 iir 21 0.001 cphe 5 0.000 cph 5 0.000 de 43 0.000 cphe 2 0.001 cfh 20 0.001 iiin 5 0.000 de 4 0.000 cfh 26 0.000 g 2 0.001 ck 8 0.000 de 5 0.000 n 3 0.000 il 21 0.000 im 1 0.000 cf 6 0.000 iim 4 0.000 cfhe 2 0.000 iir 18 0.000 cfh 1 0.000 ckhe 3 0.000 cfhe 2 0.000 id 1 0.000 pe 12 0.000 ikh 1 0.000 cphe 3 0.000 iid 1 0.000 iil 10 0.000 ct 1 0.000 iid 3 0.000 iil 8 0.000 ith 1 0.000 iim 2 0.000 id 7 0.000 ck 1 0.000 im 2 0.000 iis 7 0.000 iid 1 0.000 il 7 0.000 il 1 0.000 is 2 0.000 cfhe 2 0.000 de 2 0.000 iim 2 0.000 iis In these tables, the "_" entry represents the empty "Q" slot. Let's extract from those tables the elements that are not in the reduced set "K*" and are not simple uses of the `jokers' "p" and "f": foreach file ( hea-u hea-f heb-f bio-f vdp-z ) cat ${file}-k.frq \ | egrep -v ' (([ktpf]|c[ktpf]h|[cs]h|ee)(e|)|[_qlmsdnr]|i[nr]|ii[nr]|iiin)$' \ > ${file}-knr.frq end multicol {hea-u,hea-f,heb-f,bio-f,vdp-z}-knr.frq hea-u hea-f heb-f bio-f vdp-z --------------- --------------- --------------- -------------- --------------- 4 0.002 ct 8 0.000 de 5 0.000 de 5 0.000 de 26 0.000 g 3 0.001 de 6 0.000 iim 2 0.000 id 3 0.000 il 21 0.000 im 2 0.001 ck 3 0.000 iid 1 0.000 iil 12 0.000 ikh 1 0.000 cf 3 0.000 iil 10 0.000 ct 1 0.000 iid 2 0.000 id 8 0.000 ith 1 0.000 iim 2 0.000 iis 7 0.000 ck 1 0.000 im 1 0.000 il 7 0.000 iid 1 0.000 is 7 0.000 il 2 0.000 de 2 0.000 iim 2 0.000 iis Recall that strings with three or more "e"s have ambiguous parsing, which affects the statistics of "ee" and all elements with the "e" modifier. The factor-Ok script arbitrarily pairs the "e"s from the left, so that such strings are parsed as as zero or more "ee"s followed by one "ee" or "eee". To assess the implications of this ambiguity, let's check how many ambiguous strings we have in each file: foreach file ( hea-u hea-f heb-f bio-f vdp-z ) cat ${file}.wds \ | egrep -v '[*]' \ | sed -e 's/[^e]/./g' \ | tr '.' '\012' \ | egrep '.' \ | sort | uniq -c | expand | sort +0 -1nr \ | compute-freqs | sed -e 's/^ //g' \ > ${file}-eee.frq dicio-wc ${file}-eee.frq end multicol {hea-u,hea-f,heb-f,bio-f,vdp-z}-eee.frq hea-u hea-f heb-f bio-f vdp-z --------------- ---------------- --------------- --------------- --------------- 97 0.789 e 1069 0.721 e 952 0.782 e 2187 0.732 e 7593 0.677 e 23 0.187 ee 355 0.239 ee 253 0.208 ee 779 0.261 ee 3395 0.303 ee 3 0.024 eee 57 0.038 eee 13 0.011 eee 21 0.007 eee 223 0.020 eee 2 0.001 eeee Note that, surprisingly, there are practically no words with four ot more "e"s in a row. My factoring script will parse the "eee" strings as one "eee" element. In all files, the frequency of the "eee" element is less than 0.003 ( i.e. 0.3% of the total "K" elements) Therefore, if I had used the other parsing ("e" + "ee"), the frequencies of "ee" and all other "e"-modified elements would increase by less than 0.003 in total. By the way, the low frequency of "eee" probably means that its ambiguity would be no big problem for the intended readers. In fact, the absence of "eeee"s could be explained by the following theory: the letters "ch" and "sh" are officially written "ee" and "se"; since that would lead to ambiguities, the scribe routinely (but not invariably) adds ligatures to indicate the intended grouping. Wordlength statistics in the OKOKOKO model ------------------------------------------ Let's compute statistics on the number of O and K elements in words: foreach file ( hea-f heb-f bio-f vdp-z ) cat ${file}.fac \ | egrep -v '[@%#=]' \ | sed \ -e 's/[{}_]/ /g' \ -e 's/^ *//g' \ -e 's/ *$//g' \ -e 's/ */ /g' \ | egrep -e '.' \ | count-okokoko-lengths \ > ${file}-ok.lfr dicio-wc ${file}-ok.lfr end foreach file ( hea-f heb-f bio-f vdp-z ) cat ${file}.fac \ | egrep -v '[@%#=]' \ | sed \ -e 's/[{}_aoy]/ /g' \ -e 's/^ *//g' \ -e 's/ *$//g' \ -e 's/ */ /g' \ | egrep -e '.' \ | count-okokoko-lengths \ > ${file}-k.lfr dicio-wc ${file}-k.lfr end foreach file ( hea-f heb-f bio-f vdp-z ) cat ${file}.fac \ | egrep -v '[@%#=]' \ | sed \ -e 's/[{}_]/ /g' \ -e 's/[^ aoy]/ /g' \ -e 's/^ *//g' \ -e 's/ *$//g' \ -e 's/ */ /g' \ | egrep -e '.' \ | count-okokoko-lengths \ > ${file}-o.lfr dicio-wc ${file}-o.lfr end foreach file ( hea-f heb-f bio-f vdp-z ) set ff = ( ${file}-{ok,k,o}.lfr ) multicol -v titles="$ff" $ff end hea-f-ok.lfr hea-f-k.lfr hea-f-o.lfr ----------------------------------- --------------------------- --------------------------- avg length = 3.6 avg length = 2.2 avg length = 1.5 len nwds example len nwds example len nwds example --- ---- ------------------ --- ---- ------------------ --- ---- ------------------ 1 205 sh 1 1411 r 1 4340 yay 2 1096 a r 2 4078 f s 2 2789 y a 3 2941 f yay s 3 1801 sh l d 3 302 o o y 4 1737 y k a l 4 376 sh k ch ee 4 13 o o o y 5 1223 sh o l d y 5 28 t sh d ee s 6 412 r o l o t y 6 2 k l s ch ee s 7 92 t sh o d ee s y 7 0 8 10 q o s ch o d a m 8 1 p ch d l ch p ch l 9 1 o p ch o l o t o l 10 1 d a l ch o d o l d y 11 0 12 1 p ch o d o l ch o p ch a l heb-f-ok.lfr heb-f-k.lfr heb-f-o.lfr ------------------------------- ----------------------------- ----------------------------- avg length = 3.7 avg length = 2.3 avg length = 1.5 len nwords example len nwords example len nwords example --- ------ ------------------ --- ------ ------------------ --- ------ ------------------ 1 53 l 1 476 q 1 1575 oy 2 377 q oy 2 1605 d iir 2 1342 o y 3 999 d che y 3 857 p she k 3 145 a o y 4 932 o d a iir 4 198 q k ee d 4 1 a a o a 5 578 p she o k y 5 28 q t ee d r 6 189 a d ee o d y 6 4 k ee s ch ee s 7 40 q o t ee d a r 8 6 y k ee d l che d y 9 3 o k ee o s ch ee o s bio-f-ok.lfr bio-f-k.lfr bio-f-o.lfr ------------------------------- ----------------------------- ----------------------------- avg length = 3.8 avg length = 2.4 avg length = 1.5 len nwords example len nwords example len nwords example --- ------ ------------------ --- ------ ------------------ --- ------ ------------------ 1 80 r 1 828 sh 1 3252 y 2 692 sh y 2 2794 k r 2 2601 a y 3 2074 she d y 3 1996 k che d 3 122 o a y 4 1417 k che d y 4 464 q l che d 4 2 o a o o 5 1394 q o k a r 5 40 q l k ee l 6 418 q o l che d y 6 6 q k ee l she d 7 54 q o k a r d y 8 12 q o l k ee o l y 9 2 q o k ee y l she d y vdp-z-ok.lfr vdp-z-k.lfr vdp-z-o.lfr ----------------------------- ----------------------------- ----------------------------- avg length = 3.7 avg length = 2.3 avg length = 1.5 len nwords example len nwords example len nwords example --- ------ ------------------ --- ------ ------------------ --- ------ ------------------ 1 614 s 1 4504 l 1 14033 a 2 3551 o l 2 14200 d iin 2 12928 o y 3 8812 d a iin 3 8165 q k ee 3 1052 o o y 4 7868 o k a iin 4 1690 q k ee d 4 14 o a o y 5 5952 q o k ee y 5 72 q l k ee d 6 1786 q o k ee d y 7 192 q o k ee o d y 8 22 q o k o l che d y Frequencies of "K" elements in languages A and B ------------------------------------------------ In the "K" frequency tables above we can already see a marked difference between languages A and B. Looking only at the reduced element subset K*, plus "q" and "_" (meaning no "q"): foreach file ( hea-f heb-f ) cat ${file}-k.frq \ | egrep ' (([kt]|c[kt]h|[cs]h|ee)(e|)|[_qlmsdnr]|i[nr]|ii[nr]|iiin)$' \ > ${file}-kr.frq end multicol {hea-f,heb-f}-kr.frq hea-f heb-f ---------------- ---------------- 7024 0.297 _ 2856 0.284 _ 2524 0.107 ch 1459 0.145 d 2194 0.093 d 765 0.076 k 1702 0.072 l 608 0.060 l 1466 0.062 r 600 0.060 r 1257 0.053 k 557 0.055 ch 1177 0.050 t 424 0.042 iin 1090 0.046 iin 366 0.036 che 832 0.035 sh 353 0.035 t 695 0.029 q 321 0.032 q 632 0.027 s 250 0.025 ee 464 0.020 che 207 0.021 ke 453 0.019 cth 176 0.017 s 353 0.015 ee 176 0.017 she 253 0.011 m 175 0.017 sh 187 0.008 she 113 0.011 m 186 0.008 in 110 0.011 te 176 0.007 ckh 89 0.009 ckh 130 0.005 ke 67 0.007 in 75 0.003 n 51 0.005 ir 70 0.003 te 38 0.004 cth 65 0.003 cthe 24 0.002 cthe 59 0.002 ir 19 0.002 ckhe 57 0.002 eee 13 0.001 eee 47 0.002 ckhe 13 0.001 iir 24 0.001 iir 9 0.001 iiin 20 0.001 iiin 5 0.000 n There is also a less marked but still significant difference between herbal-B and bio-B: foreach file ( heb-f bio-f ) cat ${file}-k.frq \ | egrep ' (([kt]|c[kt]h|[cs]h|ee)(e|)|[_qlmsdnr]|i[nr]|ii[nr]|iiin)$' \ > ${file}-kr.frq end multicol {heb-f,bio-f}-kr.frq heb-f bio-f ---------------- ---------------- 2856 0.284 _ 4627 0.242 _ 1459 0.145 d 2512 0.131 d 765 0.076 k 2140 0.112 l 608 0.060 l 1516 0.079 q 600 0.060 r 1422 0.074 k 557 0.055 ch 828 0.043 che 424 0.042 iin 804 0.042 r 366 0.036 che 775 0.041 ee 353 0.035 t 723 0.038 iin 321 0.032 q 670 0.035 she 250 0.025 ee 615 0.032 t 207 0.021 ke 476 0.025 ch 176 0.017 s 377 0.020 ke 176 0.017 she 357 0.019 s 175 0.017 sh 316 0.017 sh 113 0.011 m 194 0.010 te 110 0.011 te 142 0.007 ckh 89 0.009 ckh 113 0.006 in 67 0.007 in 81 0.004 cth 51 0.005 ir 72 0.004 m 38 0.004 cth 42 0.002 ckhe 24 0.002 cthe 31 0.002 ir 19 0.002 ckhe 29 0.002 cthe 13 0.001 eee 21 0.001 eee 13 0.001 iir 10 0.001 n 9 0.001 iiin 7 0.000 iiin 5 0.000 n 2 0.000 iir However, most of that difference disappears if we: (1) identify the letters { k t p f}, which we have good reasons to believe are the same letter; (2) omit the letter "q", which is believed to be a symbol for "and", and hence might be correlated with subject matter; (3) identify "ee" with "ch". foreach file ( hea-u hea-f heb-f bio-f vdp-z ) cat ${file}.fac \ | egrep -v '[@%#=]' \ | sed \ -e 's/^[^{}]*{//g' \ -e 's/}[^{}]*$//g' \ -e 's/}[^{}]*{/./g' \ -e 's/[ktpf]/k/g' \ -e 's/ee/ch/g' \ -e 's/q//g' \ | tr '.' '\012' \ | egrep -e '.' \ | sort | uniq -c | expand | sort -b +0 -1nr \ | compute-freqs | sed -e 's/^ //g' \ | egrep ' (([kt]|c[kt]h|[cs]h|ee)(e|)|[_qlmsdnr]|i[nr]|ii[nr]|iiin)$' \ > ${file}-krr.frq dicio-wc ${file}-krr.frq end multicol {hea-u,hea-f,heb-f,bio-f,vdp-z}-krr.frq hea-u hea-f heb-f bio-f vdp-z ---------------- ---------------- ---------------- ---------------- ---------------- 752 0.310 _ 7024 0.306 _ 2856 0.293 _ 4627 0.263 _ 24167 0.288 _ 315 0.130 ch 2877 0.125 ch 1459 0.150 d 2512 0.143 d 10970 0.131 k 244 0.101 k 2725 0.119 k 1315 0.135 k 2226 0.126 k 9928 0.118 d 216 0.089 d 2194 0.096 d 807 0.083 ch 2140 0.122 l 8082 0.096 ch 183 0.075 l 1702 0.074 l 608 0.062 l 1251 0.071 ch 7523 0.089 l 178 0.073 r 1466 0.064 r 600 0.062 r 849 0.048 che 4855 0.058 r 101 0.042 iin 1090 0.047 iin 424 0.043 iin 804 0.046 r 3551 0.042 che 93 0.038 sh 832 0.036 sh 379 0.039 che 723 0.041 iin 3545 0.042 iin 84 0.035 ckh 734 0.032 ckh 317 0.033 ke 670 0.038 she 2159 0.026 ke 76 0.031 s 632 0.028 s 176 0.018 s 572 0.032 ke 1663 0.020 she 54 0.022 che 521 0.023 che 176 0.018 she 357 0.020 s 1644 0.020 sh 36 0.015 m 253 0.011 m 175 0.018 sh 316 0.018 sh 1608 0.019 s 22 0.009 in 200 0.009 ke 137 0.014 ckh 234 0.013 ckh 1428 0.017 in 19 0.008 ke 187 0.008 she 113 0.012 m 113 0.006 in 1184 0.014 ckh 19 0.008 she 186 0.008 in 67 0.007 in 83 0.005 ckhe 632 0.008 m 5 0.002 ckhe 136 0.006 ckhe 53 0.005 ckhe 72 0.004 m 379 0.005 ir 5 0.002 ir 75 0.003 n 51 0.005 ir 31 0.002 ir 356 0.004 ckhe 4 0.002 iiin 59 0.003 ir 13 0.001 iir 10 0.001 n 116 0.001 iiin 4 0.002 n 24 0.001 iir 9 0.001 iiin 7 0.000 iiin 95 0.001 iir 3 0.001 iir 20 0.001 iiin 5 0.001 n 2 0.000 iir 67 0.001 n Statistics of "O" strings ------------------------- Now, what do we do with the "O" strings? Let's look at their statistics: foreach file ( hea-u hea-f heb-f bio-f vdp-z ) cat ${file}.fac \ | egrep -v '[@%#=]' \ | sed -e 's/{[^{}]*}/./g' \ | tr '.' '\012' \ | egrep -e '.' \ | sort | uniq -c | expand | sort -b +0 -1nr \ | compute-freqs | sed -e 's/^ //g' \ > ${file}-ooo.frq dicio-wc ${file}-ooo.frq end lines file ------ ------------ 9 hea-u-ooo.frq 15 hea-f-ooo.frq 11 heb-f-ooo.frq 12 bio-f-ooo.frq 11 vdp-z-ooo.frq multicol {hea-u,hea-f,heb-f,bio-f,vdp-z}-ooo.frq hea-u hea-f heb-f bio-f vdp-z -------------- --------------- -------------- --------------- -------------- 1364 0.551 _ 12782 0.540 _ 5371 0.533 _ 10295 0.538 _ 45585 0.514 _ 595 0.240 o 5444 0.230 o 1712 0.170 y 3558 0.186 o 18671 0.211 o 262 0.106 y 3069 0.130 y 1616 0.160 o 3413 0.178 y 13615 0.154 y 234 0.094 a 2188 0.092 a 1325 0.132 a 1835 0.096 a 10544 0.119 a 11 0.004 oa 70 0.003 oa 16 0.002 oa 6 0.000 yo 171 0.002 oa 6 0.002 oy 59 0.002 oy 11 0.001 oy 4 0.000 oy 51 0.001 oy 2 0.001 ao 16 0.001 oo 8 0.001 oo 3 0.000 ay 23 0.000 oo 2 0.001 oo 14 0.001 yo 6 0.001 yo 3 0.000 oa 12 0.000 yo 1 0.000 yo 5 0.000 ay 2 0.000 ay 2 0.000 ao 6 0.000 ay 4 0.000 ya 1 0.000 ao 2 0.000 ya 6 0.000 ya 2 0.000 ao 1 0.000 ya 1 0.000 aoy 2 0.000 yy 2 0.000 yoa 1 0.000 oaa 1 0.000 aa 1 0.000 oao 1 0.000 yay Thus, the only common alternatives are empty, "y", "a", and "o". In fact, as we know, the alternative "y" is common only in initial and final positions; and in those positions it seems to be equivalent to "o". Note that about half of the "O" slots are filled (i.e. the ratio K:O is about 2:1). Therefore, if the "K" elements were randomly mixed with "O" letters, the "O" slots should be about 67% empty, 22% single-letter, 7% double-letter, and 2% triple-letter. Instead we see about 50% empty, 50% single-letter, <1% double-letter, and <0.1% triple-letter. In fact, triple-letter "O"s are so rare that they can be assumed to be errors. In Rene's good-quality word list (vdp-z.wds) there are no triple-letter "O"s at all. Statistics of "K" strings ------------------------- Let's now look at the clusters of "K" elements between consecutive non-empty "O"s. To reduce the size of the output, let's map the letters { k t p f } to "k", and "ch" to "ee": foreach file ( hea-f heb-f bio-f vdp-z ) cat ${file}.fac \ | egrep -v '[@%#=]' \ | sed \ -e 's/^{[q_]}//g' \ -e 's/^_//g' \ -e 's/_$//g' \ -e 's/[oay]/./g' \ -e 's/[{}]//g' \ -e 's/[ktpf]/k/g' \ -e 's/ch/ee/g' \ | tr '.' '\012' \ | egrep -e '.' \ | sort | uniq -c | expand | sort -b +0 -1nr \ | compute-freqs | sed -e 's/^ //g' \ > ${file}-kkk.frq dicio-wc ${file}-kkk.frq end lines file ------ ------------ 257 hea-f-kkk.frq 213 heb-f-kkk.frq 265 bio-f-kkk.frq 232 vdp-z-kkk.frq multicol {hea-f,heb-f,bio-f,vdp-z}-kkk.frq > multi-kkk.frq hea-f heb-f bio-f vdp-z ------------------------- -------------------------- --------------------------- ----------------------- 1733 0.131 d 663 0.136 k 1424 0.165 l 5345 0.119 k 1441 0.109 l 568 0.116 r 1148 0.133 k 5296 0.118 l 1380 0.104 r 550 0.113 d 729 0.084 r 4714 0.105 r 1237 0.093 k 419 0.086 iin 720 0.083 iin 4146 0.093 d 1164 0.088 ee 408 0.084 l 464 0.054 d 3534 0.079 iin 1078 0.081 iin 172 0.035 ke_d 379 0.044 ke_d 1994 0.045 k_ee 931 0.070 k_ee 161 0.033 k_ee_d 331 0.038 k_ee_d 1861 0.042 ee 592 0.045 ckh 114 0.023 s 260 0.030 s 1424 0.032 in 553 0.042 sh 107 0.022 m 258 0.030 she_d 1232 0.028 s 426 0.032 s 105 0.022 k_ee 230 0.027 eee_d 1068 0.024 k_ee_d 235 0.018 m 94 0.019 ee 175 0.020 k_ee 1052 0.023 eee 229 0.017 eee 92 0.019 ke 148 0.017 eee 1036 0.023 ke 179 0.014 ke 87 0.018 ee_d 147 0.017 she 868 0.019 ke_d 174 0.013 in 77 0.016 eee_d 112 0.013 in 813 0.018 sh 149 0.011 k_eee 70 0.014 eee 111 0.013 l_k 643 0.014 she 133 0.010 she 63 0.013 in 104 0.012 ke 632 0.014 m 114 0.009 d_ee 60 0.012 sh 99 0.011 l_eee_d 631 0.014 eee_d 112 0.008 ckhe 55 0.011 eee_k 87 0.010 k_eee_d 622 0.014 ckh 110 0.008 k_sh 52 0.011 ee_ckh 67 0.008 m 459 0.010 she_d 106 0.008 l_d 51 0.010 l_d 66 0.008 l_d 428 0.010 k_eee 64 0.005 n 50 0.010 ir 65 0.008 sh 406 0.009 l_k 58 0.004 ee_ckh 49 0.010 she 63 0.007 ee_ckh 398 0.009 k_eee_d .... ..... .......... .... ..... ............... .... ..... ................ .... ..... ............. 1 0.000 ckh_s_ee_s 1 0.000 ee_sh_d 2 0.000 l_l 6 0.000 she_ke 1 0.000 ckh_sh 1 0.000 eee_ckh_d 2 0.000 l_sh_ee_s 5 0.000 d_sh_ee_d 1 0.000 ckhe_iin 1 0.000 eee_ckhe 2 0.000 l_she_ckh 5 0.000 il 1 0.000 ckhe_k_k_k_l 1 0.000 eee_ckhe_d 2 0.000 l_she_k 5 0.000 sh_ee_k_ee 1 0.000 d_ee_ee_ckhe 1 0.000 eee_ee 2 0.000 r_ee_r 4 0.000 d_ee_ee_d 1 0.000 d_ee_ee_s 1 0.000 eee_eee 2 0.000 r_eee_k 4 0.000 d_sh_d 1 0.000 d_ee_eee 1 0.000 eee_k_ee_ee 2 0.000 r_k 4 0.000 ee_ee_k_ee .... ..... .......... .... ..... ............... .... ..... ................ .... ..... ............. Obviously, groups of two or more consecutive "K" elements are quite common. Here is the frequency for each repeat count: foreach file ( hea-f heb-f bio-f vdp-z ) cat ${file}.fac \ | egrep -v '[@%#=]' \ | sed \ -e 's/^{[q_]}//g' \ -e 's/^_//g' \ -e 's/_$//g' \ -e 's/[oay]/./g' \ -e 's/[{}]//g' \ -e 's/[a-z][a-z]*/x/g' \ | tr '.' '\012' \ | egrep -e '.' \ | sort | uniq -c | expand | sort -b +0 -1nr \ | compute-freqs | sed -e 's/^ //g' \ > ${file}-kn.frq dicio-wc ${file}-kn.frq end lines file ------ ------------ 5 hea-f-kn.frq 6 heb-f-kn.frq 6 bio-f-kn.frq 4 vdp-z-kn.frq multicol {hea-f,heb-f,bio-f,vdp-z}-kn.frq hea-f heb-f bio-f vdp-z --------------------- ----------------------- ----------------------- ------------------- 10849 0.819 x 3387 0.694 x 5527 0.640 x 33290 0.744 x 2149 0.162 x_x 1034 0.212 x_x 1966 0.228 x_x 8124 0.181 x_x 229 0.017 x_x_x 416 0.085 x_x_x 1038 0.120 x_x_x 3077 0.069 x_x_x 20 0.002 x_x_x_x 38 0.008 x_x_x_x 99 0.011 x_x_x_x 280 0.006 x_x_x_x 5 0.000 x_x_x_x_x 5 0.001 x_x_x_x_x 1 0.000 x_x_x_x_x 2 0.000 x_x_x_x_x_x 1 0.000 x_x_x_x_x_x So strings of 3 consecutive "K" elements are relatively common, strings of 4 are rare, and no word that occurs twice has 5 or more "K"s in a row. Recall that about 50% of the "O" slots are empty, and about 50% consist of one letter only. If the "O" slots were filled or empty at random, then we would expect the following statistics 0.500 x 0.250 x_x 0.125 x_x_x 0.063 x_x_x_x 0.031 x_x_x_x_x 0.015 x_x_x_x_x_x 0.007 x_x_x_x_x_x_x So the statistics above suggest that in language A the distribution of "O"s is more uniform than would be expected from chance. (The case is not clear because the presence of short words would bias the statistics towards entries with few consecutive "K"s.) Note the significant difference in K-repeat frequencies for language A and language B. The frequencies for language B are closer to the "random" model. Analysis of "K" and "O" statistics ---------------------------------- What can we conclude from these numbers? Let's consider the alternatives: (1) The EVA letters { a o y } are different Voynichese letters. This theory does not look very promising: if they were different letters, they should belong to the same class (vowel, consonant, whaterver); but then we would expect to see a fair number of diphtongs (double-letter "O" strings), which we don't see. (2) The EVA letters { a o y } are the same Voynichese letter. This theory could explain why there are so few double-letter "O" slots: namely, because the Voynichese letter "o/a/y" cannot occur twice in a row (a common restriction in natural languages). (3) Each "O" string is a modifiers for (i.e. a part of) the next "K" element; except for the final "O" string, which stands on its own. (4) Each "O" string is a modifiers for the preceding "K" element; except for the initial "O" string, which stands on its own. (5) Some "K" element may admit "O" letters as post-modifiers, some may admit them as pre-modifiers, some may admit both. After a quick look, I would guess that { sh ch ee she che eee } admit "a/o/y" only as post-modifiers { r l m n ir iir in iin iiin } admit "a/o/y" only as pre-modifiers { s d k t cth ckh ke te cthe ckhe } admit "a/o/y" in both positions. But this hunch needs to be confirmed... (6) None of the above. Appendix: A more flexible factoring script ------------------------------------------ The logic of factor-OK.sed was rewritten in AWK as a "factor-field-OK" script that allows one to factor a selected field of a multifield file. Checking consistency of the two scripts: foreach file ( hea-u hea-f heb-f bio-f vdp-z ) echo ${file}-old cat ${file}.wds \ | sed -f factor-OK.sed \ > .${file}-old.fac echo ${file}-new cat ${file}.wds \ | factor-field-OK \ | gawk '/./{ print $1; }' \ > .${file}-new.fac dicio-wc .${file}-{old,new}.fac diff .${file}-{old,new}.fac /bin/rm .${file}-{old,new}.fac end The differences are confined to words that factor-OK can't parse. The new script will forcibly factor those words into elements {i+X}, {X[eh]}, or {X} where {X} is a character other than [aoy].