Hacking at the Voynich manuscript Notebook - volume 3 Warning: these notebooks aren't strictly chronological logs. Sometimes I go back and redo things, clarify comments, delete garbage, etc. Summary of previous notebooks ============================= On 97-07-05 I obtained Landini's interlinear transcription of the VMs, version 1.6 (landini-interln16.evt) from http://sun1.bham.ac.uk/G.Landini/evmt/intrln16.zip I manually extracted from it a homogeneous, full-text sample bio-m-evt.evt, consisting of pages 147-166 (f75r--f84v) of the "biological" section, in Currier's Language B, hand 2. This section includes Currier's and Friedman's transcriptions. Currier's seems to be the most complete of them. The two versions have many differences (affecting 5-10% of the words), and often disagree even in the grouping of symbols: where one sees two words the other sees a single word, what is [A] for one may be [CI] for the other, and so on. So I decided to break all characters doen to individual "logical" strokes, and use one (computer) character to encode each stroke. I called this new encoding "jsa" (Jorge's Super-Analytic). After mapping to jsa, I generated a "consensus" version of the biological section cat bio-m-evt.evt \ | fsg2jsa \ > bio-m-jsa.evt cat bio-m-jsa.evt \ | make-consensus-interlin \ > bio-x-jsa.evt cat bio-x-jsa.evt \ | egrep '^<.*;J> ' \ | sed \ -e 's/{[^}]*}//g' \ > bio-j-jsa.evt extract-words-from-interlin \ -chars "qocilgysxju" \ bio-j-jsa.evt \ bio-j-jsa lines words bytes file ------ ------- --------- ------------ 7054 7054 62690 bio-j-jsa.wds 2132 2132 24925 bio-j-jsa.dic 4661 4661 40897 bio-j-jsa-gut.wds 992 992 9720 bio-j-jsa-gut.dic 840 840 2445 bio-j-jsa-fun.wds 2 2 5 bio-j-jsa-fun.dic 1553 1553 19348 bio-j-jsa-bad.wds 1138 1138 15200 bio-j-jsa-bad.dic Digraph counts: q o c i l g y s x j u TOT ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- . 1398 965 1877 361 60 . . . . . . 4661 q 1 . 1229 18 . 1 154 . . . 700 . 2103 o 21 486 1 63 1087 1071 . . . . . . 2729 c 4 167 176 6137 1209 232 2114 2921 1019 . . . 13979 i 4 1 1 8 1997 2 . . 560 1616 37 457 4683 l . . . . . . 16 . . . 1566 . 1582 g 52 . 74 2150 4 4 . . . . . . 2284 y 2790 26 2 47 13 43 . . . . . . 2921 s 463 1 99 1013 1 2 . . . . . . 1579 x 827 24 105 488 5 167 . . . . . . 1616 j 46 . 76 2175 6 . . . . . . . 2303 u 453 . 1 3 . . . . . . . . 457 ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- TOT 4661 2103 2729 13979 4683 1582 2284 2921 1579 1616 2303 457 40897 Some conclusions we get from this and other data: \ci/ and \o/ are lexically similar but distinct letters. The valid \i/ sequences are \ij/ \is/ \iis/ \iiu/ \iiiu/ \ix/; the others are likely to be scription or transcription errors. \qo/ is a combination that occurs only in word-initial position. 97-07-26 stolfi =============== Back from the math colloquium, i decided to review the \c/ stroke in JSA encoding, to see if strings of \c/s could be parsed unambiguously into letters. First of all, let's review the grouping of \ci*/ into letters: cat bio-j-jsa-gut.wds \ | sed \ -e 's/^/_/g' \ -e 's/$/_/g' \ -e 's/ci/a/g' \ | compare-contexts -lctx 0 -rctx 0 -colw 20 \ 'ai*' \ 'oi*' 710 0.59 ai 1642 0.60 o 330 0.27 aiii 1075 0.39 oi 130 0.11 aii 9 0.00 oiii 35 0.03 a 3 0.00 oii 4 0.00 aiiii ----- ---- ---- ----- ---- ---- 2729 1.00 TOT 1209 1.00 TOT Hm, it seems that \ci/ is not really a letter; it is most often attached to the following \i/ strings. Let's retry with some more context, and removing the \qo/ combination: cat bio-j-jsa-gut.wds \ | sed \ -e 's/^/_/g' \ -e 's/$/_/g' \ -e 's/_qo/_A/g' \ -e 's/ci/a/g' \ | compare-contexts -lctx 0 -rctx 1 -colw 20 \ 'ai*' \ 'oi*' 403 0.33 aix 760 0.50 oix 329 0.27 aiiiu 260 0.17 oq 271 0.22 ais 250 0.17 ol 109 0.09 aiiu 171 0.11 ois 32 0.03 aij 35 0.02 oc 19 0.02 aiis 17 0.01 o_ 15 0.01 ax 9 0.01 oiiiu 8 0.01 ac 4 0.00 oij 4 0.00 as 1 0.00 oiiu 4 0.00 aiu 1 0.00 oiis 4 0.00 aiiiiu ----- ---- ---- 4 0.00 a_ 1508 1.00 TOT 2 0.00 al 2 0.00 aiix 1 0.00 aq 1 0.00 ao 1 0.00 aiiis ----- ---- ---- 1209 1.00 TOT Let's look at the .dic file, instead of .wds, to lessen the effect of common words: cat bio-j-jsa-gut.dic \ | sed \ -e 's/^/_/g' \ -e 's/$/_/g' \ -e 's/_qo/_A/g' \ -e 's/ci/a/g' \ | compare-contexts -lctx 0 -rctx 1 -colw 20 \ 'ai*' \ 'oi*' 126 0.35 aix 233 0.49 oix 86 0.24 ais 64 0.13 oq 48 0.13 aiiiu 64 0.13 ois 22 0.06 aiiu 61 0.13 ol 21 0.06 aij 32 0.07 oc 14 0.04 aiis 13 0.03 o_ 11 0.03 ax 5 0.01 oiiiu 8 0.02 ac 4 0.01 oij 4 0.01 as 1 0.00 oiiu 4 0.01 aiiiiu 1 0.00 oiis 3 0.01 a_ ----- ---- ---- 2 0.01 al 478 1.00 TOT 2 0.01 aiu 2 0.01 aiix 1 0.00 aq 1 0.00 ao 1 0.00 aiiis ----- ---- ---- 356 1.00 TOT Hm, it seems quite possible that the \o/ in \oiiiu/, \oij/, \oiiu/, \oiis/ is actually a misreading of \ci/. Let's now compare the occurrences of \i/ strings after \c/ against other letters: cat bio-j-jsa-gut.wds \ | sed \ -e 's/^/_/g' \ -e 's/$/_/g' \ | compare-contexts -lctx 0 -rctx 0 -colw 20 \ 'cii*' \ '[^ci]ii*' 710 0.59 cii 1075 0.73 oi 330 0.27 ciiii 361 0.24 _i 130 0.11 ciii 13 0.01 yi 35 0.03 ci 9 0.01 oiii 4 0.00 ciiiii 6 0.00 ji ----- ---- ---- 5 0.00 xi 1209 1.00 TOT 4 0.00 gi 3 0.00 oii 1 0.00 si ----- ---- ---- 1477 1.00 TOT Hm. According to this table, strings of two or more \i/s occur only after a \c/ stroke. The exceptions \oiii/ (9 instances) and \oii/ (3 instances) can easily be explained as misreadings of \ciiii/ (330 instances) and \cii/ (130 instances). If the probability of misreading \ci/ as \o/ is independent of context, then we can expect that 20 \oi/s are actually \cii/s. Let's retry with some additional context: cat bio-j-jsa-gut.wds \ | sed \ -e 's/^/_/g' \ -e 's/$/_/g' \ | compare-contexts -lctx 0 -rctx 1 -colw 20 \ 'cii*' \ '[^ci]ii*' 403 0.33 ciix 893 0.60 oix 329 0.27 ciiiiu 282 0.19 _ix 271 0.22 ciis 178 0.12 ois 109 0.09 ciiiu 79 0.05 _is 32 0.03 ciij 9 0.01 oiiiu 19 0.02 ciiis 8 0.01 yix 15 0.01 cix 6 0.00 jix 8 0.01 cic 4 0.00 yis 4 0.00 cis 4 0.00 oij 4 0.00 ciiu 3 0.00 xix 4 0.00 ciiiiiu 3 0.00 gix 4 0.00 ci_ 2 0.00 xis 2 0.00 cil 2 0.00 oiiu 2 0.00 ciiix 1 0.00 yij 1 0.00 ciq 1 0.00 six 1 0.00 cio 1 0.00 oiis 1 0.00 ciiiis 1 0.00 gis ----- ---- ---- ----- ---- ---- 1209 1.00 TOT 1477 1.00 TOT According to this table, it sems that: The pairs \ix/, \is/, and \ij/ are letters: they can appear after \o/, \ci/, and word-begin, but also after other strokes like \c/, \y/, \j/, \g/, \s/, \x/. Strings of one or more \i/ ending with the \u/ plume can only occur after \c/. The exceptions \oiiiu/ (9) and \oiiu/ (2) can be explained as misreadings of \ciiiiu/ (329) and \ciiiu/ (109). That would suggest that about 3% of the \o/s are actually \ci/s. Strings of two or more \i/s with the \s/, \j/, and \x/ plumes can appear only after \c/. The exception \oiis/ (1) can be explained as a misreading of \ciiis/ (19). Here is the \c/ column again, manually sorted by last letter: 4 0.00 ci_ 8 0.01 cic 2 0.00 cil 1 0.00 cio 1 0.00 ciq 32 0.03 ciij 4 0.00 cis 271 0.22 ciis 19 0.02 ciiis 1 0.00 ciiiis 4 0.00 ciiu 109 0.09 ciiiu 329 0.27 ciiiiu 4 0.00 ciiiiiu 15 0.01 cix 403 0.33 ciix 2 0.00 ciiix ----- ---- ---- 1209 1.00 TOT From all this data, it seems we can draw the following hypotheses: The strings \ij/, \is/, and \ix/ are letters. It is possible that \iis/ is a rare letter, too. The pair \ci/ is often a letter, but sometimes it is not. In particular, \cix/ is the letter \ix/ following a \c/. The only strings that end with \u/ plume are \ciiiu/ and \ciiiiu/. The last observation has a number of possible explanations: (1) \ciiiu/ and \ciiiiu/ are letters; or (2) \iiu/ and \iiiu/ are letters that can occur only after \ci/; or (3) \iiiu/ and \iiiiu/ are letters that can occur only after \c/; or (4) \iiiu/ is a letter that can occur after \c/ or \ci/. The mixed hypotheses (5) \iiu/ and \iiiu/ are letters that can occur after \c/ or \ci/ (6) \iiiu/ and \iiiiu/ are letters that can occur after \c/ or \ci/ is rather unlikely, given the low frequency of \ciiu/ and \ciiiiiu/. Hypothesis (2) has the merit that it provides an alternative explanation for the rare occurrences of \oiiu/ and \oiiiu/, not depending on transcription errors. Let's be conservative, and lump only \ix/, \ij/, \is/, \iiu/ as letters, leaving out the first \i/ of \iiiu/ and \iis/. Here is a table of these \i/ letters and their left contexts: cat bio-j-jsa-gut.wds \ | sed \ -e 's/^/_/g' \ -e 's/$/_/g' \ -e 's/ij/k/g' \ -e 's/ix/e/g' \ -e 's/is/r/g' \ -e 's/iiu/n/g' \ -e 's/ci/a/g' \ | compare-contexts -lctx 1 -rctx 0 -colw 17 \ 'i*k' \ 'i*e' \ 'i*r' \ 'i*n' 32 0.86 ak 893 0.55 oe 271 0.48 ar 329 0.72 ain 4 0.11 ok 403 0.25 ae 178 0.32 or 109 0.24 an 1 0.03 yk 282 0.17 _e 79 0.14 _r 9 0.02 oin ----- ---- --- 15 0.01 ce 19 0.03 air 4 0.01 cn 37 1.00 TOT 8 0.00 ye 4 0.01 yr 4 0.01 aii 6 0.00 je 4 0.01 cr 2 0.00 on 3 0.00 ge 2 0.00 er ----- ---- --- 3 0.00 e 1 0.00 oir 457 1.00 TOT 2 0.00 aie 1 0.00 gr 1 0.00 se 1 0.00 aii ----- ---- --- ----- ---- --- 1616 1.00 TOT 560 1.00 TOT Note the occurrences of \y/ before the \i/ letters, except \m/ and \n/. This data confirms our previous guess that the \cy/ group is merely the final form of \ci/. Let's do this reduction, too: cat bio-j-jsa-gut.wds \ | sed \ -e 's/^/_/g' \ -e 's/$/_/g' \ -e 's/ij/k/g' \ -e 's/ix/e/g' \ -e 's/is/r/g' \ -e 's/iiu/n/g' \ -e 's/cy/a/g' \ -e 's/ci/a/g' \ | compare-contexts -lctx 1 -rctx 0 -colw 17 \ 'i*k' \ 'i*e' \ 'i*r' \ 'i*n' 33 0.89 ak 893 0.55 oe 275 0.49 ar 329 0.72 ain 4 0.11 ok 411 0.25 ae 178 0.32 or 109 0.24 an ----- ---- --- 282 0.17 _e 79 0.14 _r 9 0.02 oin 37 1.00 TOT 15 0.01 ce 19 0.03 air 4 0.01 cn 6 0.00 je 4 0.01 cr 4 0.01 aii 3 0.00 ge 2 0.00 er 2 0.00 on 3 0.00 e 1 0.00 oir ----- ---- --- 2 0.00 aie 1 0.00 gr 457 1.00 TOT 1 0.00 se 1 0.00 aii ----- ---- --- ----- ---- --- 1616 1.00 TOT 560 1.00 TOT Now, let's check whether the combinations \cy/, \ci/ and \cg/ behave like \c/ on the left. To reduce the number of distinct patterns, I will collapse \cs/ to \c/, and erase all gallows: cat bio-j-jsa-gut.wds \ | sed \ -e 's/^/_/g' \ -e 's/$/_/g' \ -e 's/[ql][jg]//' \ -e 's/cs/c/g' \ -e 's/ci/a/g' \ -e 's/cy/a/g' \ -e 's/cg/8/g' \ -e 's/c\([^ca8o]\)/C\1/g' \ -e 's/o\([^ca8o]\)/O\1/g' \ | compare-contexts -lctx 1 -rctx 0 -colw 20 \ '[ca8o]*C' \ '[ca8o]*8' \ '[ca8o]*a' \ '[ca8o]*O' 10 0.19 _cccC 456 0.22 _ccc8 443 0.11 _ccc8a 517 0.46 _O 6 0.11 xC 322 0.16 _8 418 0.10 qoa 157 0.14 qO 6 0.11 _C 207 0.10 qoc8 264 0.07 _8a 102 0.09 xO 5 0.09 _ccC 201 0.10 qocc8 211 0.05 _ccca 65 0.06 _cccO 4 0.07 _ccccC 150 0.07 xccc8 203 0.05 qoc8a 61 0.05 _cO 3 0.06 xccccC 95 0.05 _oc8 196 0.05 qocc8a 41 0.04 _ccO 3 0.06 OC 74 0.04 _cccc8 180 0.04 _oa 36 0.03 sO 2 0.04 xcC 65 0.03 _occ8 175 0.04 _cccca 33 0.03 qoO 2 0.04 qocccC 57 0.03 xcc8 174 0.04 xa 26 0.02 _8O 2 0.04 qoaC 54 0.03 _cc8 140 0.03 xccc8a 18 0.02 _oO 1 0.02 xccC 51 0.02 x8 108 0.03 _a 9 0.01 xcccO 1 0.02 scccC 35 0.02 qoccc8 93 0.02 _oc8a 8 0.01 _ocO 1 0.02 sccC 31 0.02 xc8 87 0.02 _ccccc 6 0.01 xccO 1 0.02 qoccC 29 0.01 _occc8 85 0.02 qocca 6 0.01 qocO 1 0.02 qocC 27 0.01 _c8 85 0.02 _ca 6 0.01 _ccccO 1 0.02 qcc8aC 26 0.01 _8ccc8 72 0.02 _cccc8 6 0.01 _8ccO 1 0.02 _occca 18 0.01 _ccccc 68 0.02 xccca 4 0.00 _occO 1 0.02 _cC 16 0.01 _accc8 64 0.02 sa 4 0.00 _ccc8O 1 0.02 _acccc 14 0.01 _acc8 62 0.02 _occ8a 4 0.00 _8cccO 1 0.02 _aC 12 0.01 xcccc8 59 0.01 _cca 3 0.00 jO 1 0.02 _8cccc 12 0.01 sccc8 55 0.01 xcc8a 3 0.00 _cc8O ----- ---- ---- 12 0.01 _ac8 50 0.01 xcca 2 0.00 xcO 54 1.00 TOT 11 0.01 _ccccc 49 0.01 _cc8a 2 0.00 qoccO 7 0.00 qo8 48 0.01 x8a 2 0.00 _occcO 6 0.00 _o8 45 0.01 qoca 2 0.00 _ccccc 5 0.00 qcc8 40 0.01 _occa 2 0.00 _acccO 5 0.00 _8cc8 34 0.01 qoccc8 1 0.00 xccccO 4 0.00 scccc8 29 0.01 xc8a 1 0.00 x8O 4 0.00 _a8 28 0.01 _occc8 1 0.00 uO 4 0.00 _8c8 27 0.01 _c8a 1 0.00 qoc8O 3 0.00 qocccc 26 0.01 _8ccc8 1 0.00 jcccO 3 0.00 qoa8 22 0.01 _oca 1 0.00 _oaO 3 0.00 qccc8 21 0.01 _occca 1 0.00 _coO 3 0.00 _8cccc 18 0.00 _ccccc 1 0.00 _c8aO 2 0.00 xoccc8 17 0.00 qoccca 1 0.00 _8cccc 2 0.00 xo8 16 0.00 xcccca ----- ---- ---- 2 0.00 xccccc 15 0.00 _accc8 1134 1.00 TOT 2 0.00 xa8 14 0.00 _acc8a 2 0.00 s8 12 0.00 sccca 2 0.00 _cocc8 12 0.00 _ac8a 2 0.00 _acccc 12 0.00 _aa 2 0.00 _8acc8 11 0.00 xcccc8 2 0.00 _8ac8 11 0.00 xca 1 0.00 xccccc 10 0.00 sccc8a 1 0.00 x8ccc8 9 0.00 _ccccc 1 0.00 x88 9 0.00 _acca 1 0.00 scc8 8 0.00 _accca 1 0.00 sa8 7 0.00 ja 1 0.00 qoc8cc 6 0.00 xccccc 1 0.00 qoc8c8 6 0.00 _o8a 1 0.00 qoacc8 6 0.00 _ccccc 1 0.00 qo8ccc 5 0.00 qo8a 1 0.00 qo8cc8 5 0.00 _acccc 1 0.00 qcccc8 5 0.00 _8cc8a 1 0.00 jcc8 4 0.00 scccc8 1 0.00 jc8 4 0.00 qcc8a 1 0.00 j8 4 0.00 qca 1 0.00 _occo8 4 0.00 _aca 1 0.00 _oa8 4 0.00 _a8a 1 0.00 _o8ccc 4 0.00 _8ccca 1 0.00 _co8 4 0.00 _8c8a 1 0.00 _ccocc 3 0.00 xoa 1 0.00 _ccocc 3 0.00 ua 1 0.00 _cco8 3 0.00 qocccc 1 0.00 _8oc8 3 0.00 qoa8a 1 0.00 _8cccc 3 0.00 qccc8a ----- ---- ---- 3 0.00 jca 2063 1.00 TOT 3 0.00 _occcc 3 0.00 _ccoa 3 0.00 _8cccc 3 0.00 _8cccc 3 0.00 _8cca 2 0.00 xoccc8 2 0.00 xccccc 2 0.00 xccccc 2 0.00 scca 2 0.00 qcca 2 0.00 jcca 2 0.00 ga 2 0.00 _cocc8 2 0.00 _acccc 2 0.00 _acccc 2 0.00 _8acca 2 0.00 _8acc8 2 0.00 _8ac8a 2 0.00 _8aa 1 0.00 xocca 1 0.00 xo8a 1 0.00 xccccc 1 0.00 xc8ca 1 0.00 xa8a 1 0.00 x8ccc8 1 0.00 x88a 1 0.00 scccca 1 0.00 scc8a 1 0.00 sca 1 0.00 sacca 1 0.00 sa8a 1 0.00 s8a 1 0.00 qocccc 1 0.00 qoc8cc 1 0.00 qoc8c8 1 0.00 qoacc8 1 0.00 qoaa 1 0.00 qo8cca 1 0.00 qo8cc8 1 0.00 qo8aca 1 0.00 qccca 1 0.00 qcc8cc 1 0.00 qaa 1 0.00 qa 1 0.00 jcc8a 1 0.00 jc8a 1 0.00 j8a 1 0.00 _occo8 1 0.00 _oc8cc 1 0.00 _oaccc 1 0.00 _oa8a 1 0.00 _o8ccc 1 0.00 _coccc 1 0.00 _co8a 1 0.00 _ccocc 1 0.00 _ccocc 1 0.00 _ccocc 1 0.00 _cco8a 1 0.00 _cccoa 1 0.00 _ccccc 1 0.00 _ccccc 1 0.00 _cccaa 1 0.00 _ccc8c 1 0.00 _ccc8a 1 0.00 _ccacc 1 0.00 _caa 1 0.00 _aoa 1 0.00 _acccc 1 0.00 _8occa 1 0.00 _8oc8a 1 0.00 _8cccc 1 0.00 _8aca ----- ---- ---- 4017 1.00 TOT Let's recount, with narrower left contexts (all the \c/s and one more letter): cat bio-j-jsa-gut.wds \ | sed \ -e 's/^/_/g' \ -e 's/$/_/g' \ -e 's/[ql][jg]//' \ -e 's/cs/c/g' \ -e 's/ci/a/g' \ -e 's/cy/a/g' \ -e 's/cg/8/g' \ -e 's/c\([^ca8o]\)/C\1/g' \ | compare-contexts -lctx 1 -rctx 0 -colw 20 \ 'c*C' \ 'c*8' \ 'c*a' \ 'c*o' 10 0.19 _cccC 486 0.23 _ccc8 1949 0.47 8a 1230 0.45 qo 6 0.11 xC 367 0.17 _8 613 0.15 oa 1066 0.39 _o 6 0.11 _C 305 0.14 oc8 221 0.05 _ccca 110 0.04 xo 5 0.09 aC 269 0.13 occ8 216 0.05 _a 80 0.03 _co 5 0.09 _ccC 150 0.07 xccc8 180 0.04 _cccca 68 0.02 _ccco 4 0.07 _ccccC 77 0.04 _cccc8 175 0.04 xa 55 0.02 _cco 3 0.06 xccccC 67 0.03 occc8 128 0.03 occa 37 0.01 8o 3 0.06 oC 60 0.03 _cc8 92 0.02 _ca 36 0.01 so 2 0.04 xcC 57 0.03 xcc8 89 0.02 _ccccc 9 0.00 xccco 2 0.04 occcC 53 0.03 x8 73 0.02 _cca 6 0.00 xcco 1 0.02 xccC 32 0.02 _c8 68 0.02 xccca 6 0.00 _cccco 1 0.02 scccC 31 0.01 xc8 67 0.02 oca 6 0.00 8cco 1 0.02 sccC 21 0.01 o8 66 0.02 sa 4 0.00 8ccco 1 0.02 occC 19 0.01 _ccccc 50 0.01 xcca 3 0.00 jo 1 0.02 ocC 17 0.01 acc8 39 0.01 occca 3 0.00 ao 1 0.02 accccC 16 0.01 accc8 16 0.00 xcccca 2 0.00 xco 1 0.02 _cC 14 0.01 ac8 12 0.00 sccca 2 0.00 accco 1 0.02 8ccccC 12 0.01 xcccc8 11 0.00 xca 2 0.00 _ccccc ----- ---- ---- 12 0.01 sccc8 8 0.00 8cca 1 0.00 xcccco 54 1.00 TOT 11 0.01 a8 7 0.00 ja 1 0.00 uo 11 0.01 _ccccc 7 0.00 _ccccc 1 0.00 jccco 5 0.00 qcc8 6 0.00 xccccc 1 0.00 8cccco 4 0.00 scccc8 4 0.00 qca ----- ---- ---- 3 0.00 qccc8 4 0.00 occcca 2729 1.00 TOT 3 0.00 occcc8 4 0.00 8ccca 2 0.00 xccccc 3 0.00 ua 2 0.00 s8 3 0.00 jca 2 0.00 acccc8 3 0.00 8cccca 1 0.00 xccccc 2 0.00 xccccc 1 0.00 scc8 2 0.00 scca 1 0.00 qcccc8 2 0.00 qcca 1 0.00 jcc8 2 0.00 qa 1 0.00 jc8 2 0.00 jcca 1 0.00 j8 2 0.00 ga ----- ---- ---- 1 0.00 scccca 2114 1.00 TOT 1 0.00 sca 1 0.00 qccca 1 0.00 _ccccc 1 0.00 8ca ----- ---- ---- 4131 1.00 TOT 97-07-27 stolfi =============== Again, ignoring repeated words: cat bio-j-jsa-gut.dic \ | sed \ -e 's/^/_/g' \ -e 's/$/_/g' \ -e 's/[ql][jg]//' \ -e 's/cs/c/g' \ -e 's/ci/a/g' \ -e 's/cy/a/g' \ -e 's/cg/8/g' \ -e 's/c\([^ca8o]\)/C\1/g' \ | compare-contexts -lctx 1 -rctx 0 -colw 20 \ 'c*C' \ 'c*8' \ 'c*a' \ 'c*o' 5 0.14 xC 79 0.19 _8 331 0.35 8a 258 0.41 _o 5 0.14 aC 37 0.09 _ccc8 108 0.11 _a 163 0.26 qo 3 0.08 xccccC 34 0.08 x8 79 0.08 oa 56 0.09 xo 3 0.08 oC 30 0.07 xccc8 75 0.08 xa 31 0.05 _cco 3 0.08 _cccC 27 0.06 xcc8 39 0.04 sa 30 0.05 _co 3 0.08 _ccC 26 0.06 occc8 38 0.04 _ccca 19 0.03 so 2 0.05 xcC 23 0.05 oc8 37 0.04 _cca 19 0.03 _ccco 2 0.05 occcC 21 0.05 occ8 32 0.03 _ca 18 0.03 8o 2 0.05 _ccccC 21 0.05 _cc8 23 0.02 _cccca 7 0.01 xccco 1 0.03 xccC 18 0.04 o8 22 0.02 xccca 5 0.01 8cco 1 0.03 scccC 14 0.03 _cccc8 21 0.02 occca 4 0.01 xcco 1 0.03 sccC 10 0.02 a8 18 0.02 xcca 4 0.01 _cccco 1 0.03 occC 9 0.02 _ccccc 16 0.02 occa 3 0.00 jo 1 0.03 ocC 9 0.02 _ccccc 14 0.01 _ccccc 3 0.00 ao 1 0.03 accccC 8 0.02 xc8 10 0.01 xcccca 2 0.00 xco 1 0.03 _cC 7 0.02 xcccc8 8 0.01 xca 2 0.00 accco 1 0.03 _C 7 0.02 _c8 8 0.01 sccca 2 0.00 _ccccc 1 0.03 8ccccC 6 0.01 accc8 7 0.01 oca 2 0.00 8ccco ----- ---- ---- 6 0.01 acc8 7 0.01 ja 1 0.00 xcccco 37 1.00 TOT 4 0.01 sccc8 7 0.01 8cca 1 0.00 uo 4 0.01 qcc8 6 0.01 _ccccc 1 0.00 jccco 4 0.01 ac8 4 0.00 qca 1 0.00 8cccco 3 0.01 scccc8 3 0.00 xccccc ----- ---- ---- 2 0.00 xccccc 3 0.00 occcca 632 1.00 TOT 2 0.00 s8 3 0.00 jca 2 0.00 qccc8 3 0.00 8ccca 2 0.00 occcc8 2 0.00 xccccc 2 0.00 acccc8 2 0.00 ua 1 0.00 xccccc 2 0.00 scca 1 0.00 scc8 2 0.00 qcca 1 0.00 qcccc8 2 0.00 qa 1 0.00 jcc8 2 0.00 jcca 1 0.00 jc8 2 0.00 ga 1 0.00 j8 2 0.00 8cccca ----- ---- ---- 1 0.00 scccca 423 1.00 TOT 1 0.00 sca 1 0.00 qccca 1 0.00 _ccccc 1 0.00 8ca ----- ---- ---- 943 1.00 TOT These data suggest that, ignoring gallows and \s/-plumes, the \c/ strings always end in \ci/, \cy/, \cg/ or \o/. Let's look again at the \c/ strings and their relationship to \s/-plumes and gallows. For simplicity, let's map all gallows to `H', and \cs/ to `z'; for consistency, let's map \cy/ and \ci/ to `a', \cg/ to `8'. cat bio-j-jsa-gut.wds \ | sed \ -e 's/^/_/g' \ -e 's/$/_/g' \ -e 's/[ql][jg]/H/' \ -e 's/cs/z/g' \ -e 's/ij/k/g' \ -e 's/ix/e/g' \ -e 's/is/r/g' \ -e 's/iiu/n/g' \ -e 's/y/i/g' \ -e 's/ci/a/g' \ -e 's/cg/8/g' \ | compare-contexts -lctx 0 -rctx 0 -colw 24\ '[czH][czH]*a' \ '[czH][czH]*o' \ '[czH][czH]*8' \ '[czH][czH]*[^czHao8]' 718 0.40 Ha 106 0.30 Ho 379 0.23 Hc8 18 0.20 z_ 169 0.09 Hcca 63 0.18 zo 319 0.19 Hcc8 11 0.12 cce 133 0.07 ccca 35 0.10 zcco 317 0.19 ccc8 9 0.10 cccz_ 108 0.06 zcca 34 0.10 ccco 296 0.18 zcc8 7 0.08 H_ 79 0.04 Hca 24 0.07 zco 84 0.05 Hccc8 6 0.07 He 70 0.04 za 23 0.07 cco 54 0.03 cc8 5 0.05 ccccz_ 54 0.03 cccHca 19 0.05 Hco 51 0.03 zccc8 4 0.04 zce 48 0.03 cca 13 0.04 Hcco 28 0.02 cccc8 4 0.04 zcccz_ 40 0.02 ccccHca 6 0.02 Hccco 22 0.01 Hzcc8 3 0.03 zccH_ 39 0.02 zccHca 4 0.01 cHco 21 0.01 zc8 3 0.03 Hcn 39 0.02 cccca 4 0.01 Hzcco 10 0.01 cccHcc8 2 0.02 zc_ 37 0.02 zcccHca 3 0.01 zccco 9 0.01 cccHc8 2 0.02 ccz_ 33 0.02 zccca 3 0.01 cccco 9 0.01 cHcc8 2 0.02 ccr 33 0.02 Hccca 1 0.00 zzcco 9 0.01 Hzc8 2 0.02 cccH_ 21 0.01 zccHa 1 0.00 zccHo 6 0.00 zcccHcc8 1 0.01 ze 18 0.01 zca 1 0.00 zcHo 6 0.00 zccHcc8 1 0.01 zcr 18 0.01 cccHa 1 0.00 zcHco 6 0.00 cHc8 1 0.01 zccz_ 14 0.01 cHcca 1 0.00 czco 5 0.00 H8 1 0.01 cq 14 0.01 Hzcca 1 0.00 ccccco 4 0.00 zzcc8 1 0.01 cn 13 0.01 zcccHa 1 0.00 ccccHco 4 0.00 c8 1 0.01 cl 12 0.01 ccccHa 1 0.00 cccHo 3 0.00 ccccHcc8 1 0.01 cccHc_ 12 0.01 cHca 1 0.00 cccHco 3 0.00 ccHc8 1 0.01 cc_ 11 0.01 zccHcca 1 0.00 ccHo 2 0.00 zcccHc8 1 0.01 cHccz_ 11 0.01 ccHa 1 0.00 cHcco 2 0.00 zccHc8 1 0.01 Hcz_ 6 0.00 cccHcca 1 0.00 Hzco 2 0.00 cccHccc8 1 0.01 Hcr 6 0.00 cHa ----- ---- ---- 2 0.00 cHccc8 1 0.01 Hccz_ 5 0.00 ca 349 1.00 TOT 2 0.00 Hcccc8 1 0.01 Hccccl 5 0.00 Hcccca 1 0.00 zzcHcc8 ----- ---- ---- 4 0.00 zcHa 1 0.00 zcz8 91 1.00 TOT 4 0.00 ccccHcca 1 0.00 zccHccc8 3 0.00 Hzca 1 0.00 ccHzccc8 2 0.00 zcccHcca 1 0.00 ccHccc8 2 0.00 zHcca 1 0.00 ccHcc8 2 0.00 cccza 1 0.00 Hczc8 2 0.00 Hczcca 1 0.00 Hcz8 1 0.00 zzcccHca 1 0.00 Hccz8 1 0.00 zzcca ----- ---- ---- 1 0.00 zzccHa 1664 1.00 TOT 1 0.00 zcccca 1 0.00 zccccHcca 1 0.00 zccHccca 1 0.00 zcHcca 1 0.00 zHa 1 0.00 ccHzccca 1 0.00 ccHcca 1 0.00 cHccca 1 0.00 Hczca 1 0.00 Hccza ----- ---- ---- 1798 1.00 TOT From these table it seems (again) that \ci/ and \o/ are equivalent; and that \cg/ is similar, but not as much. Also, virtually all \c/ strings end with these three letters; only a very few end with \ix/. Below I have marked with `#/*', `@/&', `+', and `-' the contexts with at least one frequency greater than 0.20, 0.08, 0.04, and 0.02, respectively: # 718 0.40 Ha # 106 0.30 Ho * 379 0.23 Hc8 @ 169 0.09 Hcca & 63 0.18 zo @ 319 0.19 Hcc8 & 133 0.07 ccca & 35 0.10 zcco & 317 0.19 ccc8 & 108 0.06 zcca & 34 0.10 ccco & 296 0.18 zcc8 * 79 0.04 Hca + 24 0.07 zco + 84 0.05 Hccc8 & 70 0.04 za + 23 0.07 cco + 54 0.03 cc8 - 54 0.03 cccHca * 19 0.05 Hco - 51 0.03 zccc8 + 48 0.03 cca @ 13 0.04 Hcco 28 0.02 cccc8 40 0.02 ccccHca + 6 0.02 Hccco 22 0.01 Hzcc8 39 0.02 zccHca 4 0.01 cHco + 21 0.01 zc8 39 0.02 cccca 4 0.01 Hzcco 10 0.01 cccHcc8 37 0.02 zcccHca - 3 0.01 zccco - 9 0.01 cccHc8 - 33 0.02 zccca 3 0.01 cccco 9 0.01 cHcc8 + 33 0.02 Hccca 1 0.00 zzcco 9 0.01 Hzc8 21 0.01 zccHa 1 0.00 zccHo 6 0.00 zcccHcc8 + 18 0.01 zca 1 0.00 zcHo 6 0.00 zccHcc8 18 0.01 cccHa 1 0.00 zcHco 6 0.00 cHc8 14 0.01 cHcca 1 0.00 czco # 5 0.00 H8 14 0.01 Hzcca 1 0.00 ccccco 4 0.00 zzcc8 13 0.01 zcccHa 1 0.00 ccccHco 4 0.00 c8 12 0.01 ccccHa 1 0.00 cccHo 3 0.00 ccccHcc8 12 0.01 cHca - 1 0.00 cccHco 3 0.00 ccHc8 11 0.01 zccHcca 1 0.00 ccHo 2 0.00 zcccHc8 11 0.01 ccHa 1 0.00 cHcco 2 0.00 zccHc8 Checking again, without repeated words: cat bio-j-jsa-gut.dic \ | sed \ -e 's/^/_/g' \ -e 's/$/_/g' \ -e 's/[ql][jg]/H/' \ -e 's/cs/z/g' \ -e 's/ij/k/g' \ -e 's/ix/e/g' \ -e 's/is/r/g' \ -e 's/iiu/n/g' \ -e 's/y/i/g' \ -e 's/ci/a/g' \ -e 's/cg/8/g' \ | compare-contexts -lctx 0 -rctx 0 -colw 24\ '[czH][czH]*a' \ '[czH][czH]*o' \ '[czH][czH]*8' \ '[czH][czH]*[^czHao8]' 130 0.31 Ha 59 0.35 Ho 40 0.14 Hc8 12 0.18 z_ 28 0.07 cca 21 0.12 zo 37 0.13 ccc8 8 0.12 cce 24 0.06 ccca 12 0.07 zco 33 0.12 Hcc8 6 0.09 H_ 23 0.06 Hcca 11 0.06 Hco 23 0.08 zcc8 5 0.08 He 19 0.05 zcca 11 0.06 Hcco 23 0.08 cc8 4 0.06 zcccz_ 15 0.04 za 10 0.06 zcco 23 0.08 Hccc8 3 0.05 zce 15 0.04 Hccca 10 0.06 cco 12 0.04 zc8 3 0.05 ccccz_ 15 0.04 Hca 10 0.06 ccco 11 0.04 zccc8 2 0.03 zc_ 11 0.03 cHca 4 0.02 cHco 11 0.04 Hzcc8 2 0.03 ccz_ 10 0.02 ccHa 4 0.02 Hzcco 9 0.03 cccc8 2 0.03 ccr 9 0.02 cHcca 3 0.02 cccco 7 0.02 Hzc8 2 0.03 cccz_ 8 0.02 zccHa 2 0.01 Hccco 6 0.02 cHcc8 2 0.03 cccH_ 8 0.02 Hzcca 1 0.01 zzcco 5 0.02 cccHcc8 1 0.02 ze 7 0.02 cccca 1 0.01 zccco 5 0.02 cHc8 1 0.02 zcr 6 0.01 zca 1 0.01 zccHo 5 0.02 H8 1 0.02 zccz_ 6 0.01 cccHa 1 0.01 zcHo 4 0.01 zcccHcc8 1 0.02 zccH_ 6 0.01 cHa 1 0.01 zcHco 3 0.01 ccccHcc8 1 0.02 cq 5 0.01 zccca 1 0.01 czco 3 0.01 cccHc8 1 0.02 cn 5 0.01 zcccHca 1 0.01 ccccco 3 0.01 c8 1 0.02 cl 5 0.01 ccccHca 1 0.01 ccccHco 2 0.01 zccHcc8 1 0.02 cccHc_ 5 0.01 cccHca 1 0.01 cccHo 2 0.01 cccHccc8 1 0.02 cc_ 5 0.01 ca 1 0.01 cccHco 2 0.01 ccHc8 1 0.02 cHccz_ 4 0.01 zccHcca 1 0.01 ccHo 2 0.01 cHccc8 1 0.02 Hcz_ 4 0.01 zccHca 1 0.01 cHcco 1 0.00 zzcc8 1 0.02 Hcr 4 0.01 ccccHcca 1 0.01 Hzco 1 0.00 zzcHcc8 1 0.02 Hcn 4 0.01 Hcccca ----- ---- ---- 1 0.00 zcz8 1 0.02 Hccz_ 3 0.01 zcccHa 170 1.00 TOT 1 0.00 zcccHc8 1 0.02 Hccccl 3 0.01 zcHa 1 0.00 zccHccc8 ----- ---- ---- 3 0.01 Hzca 1 0.00 zccHc8 66 1.00 TOT 2 0.00 zHcca 1 0.00 ccHzccc8 2 0.00 cccza 1 0.00 ccHccc8 2 0.00 ccccHa 1 0.00 ccHcc8 2 0.00 cccHcca 1 0.00 Hczc8 2 0.00 Hczcca 1 0.00 Hcz8 1 0.00 zzcccHca 1 0.00 Hccz8 1 0.00 zzcca 1 0.00 Hcccc8 1 0.00 zzccHa ----- ---- ---- 1 0.00 zcccca 284 1.00 TOT 1 0.00 zccccHcca 1 0.00 zcccHcca 1 0.00 zccHccca 1 0.00 zcHcca 1 0.00 zHa 1 0.00 ccHzccca 1 0.00 ccHcca 1 0.00 cHccca 1 0.00 Hczca 1 0.00 Hccza ----- ---- ---- 414 1.00 TOT Here is a summary of the most common \czH/ contexts. The numbers are the percentages from the tables above. Due to an earlier mix-up, all percentages except those in the `H' row were computed relative to the totals minus the `H' entry. in .wds in .dic ---------- ---------- ci cb cg ci cb cg -- -- -- -- -- -- c 0 0 0 2 0 1 z 6 26 0 5 19 0 zc 2 10 1 2 11 4 H 40 30 0 31 35 2 Hc 7 8 23 5 10 14 Hcc 16 5 19 8 10 12 cc 4 9 3 10 9 8 ccc 12 14 19 8 9 13 zcc 10 14 18 7 9 8 cccc 4 1 2 2 3 3 zccc 3 1 3 2 1 4 Hccc 3 2 5 5 2 8 cccHc 5 0 1 2 1 1 ccccHc 4 0 0 2 1 0 zccHc 4 0 0 1 0 0 zcccHc 3 0 0 3 0 0 It seems that \c/ alone is not a letter \z/, \zc/ are similar letters (most common before \o/) \cc/, \ccc/, \zcc/ are similar letters (equally common before \a/, \o/, \8/). \H/ is a common letter (most common before \a/ and \o/) \Hc/, \Hcc/ are similar letters (most common before \o/ and \8/). The latter does not seem to be a group \H/-\cc/. \Hccc/ may be a letter (most common before \a/ and \o/) but may be a group \Hc/-\cc/ or \H/-\ccc/ \cccc/, \zccc/ may be rare letters (equally common before \a/, \o/, \8/). but may also be groups \cc/-\cc/ and \zc/-\cc/ Let's enumerate these various patterns: cat bio-j-jsa-gut.wds \ | sed \ -e 's/^/_/g' \ -e 's/$/_/g' \ -e 's/[ql][jg]/H/' \ -e 's/cs/z/g' \ -e 's/ij/k/g' \ -e 's/ix/e/g' \ -e 's/is/r/g' \ -e 's/iiu/n/g' \ -e 's/y/i/g' \ -e 's/ci/a/g' \ -e 's/cg/8/g' \ | compare-contexts -lctx 0 -rctx 0 -colw 17 \ '[^czH]z[^czH]' \ '[^czH]zc[^czH]' \ '[^czH]cc[^czH]' \ '[^czH]ccc[^czH]' \ '[^czH]zcc[^czH]' \ '[^czH]H[^czH]' \ '[^czH]Hc[^czH]' \ '[^czH]Hcc[^czH]' 66 0.44 _za 19 0.27 _zco 22 0.16 _cc8 192 0.40 _ccc8 219 0.50 _zcc8 63 0.42 _zo 16 0.23 _zca 19 0.14 _cca 97 0.20 eccc8 77 0.18 _zcca 6 0.04 _z_ 13 0.19 ezc8 17 0.12 _cco 78 0.16 _ccca 43 0.10 ezcc8 5 0.03 ez_ 6 0.09 _zc8 15 0.11 ecca 44 0.09 eccca 30 0.07 _zcco 5 0.03 az_ 3 0.04 _zce 15 0.11 ecc8 25 0.05 _ccco 19 0.04 8zcc8 1 0.01 qza 3 0.04 8zco 9 0.07 _cce 11 0.02 8ccc8 18 0.04 ezcca 1 0.01 oza 2 0.03 ezco 7 0.05 occ8 7 0.01 rccca 8 0.02 azcc8 1 0.01 oz_ 2 0.03 ezca 6 0.04 8cca 6 0.01 rccc8 6 0.01 rzcc8 1 0.01 eza 1 0.01 rzc8 5 0.04 8cc8 6 0.01 accc8 6 0.01 azcca 1 0.01 aza 1 0.01 ezce 3 0.02 qcc8 5 0.01 eccco 4 0.01 ezcco 1 0.01 _ze 1 0.01 ezc_ 3 0.02 ecco 4 0.01 occc8 3 0.01 rzcca ---- ---- ---- 1 0.01 _zcr 3 0.02 8cco 3 0.01 8ccco 2 0.00 ozcca 151 1.00 TOT 1 0.01 _zc_ 2 0.01 rcca 2 0.00 occca 2 0.00 8zcca 1 0.01 8zc8 2 0.01 occa 1 0.00 qccc8 1 0.00 ozcc8 ---- ---- ---- 2 0.01 jcca 1 0.00 accco 1 0.00 8zcco 70 1.00 TOT 2 0.01 ecce 1 0.00 accca ---- ---- ---- 1 0.01 rccr 1 0.00 8ccca 439 1.00 TOT 1 0.01 qcca ---- ---- ---- 1 0.01 jcc8 484 1.00 TOT 1 0.01 ecc_ 1 0.01 acc8 1 0.01 _ccr ---- ---- ---- 138 1.00 TOT 604 0.72 oHa 305 0.63 oHc8 254 0.51 oHcc8 61 0.07 eHa 66 0.14 oHca 120 0.24 oHcca 51 0.06 oHo 31 0.06 eHc8 31 0.06 eHcca 49 0.06 _Ho 27 0.06 _Hc8 29 0.06 eHcc8 35 0.04 _Ha 14 0.03 oHco 20 0.04 _Hcc8 18 0.02 aHa 14 0.03 aHc8 16 0.03 aHcc8 6 0.01 oH_ 7 0.01 eHca 10 0.02 aHcca 5 0.01 eHo 4 0.01 aHca 7 0.01 _Hcca 4 0.00 oHe 3 0.01 oHcn 6 0.01 oHcco 3 0.00 oH8 3 0.01 _Hco 6 0.01 _Hcco 2 0.00 eHe 2 0.00 eHco 1 0.00 eHcco 2 0.00 _H8 2 0.00 _Hca 1 0.00 8Hcca 1 0.00 qHo 2 0.00 8Hc8 ---- ---- ---- 1 0.00 _H_ 1 0.00 oHcr 501 1.00 TOT ---- ---- ---- ---- ---- ---- 842 1.00 TOT 481 1.00 TOT Ditto, distinct words only: cat bio-j-jsa-gut.dic \ | sed \ -e 's/^/_/g' \ -e 's/$/_/g' \ -e 's/[ql][jg]/H/' \ -e 's/cs/z/g' \ -e 's/ij/k/g' \ -e 's/ix/e/g' \ -e 's/is/r/g' \ -e 's/iiu/n/g' \ -e 's/y/i/g' \ -e 's/ci/a/g' \ -e 's/cg/8/g' \ | compare-contexts -lctx 0 -rctx 0 -colw 17 \ '[^czH]z[^czH]' \ '[^czH]zc[^czH]' \ '[^czH]cc[^czH]' \ '[^czH]ccc[^czH]' \ '[^czH]zcc[^czH]' \ '[^czH]H[^czH]' \ '[^czH]Hc[^czH]' \ '[^czH]Hcc[^czH]' 21 0.44 _zo 8 0.22 ezc8 10 0.14 ecc8 15 0.21 eccc8 8 0.15 ezcc8 11 0.23 _za 7 0.19 _zco 8 0.11 _cca 11 0.15 eccca 8 0.15 _zcc8 5 0.10 az_ 4 0.11 _zca 7 0.10 ecca 10 0.14 _ccc8 6 0.12 ezcca 4 0.08 ez_ 3 0.08 8zco 7 0.10 _cco 5 0.07 _ccca 6 0.12 _zcco 1 0.02 qza 2 0.06 ezco 6 0.08 _cce 4 0.06 rccca 5 0.10 _zcca 1 0.02 oza 2 0.06 ezca 5 0.07 _cc8 4 0.06 eccco 3 0.06 rzcca 1 0.02 oz_ 2 0.06 _zce 5 0.07 8cca 4 0.06 _ccco 3 0.06 ezcco 1 0.02 eza 2 0.06 _zc8 2 0.03 rcca 3 0.04 occc8 3 0.06 azcca 1 0.02 aza 1 0.03 rzc8 2 0.03 qcc8 3 0.04 accc8 3 0.06 8zcc8 1 0.02 _ze 1 0.03 ezce 2 0.03 occa 3 0.04 8ccc8 2 0.04 rzcc8 1 0.02 _z_ 1 0.03 ezc_ 2 0.03 occ8 2 0.03 rccc8 1 0.02 ozcca ----- ---- ---- 1 0.03 _zcr 2 0.03 jcca 2 0.03 occca 1 0.02 ozcc8 48 1.00 TOT 1 0.03 _zc_ 2 0.03 ecce 1 0.01 qccc8 1 0.02 azcc8 1 0.03 8zc8 2 0.03 8cco 1 0.01 accco 1 0.02 8zcco ----- ---- ---- 2 0.03 8cc8 1 0.01 accca 1 0.02 8zcca 36 1.00 TOT 1 0.01 rccr 1 0.01 8ccco ----- ---- ---- 1 0.01 qcca 1 0.01 8ccca 52 1.00 TOT 1 0.01 jcc8 ----- ---- ---- 1 0.01 ecco 71 1.00 TOT 1 0.01 ecc_ 1 0.01 acc8 1 0.01 _ccr ----- ---- ---- 71 1.00 TOT 72 0.35 oHa 23 0.34 oHc8 13 0.19 oHcc8 34 0.17 _Ho 8 0.12 oHco 9 0.13 oHcca 23 0.11 eHa 8 0.12 eHc8 9 0.13 eHcc8 19 0.09 oHo 6 0.09 oHca 7 0.10 eHcca 19 0.09 _Ha 4 0.06 eHca 6 0.09 oHcco 16 0.08 aHa 4 0.06 aHca 6 0.09 _Hcc8 5 0.02 oH_ 4 0.06 aHc8 5 0.07 aHcc8 5 0.02 eHo 4 0.06 _Hc8 4 0.06 _Hcco 3 0.01 oHe 2 0.03 eHco 3 0.04 aHcca 3 0.01 oH8 1 0.01 oHcr 3 0.04 _Hcca 2 0.01 eHe 1 0.01 oHcn 1 0.01 eHcco 2 0.01 _H8 1 0.01 _Hco 1 0.01 8Hcca 1 0.00 qHo 1 0.01 _Hca ----- ---- ---- 1 0.00 _H_ 1 0.01 8Hc8 67 1.00 TOT ----- ---- ---- ----- ---- ---- 205 1.00 TOT 68 1.00 TOT Obviously the subset of these letters that begins with `H' is different from the others in that the `H' is usually preceded by \o/, whereas the rest is usually word-initial (in .wds) or preceded by \ix/ (in .dic). Let's see what we have left out: cat bio-j-jsa-gut.wds \ | sed \ -e 's/^/_/g' \ -e 's/$/_/g' \ -e 's/[ql][jg]/H/' \ -e 's/cs/z/g' \ -e 's/ij/k/g' \ -e 's/ix/e/g' \ -e 's/is/r/g' \ -e 's/iiu/n/g' \ -e 's/y/i/g' \ -e 's/ci/a/g' \ -e 's/cg/8/g' \ | enum-contexts -vPAT='[czH][czH]*' -vCTX=0 \ | egrep -v -e '^(z|zc|cc|ccc|zcc|H|Hc|Hcc)$' \ | wfreq 123 0.15 Hccc 87 0.11 zccc 70 0.09 cccc 65 0.08 cccHc 41 0.05 zccHc 41 0.05 ccccHc 40 0.05 Hzcc 39 0.05 zcccHc 25 0.03 zccH 24 0.03 cHcc 22 0.03 cHc 21 0.03 cccH 17 0.02 zccHcc 16 0.02 cccHcc 13 0.02 zcccH 13 0.02 Hzc 12 0.02 ccccH 12 0.02 ccH 12 0.02 c 11 0.01 cccz 8 0.01 zcccHcc 8 0.01 Hcccc 7 0.01 ccccHcc 6 0.01 zzcc 6 0.01 cH 5 0.01 zcH 5 0.01 ccccz 4 0.01 zcccz 3 0.00 ccHc 3 0.00 cHccc 3 0.00 Hccz 2 0.00 zccHccc 2 0.00 zHcc 2 0.00 ccz 2 0.00 cccHccc 2 0.00 ccHzccc 2 0.00 ccHcc 2 0.00 Hczcc 2 0.00 Hczc 2 0.00 Hcz 1 0.00 zzcccHc 1 0.00 zzccH 1 0.00 zzcHcc 1 0.00 zcz 1 0.00 zccz 1 0.00 zccccHcc 1 0.00 zcccc 1 0.00 zcHcc 1 0.00 zcHc 1 0.00 zH 1 0.00 czc 1 0.00 ccccc 1 0.00 ccHccc 1 0.00 cHccz ----- ---- ---- 794 1.00 TOT Again, with some context: cat bio-j-jsa-gut.wds \ | sed \ -e 's/^/_/g' \ -e 's/$/_/g' \ -e 's/[ql][jg]/H/' \ -e 's/cs/z/g' \ -e 's/ij/k/g' \ -e 's/ix/e/g' \ -e 's/is/r/g' \ -e 's/iiu/n/g' \ -e 's/y/i/g' \ -e 's/ci/a/g' \ -e 's/cg/8/g' \ | enum-contexts -vPAT='[^czH][czH][czH]*[^czH]' -vCTX=0 \ | egrep -v -e '[^czH](z|zc|cc|ccc|zcc|H|Hc|Hcc)[^czH]' \ | wfreq 51 0.06 _cccHca 41 0.05 oHccc8 38 0.05 _zccc8 37 0.05 _zccHca 35 0.04 _zcccHca 35 0.04 _ccccHca 34 0.04 _Hccc8 32 0.04 _cccca 23 0.03 _zccca 21 0.03 oHccca 20 0.03 _cccc8 19 0.02 _zccHa 16 0.02 oHzcc8 16 0.02 _cccHa 12 0.02 _zcccHa 12 0.02 _ccccHa 11 0.01 _zccHcca 10 0.01 _ccHa 9 0.01 _cccHc8 8 0.01 eHccc8 8 0.01 _cccz_ 8 0.01 _cccHcc8 8 0.01 _Hccca 7 0.01 oHzcca 7 0.01 ezccc8 7 0.01 _cHcca 6 0.01 oHzc8 6 0.01 _zccHcc8 6 0.01 _cccHcca 6 0.01 _Hccco 5 0.01 ezccca 5 0.01 ecccc8 5 0.01 _zcccHcc8 5 0.01 _Hzcca 4 0.01 ocHcca 4 0.01 ocHca 4 0.01 ecccca 4 0.01 eccccHca 4 0.01 _zzcc8 4 0.01 _cHcc8 4 0.01 _cHca 4 0.01 _cHc8 4 0.01 _Hzcc8 3 0.00 rzccc8 3 0.00 oHcccca 3 0.00 jca 3 0.00 ecccHca 3 0.00 eHccca 3 0.00 azccca 3 0.00 _zccco 3 0.00 _zccH_ 3 0.00 _zcHa 3 0.00 _ccccz_ 3 0.00 _ccHc8 3 0.00 _cHco 3 0.00 _cHa 3 0.00 _Hzcco 2 0.00 rcccHa 2 0.00 qcHcc8 2 0.00 qcHc8 2 0.00 qcHa 2 0.00 ocHcc8 2 0.00 oHcccc8 2 0.00 ezcccz_ 2 0.00 ezcccHca 2 0.00 ezccHca 2 0.00 ezccHa 2 0.00 acHca 2 0.00 _zcccHcca 2 0.00 _zcccHc8 2 0.00 _zccHc8 2 0.00 _zHcca 2 0.00 _cccza 2 0.00 _ccccHcca 2 0.00 _ccccHcc8 2 0.00 _cccHccc8 2 0.00 _cccH_ 2 0.00 _Hzc8 2 0.00 8zccca 2 0.00 8zccc8 2 0.00 8c8 1 0.00 rcn 1 0.00 rccz_ 1 0.00 rcccz_ 1 0.00 rcccca 1 0.00 rcccc8 1 0.00 qca 1 0.00 qcHccc8 1 0.00 qcHcca 1 0.00 qcHca 1 0.00 ozccHo 1 0.00 occHcc8 1 0.00 ocHco 1 0.00 ocHccz_ 1 0.00 ocHcco 1 0.00 oHzca 1 0.00 oHczcca 1 0.00 oHczca 1 0.00 oHczc8 1 0.00 oHcz_ 1 0.00 oHcz8 1 0.00 oHccza 1 0.00 oHccz_ 1 0.00 oHccz8 1 0.00 oHccccl 1 0.00 jczco 1 0.00 jc8 1 0.00 ezcz8 1 0.00 ezcccHcc8 1 0.00 ezcccHa 1 0.00 ezcHa 1 0.00 eccz_ 1 0.00 eccccz_ 1 0.00 ecccco 1 0.00 eccccHcca 1 0.00 eccccHcc8 1 0.00 ecccHcc8 1 0.00 eccHa 1 0.00 eHzcca 1 0.00 eHzcc8 1 0.00 eHczcca 1 0.00 azcccca 1 0.00 azccc8 1 0.00 accccz_ 1 0.00 acccca 1 0.00 accccHcca 1 0.00 accccHca 1 0.00 acccc8 1 0.00 acHcca 1 0.00 acHa 1 0.00 aHzcco 1 0.00 aHzcc8 1 0.00 aHzca 1 0.00 aHcccca 1 0.00 aHccca 1 0.00 aHccc8 1 0.00 _zzcco 1 0.00 _zzcccHca 1 0.00 _zzcca 1 0.00 _zzccHa 1 0.00 _zzcHcc8 1 0.00 _zccz_ 1 0.00 _zcccz_ 1 0.00 _zccccHcca 1 0.00 _zccHccca 1 0.00 _zccHccc8 1 0.00 _zcHo 1 0.00 _zcHco 1 0.00 _zcHcca 1 0.00 _zHa 1 0.00 _cccco 1 0.00 _ccccco 1 0.00 _ccccHco 1 0.00 _cccHo 1 0.00 _cccHco 1 0.00 _cccHc_ 1 0.00 _ccHzccc8 1 0.00 _ccHo 1 0.00 _ccHccc8 1 0.00 _ccHcca 1 0.00 _cHccca 1 0.00 _cHccc8 1 0.00 _Hzco 1 0.00 _Hzca 1 0.00 _Hcccca 1 0.00 8zcccz_ 1 0.00 8cccco 1 0.00 8cccca 1 0.00 8cccc8 1 0.00 8cccHcc8 1 0.00 8Hzcca ----- ---- ---- 785 1.00 TOT These may be groups of the letters above. So here is the situation for maximal `czH' strings (after collapsing \ci/, \cy/ to `a', and \cg/ to `8', and all gallows to `H'): string freq plausible interpretations ------ ---- ------------------------- c 8 invalid z 151 a letter. H 842 a letter. cc 138 a letter. zc 70 a letter. Hc 481 a letter. cz - invalid. zz - invalid. Hz - invalid. cH 6 invalid. zH 1 invalid, or z+H. HH - invalid. ccc 71 a letter. zcc 52 a letter, or z+cc. Hcc 67 a letter, or H+cc. czc 1 invalid. zzc - invalid. Hzc 13 a letter, or H+zc. cHc 22 a letter (gallows with platform?). zHc - invalid. HHc - invalid. ccz 2 invalid, or cc+z. zcz 1 invalid, or zc+z. Hcz 2 invalid, or Hc+z. czz - invalid. zzz - invalid. Hzz - invalid. cHz - invalid. zHz - invalid. HHz - invalid. ccH 12 a letter, or cc+H. zcH 5 a letter, or zc+H. HcH - invalid. czH - invalid. zzH - invalid. HzH - invalid. cHH - invalid. zHH - invalid. HHH - invalid. cccc 70 a letter, or cc+cc. zccc 87 a letter, or zc+cc, or z+ccc. Hccc 123 a letter, or Hc+cc, or H+ccc. czcc - invalid. zzcc 6 invalid, or z+zcc, or z+z+cc. Hzcc 40 a letter, or H+zcc, or H+z+cc. cHcc 24 a letter. zHcc 2 invalid, or z+Hcc, or z+H+cc. HHcc - invalid. cczc - invalid. zczc - invalid. Hczc 2 invalid, or Hc+zc. czzc - invalid. zzzc - invalid. Hzzc - invalid. cHzc - invalid. zHzc - invalid. HHzc - invalid. ccHc 3 invalid, or cc+Hc. zcHc 1 invalid, or zc+Hc. HcHc - invalid. czHc - invalid. zzHc - invalid. HzHc - invalid. cHHc - invalid. zHHc - invalid. HHHc - invalid. cccz 11 letter, or ccc+z. zccz 1 invalid, or zcc+z, or z+cc+z. Hccz 3 invalid, or Hcc+z, or H+cc+z. czcz - invalid. zzcz - invalid. Hzcz - invalid. cHcz - invalid. zHcz - invalid. HHcz - invalid. cczz - invalid. zczz - invalid. Hczz - invalid. czzz - invalid. zzzz - invalid. Hzzz - invalid. cHzz - invalid. zHzz - invalid. HHzz - invalid. ccHz - invalid. zcHz - invalid. HcHz - invalid. czHz - invalid. zzHz - invalid. HzHz - invalid. cHHz - invalid. zHHz - invalid. HHHz - invalid. cccH 21 letter, or ccc+H. zccH 25 letter, or zcc+H, or z+cc+H. HccH - invalid. czcH - invalid. zzcH - invalid. HzcH - invalid. cHcH - invalid. zHcH - invalid. HHcH - invalid. cczH - invalid. zczH - invalid. HczH - invalid. czzH - invalid. zzzH - invalid. HzzH - invalid. cHzH - invalid. zHzH - invalid. HHzH - invalid. ccHH - invalid. zcHH - invalid. HcHH - invalid. czHH - invalid. zzHH - invalid. HzHH - invalid. cHHH - invalid. zHHH - invalid. HHHH - invalid. 97-07-28 stolfi =============== Let's look more closely at the tall letters: cat bio-j-jsa-gut.wds \ | sed \ -e 's/^/_/g' \ -e 's/$/_/g' \ -e 's/cs/z/g' \ -e 's/ij/k/g' \ -e 's/ix/e/g' \ -e 's/is/r/g' \ -e 's/iiu/n/g' \ -e 's/y/i/g' \ -e 's/ci/a/g' \ -e 's/cg/8/g' \ | compare-contexts -lctx 0 -rctx 0 -colw 24 \ '[czlqjg]*lj[czlqjg]*' \ '[czlqjg]*qj[czlqjg]*' \ '[czlqjg]*lg[czlqjg]*' \ '[czlqjg]*qg[czlqjg]*' 571 0.36 lj 230 0.33 qj 10 0.62 lgccc 53 0.34 qgccc 376 0.24 ljcc 161 0.23 qjc 2 0.12 lg 50 0.32 qg 322 0.21 ljc 118 0.17 qjcc 2 0.12 clgccc 8 0.05 qgzcc 38 0.02 cccljc 28 0.04 qjccc 1 0.06 lgcc 8 0.05 qgcc 32 0.02 ljccc 27 0.04 cccqjc 1 0.06 cclg 4 0.03 qgzc 27 0.02 zccljc 18 0.03 ccccqjc ----- ---- ---- 4 0.03 cqgcc 26 0.02 zcccljc 14 0.02 zccqjc 16 1.00 TOT 4 0.03 cqgc 23 0.01 ccccljc 14 0.02 qjzcc 4 0.03 cccqgcc 18 0.01 ljzcc 11 0.02 zcccqjc 3 0.02 cccqg 17 0.01 zcclj 9 0.01 cqjcc 2 0.01 zccqgcc 14 0.01 ccclj 8 0.01 zcccqj 2 0.01 zcccqgc 10 0.01 zccljcc 7 0.01 zccqj 2 0.01 qgcccc 10 0.01 cccljcc 7 0.01 cqjc 2 0.01 cqg 9 0.01 cljcc 5 0.01 zccqjcc 2 0.01 ccqgzccc 9 0.01 cljc 5 0.01 qjzc 1 0.01 zqgcc 9 0.01 cccclj 5 0.01 ccqj 1 0.01 zccqgccc 6 0.00 zcccljcc 4 0.01 cccqj 1 0.01 zccqg 6 0.00 cclj 3 0.00 zcqj 1 0.01 ccqgccc 5 0.00 zccclj 3 0.00 qcqjc 1 0.01 cccqgccc 4 0.00 ljzc 3 0.00 ccccqj 1 0.01 ccccqgcc 4 0.00 ljcccc 2 0.00 zcccqjcc ----- ---- ---- 4 0.00 ccccljcc 2 0.00 qjccz 154 1.00 TOT 2 0.00 zclj 2 0.00 qcqj 2 0.00 qcljcc 2 0.00 cccqjcc 2 0.00 ljczcc 2 0.00 ccccqjcc 2 0.00 ljczc 1 0.00 zcqjcc 2 0.00 ccljcc 1 0.00 zccccqjcc 2 0.00 ccljc 1 0.00 qjczc 1 0.00 zzcljcc 1 0.00 qjcz 1 0.00 zzcclj 1 0.00 qjcccc 1 0.00 zzcccljc 1 0.00 qcqjccc 1 0.00 zljcc 1 0.00 qcqjcc 1 0.00 zlj 1 0.00 cqjccz 1 0.00 zcljc 1 0.00 cqj 1 0.00 zccljccc 1 0.00 ccqjc 1 0.00 qlj ----- ---- ---- 1 0.00 ljcz 700 1.00 TOT 1 0.00 ljccz 1 0.00 ljccccljc 1 0.00 clj 1 0.00 cccljccc ----- ---- ---- 1565 1.00 TOT These statistics confirm the identification of the \l/ and \q/ in gallows. The questions to decide now are whether \Hcc/ and \zcc/ are letters or composites \H/+\cc/ and \z/+\cc/. cat bio-j-jsa-gut.wds \ | sed \ -e 's/^/_/g' \ -e 's/$/_/g' \ -e 's/[ql][jg]/H/' \ -e 's/cs/z/g' \ -e 's/ij/k/g' \ -e 's/ix/e/g' \ -e 's/is/r/g' \ -e 's/iiu/n/g' \ -e 's/y/i/g' \ -e 's/ci/a/g' \ -e 's/cg/8/g' \ > .bar foreach f ( z H ) cat .bar \ | enum-contexts -vPAT='[^czH]'"${f}"'[^czH]' \ | sed -e 's/.$//g' \ | wfreq \ > .${f}.L cat .bar \ | enum-contexts -vPAT='[^czH]'"${f}cc"'[^czH]' \ | sed -e 's/.$//g' \ | wfreq \ > .${f}cc.L cat .bar \ | enum-contexts -vPAT='[^czH]'"cc"'[^czH]' \ | sed -e 's/^.//g' \ | wfreq \ > .cc.R cat .bar \ | enum-contexts -vPAT='[^czH]'"${f}cc"'[^czH]' \ | sed -e 's/^.//g' \ | wfreq \ > .${f}cc.R pr -m -s' ' -t -i' '1 -w 96 .${f}cc.L .${f}.L .${f}cc.R .cc.R \ | expand end 326 0.74 _zcc 136 0.90 _z 296 0.67 zcc8 54 0.39 cc8 65 0.15 ezcc 6 0.04 ez 108 0.25 zcca 47 0.34 cca 22 0.05 8zcc 6 0.04 az 35 0.08 zcco 23 0.17 cco 14 0.03 azcc 2 0.01 oz ----- ---- ---- 11 0.08 cce 9 0.02 rzcc 1 0.01 qz 439 1.00 TOT 2 0.01 ccr 3 0.01 ozcc ----- ---- ---- 1 0.01 cc_ ----- ---- ---- 151 1.00 TOT ----- ---- ---- 439 1.00 TOT 138 1.00 TOT 380 0.76 oHcc 668 0.79 oH 319 0.64 Hcc8 54 0.39 cc8 61 0.12 eHcc 87 0.10 _H 169 0.34 Hcca 47 0.34 cca 33 0.07 _Hcc 68 0.08 eH 13 0.03 Hcco 23 0.17 cco 26 0.05 aHcc 18 0.02 aH ----- ---- ---- 11 0.08 cce 1 0.00 8Hcc 1 0.00 qH 501 1.00 TOT 2 0.01 ccr ----- ---- ---- ----- ---- ---- 1 0.01 cc_ 501 1.00 TOT 842 1.00 TOT ----- ---- ---- 138 1.00 TOT From these numbers, it seems plausible that `Hcc' and `zcc' are composites. Collecting again all [czH] patterns, splitting H and P: cat bio-j-jsa-gut.wds \ | sed \ -e 's/^/_/g' \ -e 's/$/_/g' \ -e 's/[ql]j/H/' \ -e 's/[ql]g/P/' \ -e 's/cs/z/g' \ -e 's/ij/k/g' \ -e 's/ix/e/g' \ -e 's/is/r/g' \ -e 's/iiu/n/g' \ -e 's/y/i/g' \ -e 's/ci/a/g' \ -e 's/cg/8/g' \ | enum-contexts -vPAT='[czHP][czHP]*' -vCTX=0 \ | wfreq 795 0.20 H 493 0.13 Hcc 484 0.12 ccc 482 0.12 Hc 439 0.11 zcc 152 0.04 z 138 0.04 cc 87 0.02 zccc 70 0.02 zc 70 0.02 cccc 65 0.02 cccHc 63 0.02 Pccc 60 0.02 Hccc 52 0.01 P 41 0.01 zccHc 41 0.01 ccccHc 37 0.01 zcccHc 32 0.01 Hzcc 24 0.01 zccH 20 0.01 cHcc 19 0.00 cHc 18 0.00 cccH 15 0.00 zccHcc 13 0.00 zcccH 12 0.00 ccccH 12 0.00 cccHcc 11 0.00 cccz 11 0.00 ccH 9 0.00 c 9 0.00 Pcc 9 0.00 Hzc 8 0.00 zcccHcc 8 0.00 Pzcc 6 0.00 zzcc 6 0.00 ccccHcc 6 0.00 Hcccc 5 0.00 zcH 5 0.00 ccccz 4 0.00 zcccz 4 0.00 cccPcc 4 0.00 cPcc 4 0.00 cPc 4 0.00 cH 4 0.00 Pzc 3 0.00 cccP 3 0.00 ccHc 3 0.00 Hczc 3 0.00 Hccz 2 0.00 zcccPc 2 0.00 zccPcc 2 0.00 ccz 2 0.00 ccPzccc 2 0.00 ccHcc 2 0.00 cPccc 2 0.00 cP 2 0.00 Pcccc 2 0.00 Hczcc 2 0.00 Hcz 1 0.00 zzcccHc 1 0.00 zzccH 1 0.00 zzcHcc 1 0.00 zcz 1 0.00 zccz 1 0.00 zccccHcc 1 0.00 zcccc 1 0.00 zccPccc 1 0.00 zccP 1 0.00 zccHccc 1 0.00 zcHcc 1 0.00 zcHc 1 0.00 zPcc 1 0.00 zHcc 1 0.00 zH 1 0.00 ccccc 1 0.00 ccccPcc 1 0.00 cccPccc 1 0.00 cccHccc 1 0.00 ccPccc 1 0.00 ccP 1 0.00 cHccz 1 0.00 cHccc ----- ---- ---- 3906 1.00 TOT Based on the analysis above, it seems that, after colapsing \ci/, \cy/, \cg/, \cs/, and the tall characters \[lq]j/ to `H' \[lq]g/ to `P', we can parse most strings consisting of `c', `z', `H', and `P' into the following "letters" (the frequencies are for isolated occurences only): freq letter code ---- -------- ---- 795 H E 52 P P 152 z Z 138 cc M 70 zc R 482 Hc I 484 ccc O 439 zcc A 493 Hcc U 19 cHc X 4 cPc Y 20 cHcc K 4 cPcc G Here is my best guess for the parsing of those strings: freq string best parsing alt parsings ---- ---- ---------- ------------ -------------------- 795 0.20 H (H) 493 0.13 Hcc (Hcc) 484 0.12 ccc (ccc) 482 0.12 Hc (Hc) 439 0.11 zcc (zcc) 152 0.04 z (z) 138 0.04 cc (cc) 87 0.02 zccc (z)(ccc) (zc)(cc) 70 0.02 zc (zc) 70 0.02 cccc (cc)(cc) 65 0.02 cccHc (ccc)(Hc) (cc)(cHc) 63 0.02 Pccc (P)(ccc) 60 0.02 Hccc (H)(ccc) (Hc)(cc) 52 0.01 P 41 0.01 zccHc (zcc)(Hc) (zc)(cHc) 41 0.01 ccccHc (ccc)(cHc) (cc)(cc)(Hc) 37 0.01 zcccHc (zcc)(cHc) (zc)(cc)(Hc) 32 0.01 Hzcc (H)(zcc) (H)(z)(cc) 24 0.01 zccH (zcc)(H) 20 0.01 cHcc (cHcc) 19 0.00 cHc (cHc) 18 0.00 cccH (ccc)(H) 15 0.00 zccHcc (zcc)(Hcc) 13 0.00 zcccH (z)(ccc)(H) (zc)(cc)(H) 12 0.00 ccccH (cc)(cc)(H) 12 0.00 cccHcc (ccc)(Hcc) (cc)(cHcc) 11 0.00 cccz (ccc)(z) 11 0.00 ccH (cc)(H) 9 0.00 c 9 0.00 Pcc (P)(cc) 9 0.00 Hzc (H)(zc) 8 0.00 zcccHcc (zcc)(cHcc) (zc)(cc)(Hcc), (z)(cc)(cHcc) 8 0.00 Pzcc (P)(zcc) 6 0.00 zzcc (z)(zcc) 6 0.00 ccccHcc (ccc)(cHcc) (cc)(cc)(Hcc) 6 0.00 Hcccc (Hcc)(cc) (Hc)(ccc) 5 0.00 zcH (zc)(H) 5 0.00 ccccz (cc)(cc)(z) 4 0.00 zcccz (zc)(cc)(z) (z)(ccc)(z) 4 0.00 cccPcc (ccc)(P)(cc) (cc)(cPcc) 4 0.00 cPcc (cPcc) 4 0.00 cPc (cPc) 4 0.00 cH 4 0.00 Pzc (P)(zc) 3 0.00 cccP (ccc)(P) 3 0.00 ccHc (cc)(Hc) 3 0.00 Hczc (Hc)(zc) 3 0.00 Hccz (H)(cc)(z) 2 0.00 zcccPc (zcc)(cPc) 2 0.00 zccPcc (zcc)(Pcc) (z)(cc)(P)(cc) 2 0.00 ccz (cc)(z) 2 0.00 ccPzccc (cc)(P)(zc)(cc) (cc)(P)(z)(ccc) 2 0.00 ccHcc (cc)(Hcc) 2 0.00 cPccc (cPc)(cc) 2 0.00 cP 2 0.00 Pcccc (Pc)(ccc) (P)(cc)(cc) 2 0.00 Hczcc (Hc)(zcc) 2 0.00 Hcz (Hc)(z) 1 0.00 zzcccHc (z)(zcc)(cHc) (z)(zc)(cc)(Hc) 1 0.00 zzccH (z)(zcc)(H) 1 0.00 zzcHcc (z)(z)(cHcc) (z)(zc)(H)(cc) 1 0.00 zcz (zc)(z) 1 0.00 zccz (zcc)(z) 1 0.00 zccccHcc (zc)(cc)(cHcc) (z)(cc)(cc)(H)(cc) 1 0.00 zcccc (zc)(ccc) (z)(cc)(cc) 1 0.00 zccPccc (zc)(cPc)(cc) (z)(cc)(P)(ccc), (z)(cc)(Pc)(cc) 1 0.00 zccP (zcc)(P) 1 0.00 zccHccc (zc)(cHc)(cc) (zcc)(H)(ccc), (zcc)(Hc)(cc) 1 0.00 zcHcc (zc)(Hcc) (z)(cHcc) 1 0.00 zcHc (zc)(Hc) (z)(cHc) 1 0.00 zPcc (z)(Pcc) 1 0.00 zHcc (z)(Hcc) 1 0.00 zH (z)(H) 1 0.00 ccccc (cc)(ccc) (ccc)(cc) 1 0.00 ccccPcc (cc)(cc)(P)(cc) (ccc)(cPcc) 1 0.00 cccPccc (cc)(cpc)(cc) (ccc)(P)(ccc) 1 0.00 cccHccc (cc)(cHc)(cc) (ccc)(H)(ccc) 1 0.00 ccPccc (cc)(P)(ccc) 1 0.00 ccP (cc)(P) 1 0.00 cHccz (cHcc)(z) 1 0.00 cHccc (cHc)(cc) This still doesn't look quite right.... Let's try it anyway. jsa2hip ------------------------------------------------- #! /n/gnu/bin/sed -f # Recoding superanalytic to "hip" encoding: /^[^#]/s/ij/k/g /^[^#]/s/ix/e/g /^[^#]/s/is/r/g /^[^#]/s/iiu/n/g /^[^#]/s/y/i/g /^[^#]/s/ci/a/g /^[^#]/s/cg/8/g /^[^#]/s/cs/z/g /^[^#]/s/iin/m/g /^[^#]/s/in/m/g /^[^#]/s/ir/v/g /^[^#]/s/qj/E/g /^[^#]/s/qg/P/g /^[^#]/s/lj/E/g /^[^#]/s/lg/P/g # Parsing of [czPE] strings: /^[^#]/s/[zcEP][zcEP][zcEP][zcEP][zcEP][zcEP][zcEP][zcEP]*/@/g /^[^#]/s/zccEcc/AU/g /^[^#]/s/ccccEc/OX/g /^[^#]/s/zcccEc/AX/g /^[^#]/s/ccccE/MME/g /^[^#]/s/zccEc/AI/g /^[^#]/s/cccEc/OI/g /^[^#]/s/zccc/RM/g /^[^#]/s/cccc/MM/g /^[^#]/s/Pccc/PO/g /^[^#]/s/Eccc/EO/g /^[^#]/s/Ezcc/EA/g /^[^#]/s/zccE/AE/g /^[^#]/s/cEcc/K/g /^[^#]/s/cccE/OE/g /^[^#]/s/cccz/OZ/g /^[^#]/s/Ecc/U/g /^[^#]/s/ccc/O/g /^[^#]/s/zcc/A/g /^[^#]/s/cEc/X/g /^[^#]/s/ccE/ME/g /^[^#]/s/Ec/I/g /^[^#]/s/cc/M/g /^[^#]/s/zc/R/g /^[^#]/s/E/E/g /^[^#]/s/z/Z/g /^[^#]/s/P/P/g ------------------------------------------------- extract-words-from-interlin \ -recode jsa2hip \ -chars "qoa8HPZAEIOUMXRKermnkvc@" \ bio-j-jsa.evt \ bio-j-hip lines words bytes file ------ ------- --------- ------------ 7054 7054 36231 bio-j-hip.wds 1967 1967 13458 bio-j-hip.dic 4658 4658 22234 bio-j-hip-gut.wds 862 862 4575 bio-j-hip-gut.dic 843 843 2464 bio-j-hip-fun.wds 5 5 24 bio-j-hip-fun.dic 1553 1553 11533 bio-j-hip-bad.wds 1100 1100 8859 bio-j-hip-bad.dic Digraph counts (edited): q o a 8 R M A O P E Z I U e r m n k v X K c @ TOT ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- . 1239 965 161 363 129 149 440 436 65 91 149 32 29 282 79 . . . . 7 8 16 18 4658 q . . 1227 . . . . . . . . . . . . . . . . . . . . . 1247 ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- o 21 . . . 18 . 9 . . 61 714 . 394 380 893 178 9 . . . 7 10 . . 2727 a 2794 . . . 11 . . 14 9 . 23 . 19 26 411 275 333 109 33 19 . . . . 4104 ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- R . . 26 22 31 . 107 . . . . . . . . . . . . . . . . . 199 M . . 32 132 142 . 95 . . . 36 11 . . 11 . . . . . . . . . 468 ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- A . . 40 125 322 . . . . . 25 . 41 15 . . . . . . 37 . . . 609 O . . 40 167 404 . . . . 7 18 11 77 . . . . . . . 41 . . . 765 ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- e 825 . 105 114 53 36 49 71 154 10 76 7 43 61 . . . . . . . . . . 1614 ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- P . . 37 17 . . 22 8 66 . . . . . . . . . . . . . . . 165 8 50 . 37 1948 . 9 18 22 16 . . . . . . . . . . . . . . . 2113 ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- E 10 . 75 795 . 9 . 32 61 . . . . . . . . . . . . . . . 996 r 401 . 36 64 . . . 9 16 . . . . . . . . . . . . . . . 539 Z 42 . 63 73 . . . 7 . . . . . . . . . . . . . . . . 196 ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- I . . 20 173 391 . . . . . . . . . . . . . . . . . 12 . 608 U . . 10 179 321 . . . . . . . . . . . . . . . . . . . 513 ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- m 339 . . . . . . . . . . . . . . . . . . . . . . . 342 n 114 . . . . . . . . . . . . . . . . . . . . . . . 115 k 36 . . . . . . . . . . . . . . . . . . . . . . . 37 v 19 . . . . . . . . . . . . . . . . . . . . . . . 20 ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- X . . . 86 10 . . . . . . . . . . . . . . . . . . . 101 K . . . 14 10 . . . . . . . . . . . . . . . . . . . 26 c . . . 11 12 . . . . 12 . . . . . . . . . . . . . . 48 @ . . . 11 13 . . . . . . . . . . . . . . . . . . . 24 ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- TOT 4658 1247 2727 4104 2113 199 468 609 765 165 996 196 608 513 1614 539 342 115 37 20 101 26 48 24 22234 There are some nice things in this table: \ccc/ = `O' and \cscc/ = `A' come out similar, and same for \cc/ = `M' and \csc/ = `R'. There are some surprises, such as the similarity between \qg,lg/ = `P' and \cg/ = `8'; or between \lj,qj/ = `E', \cs/ = `Z', and \is/ = `r'. The slight differences between members of the same class may be telling us something, too. \cc/ and \csc/ are similar, but only \cc/ is followed by \lj/, \qj/, \cs/, or \ix/ \ccc/ and \cscc/ are similar, but only \ccc/ is followed by \lg/, \qg/, or \cs/ only \cscc/ is followed by \ljcc/ or \qjcc/ \ljc/ and \ljcc/ are similar, but only \ljc/ is followed by unparsed \c/ only \ljcc/ can be preceded by \cscc/ Also, \cgci/ is probably a letter; indeed \cg/ is followed by \ci/ 91% of the time, although \ci/ occurs in other contexts too. What can we conclude from these bits, really?