Hacking at the Voynich manuscript Notebook - volume 10 Warning: these notebooks aren't strictly chronological logs. Sometimes I go back and redo things, clarify comments, delete garbage, etc. 97-10-01 stolfi =============== Redoing the statistics. Should I correct the 2/R "mistakes"? Let's not do it for now. But I will combine H+D, P+F, S+T: Tetragram frequencies around line breaks, ignoring spaces: cat .voyn.fsg \ | tr -d ' /=' \ | sed -e 's/^\(..\).*\(..\)$/\1\2/g' \ | tr -s '\012' ':' \ | enum-ngraphs -v n=5 \ | egrep -v '\*' \ | egrep '^..:..$' \ > .voyn-nl-2-2.grm cat .voyn-nl-2-2.grm \ | sort | uniq -c | expand \ | compute-freqs \ > .voyn-nl-2-2.frq Tetragram frequencies around blanks (spaces and line breaks): cat .voyn.fsg \ | tr -d '/=' \ | tr -s ' \012' '__' \ | enum-ngraphs -v n=7 \ | egrep -v '\*' \ | egrep '^..._...$' \ | sed \ -e 's/^\(...\)_\(...\)$/\1:\2/g' \ -e 's/_//g' \ -e 's/^.*\(..\):\(..\).*$/\1:\2/g' \ > .voyn-sp-2-2.grm cat .voyn-sp-2-2.grm \ | sort | uniq -c | expand \ | compute-freqs \ > .voyn-sp-2-2.frq Comparisons: compare-freqs \ .voyn-tt-2-2.frq \ .voyn-nl-2-2.frq \ | compute-count-ratio \ | sort +0.0 -0.2r +4 -5nr \ > .voyn-tt-nl-2-2.cmp compare-freqs \ .voyn-sp-2-2.frq \ .voyn-nl-2-2.frq \ | compute-count-ratio \ | sort +0.0 -0.2r +4 -5nr \ > .voyn-sp-nl-2-2.cmp 97-10-07 stolfi =============== Let's compute the distribution of letters at start, middle, and end of the pseudo-"words" (delimited by spaces). I will ignore line-start and line-end as they may be special. cat .voyn.fsg \ | tr -d '/=' \ | sed \ -e 's/^ *//g' \ -e 's/ *$//g' \ | enum-ngraphs -v n=2 \ > .voyn.dig cat .voyn.fsg \ | tr -d '/=' \ | sed \ -e 's/^ *//g' \ -e 's/ *$//g' \ | enum-ngraphs -v n=3 \ > .voyn.trg cat .voyn.dig \ | egrep '^ .$' \ | sed -e 's/^.\(.\)$/\1/g' \ | sort | uniq -c | expand \ > .voyn-ws.frq cat .voyn.dig \ | egrep '^. $' \ | sed -e 's/^\(.\).$/\1/g' \ | sort | uniq -c | expand \ > .voyn-we.frq cat .voyn.trg \ | egrep '^[^ ][^ ][^ ]$' \ | sed -e 's/^.\(.\).$/\1/' \ | sort | uniq -c | expand \ > .voyn-wm.frq join \ -a 1 -a 2 -e 0 -j1 2 -j2 2 -o '0,1.1,2.1' \ .voyn-wm.frq \ .voyn-we.frq \ > .tmp join \ -a 1 -a 2 -e 0 -j1 2 -j2 1 -o '0,1.1,2.2,2.3' \ .voyn-ws.frq \ .tmp \ | gawk ' {printf "%s %5d %5d %5d\n", $1, $2, $3, $4}' \ > .voyn-wsme.frq let ini mid fin --- ----- ----- ----- 4 1456 21 5 O 1317 2558 27 S 670 383 4 T 746 689 1 8 412 2154 61 D 102 2071 14 H 101 816 4 P 28 112 4 F 6 27 1 A 126 1826 0 C 23 4235 4 I 1 71 0 Z 0 343 1 G 79 120 3126 K 0 2 5 L 2 3 3 M 0 10 395 N 0 16 458 6 1 1 3 * 11 15 10 E 340 909 947 2 140 28 57 R 130 176 561 Note that "E", "2" and "R" are the only letters that occur in significant numbers at all three positions. Note also that "2" and "R" are easily confused with each other, so the numbers are consistent with "2" being exclusively word-initial, "R" being exclusively word-final, and there being substantial misredings in both directions (10% of the "R"s misread as "2"s, 40% of the "2"s misread as "R"s). Here is an attempt to recreate the blanks in the VMs according to simple rules. First, prepare a file where every two characters are separated by " " or "-". Then replace all blanks by "-", and replace some "-" by " " before "[42]" and after "[GKLMN6R]" cat .voyn.fsg \ | tr -d '/=' \ | sed -e 's/^ *//g' -e 's/ *$//g' \ | sed \ -e 's/\(.\)/\1:/g' \ -e 's/: :/ /g' \ -e 's/:$//g' \ > .voyn-sp-org.fsg cat .voyn.fsg \ | tr -d '/= ' \ | sed \ -e 's/\(.\)/\1:/g' \ -e 's/:$//g' \ -e 's/:\([42]\)/ \1/g' \ -e 's/\([GKLMN6R]\):/\1 /g' \ > .voyn-sp-syn.fsg compare-spaces \ .voyn-sp-syn.fsg \ .voyn-sp-org.fsg \ | tr -d ':' \ > .voyn-sp.cmp : ----- ----- | 4707 676 : | 984 22311 cat .voyn-sp.cmp \ | tr -dc '+\- ' \ | sed -e 's/\(.\)/\1@/g' \ | tr '@ ' '\012_' \ | egrep '.' \ | sort | uniq -c | expand \ > .voyn-sp-o-s.frq 676 + 984 - 4707 _ cat .voyn-sp.cmp \ | tr ' ' '_' \ | enum-ngraphs -v n=3 \ | egrep '.[-_+].' \ | sort | uniq -c | expand \ | sort -b +1.0 -1.2 +0 -1nr \ > .foo It seems that many of the errors made by these space-prediction rules are due to confusion between "2" and "R" by the transcriber. Let's try to "correct" these mistakes by changing in the original word-initial "R" to "2" non-word-initial "2" to "R" Let's do these changes cat .voyn.fsg \ | tr -d '/=' \ | sed \ -e 's/^ *//g' \ -e 's/ *$//g' \ -e 's/ R/ 2/g' \ -e 's/\([^ ]\)2/\1R/g' \ | tr -d ' ' \ | sed \ -e 's/\(.\)/\1:/g' \ -e 's/:$//g' \ -e 's/:\([42]\)/ \1/g' \ -e 's/\([GKLMN6R]\):/\1 /g' \ > .voyn-sp-fix.fsg compare-spaces \ .voyn-sp-fix.fsg \ .voyn-sp-org.fsg \ | tr -d ':' \ > .voyn-sp-fix.cmp R 2 : ----- ----- ----- ----- R | 792 87 0 0 2 | 130 278 0 0 | 0 0 4759 507 : | 0 0 932 22480 cat .voyn-sp-fix.cmp \ | tr -dc '+\- ' \ | sed -e 's/\(.\)/\1@/g' \ | tr '@ ' '\012_' \ | egrep '.' \ | sort | uniq -c | expand \ > .voyn-sp-o-s-fix.frq 507 + 932 - 4759 _ cat .voyn-sp-fix.cmp \ | tr ' ' '_' \ | enum-ngraphs -v n=3 \ | egrep '.[-_+].' \ | sort | uniq -c | expand \ | sort -b +1.0 -1.2 +0 -1nr \ > .foo Let's compute what would be the initial/medial/final statistics with these R/2 changes but with the original spaces: cat .voyn-sp-fix.cmp \ | tr -d '+' \ | tr '\-' ' ' \ > .voyn-sp-fixr2.fsg cat .voyn-sp-fixr2.fsg \ | tr -d '/=' \ | sed \ -e 's/^ *//g' \ -e 's/ *$//g' \ | egrep ' .* ' \ | sed \ -e 's/^[^ ][^ ]* //g' \ -e 's/ [^ ][^ ]*$//g' \ | tr ' ' '\012' \ | egrep '.' \ > .voyn-sp-fixr2-nonend.wds cat .voyn-sp-fixr2-nonend.wds \ | sed -e 's/^\(.\).*$/\1/g' \ | sort | uniq -c | expand \ > .voyn-sp-fixr2-ws.frq cat .voyn-sp-fixr2-nonend.wds \ | sed -e 's/^.*\(.\)$/\1/g' \ | sort | uniq -c | expand \ > .voyn-sp-fixr2-we.frq cat .voyn-sp-fixr2-nonend.wds \ | egrep '...' \ | sed \ -e 's/^.\(.*\).$/\1/' \ -e 's/\(.\)/\1@/g' \ | tr '@' '\012' \ | egrep '.' \ | sort | uniq -c | expand \ > .voyn-sp-fixr2-wm.frq join \ -a 1 -a 2 -e 0 -j1 2 -j2 2 -o '0,1.1,2.1' \ .voyn-sp-fixr2-wm.frq \ .voyn-sp-fixr2-we.frq \ > .tmp join \ -a 1 -a 2 -e 0 -j1 2 -j2 1 -o '0,1.1,2.2,2.3' \ .voyn-sp-fixr2-ws.frq \ .tmp \ | gawk ' {printf "%s %5d %5d %5d\n", $1, $2, $3, $4}' \ > .voyn-sp-fixr2-wsme.frq let ini mid fin --- ----- ----- ----- 2 189 8 2 R 15 79 524 97-10-08 stolfi =============== An intermezzo: for Denis's benefit, let's compute a table of digraph frequencies in Currier notation. cat .voyn.fsg \ | sed \ -e 's/HZ/q/g' \ -e 's/PZ/w/g' \ -e 's/DZ/x/g' \ -e 's/FZ/y/g' \ -e 's/IIIE/1/g' \ -e 's/IIE/h/g' \ -e 's/IE/g/g' \ -e 's/IIIR/0/g' \ -e 's/IIR/t/g' \ -e 's/IR/u/g' \ -e 's/IIIL/3/g' \ -e 's/IIIK/5/g' \ -e 's/IIK/l/g' \ -e 's/IK/k/g' \ | tr 'GTSHPDFLK' '9SZPBFVDJ' \ | tr 'qwxyhgtulk' 'QWXYHGTULK' \ > .voyn.cur cat .voyn.cur \ | tr -d '/= ' \ | tr 'IGHTUDL56' '*********' \ | count-digraph-freqs \ -vshowentropy=1 \ -vchars='PFBVQXWYSZC2RNMJ4AEO89IGH1TU0D3KL567' Digraph counts: TT P F B V Q X W Y S Z C 2 R N M J 4 A E O 8 9 * ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- P 852 . 1 . . . . . . 62 26 341 1 . . 1 . . 259 3 62 3 88 5 . F 1993 . . . . . . . . 72 30 869 . 1 . 2 . . 736 11 95 3 170 4 . B 195 . . . . . . . . 92 25 4 . . . . . . 13 . 51 3 7 . . V 32 . . . . . . . . 24 2 1 . . . . . . 2 . 3 . . . . Q 121 . . . . . . . . . . 31 1 . . . . . 3 . 5 6 74 . 1 X 199 . . . . . . . . 2 1 53 . . . . . . 5 . 3 10 125 . . W 21 . . . . . . . . . . 9 . . . . . . 2 . 2 3 5 . . Y 4 . . . . . . . . . . 2 . . . . . . . . . 2 . . . S 1453 8 17 4 1 31 66 8 2 1 3 1053 6 4 . . . . 27 13 49 96 62 2 . Z 1078 6 6 . . 19 39 . . 3 . 866 1 1 . . . 2 23 5 38 41 28 . . C 4268 38 79 13 3 39 69 4 . 15 9 953 45 8 . . . 2 53 4 175 1898 844 14 3 2 365 3 4 1 . 1 1 . . 18 19 2 2 . . . . 3 150 2 133 4 10 1 11 R 883 2 5 3 . 1 1 1 1 123 145 4 4 1 . . . 25 147 3 272 22 54 6 63 N 503 1 . . . 3 . 2 . 117 104 3 5 1 . . . 19 9 2 169 30 9 . 29 M 438 . 2 . . . 2 . 1 114 89 1 7 1 . . . 16 4 2 127 25 13 1 33 J 53 . . . . . . . . . . . 1 . . . . . . 2 2 . . . 48 4 1676 1 5 2 2 5 1 . . . 1 10 2 . . . . 1 . . 1646 . . . . A 1952 . . 1 . . . 1 . . . . . 405 495 414 43 . . 552 . . . 41 . E 2344 64 310 15 8 2 1 1 . 501 344 19 41 28 . . 2 96 69 41 377 174 114 1 136 O 3964 571 1434 67 13 9 14 1 . 10 10 19 7 305 7 20 7 13 4 1349 15 41 20 19 9 8 2740 1 8 . 1 . . . . 41 43 15 2 2 . . . 21 417 14 98 4 2059 2 12 9 3781 107 115 17 2 9 3 2 . 233 190 6 101 121 . . 1 1277 18 312 556 266 34 9 402 * 113 . 1 1 . . 1 . . 8 14 4 1 1 1 1 . 3 11 3 28 5 6 8 16 763 50 6 71 2 2 1 1 . 17 23 3 138 4 . . . 198 . 26 58 104 59 . . ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- TOT 29791 852 1993 195 32 121 199 21 4 1453 1078 4268 365 883 503 438 53 1676 1952 2344 3964 2740 3781 113 763 Next-symbol probability (× 99): TT P F B V Q X W Y S Z C 2 R N M J 4 A E O 8 9 * -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- P 99 . . . . . . . . 7 3 40 . . . . . . 30 . 7 . 10 1 . F 99 . . . . . . . . 4 1 43 . . . . . . 37 1 5 . 8 . . B 99 . . . . . . . . 47 13 2 . . . . . . 7 . 26 2 4 . . V 99 . . . . . . . . 74 6 3 . . . . . . 6 . 9 . . . . Q 99 . . . . . . . . . . 25 1 . . . . . 2 . 4 5 61 . 1 X 99 . . . . . . . . 1 . 26 . . . . . . 2 . 1 5 62 . . W 99 . . . . . . . . . . 42 . . . . . . 9 . 9 14 24 . . Y 99 . . . . . . . . . . 50 . . . . . . . . . 50 . . . S 99 1 1 . . 2 4 1 . . . 72 . . . . . . 2 1 3 7 4 . . Z 99 1 1 . . 2 4 . . . . 80 . . . . . . 2 . 3 4 3 . . C 99 1 2 . . 1 2 . . . . 22 1 . . . . . 1 . 4 44 20 . . 2 99 1 1 . . . . . . 5 5 1 1 . . . . 1 41 1 36 1 3 . 3 R 99 . 1 . . . . . . 14 16 . . . . . . 3 16 . 30 2 6 1 7 N 99 . . . . 1 . . . 23 20 1 1 . . . . 4 2 . 33 6 2 . 6 M 99 . . . . . . . . 26 20 . 2 . . . . 4 1 . 29 6 3 . 7 J 99 . . . . . . . . . . . 2 . . . . . . 4 4 . . . 90 4 99 . . . . . . . . . . 1 . . . . . . . . 97 . . . . A 99 . . . . . . . . . . . . 21 25 21 2 . . 28 . . . 2 . E 99 3 13 1 . . . . . 21 15 1 2 1 . . . 4 3 2 16 7 5 . 6 O 99 14 36 2 . . . . . . . . . 8 . . . . . 34 . 1 . . . 8 99 . . . . . . . . 1 2 1 . . . . . 1 15 1 4 . 74 . . 9 99 3 3 . . . . . . 6 5 . 3 3 . . . 33 . 8 15 7 1 . 11 * 99 . 1 1 . . 1 . . 7 12 4 1 1 1 1 . 3 10 3 25 4 5 7 14 99 6 1 9 . . . . . 2 3 . 18 1 . . . 26 . 3 8 13 8 . . -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- TOT 99 3 7 1 0 0 1 0 0 5 4 14 1 3 2 1 0 6 6 8 13 9 13 0 3 Previous-symbol probability (× 99): TT P F B V Q X W Y S Z C 2 R N M J 4 A E O 8 9 * -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- P 3 . . . . . . . . 4 2 8 . . . . . . 13 . 2 . 2 4 . F 7 . . . . . . . . 5 3 20 . . . . . . 37 . 2 . 4 4 . B 1 . . . . . . . . 6 2 . . . . . . . 1 . 1 . . . . V 0 . . . . . . . . 2 . . . . . . . . . . . . . . . Q 0 . . . . . . . . . . 1 . . . . . . . . . . 2 . . X 1 . . . . . . . . . . 1 . . . . . . . . . . 3 . . W 0 . . . . . . . . . . . . . . . . . . . . . . . . Y 0 . . . . . . . . . . . . . . . . . . . . . . . . S 5 1 1 2 3 25 33 38 50 . . 24 2 . . . . . 1 1 1 3 2 2 . Z 4 1 . . . 16 19 . . . . 20 . . . . . . 1 . 1 1 1 . . C 14 4 4 7 9 32 34 19 . 1 1 22 12 1 . . . . 3 . 4 69 22 12 . 2 1 . . 1 . 1 . . . 1 2 . 1 . . . . . 8 . 3 . . 1 1 R 3 . . 2 . 1 . 5 25 8 13 . 1 . . . . 1 7 . 7 1 1 5 8 N 2 . . . . 2 . 9 . 8 10 . 1 . . . . 1 . . 4 1 . . 4 M 1 . . . . . 1 . 25 8 8 . 2 . . . . 1 . . 3 1 . 1 4 J 0 . . . . . . . . . . . . . . . . . . . . . . . 6 4 6 . . 1 6 4 . . . . . . 1 . . . . . . . 41 . . . . A 6 . . 1 . . . 5 . . . . . 45 97 94 80 . . 23 . . . 36 . E 8 7 15 8 25 2 . 5 . 34 32 . 11 3 . . 4 6 3 2 9 6 3 1 18 O 13 66 71 34 40 7 7 5 . 1 1 . 2 34 1 5 13 1 . 57 . 1 1 17 1 8 9 . . . 3 . . . . 3 4 . 1 . . . . 1 21 1 2 . 54 2 2 9 13 12 6 9 6 7 1 9 . 16 17 . 27 14 . . 2 75 1 13 14 10 1 8 52 * 0 . . 1 . . . . . 1 1 . . . . . . . 1 . 1 . . 7 2 3 6 . 36 6 2 . 5 . 1 2 . 37 . . . . 12 . 1 1 4 2 . . -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- TOT 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 Symbol entropy: 3.804 Next-symbol entropy: TT P F B V Q X W Y S Z C 2 R N M J 4 A E O 8 9 * ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- P 2.228 . 0.011 . . . . . . 0.275 0.154 0.529 0.011 . . 0.011 . . 0.522 0.029 0.275 0.029 0.338 0.044 . F 1.918 . . . . . . . . 0.173 0.091 0.522 . 0.005 . 0.010 . . 0.531 0.041 0.209 0.014 0.303 0.018 . B 2.038 . . . . . . . . 0.511 0.380 0.115 . . . . . . 0.260 . 0.506 0.093 0.172 . . V 1.288 . . . . . . . . 0.311 0.250 0.156 . . . . . . 0.250 . 0.320 . . . . Q 1.589 . . . . . . . . . . 0.503 0.057 . . . . . 0.132 . 0.190 0.215 0.434 . 0.057 X 1.476 . . . . . . . . 0.067 0.038 0.508 . . . . . . 0.134 . 0.091 0.217 0.421 . . W 2.064 . . . . . . . . . . 0.524 . . . . . . 0.323 . 0.323 0.401 0.493 . . Y 1.000 . . . . . . . . . . 0.500 . . . . . . . . . 0.500 . . . S 1.740 0.041 0.075 0.023 0.007 0.118 0.203 0.041 0.013 0.007 0.018 0.337 0.033 0.023 . . . . 0.107 0.061 0.165 0.259 0.194 0.013 . Z 1.313 0.042 0.042 . . 0.103 0.173 . . 0.024 . 0.254 0.009 0.009 . . . 0.017 0.118 0.036 0.170 0.179 0.137 . . C 2.283 0.061 0.107 0.025 0.007 0.062 0.096 0.009 . 0.029 0.019 0.483 0.069 0.017 . . . 0.005 0.079 0.009 0.189 0.520 0.462 0.027 0.007 2 2.262 0.057 0.071 0.023 . 0.023 0.023 . . 0.214 0.222 0.041 0.041 . . . . 0.057 0.527 0.041 0.531 0.071 0.142 0.023 0.152 R 2.867 0.020 0.042 0.028 . 0.011 0.011 0.011 0.011 0.396 0.428 0.035 0.035 0.011 . . . 0.146 0.431 0.028 0.523 0.133 0.247 0.049 0.272 N 2.608 0.018 . . . 0.044 . 0.032 . 0.489 0.470 0.044 0.066 0.018 . . . 0.179 0.104 0.032 0.529 0.243 0.104 . 0.237 M 2.676 . 0.036 . . . 0.036 . 0.020 0.505 0.467 0.020 0.095 0.020 . . . 0.174 0.062 0.036 0.518 0.236 0.151 0.020 0.281 J 0.594 . . . . . . . . . . . 0.108 . . . . . . 0.178 0.178 . . . 0.129 4 0.180 0.006 0.025 0.012 0.012 0.025 0.006 . . . 0.006 0.044 0.012 . . . . 0.006 . . 0.026 . . . . A 2.212 . . 0.006 . . . 0.006 . . . . . 0.471 0.502 0.474 0.121 . . 0.515 . . . 0.117 . E 3.345 0.142 0.386 0.047 0.028 0.009 0.005 0.005 . 0.476 0.406 0.056 0.102 0.076 . . 0.009 0.189 0.150 0.102 0.424 0.279 0.212 0.005 0.238 O 2.324 0.403 0.531 0.099 0.027 0.020 0.029 0.003 . 0.022 0.022 0.037 0.016 0.285 0.016 0.039 0.016 0.027 0.010 0.529 0.030 0.068 0.039 0.037 0.020 8 1.317 0.004 0.025 . 0.004 . . . . 0.091 0.094 0.041 0.008 0.008 . . . 0.054 0.413 0.039 0.172 0.014 0.310 0.008 0.034 9 3.120 0.146 0.153 0.035 0.006 0.021 0.008 0.006 . 0.248 0.217 0.015 0.140 0.159 . . 0.003 0.529 0.037 0.297 0.407 0.269 0.061 0.021 0.344 * 3.434 . 0.060 0.060 . . 0.060 . . 0.270 0.373 0.171 0.060 0.060 0.060 0.060 . 0.139 0.327 0.139 0.499 0.199 0.225 0.270 0.399 3.125 0.258 0.055 0.319 0.022 0.022 0.013 0.013 . 0.122 0.152 0.031 0.446 0.040 . . . 0.505 . 0.166 0.283 0.392 0.286 . . ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- TOT 2.219 0.147 0.261 0.047 0.011 0.032 0.048 0.007 0.002 0.213 0.173 0.402 0.078 0.150 0.099 0.090 0.016 0.234 0.258 0.289 0.387 0.317 0.378 0.031 0.135 Previous-symbol entropy: TT P F B V Q X W Y S Z C 2 R N M J 4 A E O 8 9 * ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- P 0.147 . 0.005 . . . . . . 0.194 0.130 0.291 0.023 . . 0.020 . . 0.387 0.012 0.094 0.011 0.126 0.199 . F 0.261 . . . . . . . . 0.215 0.144 0.468 . 0.011 . 0.036 . . 0.531 0.036 0.129 0.011 0.201 0.171 . B 0.047 . . . . . . . . 0.252 0.126 0.009 . . . . . . 0.048 . 0.081 0.011 0.017 . . V 0.011 . . . . . . . . 0.098 0.017 0.003 . . . . . . 0.010 . 0.008 . . . . Q 0.032 . . . . . . . . . . 0.052 0.023 . . . . . 0.014 . 0.012 0.019 0.111 . 0.013 X 0.048 . . . . . . . . 0.013 0.009 0.079 . . . . . . 0.022 . 0.008 0.030 0.163 . . W 0.007 . . . . . . . . . . 0.019 . . . . . . 0.010 . 0.006 0.011 0.013 . . Y 0.002 . . . . . . . . . . 0.005 . . . . . . . . . 0.008 . . . S 0.213 0.063 0.059 0.115 0.156 0.503 0.528 0.530 0.500 0.007 0.024 0.498 0.097 0.035 . . . . 0.085 0.042 0.078 0.169 0.097 0.103 . Z 0.173 0.050 0.025 . . 0.419 0.461 . . 0.018 . 0.467 0.023 0.011 . . . 0.012 0.075 0.019 0.064 0.091 0.052 . . C 0.402 0.200 0.185 0.260 0.320 0.526 0.530 0.456 . 0.068 0.058 0.483 0.372 0.061 . . . 0.012 0.141 0.016 0.199 0.367 0.483 0.373 0.031 2 0.078 0.029 0.018 0.039 . 0.057 0.038 . . 0.078 0.103 0.005 0.041 . . . . 0.016 0.284 0.009 0.164 0.014 0.023 0.060 0.088 R 0.150 0.021 0.022 0.093 . 0.057 0.038 0.209 0.500 0.302 0.389 0.009 0.071 0.011 . . . 0.090 0.281 0.012 0.265 0.056 0.088 0.225 0.297 N 0.099 0.011 . . . 0.132 . 0.323 . 0.293 0.325 0.007 0.085 0.011 . . . 0.073 0.036 0.009 0.194 0.071 0.021 . 0.179 M 0.090 . 0.010 . . . 0.067 . 0.500 0.288 0.297 0.003 0.109 0.011 . . . 0.064 0.018 0.009 0.159 0.062 0.028 0.060 0.196 J 0.016 . . . . . . . . . . . 0.023 . . . . . . 0.009 0.006 . . . 0.251 4 0.234 0.011 0.022 0.068 0.250 0.190 0.038 . . . 0.009 0.020 0.041 . . . . 0.006 . . 0.527 . . . . A 0.258 . . 0.039 . . . 0.209 . . . . . 0.516 0.023 0.077 0.245 . . 0.491 . . . 0.531 . E 0.289 0.281 0.418 0.285 0.500 0.098 0.038 0.209 . 0.530 0.526 0.035 0.354 0.158 . . 0.178 0.236 0.170 0.102 0.323 0.253 0.152 0.060 0.443 O 0.387 0.387 0.342 0.530 0.528 0.279 0.269 0.209 . 0.049 0.063 0.035 0.109 0.530 0.086 0.203 0.386 0.054 0.018 0.459 0.030 0.091 0.040 0.433 0.076 8 0.317 0.011 0.032 . 0.156 . . . . 0.145 0.185 0.029 0.041 0.020 . . . 0.079 0.476 0.044 0.132 0.014 0.477 0.103 0.094 9 0.378 0.376 0.237 0.307 0.250 0.279 0.091 0.323 . 0.423 0.441 0.013 0.513 0.393 . . 0.108 0.299 0.062 0.387 0.397 0.327 0.061 0.291 0.487 * 0.031 . 0.005 0.039 . . 0.038 . . 0.041 0.081 0.009 0.023 0.011 0.018 0.020 . 0.016 0.042 0.012 0.050 0.017 0.015 0.270 0.117 0.135 0.240 0.025 0.531 0.250 0.098 0.038 0.209 . 0.075 0.118 0.007 0.531 0.035 . . . 0.364 . 0.072 0.089 0.179 0.094 . . ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- TOT 2.219 1.681 1.405 2.305 2.411 2.639 2.176 2.678 1.500 3.091 3.046 2.547 2.483 1.815 0.126 0.356 0.917 1.323 2.713 1.740 3.015 1.809 2.262 2.879 2.273 Denis would like me to remove the paragraph-initial lines. Testing the hypothesis that P,F are (often) HOE,DOE cat .voyn.cur \ | tr -d '/= ' \ | tr 'IGHTUDL56' '*********' \ | sed \ -e 's/POE/b/g' \ -e 's/FOE/v/g' \ | count-digraph-freqs \ -vshowentropy=1 \ -vchars='PFBVbvQXWYSZC2RNMJ4AEO89IGH1TU0D3KL567' Next-symbol probability (× 99): P F B V b v Q X W Y S Z C 2 R N M J 4 A E O 8 9 * -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- P . . . . . . . . . . 8 3 42 . . . . . . 32 . 2 . 11 1 . F . . . . . . . . . . 4 2 44 . . . . . . 38 1 2 . 9 . . B . . . . . . . . . . 47 13 2 . . . . . . 7 . 26 2 4 . . V . . . . . . . . . . 74 6 3 . . . . . . 6 . 9 . . . . R . . . . . . . . . . 14 16 . . . . . . 3 16 . 30 2 6 1 7 N . . . . . . 1 . . . 23 20 1 1 . . . . 4 2 . 33 6 2 . 6 M . . . . . . . . . . 26 20 . 2 . . . . 4 1 . 29 6 3 . 7 E 3 13 1 . . . . . . . 21 14 1 2 1 . . . 4 3 2 16 7 5 . 6 O 14 36 2 . 1 1 . . . . . . . . 8 . 1 . . . 32 . 1 1 . . A . . . . . . . . . . . . . . 21 25 21 2 . . 28 . . . 2 . 9 3 3 . . . . . . . . 6 5 . 3 3 . . . 33 . 8 15 7 1 . 11 b 7 9 2 2 . . . . . . 20 18 . . . . . . 4 . . 24 13 . . . v . 6 2 . . . . . . . 32 21 . . . . . . . . . 24 13 . . 2 Q . . . . . . . . . . . . 25 1 . . . . . 2 . 4 5 61 . 1 X . . . . . . . . . . 1 . 26 . . . . . . 2 . 1 5 62 . . W . . . . . . . . . . . . 42 . . . . . . 9 . 9 14 24 . . Y . . . . . . . . . . . . 50 . . . . . . . . . 50 . . . S . 1 . . . . 2 4 1 . . . 72 . . . . . . 2 1 3 7 4 . . Z . 1 . . . . 2 4 . . . . 80 . . . . . . 2 . 3 4 3 . . C 1 2 . . . . 1 2 . . . . 22 1 . . . . . 1 . 4 44 20 . . 2 1 1 . . . . . . . . 5 5 1 1 . . . . 1 41 1 36 1 3 . 3 J . . . . . . . . . . . . . 2 . . . . . . 4 4 . . . 90 4 . . . . . . . . . . . . 1 . . . . . . . . 97 . . . . 8 . . . . . . . . . . 1 2 1 . . . . . 1 15 1 4 . 74 . . * . 1 1 . . . . 1 . . 7 12 4 1 1 1 1 . 3 10 3 25 4 5 7 14 5 1 9 . 1 . . . . . 2 3 . 18 1 . . . 26 . 3 8 13 8 . . -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- Previous-symbol probability (× 99): P F B V b v Q X W Y S Z C 2 R N M J 4 A 9 E O 8 * -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- P . . . . . . . . . . 4 2 8 . . . . . . 13 2 . . . 4 . F . . . . . . . . . . 5 3 20 . . . . . . 37 4 . 1 . 4 . B . . . . . . . . . . 6 2 . . . . . . . 1 . . 1 . . . V . . . . . . . . . . 2 . . . . . . . . . . . . . . . b . . 1 3 . . . . . . 1 1 . . . . . . . . . . . . . . v . . 1 . . . . . . . 1 1 . . . . . . . . . . . . . . Q . . . . . . . . . . . . 1 . . . . . . . 2 . . . . . X . . . . . . . . . . . . 1 . . . . . . . 3 . . . . . W . . . . . . . . . . . . . . . . . . . . . . . . . . Y . . . . . . . . . . . . . . . . . . . . . . . . . . S 1 1 2 3 2 . 25 33 38 50 . . 24 2 . . . . . 1 2 1 1 3 2 . Z . . . . 7 . 16 19 . . . . 20 . . . . . . 1 1 . 1 1 . . C 5 4 7 9 . . 32 34 19 . 1 1 22 12 1 . . . . 3 22 . 4 69 12 . 2 . . 1 . 2 . 1 . . . 1 2 . 1 . . . . . 8 . . 3 . 1 1 R . . 2 . . 2 1 . 5 25 8 13 . 1 . . . . 1 7 1 . 7 1 5 8 N . . . . . . 2 . 9 . 8 10 . 1 . . . . 1 . . . 4 1 . 4 M . . . . . . . 1 . 25 8 8 . 2 . . . . 1 . . . 3 1 1 4 J . . . . . . . . . . . . . . . . . . . . . . . . . 6 4 . . 1 6 . 2 4 . . . . . . 1 . . . . . . . . 42 . . . A . . 1 . . . . . 5 . . . . . 45 97 94 80 . . . 24 . . 36 . E 7 15 7 22 7 11 2 . 5 . 32 30 . 11 3 . . 4 6 3 3 2 9 6 1 18 O 67 71 34 40 48 65 7 7 5 . 1 1 . 2 34 1 5 13 1 . 1 55 . 1 17 1 8 . . . 3 . . . . . . 3 4 . 1 . . . . 1 21 54 1 3 . 2 2 9 13 5 9 6 11 15 7 1 9 . 16 17 . 27 14 . . 2 75 1 1 14 14 10 8 52 * . . 1 . . . . . . . 1 1 . . . . . . . 1 . . 1 . 7 2 5 . 36 6 22 4 2 . 5 . 1 2 . 37 . . . . 12 . 2 1 1 4 . . -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- Perhaps they are POE8/HOE8: cat .voyn.cur \ | tr -d '/= ' \ | tr 'IGHTUDL56' '*********' \ | sed \ -e 's/POE8/b/g' \ -e 's/FOE8/v/g' \ | count-digraph-freqs \ -vshowentropy=1 \ -vchars='PFBVbvQXWYSZC2RNMJ4AEO89IGH1TU0D3KL567' Digraph counts: TT P F B V b v Q X W Y S Z C 2 R N M J 4 A E O 8 9 * ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- P 846 . 1 . . . . . . . . 62 26 341 1 . . 1 . . 259 3 56 3 88 5 . F 1986 . . . . . . . . . . 72 30 869 . 1 . 2 . . 736 11 88 3 170 4 . B 195 . . . . . . . . . . 92 25 4 . . . . . . 13 . 51 3 7 . . V 32 . . . . . . . . . . 24 2 1 . . . . . . 2 . 3 . . . . b 6 . . . . . . . . . . . . . . . . . . . 2 . 1 . 3 . . v 7 . . . . . . . . . . . . . . . . . . . 1 . . . 6 . . Q 121 . . . . . . . . . . . . 31 1 . . . . . 3 . 5 6 74 . 1 X 199 . . . . . . . . . . 2 1 53 . . . . . . 5 . 3 10 125 . . W 21 . . . . . . . . . . . . 9 . . . . . . 2 . 2 3 5 . . Y 4 . . . . . . . . . . . . 2 . . . . . . . . . 2 . . . S 1453 8 17 4 1 . . 31 66 8 2 1 3 1053 6 4 . . . . 27 13 49 96 62 2 . Z 1078 6 6 . . . . 19 39 . . 3 . 866 1 1 . . . 2 23 5 38 41 28 . . C 4268 38 79 13 3 . . 39 69 4 . 15 9 953 45 8 . . . 2 53 4 175 1898 844 14 3 2 365 3 4 1 . . . 1 1 . . 18 19 2 2 . . . . 3 150 2 133 4 10 1 11 R 883 2 5 3 . . . 1 1 1 1 123 145 4 4 1 . . . 25 147 3 272 22 54 6 63 N 503 1 . . . . . 3 . 2 . 117 104 3 5 1 . . . 19 9 2 169 30 9 . 29 M 438 . 2 . . . . . 2 . 1 114 89 1 7 1 . . . 16 4 2 127 25 13 1 33 J 53 . . . . . . . . . . . . . 1 . . . . . . 2 2 . . . 48 4 1676 1 5 2 2 . . 5 1 . . . 1 10 2 . . . . 1 . . 1646 . . . . A 1952 . . 1 . . . . . 1 . . . . . 405 495 414 43 . . 552 . . . 41 . E 2331 64 309 15 8 . 1 2 1 1 . 501 344 19 41 28 . . 2 96 69 41 377 161 114 1 136 O 3951 567 1430 67 13 4 4 9 14 1 . 10 10 19 7 305 7 20 7 13 4 1336 15 41 20 19 9 8 2727 1 8 . 1 . . . . . . 41 43 15 2 2 . . . 21 414 14 97 4 2050 2 12 9 3781 106 113 17 2 1 2 9 3 2 . 233 190 6 101 121 . . 1 1277 18 312 556 266 34 9 402 * 113 . 1 1 . . . . 1 . . 8 14 4 1 1 1 1 . 3 11 3 28 5 6 8 16 763 49 6 71 2 1 . 2 1 1 . 17 23 3 138 4 . . . 198 . 26 58 104 59 . . ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- TOT 29752 846 1986 195 32 6 7 121 199 21 4 1453 1078 4268 365 883 503 438 53 1676 1952 2331 3951 2727 3781 113 763 Next-symbol probability (× 99): P F B V b v Q X W Y S Z C 2 R N M J 4 A E O 8 9 * -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- P . . . . . . . . . . 7 3 40 . . . . . . 30 . 7 . 10 1 . F . . . . . . . . . . 4 1 43 . . . . . . 37 1 4 . 8 . . B . . . . . . . . . . 47 13 2 . . . . . . 7 . 26 2 4 . . V . . . . . . . . . . 74 6 3 . . . . . . 6 . 9 . . . . b . . . . . . . . . . . . . . . . . . . 33 . 17 . 50 . . v . . . . . . . . . . . . . . . . . . . 14 . . . 85 . . Q . . . . . . . . . . . . 25 1 . . . . . 2 . 4 5 61 . 1 X . . . . . . . . . . 1 . 26 . . . . . . 2 . 1 5 62 . . W . . . . . . . . . . . . 42 . . . . . . 9 . 9 14 24 . . Y . . . . . . . . . . . . 50 . . . . . . . . . 50 . . . S 1 1 . . . . 2 4 1 . . . 72 . . . . . . 2 1 3 7 4 . . Z 1 1 . . . . 2 4 . . . . 80 . . . . . . 2 . 3 4 3 . . C 1 2 . . . . 1 2 . . . . 22 1 . . . . . 1 . 4 44 20 . . 2 1 1 . . . . . . . . 5 5 1 1 . . . . 1 41 1 36 1 3 . 3 R . 1 . . . . . . . . 14 16 . . . . . . 3 16 . 30 2 6 1 7 N . . . . . . 1 . . . 23 20 1 1 . . . . 4 2 . 33 6 2 . 6 M . . . . . . . . . . 26 20 . 2 . . . . 4 1 . 29 6 3 . 7 J . . . . . . . . . . . . . 2 . . . . . . 4 4 . . . 90 4 . . . . . . . . . . . . 1 . . . . . . . . 97 . . . . A . . . . . . . . . . . . . . 21 25 21 2 . . 28 . . . 2 . E 3 13 1 . . . . . . . 21 15 1 2 1 . . . 4 3 2 16 7 5 . 6 O 14 36 2 . . . . . . . . . . . 8 . 1 . . . 33 . 1 1 . . 8 . . . . . . . . . . 1 2 1 . . . . . 1 15 1 4 . 74 . . 9 3 3 . . . . . . . . 6 5 . 3 3 . . . 33 . 8 15 7 1 . 11 * . 1 1 . . . . 1 . . 7 12 4 1 1 1 1 . 3 10 3 25 4 5 7 14 6 1 9 . . . . . . . 2 3 . 18 1 . . . 26 . 3 8 13 8 . . -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- Previous-symbol probability (× 99): P F B V b v Q X W Y S Z C 2 R N M J 4 A E O 8 9 * -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- P . . . . . . . . . . 4 2 8 . . . . . . 13 . 1 . 2 4 . F . . . . . . . . . . 5 3 20 . . . . . . 37 . 2 . 4 4 . B . . . . . . . . . . 6 2 . . . . . . . 1 . 1 . . . . V . . . . . . . . . . 2 . . . . . . . . . . . . . . . b . . . . . . . . . . . . . . . . . . . . . . . . . . v . . . . . . . . . . . . . . . . . . . . . . . . . . Q . . . . . . . . . . . . 1 . . . . . . . . . . 2 . . X . . . . . . . . . . . . 1 . . . . . . . . . . 3 . . W . . . . . . . . . . . . . . . . . . . . . . . . . . Y . . . . . . . . . . . . . . . . . . . . . . . . . . S 1 1 2 3 . . 25 33 38 50 . . 24 2 . . . . . 1 1 1 3 2 2 . Z 1 . . . . . 16 19 . . . . 20 . . . . . . 1 . 1 1 1 . . C 4 4 7 9 . . 32 34 19 . 1 1 22 12 1 . . . . 3 . 4 69 22 12 . 2 . . 1 . . . 1 . . . 1 2 . 1 . . . . . 8 . 3 . . 1 1 R . . 2 . . . 1 . 5 25 8 13 . 1 . . . . 1 7 . 7 1 1 5 8 N . . . . . . 2 . 9 . 8 10 . 1 . . . . 1 . . 4 1 . . 4 M . . . . . . . 1 . 25 8 8 . 2 . . . . 1 . . 3 1 . 1 4 J . . . . . . . . . . . . . . . . . . . . . . . . . 6 4 . . 1 6 . . 4 . . . . . . 1 . . . . . . . 41 . . . . A . . 1 . . . . . 5 . . . . . 45 97 94 80 . . 23 . . . 36 . E 7 15 8 25 . 14 2 . 5 . 34 32 . 11 3 . . 4 6 3 2 9 6 3 1 18 O 66 71 34 40 66 57 7 7 5 . 1 1 . 2 34 1 5 13 1 . 57 . 1 1 17 1 8 . . . 3 . . . . . . 3 4 . 1 . . . . 1 21 1 2 . 54 2 2 9 12 6 9 6 17 28 7 1 9 . 16 17 . 27 14 . . 2 75 1 13 14 10 1 8 52 * . . 1 . . . . . . . 1 1 . . . . . . . 1 . 1 . . 7 2 6 . 36 6 17 . 2 . 5 . 1 2 . 37 . . . . 12 . 1 1 4 2 . . -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- Obviously B/V are not POE8/HOE8: the counts are too low and the next-symbol frequencies are all wrong. POE/HOE is still the best fit. But surely there is more to the story... 97-10-08 stolfi =============== Digraph frequencies ignoring blanks and line breaks, and collapsing 'DFT' to 'HPS': cat .voyn.fsg \ | tr -d ' /=\012' \ | tr 'DFT' 'HPS' \ | enum-ngraphs -v n=2 \ | egrep -v '\*' \ > .voyn-tt-2-r.grm cat .voyn-tt-2-r.grm \ | sed -e 's/^\(.\)\(.\)$/\1:\2/g' \ > .voyn-tt-1-1-r.grm cat .voyn-tt-1-1-r.grm \ | sort | uniq -c | expand \ | compute-freqs \ > .voyn-tt-1-1-r.frq Digraph frequencies around line breaks, ignoring spaces: cat .voyn.fsg \ | tr -d ' /=' \ | tr 'DFT' 'HPS' \ | sed -e 's/^\(..\).*\(..\)$/\1\2/g' \ | tr -s '\012' ':' \ | enum-ngraphs -v n=3 \ | egrep -v '\*' \ | egrep '^.:.$' \ > .voyn-nl-1-1-r.grm cat .voyn-nl-1-1-r.grm \ | sort | uniq -c | expand \ | compute-freqs \ > .voyn-nl-1-1-r.frq Digraph frequencies around interword blanks (omitting line breaks): cat .voyn.fsg \ | tr -d '/=\012' \ | tr 'DFT' 'HPS' \ | tr -s ' ' ':' \ | enum-ngraphs -v n=3 \ | egrep -v '\*' \ | egrep '^.:.$' \ > .voyn-sp-1-1-r.grm cat .voyn-sp-1-1-r.grm \ | sort | uniq -c | expand \ | compute-freqs \ > .voyn-sp-1-1-r.frq Now let's do the comparisons. First, line breaks against total occurrences: compare-freqs \ .voyn-tt-1-1-r.frq \ .voyn-nl-1-1-r.frq \ | compute-count-ratio \ -v nmin=10 -v mw=8 -v mc=40 \ | sort +0.0 -0.2r +4 -5nr \ > .voyn-tt-nl-1-1-r.cmp cat .voyn-tt-nl-1-1-r.cmp \ | print-pattern-classes \ -v rowchars='AI4FPDHCTSZ2L68OKMNREG' \ -v colchars='A6KLMNIZFC2PEDHSTR4G8O' A 6 K L M N I Z C 2 P E H S R 4 G 8 O -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- A | . . oo . oo oo oo . . . -? oo . . oo . . . . I | . . -? -? . . oo . . . . -? . . oo . . . . 4 | . . . . . . . . oo -? -? . oo -? . -? . . oo P | oo . . . . . . oo -? . . . . oo . . -? -? oo H | oo . . . -? . -? oo oo -? . oo -? oo -? . oo -? oo C | oo -? . -? . . . . oo oo oo -? oo oo -? -? oo -- oo S | oo . . . . . . . oo -? oo oo oo -? -? -? oo oo oo Z | oo . . . . . . . oo -? . . . -? . . oo oo -- 2 | oo . . . . . -? . -? +? -? -? || -- . +? oo -? -- L | . . . . . . . . . . . -? . -? . . -? . -? 6 | -? . . . . . . . . -? . . -? -? . +? . +? -? 8 | oo . . -? . . -? . oo -? -? || || oo -? || oo -? -- O | -? -? -? -? oo -? -? . oo ## oo oo oo -- oo || oo oo || K | . . . . . . . . . ## -? -? -? +? . ## +? +? +? M | -? . . . . . -? . -? ## +? -? +? -- -? || -- -- -- N | -? . . . . . . . -? ## +? +? -? -- -? || || ++ -- R | oo -? . . . . -? . -? ## ## +? || -- -? || -- || -- E | oo . -? . . . . . -- || || || -- -- oo || ++ ++ -- G | oo -? -? -? . . -? . +? || || -- || -- -- ++ || || -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- A 6 K L M N I Z C 2 P E H S R 4 G 8 O Now, intra-line spaces against all occurrences: compare-freqs \ .voyn-tt-1-1-r.frq \ .voyn-sp-1-1-r.frq \ | compute-count-ratio \ -v nmin=10 -v mw=5 -v mc=5 \ | sort +0.0 -0.2r +4 -5nr \ > .voyn-tt-sp-1-1-r.cmp cat .voyn-tt-sp-1-1-r.cmp \ | print-pattern-classes \ -v rowchars='AI4FPDHCTSZ2L68OKMNREG' \ -v colchars='A6KLMNIZFC2PEDHSTR4G8O' A 6 K L M N I Z C 2 P E H S R 4 G 8 O -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- A | . . oo . oo oo oo . . . +? oo . . oo . . . . I | . . +? +? . . oo . . . . -? . . oo . . . . 4 | . . . . . . . . oo +? +? . || +? . +? . . oo P | oo . . . . . . oo +? . . . . -- . . -? +? -- H | -- . . . +? . +? oo -- +? . oo +? -- +? . -- +? -- C | oo +? . +? . . . . oo oo oo +? oo oo +? +? -- -- -- S | oo . . . . . . . oo -? oo oo oo +? +? +? oo -- -- Z | oo . . . . . . . oo +? . . . +? . . oo oo ++ 2 | ++ . . . . . +? . +? +? +? +? ## || . +? ++ +? ++ L | . . . . . . . . . . . +? . +? . . +? . +? 6 | +? . . . . . . . . +? . . +? +? . +? . +? +? 8 | -- . . +? . . +? . oo +? +? || || ++ +? ## -- +? || O | +? -? -? +? oo -? +? . oo ## oo -- -- || -- ## ++ -- || K | . . . . . . . . . ## +? +? +? +? . ## +? +? +? M | +? . . . . . +? . +? ## +? +? +? ## +? ## ## ## ## N | +? . . . . . . . +? ## +? +? +? ## +? ## ## ## ## R | || +? . . . . +? . +? ## ## +? ## || +? ## || ## || E | || . +? . . . . . || || || ## ++ || || ## ++ || || G | ## +? +? +? . . +? . +? ## ## || || || || || ## || ## -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- A 6 K L M N I Z C 2 P E H S R 4 G 8 O There are some notable differences. Patterns that are strong space-contexts but weak or negligible line break contexts: [28OMNREG]:[ST] [2O]:G [N8]:O M:[G8O] R:[GO] E:[CRO] G:[AERO] Just for the sake of completeness, here is the comparison of spaces with line breaks: compare-freqs \ .voyn-sp-1-1-r.frq \ .voyn-nl-1-1-r.frq \ | compute-count-ratio \ -v nmin=10 -v mw=2 -v mc=8 \ | sort +0.0 -0.2r +4 -5nr \ > .voyn-sp-nl-1-1-r.cmp cat .voyn-sp-nl-1-1-r.cmp \ | print-pattern-classes \ -v rowchars='AI4FPDHCTSZ2L68OKMNREG' \ -v colchars='A6KLMNIZFC2PEDHSTR4G8O' A 6 L I C 2 P E H S R 4 G 8 O -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- 4 | . . . . . . -? . -? . . -? . . . P | . . . . -? . . . . -? . . . . -? H | -? . . . -? . . . . -? -? . -? -? -? C | . . . . . . . . . . -? -? -? -? -? S | . . . . . . . . . -? . -? . -? -? Z | . . . . . . . . . -? . . . . -? 2 | oo . . . . +? -? -? +? oo . -? -? -? oo L | . . . . . . . . . -? . . . . -? 6 | . . . . . -? . . -? -? . +? . +? -? 8 | -? . . . . -? -? -? +? -? . -- -? -? oo O | . . . . . +? . -? -? -? -? -? -? -? -? K | . . . . . ## -? -? -? +? . ## +? +? +? M | -? . . -? -? ## +? -? +? oo -? || -? oo oo N | -? . . . -? ## +? +? -? oo -? -- oo -- oo R | oo . . . -? ## ## +? ## -- -? ++ -- ++ -- E | oo . . . oo || ## -- -- -- oo ++ ## -- -- G | oo -? -? . -? ++ || -- ++ -- oo -- || -- -- Let's write a sed script to split words and syllabes according to the patterns that occur at line breaks. I recomputed the ratio by the more generous formula gawk '\ { printf " %5d %5.3f %5d %5.3f %5.3f %s %s\n",\ $1, $2, $3, $4, ($3)/($1+2), $6, $7 \ }' Then classified them as ++ very likey a word break ratio >= 0.200 and NT >= 5 +? possibly a word break ratio >= 0.200 and NT < 5 :: very likey a syllabe break 0.200 > ratio >= 0.005 and NL >= 5 :? possible syllabe break 0.200 > ratio >= 0.005 and NL < 5 -- very likely unbreakable 0.005 > ratio and NT >= 80 -? possibly unbreakable 0.005 > ratio and NT < 80 Result is in .voyn-tt-nl-1-1-r-hand.cmp cat .voyn-tt-nl-1-1-r-hand.cmp \ | print-pattern-classes \ -v rowchars='AI4FPDHCTSZ2L68OKMNREG' \ -v colchars='A6KLMNIZFC2PEDHSTR4G8O' A 6 K L M N I Z C 2 P E H S R 4 G 8 O -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- A | . . -? . -- -- -? . . . -? -- . . -- . . . . I | . . -? -? . . -? . . . . -? . . -? . . . . 4 | . . . . . . . . -? -? -? . -? -? . -? . . -- P | -? . . . . . . -? -? . . . . -- . . -? -? -? H | -- . . . -? . -? -- -- -? . -? -? -- -? . -- -? -- C | -? -? . -? . . . . -- -? -? -? -- -? :? +? -- -- -- S | -? . . . . . . . -- -? -? -? -- -? -? -? -- -- -- Z | -? . . . . . . . -- -? . . . -? . . -- -? :? 2 | -- . . . . . -? . -? ++ +? -? :? :? . ++ -? -? :? L | . . . . . . . . . . . -? . -? . . -? . -? 6 | -? . . . . . . . . +? . . +? +? . +? . +? -? 8 | -- . . -? . . -? . -? +? -? :? :? -- -? :? -- :? :? O | -? -? -? -? -? -? -? . -? ++ -- -- -- :? -- :? -? -? :? K | . . . . . . . . . ++ +? -? +? +? . ++ +? ++ ++ M | -? . . . . . -? . -? ++ +? +? ++ :? -? ++ :? :? :? N | -? . . . . . . . -? ++ ++ +? :? :? -? :: ++ :? :? R | -- -? . . . . -? . -? ++ ++ ++ ++ :? +? ++ :? ++ :: E | -? . -? . . . . . :? ++ ++ :: :: :? -? ++ :: :: :: G | -? -? -? -? . . -? . ++ ++ ++ :: :: :: :? :: ++ :: :: Here are the rules ("+" means word split, ":" means syllabe, "-" means no break). .-[A6KLMNIZ] [AI4FPDHCSTZ]-. [2MNRE]-G [L8O]-[CFPEDHSTR4G8O] [26KMNREG]+[FP] [2L68OKMNREG]+2 [G]+[CFPG] [MNR]+[EDHG] [MER2]+[4] [R]+[8R] [2]:[FPEDHSTRG8O] [MNR]:[ST] [E]:[EDHSTG8] [G]:[EDHSTR8] [MN]:[8] {MNREG]:[O] cat .voyn.fsg \ | tr -d '/= ' \ | sed -e 's/\(.\)/\1 /g' \ | split-by-nl-patterns \ | split-by-nl-patterns \ | tr -d ' \-' \ | tr '+' ' ' \ > .voyn-nl-split.fsg Global tetragram frequencies, ignoring line breaks and word spaces: cat .voyn.fsg \ | tr -d ' /=\012' \ | enum-ngraphs -v n=4 \ | egrep -v '\*' \ > .voyn-tt-4.grm cat .voyn-tt-4.grm \ | sed -e 's/^\(..\)\(..\)$/\1:\2/g' \ > .voyn-tt-2-2.grm cat .voyn-tt-2-2.grm \ | sort | uniq -c | expand \ | compute-freqs \ > .voyn-tt-2-2.frq 97-10-09 stolfi =============== Decided to create another error-tolerant encoding even more "lossy" than HOP. This one collapses FSG A with O, R with 2, S with T. Also ignore spaces (periods): --- fsg2ecc ------------------------ #! /n/gnu/bin/gawk -f # Recoding an interlinear file from the FSG alphabet to # my Super-Lossy Fault-Tolerant encoding BEGIN { print "# Output of fsg2ecc - Stolfi's Semi-Analytic Fault-Tolerant alphabet" } /^ *$/ { print; next } /^ *#/ { print; next } /^<[^>.;]*>/ { print; next } /^<[^>]*\.[^>]*;[A-Z]> / { curtxt = substr($0,20) # We discard "%" and "!" since the conversion # will destroy synchronism anyway. gsub(/[%!]/, "", curtxt); # We also discard spaces ("." in the evt format), # since they are not reliable gsub(/[.]/, "", curtxt); # First, the conversion from FSG to JSA (Stolfi's super-analytic) gsub(/IIIK/, "iiiij", curtxt); gsub(/IIIL/, "iiiiu", curtxt); gsub(/IIIR/, "iiiis", curtxt); gsub(/IIIE/, "iiiix", curtxt); gsub(/IIE/, "iiix", curtxt); gsub(/IIR/, "iiis", curtxt); gsub(/IIK/, "iiij", curtxt); gsub(/HZ/, "cqjc", curtxt); gsub(/PZ/, "cqgc", curtxt); gsub(/DZ/, "cljc", curtxt); gsub(/FZ/, "clgc", curtxt); gsub(/IE/, "iix", curtxt); gsub(/IR/, "iis", curtxt); gsub(/IK/, "iij", curtxt); gsub(/2/, "cs", curtxt); gsub(/4/, "q", curtxt); gsub(/6/, "cj", curtxt); gsub(/7/, "ig", curtxt); gsub(/8/, "cg", curtxt); gsub(/A/, "ci", curtxt); gsub(/C/, "c", curtxt); gsub(/D/, "lj", curtxt); gsub(/E/, "ix", curtxt); gsub(/F/, "lg", curtxt); gsub(/G/, "cy", curtxt); gsub(/H/, "qj", curtxt); gsub(/I/, "i", curtxt); gsub(/K/, "ij", curtxt); gsub(/L/, "iu", curtxt); gsub(/M/, "iiiu", curtxt); gsub(/N/, "iiu", curtxt); gsub(/O/, "o", curtxt); gsub(/P/, "qg", curtxt); gsub(/R/, "is", curtxt); gsub(/S/, "cc", curtxt); # Was "csc" in JSA gsub(/T/, "cc", curtxt); gsub(/V/, "?", curtxt); gsub(/Y/, "?", curtxt); # Now, the conversion from JSA to ECC: gsub(/[ql]j/, "H", curtxt); gsub(/[ql]g/, "P", curtxt); gsub(/ij/, "k", curtxt); gsub(/ii*x/, "e", curtxt); gsub(/is/, "r", curtxt); gsub(/iiu/, "n", curtxt); gsub(/y/, "i", curtxt); gsub(/ci/, "a", curtxt); gsub(/cg/, "8", curtxt); gsub(/cs/, "r", curtxt); gsub(/ii*r/, "w", curtxt); gsub(/i*n/, "m", curtxt); gsub(/a/, "o", curtxt); print (substr($0,1,19) curtxt); next } ------------------------------------ cat bio-m-evt.evt \ | fsg2ecc \ > bio-m-ecc.evt cat bio-m-ecc.evt \ | make-consensus-interlin \ > bio-x-ecc.evt cat bio-x-ecc.evt \ | egrep '^<.*;J> ' \ | sed \ -e 's/{[^}]*}//g' \ > bio-j-ecc.evt extract-words-from-interlin \ -chars "8coqHPemrwk" \ bio-j-ecc.evt \ bio-j-ecc lines words bytes file ------ ------- --------- ------------ 1605 1605 35644 bio-j-ecc.wds 767 767 33204 bio-j-ecc.dic 333 333 13811 bio-j-ecc-gut.wds 333 333 13811 bio-j-ecc-gut.dic 840 840 2445 bio-j-ecc-fun.wds 2 2 5 bio-j-ecc-fun.dic 432 432 19388 bio-j-ecc-bad.wds 432 432 19388 bio-j-ecc-bad.dic Here are the statistics. Keep in mind that spaces were deleted, and here " " means line break. Digraph counts: TT 8 c o q H P e m r w k ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- 333 . 39 15 51 89 24 38 11 . 66 . . 8 1166 4 2 92 1052 9 2 . 4 . 1 . . c 4351 1 909 2389 585 1 183 18 232 3 30 . . o 3864 189 113 211 261 576 972 41 683 402 384 10 22 q 728 . . 10 718 . . . . . . . . H 1347 . 2 853 484 . . . 5 1 2 . . P 109 . 1 75 33 . . . . . . . . e 958 64 67 360 224 29 162 10 18 . 24 . . m 406 24 24 188 148 13 1 . 2 . 6 . . r 517 31 9 153 302 11 3 2 3 . 3 . . w 10 . . 5 5 . . . . . . . . k 22 20 . . 1 . . . . . 1 . . ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- TOT 13811 333 1166 4351 3864 728 1347 109 958 406 517 10 22 Next-symbol probability (× 99): 8 c o q H P e m r w k -- -- -- -- -- -- -- -- -- -- -- -- . 12 4 15 26 7 11 3 . 20 . . c . 21 54 13 . 4 . 5 . 1 . . o 5 3 5 7 15 25 1 17 10 10 . 1 8 . . 8 89 1 . . . . . . . q . . 1 98 . . . . . . . . H . . 63 36 . . . . . . . . P . 1 68 30 . . . . . . . . w . . 50 50 . . . . . . . . e 7 7 37 23 3 17 1 2 . 2 . . m 6 6 46 36 3 . . . . 1 . . r 6 2 29 58 2 1 . 1 . 1 . . k 90 . . 5 . . . . . 5 . . -- -- -- -- -- -- -- -- -- -- -- -- TOT 2 8 31 28 5 10 1 7 3 4 0 0 Note that "e", "m", and "r" have become more similar. It is curious that "8" and "q" have very similar next-symbol statistics. Also curious that P and H become identical... Previous-symbol probability (× 99): TT w k m e H P q r 8 c o -- -- -- -- -- -- -- -- -- -- -- -- -- 2 . . . . 1 2 35 12 13 3 . 1 o 28 56 99 99 98 71 71 37 78 74 10 5 7 c 31 . . . 1 24 13 16 . 6 77 54 15 8 8 1 . . . . . . 1 . . 2 27 q 5 . . . . . . . . . . . 18 H 10 . . . . 1 . . . . . 19 12 P 1 . . . . . . . . . . 2 1 e 7 19 . . . 2 12 9 4 5 6 8 6 m 3 7 . . . . . . 2 1 2 4 4 r 4 9 . . . . . 2 1 1 1 3 8 w 0 . . . . . . . . . . . . k 0 6 . . . . . . . . . . . -- -- -- -- -- -- -- -- -- -- -- -- -- Symbol entropy: 2.693 An encouraging sign: with this encoding, all labels in f77v can be found in the text of the bio section, hand B. Let's try to discern word/syllabe boundaries from the line breaks, in this reduced encoding: cat bio-j-ecc-gut.wds \ | tr -d '\012' \ | enum-ngraphs -v n=2 \ | egrep -v '\*' \ > .bio-j-ecc-tt-2.grm cat .bio-j-ecc-tt-2.grm \ | sed -e 's/^\(.\)\(.\)$/\1:\2/g' \ > .bio-j-ecc-tt-1-1.grm cat .bio-j-ecc-tt-1-1.grm \ | sort | uniq -c | expand \ | compute-freqs \ > .bio-j-ecc-tt-1-1.frq Digraph frequencies around line breaks, ignoring spaces: cat bio-j-ecc-gut.wds \ | sed -e 's/^\(..\).*\(..\)$/\1\2/g' \ | tr -s '\012' ':' \ | enum-ngraphs -v n=3 \ | egrep -v '\*' \ | egrep '^.:.$' \ > .bio-j-ecc-nl-1-1.grm cat .bio-j-ecc-nl-1-1.grm \ | sort | uniq -c | expand \ | compute-freqs \ > .bio-j-ecc-nl-1-1.frq compare-freqs \ .bio-j-ecc-tt-1-1.frq \ .bio-j-ecc-nl-1-1.frq \ | compute-count-ratio \ -v nmin=10 -v mw=10 -v mc=40 \ | sort +0.0 -0.2r +4 -5nr \ > .bio-j-ecc-tt-nl-1-1.cmp cat .bio-j-ecc-tt-nl-1-1.cmp \ | print-pattern-classes \ -v rowchars='co8qHPwemrk' \ -v colchars='co8qHPwemrk' Pattern classes: c o 8 q H P w e m r k -- -- -- -- -- -- -- -- -- -- -- c | -- -- -- -? -- -- . -- -? -- . q | -- -- . . . . . . . . . H | -- -- -? . . . . -? -? -? . P | -- -- -? . . . . . . . . w | -? -? . . . . . . . . . 8 | -- -- -? || -? -? . -? . -? . o | -- || || ++ -- || -- -- -- ++ -- e | -- -- || || -- || . || . || . m | -- -- || || -? +? . +? . ## . r | -- -- || || +? +? . -? . +? . k | -? +? +? +? . +? . -? . +? . Fixing the count ratio and classification as in previous manual classification experiment: --- compute-count-ratio-new ------------------------ #! /n/gnu/bin/gawk -f # # Usage: "$0 -v nmin=NNN -v mw=N.NNN mc=N.NNN # # Computes the ratio of two counts for a list of patterns. # The input must be the output of compare-freqs, in the # format " NT FT NL FL patt", where "NT","NL" are # two counts, and "FT","FL" the corresponding relative # frequencies. The output will have the format # " NT FT NL FL rat mk patt" where "rat=(NL)/(NT+2)". # # The "mk" field is a class code, assigned based on the # ratio and its certainty, and the parameters "mw", "mc", # and "nmin", as follows: function classify(NT, NL, ratio, nmin, mw, mc) { if (ratio >= 1.0/mw) { if (NT >= nmin) { return "++" } # Probably word break else { return "+?" } # unimportant but looks more like a word break } else if (ratio >= 0.005) { if (NL >= nmin) { return "::" } # possible syllabe break else { return ":?" } # uncertain but looks more like syllabe break } else { if (2*NT < mc) { return "??" } # too rare, can't tell else if (NT < 2*mc) { return "-?" } # uncertain but looks more like non-break else { return "--" } # non-break } } /^##/ { $0 = substr($0, 3); printf "##%11.11s %11.11s RelFr MK %s\n", $1, $2, $3; next } /^# / { $0 = substr($0, 3); printf "# %11.11s %11.11s ----- -- %s\n", $1, $2, $3; next } /[0-9]\.[0-9]/ { if (mw == 0) { print "must define mw" > "/dev/stderr"; exit 1; } if (mc == 0) { print "must define mc" > "/dev/stderr"; exit 1; } if (nmin == 0) { print "must define nmin" > "/dev/stderr"; exit 1; } NT = $1 NL = $3 rat = (NL/(NT+2)); mark = classify(NT, NL, rat, nmin, mw, mc) printf " %5d %5.3f %5d %5.3f %6.3f %s %s\n", $1, $2, $3, $4, rat, mark, $5; next } ---------------------------------------------------- compare-freqs \ .bio-j-ecc-tt-1-1.frq \ .bio-j-ecc-nl-1-1.frq \ | compute-count-ratio-new \ -v nmin=5 -v mw=8 -v mc=40 \ | sort +0.0 -0.2r +4 -5nr \ > .bio-j-ecc-tt-nl-1-1-new.cmp cat .bio-j-ecc-tt-nl-1-1-new.cmp \ | print-pattern-classes \ -v rowchars='qHPwco8rekm' \ -v colchars='mwkco8eHPqr' m w k c o 8 e H P q r -- -- -- -- -- -- -- -- -- -- -- q | . . . ?? -- . . . . . . H | ?? . . -- -- ?? ?? . . . ?? P | . . . -? -? ?? . . . . . w | . . . ?? ?? . . . . . . c | ?? . . -- -- -- -- -- ?? +? -? o | -- ?? -? :: :: ++ :: :: ++ :: :: 8 | . . . -- -- ?? ?? ?? +? ++ +? r | . . . :? :? ++ ?? ++ ++ ++ ++ e | . . . :? :: :: ++ :? ++ ++ ++ k | . . . +? +? +? +? . +? ++ ++ m | . . . -- :? :? +? ?? +? ++ ++ Non-breaks: [qHPw]:. .:[mwk] [c]:[co8eHPr] [8]:[co] [m]:[c] "Word" breaks: [8rk]:[8] [8erkm]:[eHPqr] [o]:[8P] [k]:[co] Possible "Syllabe" breaks: all else. Recomputing with mw=5 instead of 8: compare-freqs \ .bio-j-ecc-tt-1-1.frq \ .bio-j-ecc-nl-1-1.frq \ | compute-count-ratio-new \ -v nmin=5 -v mw=5 -v mc=40 \ | sort +0.0 -0.2r +4 -5nr \ > .bio-j-ecc-tt-nl-1-1-new.cmp cat .bio-j-ecc-tt-nl-1-1-new.cmp \ | print-pattern-classes \ -v rowchars='qHPwco8rekm' \ -v colchars='mwkco8eHPqr' m w k c o 8 e H P q r -- -- -- -- -- -- -- -- -- -- -- q | . . . ?? -- . . . . . . H | ?? . . -- -- ?? ?? . . . ?? P | . . . -? -? ?? . . . . . w | . . . ?? ?? . . . . . . c | ?? . . -- -- -- -- -- ?? +? -? o | -- ?? -? :: :: :: :: :: ++ :: :: 8 | . . . -- -- ?? ?? ?? +? :? +? e | . . . :? :: :: :? :? ++ ++ ++ r | . . . :? :? ++ ?? ++ ++ ++ ++ k | . . . +? +? +? +? . +? ++ ++ m | . . . -- :? :? +? ?? +? ++ ++ Non-breaks: [qHPw]:. .:[mwk] [c]:[Pr] [8]:[co] [m]:[c] "Word" breaks: [8erkm]:[eHPqr] [8]:[8] [rkm]:[o8] [k]:[c] Possible "Syllabe" breaks: all else (should check digraphs). Overall tetragram frequencies: cat bio-j-ecc-gut.wds \ | tr -d ' \012' \ | enum-ngraphs -v n=4 \ | egrep -v '\*' \ | sed \ -e 's/^\(..\)\(..\)$/\1:\2/g' \ > .bio-j-ecc-gut-tt-2-2.grm cat .bio-j-ecc-gut-tt-2-2.grm \ | egrep -v '[qHPw]:.|.:[mwk]|[c]:[co8eHPr]|[8]:[co]|[m]:[c]' \ | egrep -v '[8rk]:[8]|[8erkm]:[eHPqr]|[o]:[8P]|[k]:[co]' \ | sort | uniq -c | expand \ | compute-freqs \ > .bio-j-ecc-gut-tt-2-2.frq Tetragram frequencies around line breaks, ignoring spaces: cat bio-j-ecc-gut.wds \ | sed -e 's/^\(..\).*\(..\)$/\1\2/g' \ | tr -s '\012' ':' \ | enum-ngraphs -v n=5 \ | egrep -v '\*' \ | egrep '^..:..$' \ > .bio-j-ecc-gut-nl-2-2.grm cat .bio-j-ecc-gut-nl-2-2.grm \ | egrep -v '[qHPw]:.|.:[mwk]|[c]:[co8eHPr]|[8]:[co]|[m]:[c]' \ | egrep -v '[8rk]:[8]|[8erkm]:[eHPqr]|[o]:[8P]|[k]:[co]' \ | sort | uniq -c | expand \ | compute-freqs \ > .bio-j-ecc-gut-nl-2-2.frq Comparisons: compare-freqs \ .bio-j-ecc-gut-tt-2-2.frq \ .bio-j-ecc-gut-nl-2-2.frq \ | compute-count-ratio-new \ -v nmin=5 -v mw=8 -v mc=40 \ | sort +0.0 -0.2r +4 -5nr \ > .bio-j-ecc-gut-tt-nl-2-2-new.cmp cat .bio-j-ecc-gut-tt-nl-2-2-new.cmp \ | print-pattern-classes oc cc 8o 8c oH oP oe or om o8 oq oo ok ow qo qc ro rq Ho Hc eo ec rc e8 eq er ee eH eP r8 rH rP re rr ce cH cP cm co He H8 Hm 8P 8e 8r -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- oo | . ?? . . ?? . ?? . . . . . . . . . ?? . -? -- ?? -? ?? ?? ?? ?? ?? -? ?? . . ?? . . . ?? . . . . ?? . . . . qo | . ?? . . ?? . ?? ?? . . . . . . ?? . ?? ?? -- -- ?? -? ?? ?? ?? ?? ?? ?? ?? . . . . . . ?? . . . ?? . . . . . ko | . ?? . . ?? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . wo | . . . . . . . . . . . . . . . . ?? . ?? ?? . . . . . . . ?? . . . . . . . . . . . . . . . . . Ho | +? ++ . . :? ?? ?? +? . . . . . . ++ +? :? ?? ?? +? ?? ?? -? ?? . . . ?? ?? ?? . . . . . . . . . . . . . . . Po | . ?? . . . . . . . . . . . . . . ++ . ?? ?? ?? ?? ?? . . . . ?? ?? . ?? . . . . ?? . . . . . . . . . eo | +? ++ . . +? . +? . ?? . . . . . ++ . ++ ?? :? -? ?? -? ?? ?? ?? ?? ?? ?? . ?? . ?? . ?? . . . . . . . . . . . mo | . ?? . . . . . . . . . . . . +? . +? +? -? -? ?? ?? ?? ?? ?? ?? ?? ?? ?? . . . . . . ?? . . . . . ?? . . . ro | +? ?? . . +? . +? . ?? +? . . . . +? . :? ?? ?? :? :? -? ?? ?? ?? ?? ?? -? ?? ?? ?? ?? ?? ?? . ?? . . . ?? . . . . . 8o | ++ :? . . :: ?? :? ?? ?? +? . ?? . . :: ?? ++ ?? ++ ++ ?? :? :? ?? ?? ?? . ?? ?? ?? ?? ?? . ?? ?? ?? +? . . . . . . . . co | +? :? . . :? ?? :? ?? . ?? . ?? ?? . :: ?? :? ?? :? :? -? :? :? ?? ?? ?? ?? -? ?? ?? ?? ?? . ?? . ?? ?? . . . . . . . . oe | ++ -- :? ++ :? ?? :? -? ?? ?? ?? ?? ?? . . . . . . . . . . . . . . . . . . . . . ?? ?? ?? . . . . . . . . om | ++ . :? :? :? ?? -? ?? . ?? ?? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ?? ?? or | ++ :? . . -? ?? :? -? -? ?? ?? ?? ?? ?? . . . . . . . . . . . . . . . . . . . . -? ?? ?? ?? . . . . . . . ce | +? :? :? :? ?? ?? :? ?? ?? ?? ?? ?? ?? . . . . . . . . . . . . . . . . . . . . . ?? ?? . . . . . . ?? . . Hc | . . . . . . . . . . . . . . +? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . cr | . ?? . . ?? . ?? ?? ?? . ?? ?? . . . . . . . . . . . . . . . . . . . . . . ?? . . . ?? . . . . . . er | . ?? . . . . ?? ?? ?? . . +? . . . . . . . . . . . . . . . . . . . . . . ?? . . . . . . . . . . kr | . . . . . . ?? ?? ?? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . mr | . . . . . . ?? ?? ?? . . . . . . . . . . . . . . . . . . . . . . . . . ?? . . . . . . . . . . rr | . ?? . . . . ?? ?? ?? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8e | . ?? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . He | . ?? . . . . ?? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . cc | . . . . . . . . . . . . . . ?? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ee | . ?? . . . . ?? ?? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ke | . . . . . . ?? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . me | . ?? . . . . ?? . . ?? . ?? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . re | . ?? . . . . . . . . . . ?? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Hm | . . . . . . . ?? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . cm | . . . . . . ?? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8r | . . . . . . ?? . . ?? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Hr | . . . . . . . . ?? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- oc cc 8o 8c oH oP oe or om o8 oq oo ok ow qo qc ro rq Ho Hc eo ec rc e8 eq er ee eH eP r8 rH rP re rr ce cH cP cm co He H8 Hm 8P 8e 8r Note that :oH resembles :cH, could it be a mistreading? From this table, the only reasonably certain entries are "Word" boundary: eo:cc eo:ro eo:qo Ho:cc Ho:qo Po:ro 8o:oc 8o:ro 8o:Ho 80:Hc oe:oc oe:8c om:oc or:oc Non-boundary: oo:Hc qo:Ho qo:Hc oe:cc "Syllabe" boundary: 8o:qo 8o:oH co:qo We could extend these to "don't care" cases as follows: "Word" boundary: [HPerm8c]o:o[crm8] [HPemr]o:(cc|qo|qc) [emr]o:o[HPeqokw] [Pem8o]o:r[oq] 8o:H[oc] Ho:Hc (oe|om|or|ce):oc oe:8c Hc:qo "Syllabe" boundary: [HP8c]o:o[HPeqokw] [8c]o:(cc|qo|qc|ec|rc) Ho:ro eo:Ho ro:(ro|Hc|eo) co:(r[oq]|H[oc]) (om|or|ce):(cc|8o|8c) oe:8o o[em]:oH (oe|or|ce):oe Non-break: ([cekmr8H]r|oo|qo|ko|wo):.. ..:(e[8qreHP]|r[8HPer]|c[eHPmo]|8[Per]) ([HPem]o|oe|or|om|ce):(eo|ec|rc|8o|8c) [r8c]o:8[oc] (oe|om|or|ce):(o[Pr8mqokw]|q[oc]|r[oq]|H[oc]) (ro|Ho):(rq|Ho) (mo|Po):(Ho|Hc) ro:(ec|rc) co:eo eo:Hc om:oe or:oH ce:oH oe:cc cat bio-j-ecc-gut.wds \ | sed -e 's/\(.\)/\1 /g' -e 's/ $//g' \ | split-ecc-by-nl-patterns \ | split-ecc-by-nl-patterns \ | tr -d ' \-' | tr '+:' ' \-' \ > .bio-j-ecc-gut-split.ecc Here is a sample of the result: 8ocHcoe Hok ooHcco-eccco-Hce-8o-ccco-oHccco-qoHcc8o Pccc8o-qoHcc8o-oHomccc8o-qoHor-ccoe-oeccc8o-qoHo Pccc8o Hcc8o-qoHc8o-qoHc8o-qoHc8o-qoHc8o-qoHomoeccc8o rom qoHom qoe Hccoeo romccc8o r-o-eor-ccc8o-oHcc8o-qoHo Pccc8o-r-cccPcco-eccc8o ro 8ce-ccce-cco-Hoeccc8o-qoHok roecccc8o-qoeccc8o-qoe-o-Homccor ro-r-o-eo qoHccc8o-qoeccco-qoHo cccocHcco-qoHomor qoHomoe Hcco-qoe Ho-ro-romccccHcoeo r-oe 8omoecccoe-8omoe qoeo 8o ro 8o Hccc8o Pccc8o-qoHcco-r-o-e-oe-8owccccHco-qoe ecccc8o-qoHcc8oe-oeccc8o qo 8omccccHo qoHco-qoHomcccHo qoHce-8omccc8o-oHce-oeccc8o-oHo-r-o-eok roe Hc8o-oHce-8o roHo-oHo-roHo-r-oe Homoe Hc8o qoHc8o 8o-ccccHo qoHc8o-qoHcc8o-qoHccc8oe-oe qoHcc8o-qoHcc8o-qoHc8o-qoHc8o-qoHcc8oe-8o occc8o-qoHcc8o-qoHcc8o-oe Hcc8o-oHco-Hoe-8o 8ccc8o-qoHc8o-qoHcc8o-qoHcco-qoHcc8o 8or occc8o-cccHo-r-oe-8o-qoHomccHo-roHo-r-oe-8o Ditto, without "-"s: 8ocHcoe Hok ooHccoecccoHce8occcooHcccoqoHcc8o Pccc8oqoHcc8ooHomccc8oqoHorccoeoeccc8oqoHo Pccc8o Hcc8oqoHc8oqoHc8oqoHc8oqoHc8oqoHomoeccc8o rom qoHom qoe Hccoeo romccc8o roeorccc8ooHcc8oqoHo Pccc8orcccPccoeccc8o ro 8ceccceccoHoeccc8oqoHok roecccc8oqoeccc8oqoeoHomccor roroeo qoHccc8oqoecccoqoHo cccocHccoqoHomor qoHomoe Hccoqoe HororomccccHcoeo roe 8omoecccoe8omoe qoeo 8o ro 8o Hccc8o Pccc8oqoHccoroeoe8owccccHcoqoe ecccc8oqoHcc8oeoeccc8o qo 8omccccHo qoHcoqoHomcccHo qoHce8omccc8ooHceoeccc8ooHoroeok roe Hc8ooHce8o roHooHoroHoroe Homoe Hc8o qoHc8o 8occccHo qoHc8oqoHcc8oqoHccc8oeoe qoHcc8oqoHcc8oqoHc8oqoHc8oqoHcc8oe8o occc8oqoHcc8oqoHcc8ooe Hcc8ooHcoHoe8o 8ccc8oqoHc8oqoHcc8oqoHccoqoHcc8o 8or occc8occcHoroe8oqoHomccHoroHoroe8o