Hacking at the Voynich manuscript - Side notes 025 Classifying OKOKOKO elements as word-initial, -medial, and -final Last edited on 1999-02-05 19:33:19 by stolfi Word and line breaks in the OKOKOKO model ----------------------------------------- In Notes/017 we saw that (practically) every Voynichese word can be parsed into the paradigm QOKOKO...KO where Q, O, K are certain sets of letters and letter clusters. It is instructive to analyze the immediate contexts of definite word spaces (std), breaks due to figures in the text (fig), intra-paragraph line breaks (lin), and inter-word pairs (non), in terms of this paradigm. The source text --------------- For this study we will use the majority-vote and consensus transcriptions, that includes Takeshi's new full transcription. For simplicity, let's discard all data containing weirdos, extra plumes, unreadable characters, or the rare letters [abuvxz]. Let's also map the upper case EVA letters [SCIKTPF] to their lower case varians, since the capitalization carries no information in those cases. foreach vt ( m.A c.Y ) set v = ${vt:r}; set t = ${vt:e} cat ../045/only-${v}.evt \ | egrep -e '^<[^<>]*;'"$t"'>' \ | tr 'SCIKTPF' 'sciktpf' \ | tr -d '\!' \ | sed \ -e 's/^<[^<>]*> *//g' \ -e 's/[{][^{}]*[}]//g' \ -e 's/[&][0-9*?][0-9*?]*[;]\?/*/g' \ -e 's/[buxvz]/*/g' \ -e 's/[.,]*-[-.,]*/-/g' \ -e 's/[,]*[.][,.]*/./g' \ -e 's/[,][,]*/,/g' \ -e 's/.['"'"'"]/?/g' \ -e 's/[^-,./= ]*[%?*][^-,./= ]*/?/g' \ > base-${v}.txt end Let's now separate the text into the OKOKOKO elements. We delete empty elements and put {} around the O strings too: foreach v ( m c ) cat base-${v}.txt \ | factor-field-OK \ -v inField=1 \ -v erase=1 \ -v outField=1 \ | sed \ -e 's/{_}//g' \ -e 's/_//g' \ -e 's/\([aoy][aoy]*\)/{\1}/g' \ > base-${v}.elt end Besides these elements, we will use the following reduced alphabets: the coarse set "clt" O = <[aoy]+> Q = [q] I = [i]+ R = [djmg] and [rlsn] X = , , , , , [ci][ktpf][h], [ci][ktpf], [ktpf] E = [ehc] not included in the X letters We consider also the finer set "flt" where the R and X sets get split further as follows S = , , , , H = [ktpf] G = [ci][ktpf][h], [c][ktpf] L = [rlsn] D = [djmg] Finally we shoudl consider the "simplified" elements where possible errors and calligraphic variants are mapped to likely "correct" versions: {p} -> {t} (also in composites) {f} -> {k} (also in composites) {g} -> {m} {j} -> {d} {iXh} -> {cXh} {cXhh} -> {cXhe} {iXhh} -> {cXhe} {iid} -> {ii} {d} etc. Let's call these the "simple" elements (slt). The conversion is done by the sed scripts elt2clt, elt2flt, elt2slt. The script elt2elt, a no-op, is also provided for uniformity. Converting: foreach v ( m c ) foreach map ( clt flt slt ) echo "map = ${map}" cat base-${v}.elt \ | elt2${map} \ > base-${v}.${map} end end Checking the completeness of the conversion: foreach v ( m c ) foreach ma ( elt.a-z clt.A-Z flt.A-Z slt.a-z ) set map = "${ma:r}"; set alf = "${ma:e}" echo "map = ${map} alf = ${alf}" cat base-${v}.${map} \ | egrep '[{][^{}]*[^{}'"${alf}"'?*][^{}]*[}]' \ > .bugs-${v}.${map} cat base-${v}.${map} \ | egrep '(^|[}])[^{}]*[^-,.=/ {}][^{}]*([{]|$)' \ >> .bugs-${v}.${map} head -10 .bugs-${v}.${map} end end Element frequencies ------------------- Computing the element frequencies: foreach v ( m c ) foreach map ( elt clt flt slt ) cat base-${v}.${map} \ | tr '{}' '\012\012' \ | egrep '[_A-Za-z?*%]' \ | sort | uniq -c | expand \ | sort +0 -1nr \ > ${map}-${v}.frq cat ${map}-${v}.frq \ | gawk '/./{print $2;}' \ | sort \ > ${map}-${v}.dic end end Element frequencies Majority version: Consensus version: count clt count flt count slt count elt count clt count flt count slt count elt ----- --- ----- --- ----- --- ----- ----- ----- --- ----- --- ----- --- ----- ----- 53662 O 53662 O 23747 o 23193 o 45290 O 45290 O 19966 o 19593 o 38745 R 25157 L 16878 y 16681 y 32418 R 20716 L 14831 y 14702 y 38037 X 19283 S 13568 a 13260 a 32223 X 16632 S 10891 d 10862 d 9927 E 16647 H 12490 d 12440 d 8506 E 13888 H 10856 a 10635 a 6351 I 13585 D 10463 ch 10019 l 8052 ? 11702 D 9094 ch 8745 l 5138 Q 9927 E 10075 l 7683 k 4766 I 8506 E 8771 l 8052 ? 2487 ? 6367 I 9919 e 6392 r 4451 Q 8052 ? 8503 e 6314 k 5138 Q 9756 k 6383 ch 4774 I 8052 ? 5491 ch 2487 ? 7114 r 5138 q 4451 Q 8042 k 5400 r 2110 G 6891 t 4570 t 1703 G 5951 r 4451 q 5580 n 4103 ee 5846 t 3870 t 5138 q 4069 che 4451 q 3660 iin 4443 ee 4017 iin 4215 n 3599 che 4385 sh 2487 ? 3782 ee 3535 ee 4204 ii 2369 s 3759 sh 1964 sh 2487 ? 2308 sh 3758 ii 1774 she 2380 s 2029 she 1776 s 1772 s 2058 i 1690 ke 981 i 1459 ke 1138 cth 1326 in 936 cth 1103 p 1095 m 1316 p 811 m 864 te 972 ckh 996 m 767 ckh 756 m 106 iii 993 te 35 iii 630 cth 730 cth 541 ckh 654 ckh 471 ir 582 ir 433 in 371 f 266 f 340 eee 247 eee 260 oa 194 oa 231 e 181 ckhe 229 ckhe 145 cthe 185 cthe 135 e 145 cph 112 cph 133 n 88 oy ... ... 87 n ... ... Observe that the frequency of {?} increased inthe consensus version, while all other counts decreased. Note that the counts of {r} and {s} decreased even relative to the other letters. Otherwise the differences are minimal. Plotting the histograms foreach v ( m c ) foreach map ( clt flt slt elt ) gnuplot < non-${map}-${v}.frq echo "simple word breaks ..." cat base-${v}.${map} \ | sed \ -e 's/[.]\({[^{}]*}\)[.]/.\1\1./g' \ -e 's/\({[^{}]*}\)[.]\({[^{}]*}\)/@\1.\2@/g' \ | tr '@' '\012' \ | egrep -e '^{[^{}]*}[.]{[^{}]*}$' \ | compute-pair-freqs \ > std-${map}-${v}.frq echo "figure breaks ..." cat base-${v}.${map} \ | sed \ -e 's/-\({[^{}]*}\)-/-\1\1-/g' \ -e 's/\({[^{}]*}\)-\({[^{}]*}\)/@\1-\2@/g' \ | tr '@' '\012' \ | egrep -e '^{[^{}]*}-{[^{}]*}$' \ | compute-pair-freqs \ > fig-${map}-${v}.frq echo "line breaks ..." cat base-${v}.${map} \ | sed \ -e 's/^[-=., ]*\({[^{}]*}\)[-=., ]*$/\1\1/g' \ -e 's/^[-=., ]*\({[^{}]*}\)/\1@/g' \ -e 's/\({[^{}]*}\)[-., ]*$/@\1\//g' \ | tr -d '\012' \ | tr '@' '\012' \ | egrep -e '^{[^{}]*}[/]{[^{}]*}$' \ | compute-pair-freqs \ > lin-${map}-${v}.frq end end Created {clt,flt,slt,elt}.dic from the "-m" versions, reordering with hindsight into important classes Format element pair frequencies in matrix form: foreach v ( m c ) foreach map ( clt flt slt ) foreach brk ( lin fig std non ) echo "map = ${map} brk = ${brk}" cat ${brk}-${map}-${v}.frq \ | gawk '/./{s=$3; gsub(/[\!:./]/, " ", s); print $1,s;}' \ | count-diword-freqs \ -v rows=${map}.dic -v cols=${map}.dic \ -v counted=1 -v digits=5 \ > ${map}-${brk}-${v}.dwtbl end end end Analysis in terms of the Q O I R X E classes -------------------------------------------- Here are the counts for the mapping to the coarse { Q O I R X E } classes: Pairs of adjacent characters inside words: absolute counts per-row percentages --- ------- ----- ----- ----- ----- ----- ----- --- -- -- -- -- -- -- T O T Q O R X E I Q O R X E I --- ------- ----- ----- ----- ----- ----- ----- --- -- -- -- -- -- -- Q 5137 . 5048 4 52 33 . Q . 98 . 1 . . O 37646 28 . 18339 12821 152 6306 O . . 48 34 . 16 R 18953 3 14900 904 3072 59 15 R . 78 4 16 . . X 37748 . 16730 2859 8517 9629 13 X . 44 7 22 25 . E 9860 . 5139 3770 948 2 1 E . 52 38 9 . . I 6341 . 11 6301 18 . 11 I . . 99 . . . --- ------- ----- ----- ----- ----- ----- ----- --- -- -- -- -- -- -- TOT 115685 31 41828 32177 25428 9875 6346 TOT . 36 27 21 8 5 Pairs around ordinary word breaks (std): absolute counts per-row percentages --- ------- ----- ----- ----- ----- ----- ----- --- -- -- -- -- -- -- T O T Q O R X E I Q O R X E I --- ------- ----- ----- ----- ----- ----- ----- --- -- -- -- -- -- -- Q 0 . . . . . . Q . . . . . . O 12535 3287 2786 2748 3703 11 . O 26 22 21 29 . . R 15134 1068 5797 1879 6366 22 2 R 7 38 12 42 . . X 174 10 59 19 85 1 . X 5 33 10 48 . . E 44 5 16 8 15 . . E . . . . . . I 5 . 2 2 1 . . I . . . . . . --- ------- ----- ----- ----- ----- ----- ----- --- -- -- -- -- -- -- TOT 27892 4370 8660 4656 10170 34 2 TOT 15 31 16 36 . . Pairs around figure breaks (fig): absolute counts per-row percentages --- ------- ----- ----- ----- ----- ----- ----- --- -- -- -- -- -- -- T O T Q O R X E I Q O R X E I --- ------- ----- ----- ----- ----- ----- ----- --- -- -- -- -- -- -- Q 0 . . . . . . Q . . . . . . O 360 16 134 107 100 3 . O 4 37 29 27 . . R 382 9 168 98 107 . . R 2 43 25 28 . . X 19 . 4 6 9 . . X . 21 31 47 . . E 1 . . 1 . . . E . . . . . . I 0 . . . . . . I . . . . . . --- ------- ----- ----- ----- ----- ----- ----- --- -- -- -- -- -- -- TOT 762 25 306 212 216 3 . TOT 3 40 27 28 . . Pairs around line breaks (lin): absolute counts per-row percentages --- ------- ----- ----- ----- ----- ----- ----- --- -- -- -- -- -- -- T O T Q O R X E I Q O R X E I --- ------- ----- ----- ----- ----- ----- ----- --- -- -- -- -- -- -- Q 1 . . 1 . . . Q . . . . . . O 1244 199 377 415 248 5 . O 15 30 33 19 . . R 1828 262 636 594 334 2 . R 14 34 32 18 . . X 28 2 18 5 2 1 . X 7 64 17 7 3 . E 2 . 1 1 . . . E . . . . . . I 2 1 1 . . . . I . . . . . . --- ------- ----- ----- ----- ----- ----- ----- --- -- -- -- -- -- -- TOT 3105 464 1033 1016 584 8 . TOT 14 33 32 18 . . So we can say that (1) Line breaks occur almost only between { O R } and { Q O R X } (with frequencies ranging from 6% to 20% of all line breaks); rarely between X and { Q O R X } (less than 0.9% of all line breaks); and essentially never after { Q I E } or before { I E } (less than 0.4% of all line breaks). (2) Ordinary word breaks follow the same pattern: the pairs between { O R } and { Q O R X } have frequencies between 3.8% and 22%; pairs between X and { Q O R X } have total frequency of 0.6%; and all the remaining pairs account for only 0.3% of the line breaks. (3) Figure breaks too follow almost the same pattern: the pairs { O R } and { O R X } have frequencies ranging from 22% to 12%, but the pairs { R-Q and O-Q } are much rarer than around line breaks and ordinary spaces, about 1--2% each. Breaks between X and { Q O R X } are slightly more common (2.5% total) and all other pairs are almost absent (about 0.5%). (4) The relative frequencies of { Q O R X } are approximately 1:2:2:1 after a line break, and 0:3:2:2 after a figure break, roughly independently of the character before the break. (5) The relative frequencies of { Q O R X } after ordinary word breaks seem to depend on the preceding letter: 1:1:1:1 after O, 1:4:1:4 after R. However they are still of the same order of magnitude. (6) Inside words, the valid pairs are { QO, OX, OI, IX, XX, XE, XO, EX, EO } with frequencies ranging from 4.1% to 27%. The remaining pairs have much lower frequencies (OO accounts for 0.46% of all pairs, and OE for only 0.13%). These observations seem to imply that the "word spaces", line breaks, and figure breaks are fairly similar when compared to all inter-character pairs. Their similarity, and the relative independence of the second letter on the first strongly suggests that those breaks are indeed word boundaries. In that case we conclude that Voynichese words may end in O or R (40-45% and 50-60%, respectively) or rarely X; and may begin only with Q, O, R, or X. (A more detailed analysis would show that the O at end of words is almost always . Also the last R in a line is most often EVA , which is only rarely seen at the other kinds of word breaks.) Point (3) shows that figure breaks are more like line and word breaks than like random inter-character breaks. The main difference between line breaks and figure breaks is that the probability of finding a Q is much higher after a line break (14%) than after a figure break (3%). The main difference between line and figure breaks on one side, and the ordinary word breaks on the other, is that the probability of the letter after an ordinary word break is visibly dependent on the letter before the break. The difference can be described as an enhancement of Q.O pairs at the expense of O.O pairs; and an enhancement of R.O and R.X pairs at the expense fo R.R pairs. Distribution of letters after a break, depending on the previous one, ignoring pairs that end in Q: after R: after O: O R X O R X -- -- -- -- -- -- lin 40 37 21 lin 36 39 23 fig 45 26 28 fig 39 31 29 std 41 13 45 std 30 29 40 non 78 4 16 non . 48 34 Point (6) is a partial restatement of the QOKOKOKO paradigm. Note that the pairs { QO OI IX XE EX EO }, which are fairly common inside words, are not legal places for word spaces, line, or figure breaks. Unfortunately this data does not shed much light on whether each O (or E) is attached to the preceding X, the following X, or sometimes both, or neither. There are (practically) no figure breaks adjacent to an E. Analysis for the Q O I G H S L D E classes ------------------------------------------ Here are the counts for the mapping to finer classes { Q O I G H S L D E } Pairs around ordinary word breaks (std): absolute counts per-row percentages --- ------- ----- ----- ----- ----- ----- ----- ----- ----- ----- --- -- -- -- -- -- -- -- T O T Q O L D S H G E I Q O L D S H G --- ------- ----- ----- ----- ----- ----- ----- ----- ----- ----- --- -- -- -- -- -- -- -- Q 0 . . . . . . . . . Q . . . . . . . O 12535 3287 2786 1493 1255 2359 1050 294 11 . O 26 22 11 10 18 8 2 L 14499 957 5570 604 1186 5127 568 465 21 1 L 6 38 4 8 35 3 3 D 635 111 227 51 38 183 13 10 1 1 D 17 35 8 5 28 2 1 S 57 5 23 4 5 7 12 1 . . S 8 40 7 8 12 21 1 H 112 5 34 4 4 63 1 . 1 . H 4 30 3 3 56 . . G 5 . 2 1 1 1 . . . . G . . . . . . . E 44 5 16 2 6 5 9 1 . . E 11 36 4 13 11 20 2 I 5 . 2 1 1 1 . . . . I . . . . . . . --- ------- ----- ----- ----- ----- ----- ----- ----- ----- ----- --- -- -- -- -- -- -- -- TOT 27892 4370 8660 2160 2496 7746 1653 771 34 2 TOT 15 31 7 8 27 5 2 Pairs around figure breaks (fig): absolute counts per-row percentages --- ------- ----- ----- ----- ----- ----- ----- ----- ----- ----- --- -- -- -- -- -- -- -- T O T Q O L D S H G E I Q O L D S H G --- ------- ----- ----- ----- ----- ----- ----- ----- ----- ----- --- -- -- -- -- -- -- -- Q 0 . . . . . . . . . Q . . . . . . . O 360 16 134 38 69 85 1 14 3 . O 4 37 10 19 23 . 3 L 315 9 133 28 53 86 . 6 . . L 2 42 8 16 27 . 1 D 67 . 35 8 9 12 3 . . . D . 52 11 13 17 4 . S 2 . . . . 2 . . . . S . . . . . . . H 9 . 2 2 2 3 . . . . H . . . . . . . G 8 . 2 1 1 4 . . . . G . . . . . . . E 1 . . . 1 . . . . . E . . . . . . . I 0 . . . . . . . . . I . . . . . . . --- ------- ----- ----- ----- ----- ----- ----- ----- ----- ----- --- -- -- -- -- -- -- -- TOT 762 25 306 77 135 192 4 20 3 . TOT 3 40 10 17 25 . 2 Pairs around line breaks (lin): absolute counts per-row percentages --- ------- ----- ----- ----- ----- ----- ----- ----- ----- ----- --- -- -- -- -- -- -- -- T O T Q O L D S H G E I Q O L D S H G --- ------- ----- ----- ----- ----- ----- ----- ----- ----- ----- --- -- -- -- -- -- -- -- Q 1 . . 1 . . . . . . Q . . . . . . . O 1244 199 377 172 243 124 123 1 5 . O 15 30 13 19 9 9 . L 1160 187 385 165 206 121 89 5 2 . L 16 33 14 17 10 7 . D 668 75 251 99 124 58 59 2 . . D 11 37 14 18 8 8 . S 5 . 5 . . . . . . . S . . . . . . . H 20 1 11 3 2 . 2 . 1 . H 4 54 14 9 . 9 . G 3 1 2 . . . . . . . G . . . . . . . E 2 . 1 1 . . . . . . E . . . . . . . I 2 1 1 . . . . . . . I . . . . . . . --- ------- ----- ----- ----- ----- ----- ----- ----- ----- ----- --- -- -- -- -- -- -- -- TOT 3105 464 1033 441 575 303 273 8 8 . TOT 14 33 14 18 9 8 . These numbers can be summarized as follows: line breaks occur only between { O L D } and { Q O L D S H }, and practically never after { Q S H G E I } or before { G E I }. The distribution of the first letter of the line does not depend much on the last letter of the previous line. Ordinary word breaks have almost the same distribution, except that they may also occur before G. Figure breaks are even more extreme in that they occur before G but hardly ever before H. Also the distribution of the letter after a word break depends significantly on the letter before the break. In particular the pairs L.S and D.S are far more common around word breaks than they are around line breaks. Analysis for the "corrected" elements ------------------------------------- In tabular form: Pairs around ordinary word breaks (std): --- ------- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- T c c O c s e k t T q a o y r l s n d m h h e k t h h --- ------- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- a 29 3 2 5 1 5 3 . . 4 . 3 . . 2 . 1 . o 836 75 32 80 22 59 108 31 . 120 1 98 42 1 78 37 17 33 y 11670 3209 103 2250 291 228 726 332 1 1128 2 1519 689 7 501 432 61 182 r 4520 232 757 1205 188 15 48 58 . 218 3 965 594 8 54 33 39 94 l 4688 371 150 908 113 56 168 124 1 602 1 1049 600 15 294 118 26 85 s 763 28 199 204 44 2 13 9 . 16 1 136 69 3 6 6 12 14 n 4528 326 209 1383 210 11 30 69 . 342 3 1116 571 1 31 26 48 147 d 371 88 29 81 20 5 27 5 1 19 . 51 32 . 6 2 2 2 m 264 23 10 76 11 . 2 11 . 19 . 69 31 . 5 . . 6 --- ------- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- TOT 27892 4370 1509 6241 910 383 1131 643 3 2485 11 5054 2656 36 996 657 206 565 Pairs around figure breaks (fig): --- ------- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- T c c O c s e k t T q a o y r l s n d m h h e k t h h --- ------- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- a 0 . . . . . . . . . . . . . . . . . o 19 2 . 2 4 1 3 . . 1 . 2 2 . . 1 . 1 y 341 14 4 77 47 . 2 32 . 68 . 58 22 1 . . 2 11 r 56 3 2 11 10 . . 4 . 10 . 12 4 . . . . . l 103 4 3 25 16 2 1 9 . 19 . 12 9 . . . . 3 s 55 . 5 16 5 . . 3 . 10 . 11 5 . . . . . n 101 2 . 25 15 . . 9 . 14 . 21 12 . . . . 3 d 45 . 2 10 9 2 . 4 . 9 . 4 5 . . . . . m 22 . . 11 3 . . 2 . . . 2 1 . 2 1 . . --- ------- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- TOT 762 25 17 179 110 7 6 64 . 133 2 130 61 1 2 2 2 18 Pairs around line breaks (lin): --- ------- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- T c c O c s e k t T q a o y r l s n d m h h e k t h h --- ------- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- a 7 4 . 1 1 . . . . 1 . . . . . . . . o 47 8 1 9 6 2 3 6 . 6 . 2 1 . 1 2 . . y 1190 187 2 168 189 2 25 134 . 236 . 59 62 . 12 108 . 1 r 276 41 1 55 48 3 6 19 . 49 . 16 17 . 2 18 . 1 l 364 57 . 54 59 2 15 53 . 70 . 10 16 . 6 21 . 1 s 109 21 . 19 18 1 2 10 . 14 . 3 8 . . 12 . . n 411 68 1 66 64 1 4 49 . 73 . 16 35 . 5 25 . 3 d 108 19 1 21 22 1 4 8 . 17 . 6 3 . 1 4 . 1 m 560 56 3 89 115 1 8 77 . 107 . 17 31 1 2 52 . 1 --- ------- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- TOT 3105 464 9 489 535 14 67 360 . 575 . 129 173 1 30 243 . 8 Distribution (percentual) of letters after each type of break: --- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- c c c s e k t q a o y r l s n d m h h e k t h h --- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- non . 10 13 12 5 7 5 4 7 . 4 1 3 7 4 . . std 15 5 22 3 1 4 2 . 8 . 18 9 . 3 2 . 2 fig 3 2 23 14 . . 8 . 17 . 17 8 . . . . 2 lin 14 . 15 17 . 2 11 . 18 . 4 5 . . 7 . . Distribution (percentual) of letters before each type of break: --- ---- ---- ---- ---- n s f l o t i i n d g n --- ---- ---- ---- ---- q 4 0 0 0 a 11 0 0 0 o 19 2 2 1 y 1 41 44 38 r 1 16 7 8 l 3 16 13 11 s 1 2 7 3 n 0 16 13 13 d 10 1 5 3 m 0 0 2 18 ch 8 0 0 0 sh 3 0 0 0 ee 3 0 0 0 k 8 0 0 0 t 5 0 0 0 ckh 0 0 0 0 cth 0 0 0 0 e 8 0 0 0 i 1 0 0 0 ii 3 0 0 0 iii 0 0 0 0 --- ---- ---- ---- ---- From the analysis with "flt" classes, we would expect the following pairs to occur: { a o y r l s n d m } and { q a o y r l s n d m ch sh ee k t ckh cth } Notable absences in all breaks: */m - the ratio d:m is 10:1 but the ratio */d : */m is over 200:1 */n - compared to */r, */l, */s and considering the element frequencies. of course that is because only occurs after whereas also occur in other contexts. */ee - the ratio ch:sh:ee is 2:1:1 but */ch : */sh : */ee is 2:1:0 a/*, o/* - while a:o:y is 3:6:4, a/*:o/*:y/* is 0:10:150 in ordinary breaks, and similarly absent in other breaks. notable absences in line and figure breaks: */a - ratio */a:*/o:*/y is 2:8:1 in std, 1:50:50 in lin, while a:o:y is 3:6:4 */r - 1.3% of std, 1% of fig, 0.4% of lin. Also, the ratio r:s is 3:1 but */r:*/s is 1:2 in std, 1:25 in lin. */k - 3.5% of std, 0.3% of fig, 1% of lin */l - 4% in std, 1% of fi, 2% of lin; notable absences in line breaks: */ckh, */cth - 2.7% of std, 2.6% of fig, 0.2% of line Notable absences in figure breaks: */q - 15% of standard breaks, 3% of figure breaks, 15% of line breaks */k,*/t - 6% of std, 0.5% of fig, 9% of lin (seen in in flt) s/r, s/s, d/r, d/s Notable anomalies: */s - 2% of std, 8% of fig, 12% of lin. This said, the significant pairs found around line breaks are mostly between { o y r l s n d m } and { q o y l s d ch sh t } where o/* pairs are actually quite rare. Pairs with second element or are in fact more common around line breaks than around word breaks. Thus it may be that such word breaks are preferentially omitted in transcription. Pairs with first element have the same discrepancy, but there the likely explanation is that is an abbreviation or a calligraphic variant that is specifically used at end of line. (Since also occurs before figure breaks, the abbreviation theory seems more likely.)