Let's confirm the split rules based on glyphs by comparing the global distribution of letter groups (spaces omitted) with the distribution of letter groups that straddle or are adjacent to the line breaks. cat bio-m-evt.evt \ | grep ';C>' \ | sed \ -e 's/{[^}]*}//g' \ -e 's/[\!%]//g' \ > .tmp-c-fsg.evt extract-words-from-interlin \ -chars 'COG8EDA4TSHRNM2ZPIKLF6' \ .tmp-c-fsg.evt \ .tmp-c-fsg lines words bytes file ------ ------- --------- ------------ 7227 7227 38289 .tmp-c-fsg.wds 1699 1699 10717 .tmp-c-fsg.dic 6420 6420 35789 .tmp-c-fsg-gut.wds 1665 1665 10530 .tmp-c-fsg-gut.dic 807 807 2500 .tmp-c-fsg-fun.wds 34 34 187 .tmp-c-fsg-fun.dic 0 0 0 .tmp-c-fsg-bad.wds 0 0 0 .tmp-c-fsg-bad.dic Digraph counts: TT C O G 8 E D A 4 T S H R N M 2 Z P I K L F 6 ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- 6420 . 24 1363 138 514 365 108 133 1646 761 692 153 134 . . 278 . 99 1 . 2 8 1 C 4278 7 951 172 838 1895 4 155 55 1 15 9 80 8 . 8 45 . 17 11 . 2 3 2 O 3895 34 19 4 13 31 1342 1430 3 8 7 9 567 300 7 14 7 . 68 9 7 1 13 2 G 3764 3510 1 7 . 17 21 71 2 10 20 25 55 14 . . 6 . 1 1 1 . 2 . 8 2728 73 19 72 2045 2 10 8 418 1 36 38 1 2 . . 1 . . 1 . 1 . . E 2349 1085 9 159 106 84 7 270 55 2 306 181 37 13 . . 16 . 11 . 2 . 6 . D 2185 14 871 79 169 2 11 . 740 . 69 28 . . . 1 . 198 . 3 . . . . A 1969 6 . 5 4 8 551 4 1 . . 1 4 394 471 399 7 . 2 51 43 12 . 6 4 1668 5 19 1622 3 . . 4 4 . . 1 5 . . . 2 . 2 . . . 1 . T 1447 1 1050 49 62 96 13 83 26 . 1 2 39 4 . . 6 . 12 . . . 3 . S 1073 4 864 37 27 40 5 45 21 . 3 . 25 1 . . 1 . . . . . . . H 969 4 343 58 88 3 3 1 258 . 60 25 . . . . 1 121 . 4 . . . . R 913 619 4 82 44 5 1 1 93 . 37 22 1 . . . . . 2 1 . . . 1 N 478 462 . 7 2 3 . . 2 . 1 . . . . . 1 . . . . . . . M 422 412 . 2 5 1 . . 1 . 1 . . . . . . . . . . . . . 2 372 73 4 114 10 3 1 5 131 . 14 13 2 . . . . . 1 1 . . . . Z 344 2 96 10 203 21 . . 9 . 2 . . . . . 1 . . . . . . . P 215 4 3 48 6 3 . . 14 . 91 25 . . . . . 21 . . . . . . I 152 . . . . . 11 . . . . . . 43 . . . . . 69 4 25 . . K 57 55 . 1 . . 1 . . . . . . . . . . . . . . . . . L 43 38 . 1 1 . 3 . . . . . . . . . . . . . . . . . F 36 1 1 3 . . . . 2 . 23 2 . . . . . 4 . . . . . . 6 12 11 . . . . . . 1 . . . . . . . . . . . . . . . ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- TOT 35789 6420 4278 3895 3764 2728 2349 2185 1969 1668 1447 1073 969 913 478 422 372 344 215 152 57 43 36 12 Next-symbol probability (× 99): TT C O G 8 E D A 4 T S H R N M 2 Z P I K L F 6 -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- 99 . . 21 2 8 6 2 2 25 12 11 2 2 . . 4 . 2 . . . . . C 99 . 22 4 19 44 . 4 1 . . . 2 . . . 1 . . . . . . . O 99 1 . . . 1 34 36 . . . . 14 8 . . . . 2 . . . . . G 99 92 . . . . 1 2 . . 1 1 1 . . . . . . . . . . . 8 99 3 1 3 74 . . . 15 . 1 1 . . . . . . . . . . . . E 99 46 . 7 4 4 . 11 2 . 13 8 2 1 . . 1 . . . . . . . D 99 1 39 4 8 . . . 34 . 3 1 . . . . . 9 . . . . . . A 99 . . . . . 28 . . . . . . 20 24 20 . . . 3 2 1 . . 4 99 . 1 96 . . . . . . . . . . . . . . . . . . . . T 99 . 72 3 4 7 1 6 2 . . . 3 . . . . . 1 . . . . . S 99 . 80 3 2 4 . 4 2 . . . 2 . . . . . . . . . . . H 99 . 35 6 9 . . . 26 . 6 3 . . . . . 12 . . . . . . R 99 67 . 9 5 1 . . 10 . 4 2 . . . . . . . . . . . . N 99 96 . 1 . 1 . . . . . . . . . . . . . . . . . . M 99 97 . . 1 . . . . . . . . . . . . . . . . . . . 2 99 19 1 30 3 1 . 1 35 . 4 3 1 . . . . . . . . . . . Z 99 1 28 3 58 6 . . 3 . 1 . . . . . . . . . . . . . P 99 2 1 22 3 1 . . 6 . 42 12 . . . . . 10 . . . . . . I 99 . . . . . 7 . . . . . . 28 . . . . . 45 3 16 . . K 99 96 . 2 . . 2 . . . . . . . . . . . . . . . . . L 99 87 . 2 2 . 7 . . . . . . . . . . . . . . . . . F 99 3 3 8 . . . . 6 . 63 6 . . . . . 11 . . . . . . 6 99 91 . . . . . . 8 . . . . . . . . . . . . . . . -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- TOT 99 18 12 11 10 8 6 6 5 5 4 3 3 3 1 1 1 1 1 0 0 0 0 0 Previous-symbol probability (× 99): TT C O G 8 E D A 4 T S H R N M 2 Z P I K L F 6 -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- 18 . 1 35 4 19 15 5 7 98 52 64 16 15 . . 74 . 46 1 . 5 22 8 C 12 . 22 4 22 69 . 7 3 . 1 1 8 1 . 2 12 . 8 7 . 5 8 17 O 11 1 . . . 1 57 65 . . . 1 58 33 1 3 2 . 31 6 12 2 36 17 G 10 54 . . . 1 1 3 . 1 1 2 6 2 . . 2 . . 1 2 . 6 . 8 8 1 . 2 54 . . . 21 . 2 4 . . . . . . . 1 . 2 . . E 6 17 . 4 3 3 . 12 3 . 21 17 4 1 . . 4 . 5 . 3 . 17 . D 6 . 20 2 4 . . . 37 . 5 3 . . . . . 57 . 2 . . . . A 5 . . . . . 23 . . . . . . 43 98 94 2 . 1 33 75 28 . 50 4 5 . . 41 . . . . . . . . 1 . . . 1 . 1 . . . 3 . T 4 . 24 1 2 3 1 4 1 . . . 4 . . . 2 . 6 . . . 8 . S 3 . 20 1 1 1 . 2 1 . . . 3 . . . . . . . . . . . H 3 . 8 1 2 . . . 13 . 4 2 . . . . . 35 . 3 . . . . R 3 10 . 2 1 . . . 5 . 3 2 . . . . . . 1 1 . . . 8 N 1 7 . . . . . . . . . . . . . . . . . . . . . . M 1 6 . . . . . . . . . . . . . . . . . . . . . . 2 1 1 . 3 . . . . 7 . 1 1 . . . . . . . 1 . . . . Z 1 . 2 . 5 1 . . . . . . . . . . . . . . . . . . P 1 . . 1 . . . . 1 . 6 2 . . . . . 6 . . . . . . I 0 . . . . . . . . . . . . 5 . . . . . 45 7 58 . . K 0 1 . . . . . . . . . . . . . . . . . . . . . . L 0 1 . . . . . . . . . . . . . . . . . . . . . . F 0 . . . . . . . . . 2 . . . . . . 1 . . . . . . 6 0 . . . . . . . . . . . . . . . . . . . . . . . -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- TOT 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 Symbol entropy: 3.749 Next-symbol entropy: 1.975 let's generate a working text, correcting a few obvious mistakes, such as the "I" and "L" mistakes. cat .tmp-c-fsg.txt \ | /n/gnu/bin/sed \ -e 's/^ *//g' -e 's/ *$//g' -e 's/ */ /g' \ | correct-fsg \ > .voyn.fsg --- correct-fsg ------------------------ #! /n/gnu/bin/sed -f # Corrects "transcription errors" in FSG notation # s/$/ /g s/^/ /g s/CI/A/g s/IIIL/M/g s/CM/AN/g s/AL/AN/g s/A2/AR/g s/4A/4O/g s/A /G /g s/4CD/4OD/g s/4CH/4OH/g s/4G/4O/g s/A\([^KMNRIEFP]\)/O\1/g s/^ *//g s/ *$//g ---------------------------------------- First, single characters at end-of-line: cat .voyn.fsg \ | tr -d ' /=\012' \ | enum-ngraphs -v n=1 \ | egrep -v '\*' \ > .voyn-tt-1.grm cat .voyn-tt-1.grm \ | sed -e 's/^\(.\)$/\1:/g' \ | sort | uniq -c | expand \ | compute-freqs \ > .voyn-tt-1-0.frq cat .voyn.fsg \ | tr -d ' /=' \ | sed -e 's/^\(..\).*\(..\)$/\1\2/g' \ | tr -s '\012' ':' \ | enum-ngraphs -v n=2 \ | egrep -v '\*' \ | egrep '^.:$' \ | sort | uniq -c | expand \ | compute-freqs \ > .voyn-nl-1-0.frq compare-freqs \ .voyn-tt-1-0.frq \ .voyn-nl-1-0.frq \ | compute-count-ratio \ -v nmin=25 -v mw=5 -v mc=40 \ | sort -b +0.0 -0.2r +5 -6 +4 -5nr +0 -1nr \ > .voyn-tt-nl-1-0.cmp The output is shown below. The first two columns are the total occurrences of the pattern (NT), and the ratio NT/(total characters). The next two columns are count of occurrences centered on line breaks (NL) and the ratio NL/(total line breaks). The "ratio" column is the ratio NL/NT. The "mk" field is a classification of the pattern based on the two counts. First, if the total occurrence count NT of the pattern is less than a certain minimum count "nmin", it doesn't really matter how we classify it. Such patterns are marked "-?" or "+?", meanind "don't care, but low" and "don't care, but high", respectively. The choice between the two is the same as for the "--" and "++" marks, explained below. Here we have taken nmin=25. Now suppose NT is greater than or equal to "nmin". IF NL is too small, it means that the pattern never occurs at line breaks, and hence is probably not a valid word boundary. Such patterns are marked "oo". If NL is close to NT (i.e. the ratio is close to 1.000), it means that the group ONLY occurs around LINE breaks; presumably because it only occurs around paragraph breaks, or is affected by abbreviations and calligraphic ornament. Such patterns are marked "##". Let the average number of "true" words per line be "mw". If line breaks are a random subset of the "true" word breaks, a pattern that ONLY occurs at word breaks would have ratio NL/NT equal to 1/mw. Conversely, if the ratio NL/NT is 1/mw or more, it means that the pattern is an almost certain indicator of WORD break (unless its line-break statistics are biased for some other rason). Such patterns are marked "||". For starters, we have guessed mw=5. Finally, the probability that an occurrence of the pattern marks a true word boundary is NB/NT, where NB is the number of occurrences of P at a true word boundary. If we use mw*NL as an estimate for NB, then P is most probably a word break when mw*NL/NT >= 1/2. In that case, we mark the pattern with "++". Conversely, if mw*NL/NT < 1/2, the pattern is most likely *not* a word break; we mark such a pattern with "--". The counts can be contaminated by sampling error, transcription mistakes, non-text usage, etc. To allow for these perturbations, the ratio is computed as (NL+1)/(NT+mc), and the conditions have actually been modified as shown below. The basic idea is to assign an "##", "||", or "oo" mark only if we are reasonably sure that by taking those marks at face value we would not make more than "nmin" mistakes for each pattern. function classify(NT, NL, nmin, mw, mc) { if ((NT < nmin) && (2*mw*(NL+1) < (NT+mc))) { return "-?" } # unimportant but NL low else if ((NT < nmin) && (2*mw*(NL+1) >= (NT+mc))) { return "+?" } # unimportant but NL high else if (mw*(NL+1) < nmin) { return "oo" } # NL practically zero else if ((NL-1) > NT - nmin) { return "##" } # NL practically NT else if (mw*(NL-1) > NT - nmin) { return "||" } # NL practically maximum expected else if (2*mw*(NL+1) < (NT+mc)) { return "--" } # NL on the low side else if (2*mw*(NL+1) >= (NT+mc)) { return "++" } # NL on the high side else { return "!!" } # program error } Here is the output: tot occurs at newline ratio mk group ----------- ----------- ----- -- ----------- 57 0.002 50 0.066 0.526 ## K: 12 0.000 8 0.011 0.173 +? 6: 3781 0.129 402 0.529 0.105 ++ G: 438 0.015 33 0.043 0.071 -- M: 922 0.031 64 0.084 0.068 -- R: 2353 0.080 138 0.182 0.058 -- E: 503 0.017 29 0.038 0.055 -- N: 365 0.012 11 0.014 0.030 -- 2: 2740 0.093 12 0.016 0.005 -- 8: 3964 0.135 9 0.012 0.002 -- O: 8 0.000 0 0.000 0.021 -? L: 36 0.001 0 0.000 0.013 oo F: 72 0.002 0 0.000 0.009 oo I: 345 0.012 1 0.001 0.005 oo Z: 216 0.007 0 0.000 0.004 oo P: 4268 0.145 3 0.004 0.001 oo C: 1952 0.066 0 0.000 0.001 oo A: 1676 0.057 0 0.000 0.001 oo 4: 1453 0.049 0 0.000 0.001 oo T: 1078 0.037 0 0.000 0.001 oo S: 973 0.033 0 0.000 0.001 oo H: 2192 0.075 0 0.000 0.000 oo D: Interpretation: we can safely insert a word break after every occurrence of "K", and supress a word break after "F", "I", "Z", "A", "P", "H", "S", "T", "4", "C", and "D". By doing so we will probably not make more than 25 mistakes for each pattern. Also, we can either break or not after "6" and "L". Given the ratios, it seems safer to break after "6" but not break after "L". Note that, by our stated criteria, we cannot safely break after FSG "G". If there are only 5 true words per line (our working guess), then there should be about 5*402 = 2010 true words ending in "G", and therefore another 3781 - 2010 = 1771 "G"s that are not word-final. In fact, the many lines that *start* with "G" already told us that. Moreover, since we know that "G" is not always word-final, we can (tentatively) conclude that mw is less than 3781/402 = 9.44. But we should not bet the house on this... By the same reasoning, it is not safe to always supress word breaks after "O". Given that there are 9 occurrences of "O" at end-of-line, we expect 5*9 = 45 "O"s at the end of "true" words. Supressing those word breaks would mean 45 mistakes, with is more than our specified limit "nmin". (Of course that may also mean that our choice of "nmin" is too low...) Summarizing, we get the following word-splitting rules: always break: K: likely break: 6: G: unlikely break: 2: 8: E: M: N: O: R: never break: 4: A: C: D: F: H: I: L: P: S: T: Z: Now let's do the same for single characters at line-start: cat .voyn-tt-1.grm \ | sed -e 's/^\(.\)$/:\1/g' \ | sort | uniq -c | expand \ | compute-freqs \ > .voyn-tt-0-1.frq cat .voyn.fsg \ | tr -d ' /=' \ | sed -e 's/^\(..\).*\(..\)$/\1\2/g' \ | tr -s '\012' ':' \ | enum-ngraphs -v n=2 \ | egrep -v '\*' \ | egrep '^:.$' \ | sort | uniq -c | expand \ | compute-freqs \ > .voyn-nl-0-1.frq compare-freqs \ .voyn-tt-0-1.frq \ .voyn-nl-0-1.frq \ | compute-count-ratio \ -v nmin=25 -v mw=5 -v mc=40 \ | sort -b +0.0 -0.2r +5 -6 +4 -5nr +0 -1nr \ > .voyn-tt-nl-0-1.cmp tot occurs at newline ratio mk group ----------- ----------- ----- -- ----------- 365 0.012 138 0.181 0.343 || :2 216 0.007 72 0.094 0.285 || :P 1676 0.057 198 0.260 0.116 ++ :4 973 0.033 52 0.068 0.052 -- :H 2740 0.093 104 0.136 0.038 -- :8 1078 0.037 23 0.030 0.021 -- :S 3781 0.129 59 0.077 0.016 -- :G 3964 0.135 58 0.076 0.015 -- :O 1453 0.049 17 0.022 0.012 -- :T 2353 0.080 26 0.034 0.011 -- :E 922 0.031 4 0.005 0.005 -- :R 2192 0.075 7 0.009 0.004 -- :D 8 0.000 0 0.000 0.021 -? :L 12 0.000 0 0.000 0.019 -? :6 36 0.001 1 0.001 0.026 oo :F 57 0.002 0 0.000 0.010 oo :K 72 0.002 0 0.000 0.009 oo :I 345 0.012 0 0.000 0.003 oo :Z 503 0.017 0 0.000 0.002 oo :N 438 0.015 0 0.000 0.002 oo :M 4268 0.145 3 0.004 0.001 oo :C 1952 0.066 0 0.000 0.001 oo :A In words, "2" and "P" can be considered word-start indicators. Note that character "4" would only be a sure word-start indicator if the average number of true words per line was 8.45; and that is probably a upper bound for "mw". We can also assume safely that no words begin with "F", "C", "K", "I", "Z", "M", "N", and "A", and supress word breaks before those characters. Characters "6" and "L" are now so rare that we can either break or supress breaks before them. Since "6" and "L" do not occur at line-start, but do occur at line-end, it is safer not to break before them (unless we have other reasons to). Note the strong difference between the "word-startiness" of "H" and "D". Since those characters seem equivalent by many other criteria, we conjecture the difference is a calligraphic efect, namely that "D" is almost always written as "H" when it is line-initial. In tabular form: always break: :2 :P likely break: :4 unlikely break: :8 :D :E :G :H :O :R :S :T never break: :6 :A :C :F :I :K :L :M :N :Z Let's recompute these statistics, excluding the "sure" break and non-break patterns that we have already identified, namely K: 4: A: C: D: F: H: I: P: S: T: Z: and :2 :P :6 :A :C :F :I :K :L :M :N :Z cat .voyn.fsg \ | tr -d ' /=\012' \ | enum-ngraphs -v n=2 \ | egrep -v '\*' \ > .voyn-tt-2.grm cat .voyn-tt-2.grm \ | sed -e 's/^\(.\)\(.\)$/\1:\2/g' \ | egrep -v '[K4ACDFHIPSTZ]:' \ | egrep -v ':[2P6ACFIKLMNZ]' \ > .voyn-tt-1-1-x.grm cat .voyn.fsg \ | tr -d ' /=' \ | sed -e 's/^\(..\).*\(..\)$/\1\2/g' \ | tr -s '\012' ':' \ | enum-ngraphs -v n=3 \ | egrep -v '\*' \ | egrep '^.:.$' \ | egrep -v '[K4ACDFHIPSTZ]:' \ | egrep -v ':[2P6ACFIKLMNZ]' \ > .voyn-nl-1-1-x.grm Let's now have a look at the line-final characters in this reduced sample: cat .voyn-tt-1-1-x.grm \ | sed -e 's/^\(.\):.$/\1:/g' \ | sort | uniq -c | expand \ | compute-freqs \ > .voyn-tt-1-0-x.frq cat .voyn-nl-1-1-x.grm \ | sed -e 's/^\(.\):.$/\1:/g' \ | sort | uniq -c | expand \ | compute-freqs \ > .voyn-nl-1-0-x.frq compare-freqs \ .voyn-tt-1-0-x.frq \ .voyn-nl-1-0-x.frq \ | compute-count-ratio \ -v nmin=25 -v mw=5 -v mc=40 \ | sort -b +0.0 -0.2r +5 -6 +4 -5nr +0 -1nr \ > .voyn-tt-nl-1-0-x.cmp Here are the results: tot occurs at newline ratio mk group ----------- ----------- ----- -- ----------- 9 0.001 7 0.014 0.163 +? 6: 3514 0.258 291 0.572 0.082 -- G: 732 0.054 47 0.092 0.062 -- R: 413 0.030 22 0.043 0.051 -- M: 2155 0.158 98 0.193 0.045 -- E: 475 0.035 20 0.039 0.041 -- N: 205 0.015 7 0.014 0.033 -- 2: 2302 0.169 11 0.022 0.005 -- 8: 3797 0.279 6 0.012 0.002 -- O: 8 0.001 0 0.000 0.021 -? L: Note that "G:" changed from "++" to "--" when we excluded the "sure" patterns. Otherwise there was no change. Now let's look again at the line-initial characters: cat .voyn-tt-1-1-x.grm \ | sed -e 's/^.:\(.\)$/:\1/g' \ | sort | uniq -c | expand \ | compute-freqs \ > .voyn-tt-0-1-x.frq cat .voyn-nl-1-1-x.grm \ | sed -e 's/^.:\(.\)$/:\1/g' \ | sort | uniq -c | expand \ | compute-freqs \ > .voyn-nl-0-1-x.frq compare-freqs \ .voyn-tt-0-1-x.frq \ .voyn-nl-0-1-x.frq \ | compute-count-ratio \ -v nmin=25 -v mw=5 -v mc=40 \ | sort -b +0.0 -0.2r +5 -6 +4 -5nr +0 -1nr \ > .voyn-tt-nl-0-1-x.cmp tot occurs at newline ratio mk group ----------- ----------- ----- -- ----------- 660 0.048 94 0.185 0.136 ++ :8 1657 0.122 185 0.363 0.110 ++ :4 825 0.061 51 0.100 0.060 -- :H 1819 0.134 52 0.102 0.029 -- :O 2373 0.174 56 0.110 0.024 -- :G 976 0.072 21 0.041 0.022 -- :S 1754 0.129 26 0.051 0.015 -- :E 1177 0.086 14 0.028 0.012 -- :T 1907 0.140 7 0.014 0.004 -- :D 462 0.034 3 0.006 0.008 oo :R The last line says is: if we have already decided to supress word breaks after [4ACDFHIPSTZ], and break after [K], then we might as well supress breaks before "R", since we would be making less than 25 errors because of that decision. Now let's look at 2-char patterns, with one character on either side of the line break. First, let's omit the patterns already fixed: cat .voyn-tt-1-1-x.grm \ | sort | uniq -c | expand \ | compute-freqs \ > .voyn-tt-1-1-x.frq cat .voyn-nl-1-1-x.grm \ | sort | uniq -c | expand \ | compute-freqs \ > .voyn-nl-1-1-x.frq compare-freqs \ .voyn-tt-1-1-x.frq \ .voyn-nl-1-1-x.frq \ | compute-count-ratio \ -v nmin=10 -v mw=5 -v mc=40 \ | sort -b +0.0 -0.2r +5 -6 +4 -5nr +0 -1nr \ > .voyn-tt-nl-1-1-x.cmp Note that we have lowered nmin=10, because there are many more patterns. The results are better examined in tabular form: cat .voyn-tt-nl-1-1-x.cmp \ | print-pattern-classes \ -v rowchars='O28ERMNGL6' \ -v colchars='4E8RDHGSTO' 4 E 8 R D H G S T O -- -- -- -- -- -- -- -- -- -- O | -- oo oo oo oo oo oo -- oo || 2 | -? -? -? . -? -? oo -- oo -- 8 | -- -- -? -? -? -? oo oo oo -- E | || -- ++ oo -- -- -- -- -- -- R | || +? || -? -? +? -- -- -- -- M | || -? -- -? -? -? -- -- -- -- N | || -? -- -? . -? || -- -- -- G | -- -- ++ -- -- || || -- -- -- L | . -? . . . . -? . -? -? 6 | -? . -? . . -? . -? . -? The table only shows pairs that still occur at least once. The table says that, after we have decided to split at K:. .:[2P], and not to split at [4ACDFHILPSTZ]:. .:[6ACFIKLMNZ], we can also break at [ERMN]:[4] [NG]:[HG] R:8 O:O and supress breaks at [O82]:[8RDHGT] 8:S E:R O:E If we have to choose, it is best to supress breaks at [L6]:. 2:[4E] [RMN]:[RD] [MN]:[HE] and perhaps to break at R:[EH] Here are some particular entries: tot occurs at newline ratio mk group ----------- ----------- ----- -- ----------- 5 0.000 2 0.004 0.067 -? 2:4 135 0.010 2 0.004 0.017 -- 2:O 20 0.001 1 0.002 0.033 -- 2:S 18 0.001 0 0.000 0.017 oo 2:T 25 0.002 4 0.008 0.077 -- 8:4 16 0.001 2 0.004 0.054 -- 8:E 100 0.007 2 0.004 0.021 -- 8:O 312 0.023 1 0.002 0.006 -- E:D 47 0.003 6 0.012 0.080 -- E:E 126 0.009 12 0.024 0.078 -- E:G 71 0.005 5 0.010 0.054 -- E:H 1377 0.101 100 0.196 0.071 -- G:4 123 0.009 5 0.010 0.037 -- G:D 323 0.024 11 0.022 0.033 -- G:E 150 0.011 34 0.067 0.184 || G:H 123 0.009 2 0.004 0.018 -- G:R 15 0.001 2 0.004 0.055 -- O:4 10 0.001 0 0.000 0.020 oo O:T 11 0.001 1 0.002 0.039 -- O:S Pattern 2:S is actually very similar to 2:T. Patterns 2:O also is almost a non-break. Note that E:D is amost a non-break, wheras E:H is only moderately unlikely. As we remarked, the difference is probably a calligraphic effect. Pattern O:S is actually very close to O:T; they are just above the "nmin" threshold. Let's reanalyze the 1:1 patterns without eliminating the 1:0 and 0:1 extremal cases. cat .voyn-tt-2.grm \ | sed -e 's/^\(.\)\(.\)$/\1:\2/g' \ > .voyn-tt-1-1.grm cat .voyn.fsg \ | tr -d ' /=' \ | sed -e 's/^\(..\).*\(..\)$/\1\2/g' \ | tr -s '\012' ':' \ | enum-ngraphs -v n=3 \ | egrep -v '\*' \ | egrep '^.:.$' \ > .voyn-nl-1-1.grm cat .voyn-tt-1-1.grm \ | sort | uniq -c | expand \ | compute-freqs \ > .voyn-tt-1-1.frq cat .voyn-nl-1-1.grm \ | sort | uniq -c | expand \ | compute-freqs \ > .voyn-nl-1-1.frq compare-freqs \ .voyn-tt-1-1.frq \ .voyn-nl-1-1.frq \ | compute-count-ratio \ -v nmin=10 -v mw=5 -v mc=40 \ | sort -b +0.0 -0.2r +5 -6 +4 -5nr +0 -1nr \ > .voyn-tt-nl-1-1.cmp Summarizing: cat .voyn-tt-nl-1-1.cmp \ | print-pattern-classes \ -v rowchars='AI4FPDHCTSZ2L68OKMNREG' \ -v colchars='A6KLMNIZFC2PEDHSTR4G8O' A 6 K L M N I Z F C 2 P 4 E 8 R D H G S T O -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- A | . . oo . oo oo oo . . . . -? . oo . oo . . . . . . I | . . -? -? . . oo . . . . . . -? . oo . . . . . . 4 | . . . . . . . . -? oo -? -? -? . . . -? -? . -? . oo D | oo . . . -? . -? oo . oo . . . oo -? -? . . oo oo oo oo H | oo . . . -? . -? oo . oo -? . . -? -? . -? . oo oo oo oo P | oo . . . . . . oo . -? . . . . -? . . . -? oo oo oo F | -? . . . . . . -? . -? . . . . . . . . . -? oo -? S | oo . . . . . . . . oo -? . -? -? oo -? oo oo oo . -? oo T | oo . . . . . . . -? oo -? oo . oo oo -? oo oo oo -? -? oo C | oo -? . -? . . . . -? oo oo oo -? -? -- -? oo oo oo -? oo oo Z | oo . . . . . . . . oo -? . . . oo . . . oo -? -? -- L | . . . . . . . . . . . . . -? . . . . -? . -? -? 6 | -? . . . . . . . . . -? . -? . -? . . -? . -? . -? K | . . . . . . . . . . ## -? ## -? +? . . -? -? -? -? +? 2 | oo . . . . . -? . . -? -? -? -? -? -? . -? -? oo -- oo -- 8 | oo . . -? . . -? . -? oo -? . -- -- -? -? -? -? oo oo oo -- O | -? -? -? -? oo -? -? . oo oo ## oo -- oo oo oo oo oo oo -- oo || E | oo . -? . . . . . -? -- || || || -- ++ oo -- -- -- -- -- -- R | oo -? . . . . -? . -? -? ## ## || +? || -? -? +? -- -- -- -- M | -? . . . . . -? . -? -? ## -? || -? -- -? -? -? -- -- -- -- N | -? . . . . . . . . -? ## -? || -? -- -? . -? || -- -- -- G | oo -? -? -? . . -? . -? -? || || -- -- ++ -- -- || || -- -- -- Note that this table says that some single-character break-supressing patterns that we had found before, like C:8 Z:O E:C, are actually weakly possible: tot occurs at newline ratio mk group ----------- ----------- ----- -- ----------- 11 0.000 1 0.001 0.039 -- Z:O 1899 0.065 1 0.001 0.001 -- C:8 20 0.001 1 0.001 0.033 -- E:C Pattern C:8 should obviously have been "oo"; the classifier needs more work. Also Z:O and E:C are practically "?-". Also, the breaking pattern K:. is not optimal: we can probably do better by breaking at K:[248O] and supressing at other combinations. Similarly, the breaking pattern .:[2P] is not optimal; it seems safer to break [KOERMNG]:2, [ERG]:P but supress breaks at [OMN]:P, [28]:[2P] Let's look closely at the ++ and -- entries: tot occurs at newline ratio mk group ----------- ----------- ----- -- ----------- 197 0.007 23 0.030 0.101 ++ E:8 321 0.011 55 0.072 0.155 ++ G:8 These seem legitimately ambiguous. tot occurs at newline ratio mk group ----------- ----------- ----- -- ----------- 20 0.001 1 0.001 0.033 -- 2:S 1899 0.065 1 0.001 0.001 -- C:8 20 0.001 1 0.001 0.033 -- E:C 312 0.011 1 0.001 0.006 -- E:D 14 0.000 1 0.001 0.037 -- M:G 115 0.004 1 0.001 0.013 -- M:T 105 0.004 1 0.001 0.014 -- N:S 118 0.004 1 0.001 0.013 -- N:T 11 0.000 1 0.001 0.039 -- O:S 11 0.000 1 0.001 0.039 -- Z:O 135 0.005 2 0.003 0.017 -- 2:O 16 0.001 2 0.003 0.054 -- 8:E 100 0.003 2 0.003 0.021 -- 8:O 348 0.012 2 0.003 0.008 -- E:S 503 0.017 2 0.003 0.006 -- E:T 123 0.004 2 0.003 0.018 -- G:R 91 0.003 2 0.003 0.023 -- M:S 15 0.001 2 0.003 0.055 -- O:4 155 0.005 2 0.003 0.015 -- R:S 130 0.004 2 0.003 0.018 -- R:T 25 0.001 4 0.005 0.077 -- 8:4 47 0.002 6 0.008 0.080 -- E:E 126 0.004 12 0.016 0.078 -- E:G 71 0.002 5 0.007 0.054 -- E:H 384 0.013 5 0.007 0.014 -- E:O 1377 0.047 100 0.132 0.071 -- G:4 123 0.004 5 0.007 0.037 -- G:D 323 0.011 11 0.014 0.033 -- G:E 585 0.020 29 0.038 0.048 -- G:O 201 0.007 11 0.014 0.050 -- G:S 241 0.008 8 0.011 0.032 -- G:T 27 0.001 2 0.003 0.045 -- M:8 130 0.004 3 0.004 0.024 -- M:O 34 0.001 4 0.005 0.068 -- N:8 172 0.006 3 0.004 0.019 -- N:O 61 0.002 4 0.005 0.050 -- R:G 291 0.010 5 0.007 0.018 -- R:O It seems that the "--" of the first group above could be changed to "oo" without much harm; each is likely to generate 5 mistakes (omission of true break). Each pattern in the second group would generate about 10 mistakes. Perhaps less in some cases like E:T, since we aren't taking into account the possibility of accidental line breaking inside words. The patterns in the third group seem legimate ambiguities. Here is the table again, with the 1st and 2nd groups above manually changed to "-?" A 6 K L M N I Z F C 2 P E D H S T R 4 G 8 O -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- A | . . oo . oo oo oo . . . . -? oo . . . . oo . . . . I | . . -? -? . . oo . . . . . -? . . . . oo . . . . 4 | . . . . . . . . -? oo -? -? . -? -? -? . . -? . . oo F | -? . . . . . . -? . -? . . . . . -? oo . . . . -? P | oo . . . . . . oo . -? . . . . . oo oo . . -? -? oo D | oo . . . -? . -? oo . oo . . oo . . oo oo -? . oo -? oo H | oo . . . -? . -? oo . oo -? . -? -? . oo oo . . oo -? oo C | oo -? . -? . . . . -? oo oo oo -? oo oo -? oo -? -? oo -? oo T | oo . . . . . . . -? oo -? oo oo oo oo -? -? -? . oo oo oo S | oo . . . . . . . . oo -? . -? oo oo . -? -? -? oo oo oo Z | oo . . . . . . . . oo -? . . . . -? -? . . oo oo -? 2 | oo . . . . . -? . . -? -? -? -? -? -? -? oo . -? oo -? -? L | . . . . . . . . . . . . -? . . . -? . . -? . -? 6 | -? . . . . . . . . . -? . . . -? -? . . -? . -? -? 8 | oo . . -? . . -? . -? oo -? . -? -? -? oo oo -? -- oo -? -? O | -? -? -? -? oo -? -? . oo oo ## oo oo oo oo -? oo oo -? oo oo || K | . . . . . . . . . . ## -? -? . -? -? -? . ## -? +? +? M | -? . . . . . -? . -? -? ## -? -? -? -? -? -? -? || -? -- -- N | -? . . . . . . . . -? ## -? -? . -? -? -? -? || || -- -- R | oo -? . . . . -? . -? -? ## ## +? -? +? -? -? -? || -- || -- E | oo . -? . . . . . -? -? || || -- -? -- -? -? oo || -- ++ -- G | oo -? -? -? . . -? . -? -? || || -- -- || -- -- -? -- || ++ -- Here are the tentative word-breaking rules, derived from this table: 9. Supress breaks at [24AIDHPFSTCZ6L]:. .:[6AKLMNIZFCR] [8O]:[8G] O:4 8:2 [8KMNO]:[PEDHSTR] 8. Insert break at K:[8O] [KERMN]:[24] [OG]:2 [ERG]:P R:[E8] G:H [NG]:G O:O For the remaining cases, we must look at digraphs on either side of line breaks. First, let's prepare files of tetragrams with 2 chars on each side of the "cursor", discarding the unambiguous single-character patterns that we have found already. cat .voyn.fsg \ | tr -d ' /=\012' \ | enum-ngraphs -v n=4 \ | egrep -v '\*' \ > .voyn-tt-4.grm cat .voyn-tt-4.grm \ | sed -e 's/^\(..\)\(..\)$/\1:\2/g' \ | egrep -v '[24AIDHPFSTCZ6L]:' \ | egrep -v ':[6AKLMNIZFCR]' \ | egrep -v '[8O]:[8G]|O:4|8:2|[8KMNO]:[PEDHSTR]' \ | egrep -v 'K:[8O]|[KERMN]:[24]|[OG]:2|[ERG]:P|R:[E8]|G:H|[NG]:G|O:O' \ > .voyn-tt-2-2-x.grm cat .voyn.fsg \ | tr -d ' /=' \ | sed -e 's/^\(..\).*\(..\)$/\1\2/g' \ | tr -s '\012' ':' \ | enum-ngraphs -v n=5 \ | egrep -v '\*' \ | egrep '^..:..$' \ | egrep -v '[24AIDHPFSTCZ6L]:' \ | egrep -v ':[6AKLMNIZFCR]' \ | egrep -v '[8O]:[8G]|O:4|8:2|[8KMNO]:[PEDHSTR]' \ | egrep -v 'K:[8O]|[KERMN]:[24]|[OG]:2|[ERG]:P|R:[E8]|G:H|[NG]:G|O:O' \ > .voyn-nl-2-2-x.grm Now let's first look at 2-character word-end patterns: cat .voyn-tt-2-2-x.grm \ | sed -e 's/\(..\):..$/\1:/g' \ | sort | uniq -c | expand \ | compute-freqs \ > .voyn-tt-2-0-x.frq cat .voyn-nl-2-2-x.grm \ | sed -e 's/^\(..\):..$/\1:/g' \ | sort | uniq -c | expand \ | compute-freqs \ > .voyn-nl-2-0-x.frq compare-freqs \ .voyn-tt-2-0-x.frq \ .voyn-nl-2-0-x.frq \ | compute-count-ratio \ -v nmin=7 -v mw=5 -v mc=40 \ | sort -b +0.0 -0.2r +5 -6 +4 -5nr +0 -1nr \ > .voyn-tt-nl-2-0-x.cmp Here are the results. These seem to be sure word-ends: tot occurs at newline ratio mk group ----------- ----------- ----- -- ----------- 90 0.014 37 0.118 0.292 || EG: 44 0.007 19 0.061 0.238 || RG: 44 0.007 9 0.029 0.119 || TG: 10 0.002 3 0.010 0.080 || MG: The ones below seem possible but not certain. Note that DG: and DH: are actually quite similar. tot occurs at newline ratio mk group ----------- ----------- ----- -- ----------- 138 0.022 19 0.061 0.112 ++ DG: 75 0.012 10 0.032 0.096 -- HG: 1764 0.280 97 0.310 0.054 -- 8G: 214 0.034 12 0.038 0.051 -- OR: 160 0.025 6 0.019 0.035 -- AM: 45 0.007 2 0.006 0.035 -- C8: 438 0.069 15 0.048 0.033 -- AE: 203 0.032 7 0.022 0.033 -- AN: 26 0.004 1 0.003 0.030 -- SG: 1149 0.182 32 0.102 0.028 -- OE: 40 0.006 1 0.003 0.025 -- G8: 44 0.007 1 0.003 0.024 -- EE: 181 0.029 4 0.013 0.023 -- ZG: 713 0.113 15 0.048 0.021 -- CG: 79 0.013 1 0.003 0.017 -- GR: 296 0.047 4 0.013 0.015 -- GE: 303 0.048 3 0.010 0.012 -- AR: These seem actually "+?", the classifier is wrong: tot occurs at newline ratio mk group ----------- ----------- ----- -- ----------- 3 0.000 3 0.010 0.093 -? AK: 2 0.000 2 0.006 0.071 -? O8: 1 0.000 1 0.003 0.049 -? KE: 3 0.000 1 0.003 0.047 -? LE: 4 0.001 1 0.003 0.045 -? 2G: 5 0.001 1 0.003 0.044 -? PG: 6 0.001 1 0.003 0.043 -? TE: These truly deserve "-?": tot occurs at newline ratio mk group ----------- ----------- ----- -- ----------- 8 0.001 1 0.003 0.042 -- NG: 9 0.001 1 0.003 0.041 -- DE: 11 0.002 1 0.003 0.039 -- ER: 16 0.003 1 0.003 0.036 -- OG: 15 0.002 1 0.003 0.036 -- E8: 2 0.000 0 0.000 0.024 -? 8R: 2 0.000 0 0.000 0.024 -? CE: 2 0.000 0 0.000 0.024 -? KG: 2 0.000 0 0.000 0.024 -? RR: 2 0.000 0 0.000 0.024 -? TR: 1 0.000 0 0.000 0.024 -? 2E: 1 0.000 0 0.000 0.024 -? 68: 1 0.000 0 0.000 0.024 -? DM: 1 0.000 0 0.000 0.024 -? DR: 1 0.000 0 0.000 0.024 -? K8: 1 0.000 0 0.000 0.024 -? P8: 1 0.000 0 0.000 0.024 -? S8: 4 0.001 0 0.000 0.023 -? IE: 4 0.001 0 0.000 0.023 -? NE: 4 0.001 0 0.000 0.023 -? R8: 3 0.000 0 0.000 0.023 -? HE: 3 0.000 0 0.000 0.023 -? M8: 3 0.000 0 0.000 0.023 -? ME: 3 0.000 0 0.000 0.023 -? N8: 3 0.000 0 0.000 0.023 -? ON: 3 0.000 0 0.000 0.023 -? T8: 3 0.000 0 0.000 0.023 -? Z8: 6 0.001 0 0.000 0.022 -? CR: 6 0.001 0 0.000 0.022 -? RE: 5 0.001 0 0.000 0.022 -? SE: Finally, these are non-enders: tot occurs at newline ratio mk group ----------- ----------- ----- -- ----------- 10 0.002 0 0.000 0.020 oo OM: 12 0.002 0 0.000 0.019 oo 8E: 30 0.005 0 0.000 0.014 oo IR: 49 0.008 0 0.000 0.011 oo GG: In tabular form: always break: [ERTM]G:.. AK:.. O8:.. [KLT]E:.. likely break: [DH2P]G:.. unlikely break: [CG]8:.. [EGO]E:.. [8CSZ]G:.. A[EMNRGO]:.. never break: OM:.. 8E:.. IR:.. GG:.. Now the 2 characters after line-start: cat .voyn-tt-2-2-x.grm \ | sed -e 's/^..:\(..\)$/:\1/g' \ | sort | uniq -c | expand \ | compute-freqs \ > .voyn-tt-0-2-x.frq cat .voyn-nl-2-2-x.grm \ | sed -e 's/^..:\(..\)$/:\1/g' \ | sort | uniq -c | expand \ | compute-freqs \ > .voyn-nl-0-2-x.frq compare-freqs \ .voyn-tt-0-2-x.frq \ .voyn-nl-0-2-x.frq \ | compute-count-ratio \ -v nmin=7 -v mw=5 -v mc=40 \ | sort -b +0.0 -0.2r +5 -6 +4 -5nr +0 -1nr \ > .voyn-tt-nl-0-2-x.cmp The ones below seem sure word-starts: tot occurs at newline ratio mk group ----------- ----------- ----- -- ----------- 7 0.001 3 0.010 0.085 ## :HS 34 0.005 23 0.073 0.324 || :8S 22 0.003 14 0.045 0.242 || :8T 17 0.003 7 0.022 0.140 || :GS 16 0.003 6 0.019 0.125 || :GD 13 0.002 5 0.016 0.113 || :GT 9 0.001 3 0.010 0.082 || :HT 9 0.001 2 0.006 0.061 || :8E The ones below seem probable word-starts: tot occurs at newline ratio mk group ----------- ----------- ----- -- ----------- 302 0.048 35 0.112 0.105 ++ :8A 60 0.010 8 0.026 0.090 -- :8O 5 0.001 2 0.006 0.067 -? :HO 6 0.001 2 0.006 0.065 -? :OS The ones below seem possible word-starts: tot occurs at newline ratio mk group ----------- ----------- ----- -- ----------- 1376 0.218 99 0.316 0.071 -- :4O 26 0.004 3 0.010 0.061 -- :SO 128 0.020 9 0.029 0.060 -- :ET 61 0.010 5 0.016 0.059 -- :EO 15 0.002 2 0.006 0.055 -- :TA 18 0.003 2 0.006 0.052 -- :SA 27 0.004 2 0.006 0.045 -- :DO 27 0.004 2 0.006 0.045 -- :TO 59 0.009 3 0.010 0.040 -- :ES 13 0.002 1 0.003 0.038 -- :O8 15 0.002 1 0.003 0.036 -- :GH 21 0.003 1 0.003 0.033 -- :DT 25 0.004 1 0.003 0.031 -- :G2 323 0.051 10 0.032 0.030 -- :OD 721 0.114 21 0.067 0.029 -- :OE 31 0.005 1 0.003 0.028 -- :HC 290 0.046 7 0.022 0.024 -- :OH 45 0.007 1 0.003 0.024 -- :TD 572 0.091 10 0.032 0.018 -- :SC 187 0.030 2 0.006 0.013 -- :OR 131 0.021 1 0.003 0.012 -- :8G 670 0.106 7 0.022 0.011 -- :TC 206 0.033 1 0.003 0.008 -- :DC The ones below are uncertain: tot occurs at newline ratio mk group ----------- ----------- ----- -- ----------- 36 0.006 0 0.000 0.013 oo :G4 32 0.005 0 0.000 0.014 oo :G8 29 0.005 0 0.000 0.014 oo :DG 25 0.004 0 0.000 0.015 oo :SD 25 0.004 0 0.000 0.015 oo :TG 23 0.004 0 0.000 0.016 oo :E8 22 0.003 0 0.000 0.016 oo :S8 21 0.003 0 0.000 0.016 oo :TH 19 0.003 0 0.000 0.017 oo :GO 19 0.003 0 0.000 0.017 oo :HA 17 0.003 0 0.000 0.018 oo :SH 16 0.003 0 0.000 0.018 oo :SG 13 0.002 0 0.000 0.019 oo :EA 13 0.002 0 0.000 0.019 oo :GE 11 0.002 0 0.000 0.020 oo :DS 11 0.002 0 0.000 0.020 oo :GG 11 0.002 0 0.000 0.020 oo :OM 9 0.001 0 0.000 0.020 oo :EG 9 0.001 1 0.003 0.041 -- :O4 8 0.001 0 0.000 0.021 oo :4C 8 0.001 0 0.000 0.021 oo :EH 8 0.001 0 0.000 0.021 oo :TP 7 0.001 0 0.000 0.021 oo :8C 7 0.001 0 0.000 0.021 oo :OC 7 0.001 0 0.000 0.021 oo :OF 7 0.001 0 0.000 0.021 oo :OO 7 0.001 0 0.000 0.021 oo :TE 7 0.001 1 0.003 0.043 -- :O2 6 0.001 0 0.000 0.022 -? :OK 6 0.001 1 0.003 0.043 -? :DZ 6 0.001 2 0.006 0.065 -? :OS 5 0.001 0 0.000 0.022 -? :4H 5 0.001 0 0.000 0.022 -? :8D 5 0.001 0 0.000 0.022 -? :ON 5 0.001 0 0.000 0.022 -? :T2 5 0.001 1 0.003 0.044 -? :OT 5 0.001 3 0.010 0.089 -? :4D 4 0.001 0 0.000 0.023 -? :DE 4 0.001 0 0.000 0.023 -? :E2 4 0.001 0 0.000 0.023 -? :ER 4 0.001 0 0.000 0.023 -? :OG 4 0.001 0 0.000 0.023 -? :SE 4 0.001 1 0.003 0.045 -? :84 3 0.000 0 0.000 0.023 -? :EE 3 0.000 0 0.000 0.023 -? :EP 3 0.000 0 0.000 0.023 -? :HZ 3 0.000 0 0.000 0.023 -? :O6 3 0.000 0 0.000 0.023 -? :OI 3 0.000 0 0.000 0.023 -? :TR 2 0.000 0 0.000 0.024 -? :42 2 0.000 0 0.000 0.024 -? :4P 2 0.000 0 0.000 0.024 -? :EC 2 0.000 0 0.000 0.024 -? :EF 2 0.000 0 0.000 0.024 -? :GC 2 0.000 0 0.000 0.024 -? :GP 2 0.000 0 0.000 0.024 -? :GR 2 0.000 0 0.000 0.024 -? :ST 2 0.000 0 0.000 0.024 -? :TS 1 0.000 0 0.000 0.024 -? :4F 1 0.000 0 0.000 0.024 -? :82 1 0.000 0 0.000 0.024 -? :88 1 0.000 0 0.000 0.024 -? :8H 1 0.000 0 0.000 0.024 -? :8L 1 0.000 0 0.000 0.024 -? :8R 1 0.000 0 0.000 0.024 -? :DI 1 0.000 0 0.000 0.024 -? :E4 1 0.000 0 0.000 0.024 -? :EK 1 0.000 0 0.000 0.024 -? :GA 1 0.000 0 0.000 0.024 -? :H8 1 0.000 0 0.000 0.024 -? :HD 1 0.000 0 0.000 0.024 -? :HE 1 0.000 0 0.000 0.024 -? :HG 1 0.000 0 0.000 0.024 -? :OA 1 0.000 0 0.000 0.024 -? :S2 1 0.000 0 0.000 0.024 -? :SR 1 0.000 0 0.000 0.024 -? :TF 1 0.000 0 0.000 0.024 -? :TT 1 0.000 1 0.003 0.049 -? :44 1 0.000 1 0.003 0.049 -? :4S 1 0.000 1 0.003 0.049 -? :DR The ones below seem invalid as word-starts: tot occurs at newline ratio mk group ----------- ----------- ----- -- ----------- 44 0.007 0 0.000 0.012 oo :T8 42 0.007 0 0.000 0.012 oo :OP 49 0.008 0 0.000 0.011 oo :ED 134 0.021 0 0.000 0.006 oo :DA In tabular form: always break: :H[ST] :8[STE] :G[STD] likely break: :8[AO] :HO :OS unlikely break: :4O :8G :D[COT] :E[OST] :G[2H] :HC :O[8DHER] :S[ACO] :T[ACDO] never break: :T8 :OP :ED :DA