let's recompute the frequency of each glyph, excluding garbage, without classification: cat .voyn.glp \ | sed \ -e 's:([/ _=]*):@:g' \ -e 's/)(/)@(/g' \ | tr '@' '\n' \ | egrep '.' \ | grep -v '\*' \ | sort | uniq -c | expand \ | sort +0.8 -0.99 \ | compute-freqs \ > .voyn-glyphs.frq There are 763 lines, and a total of 17561 glyphs by the current parsing. Hnece the average number of glyphs per line is 23.0. If we believe that most Bio lines are just paragraph continuation lines, then the line-initial and line-final "glyphs" should reflect the beginning and endings of true words. If we take the number of words per line in .voyn.fsg (9.44) as basically correct, then a glyph that is strictly word-final in the VMs language should have 1/9.44 of its occurrences in line-final position. The probability of some occurrence of a glyph being word-end can be estimated as Pwe = 9.44*NLE/NT, where NLE is the number of occurrences as line-end and NT the total occurrences. The probability Pws of it being word-start is estimated the same way from the number NLS of occurrences as line-start. Let's compute the frequencies of glyphs at line-end and word-end (where the word division is taken from the transcription file): cat .voyn.glp \ | sed \ -e 's:([_/=]*)::g' \ -e 's:([^)]*)$:-@&@:g' \ | tr '@' '\012' \ | egrep -v -e '-$' \ | egrep '.' \ | egrep -v '\*' \ | sort | uniq -c | expand \ | sort +0.0 -0.7nr \ | compute-freqs \ > .voyn-line-fin-g.frq cat .voyn.glp \ | sed \ -e 's:([/=]*)::g' \ -e 's:(_[(_)]*)$:-:g' \ -e 's:\(([^)]*)\)(_):-@\1@:g' \ | tr '@' '\012' \ | egrep -v -e '-$' \ | egrep -v '\*' \ | egrep '.' \ | egrep -v '^\(_\)' \ | sort | uniq -c | expand \ | sort +0.0 -0.7nr \ | compute-freqs \ > .voyn-word-fin-g.frq compare-freqs \ .voyn-glyphs.frq \ .voyn-line-fin-g.frq \ .voyn-word-fin-g.frq \ | gawk '/[0-9]\.[0-9]/ {printf "%-63s %5.2f\n", $0, (9.44*$5/$3)}' \ > .voyn-fin.cmp tot occurs line ends word ends glyph Pwe ---------- ---------- ---------- --------- ----- 43 0.002 40 0.053 2 0.000 (ak) 8.78 10 0.001 8 0.011 1 0.000 (K?) 7.55 12 0.001 8 0.011 3 0.001 (6?) 6.29 2 0.000 1 0.001 1 0.000 (IIIK?) 4.72 2 0.000 1 0.001 1 0.000 (IIK?) 4.72 1726 0.098 218 0.287 1308 0.230 (g) 1.19 286 0.016 30 0.039 178 0.031 (or) 0.99 557 0.032 54 0.071 343 0.060 (ae) 0.92 2055 0.117 184 0.242 1818 0.320 (bg) 0.85 23 0.001 2 0.003 20 0.004 (M?) 0.82 414 0.024 31 0.041 374 0.066 (am) 0.71 405 0.023 25 0.033 313 0.055 (ar) 0.58 495 0.028 29 0.038 451 0.079 (an) 0.55 17 0.001 1 0.001 7 0.001 (qor) 0.56 1152 0.066 57 0.075 490 0.086 (oe) 0.47 193 0.011 9 0.012 81 0.014 (qoe) 0.44 186 0.011 8 0.011 39 0.007 (r) 0.41 456 0.026 18 0.024 34 0.006 (e) 0.37 365 0.021 11 0.014 57 0.010 (z) 0.28 685 0.039 12 0.016 61 0.011 (b) 0.17 30 0.002 0 0.000 5 0.001 (4?) 0.00 10 0.001 0 0.000 0 0.000 (I?) 0.00 10 0.001 0 0.000 0 0.000 (II?) 0.00 2 0.000 0 0.000 2 0.000 (IIL?) 0.00 6 0.000 0 0.000 1 0.000 (L?) 0.00 7 0.000 0 0.000 6 0.001 (N?) 0.00 28 0.002 0 0.000 24 0.004 (air?) 0.00 1141 0.065 0 0.000 14 0.002 (d) 0.00 412 0.023 0 0.000 0 0.000 (dc) 0.00 440 0.025 0 0.000 1 0.000 (dcc) 0.00 147 0.008 0 0.000 1 0.000 (dz) 0.00 52 0.003 0 0.000 0 0.000 (dzc) 0.00 32 0.002 0 0.000 1 0.000 (f) 0.00 3 0.000 0 0.000 0 0.000 (fz) 0.00 1 0.000 0 0.000 0 0.000 (fzc) 0.00 513 0.029 0 0.000 4 0.001 (h) 0.00 210 0.012 0 0.000 0 0.000 (hc) 0.00 129 0.007 0 0.000 0 0.000 (hcc) 0.00 91 0.005 1 0.001 0 0.000 (hz) 0.10 30 0.002 0 0.000 0 0.000 (hzc) 0.00 885 0.050 8 0.011 16 0.003 (o) 0.09 195 0.011 0 0.000 4 0.001 (p) 0.00 12 0.001 0 0.000 0 0.000 (pz) 0.00 9 0.001 0 0.000 0 0.000 (pzc) 0.00 1436 0.082 1 0.001 10 0.002 (qo) 0.01 212 0.012 0 0.000 4 0.001 (s) 0.00 716 0.041 1 0.001 0 0.000 (sc) 0.01 150 0.009 0 0.000 1 0.000 (scc) 0.00 400 0.023 0 0.000 1 0.000 (t) 0.00 927 0.053 0 0.000 1 0.000 (tc) 0.00 126 0.007 0 0.000 1 0.000 (tcc) 0.00 The statistics for line-ends roughly agree with those of word-ends. By th Pwe criterion, the glyphs that asre almost certain to be word-ends are tot occurs line ends word ends glyph Pwe ---------- ---------- ---------- --------- ----- 1726 0.098 218 0.287 1308 0.230 (g) 1.19 286 0.016 30 0.039 178 0.031 (or) 0.99 The following glyphs are anomalous in that they seem to occur as line-end (rarely, except for (ak)) but not as other word-end. Thus they may be abbreviations, or continuation signs, or truncated words. tot occurs line ends word ends glyph Pwe ---------- ---------- ---------- --------- ----- 43 0.002 40 0.053 2 0.000 (ak) 8.78 12 0.001 8 0.011 3 0.001 (6?) 6.29 10 0.001 8 0.011 1 0.000 (K?) 7.55 The following glyphs apepar to be very likely, but not certain, word-ends (about 75-90% chance): tot occurs line ends word ends glyph Pwe ---------- ---------- ---------- --------- ----- 2055 0.117 184 0.242 1818 0.320 (bg) 0.85 557 0.032 54 0.071 343 0.060 (ae) 0.92 414 0.024 31 0.041 374 0.066 (am) 0.71 The following glyphs apepar to be likely, but not certain, word-ends (about 50% chance): tot occurs line ends word ends glyph Pwe ---------- ---------- ---------- --------- ----- 405 0.023 25 0.033 313 0.055 (ar) 0.58 495 0.028 29 0.038 451 0.079 (an) 0.55 1152 0.065 57 0.075 490 0.086 (oe) 0.47 193 0.011 9 0.012 81 0.014 (qoe) 0.44 17 0.001 1 0.001 7 0.001 (qor) 0.56 The following glyphs apparently MAY occur at end-of-word, but apparently occur with higher frequency in non-word-final position: tot occurs line ends word ends glyph Pwe ---------- ---------- ---------- --------- ----- 685 0.039 12 0.016 61 0.011 (b) 0.17 365 0.021 11 0.014 57 0.010 (z) 0.28 456 0.026 18 0.024 34 0.006 (e) 0.37 186 0.011 8 0.011 39 0.007 (r) 0.41 Numerically, (air?) would not seem to be a valid word-end, but structurally it is like (ar), and its occurrences suggest it should be allowed: tot occurs line ends word ends glyph Pwe ---------- ---------- ---------- --------- ----- 28 0.002 0 0.000 24 0.004 (air?) 0.00 Finally, these glyphs do NOT seem to be valid word-ends: tot occurs line ends word ends glyph Pwe ---------- ---------- ---------- --------- ----- 30 0.002 0 0.000 5 0.001 (4?) 0.00 885 0.050 8 0.011 16 0.003 (o) 0.09 1436 0.082 1 0.001 10 0.002 (qo) 0.01 1141 0.065 0 0.000 14 0.002 (d) 0.00 412 0.023 0 0.000 0 0.000 (dc) 0.00 440 0.025 0 0.000 1 0.000 (dcc) 0.00 147 0.008 0 0.000 1 0.000 (dz) 0.00 52 0.003 0 0.000 0 0.000 (dzc) 0.00 513 0.029 0 0.000 4 0.001 (h) 0.00 210 0.012 0 0.000 0 0.000 (hc) 0.00 129 0.007 0 0.000 0 0.000 (hcc) 0.00 91 0.005 1 0.001 0 0.000 (hz) 0.10 30 0.002 0 0.000 0 0.000 (hzc) 0.00 212 0.012 0 0.000 4 0.001 (s) 0.00 716 0.041 1 0.001 0 0.000 (sc) 0.01 150 0.009 0 0.000 1 0.000 (scc) 0.00 400 0.023 0 0.000 1 0.000 (t) 0.00 927 0.053 0 0.000 1 0.000 (tc) 0.00 126 0.007 0 0.000 1 0.000 (tcc) 0.00 38 0.002 2 0.003 0 0.000 (c?) 0.50 48 0.003 0 0.000 0 0.000 (cc?) 0.00 29 0.002 0 0.000 0 0.000 (ccc?) 0.00 32 0.002 0 0.000 1 0.000 (f) 0.00 3 0.000 0 0.000 0 0.000 (fz) 0.00 1 0.000 0 0.000 0 0.000 (fzc) 0.00 195 0.011 0 0.000 4 0.001 (p) 0.00 12 0.001 0 0.000 0 0.000 (pz) 0.00 9 0.001 0 0.000 0 0.000 (pzc) 0.00 Let's now look at the frequencies of line-starts and word-starts: cat .voyn.glp \ | sed \ -e 's/(_)//g' \ -e 's:^([^)]*):@&@-:g' \ | tr '@' '\012' \ | egrep -v '^-' \ | egrep -v '\*' \ | egrep '.' \ | egrep -v '^\(//\)$' \ | sort | uniq -c | expand \ | sort +0.0 -0.7nr \ | compute-freqs \ > .voyn-line-ini-g.frq cat .voyn.glp \ | sed \ -e 's:([/=]*)::g' \ -e 's:^(_):-:g' \ -e 's:(_)\(([^)]*)\):@\1@-:g' \ | tr '@' '\012' \ | egrep -v '^-' \ | egrep -v '\*' \ | egrep -v '^\(_\)$' \ | egrep '.' \ | sort | uniq -c | expand \ | sort +0.0 -0.7nr \ | compute-freqs \ > .voyn-word-ini-g.frq compare-freqs \ .voyn-glyphs.frq \ .voyn-line-ini-g.frq \ .voyn-word-ini-g.frq \ | gawk '/[0-9]\.[0-9]/ {printf "%-63s %5.2f\n", $0, (9.44*$5/$3)}' \ > .voyn-ini.cmp tot occurs line start word start glyph Pws ---------- ---------- ---------- --------- ----- 365 0.021 138 0.181 140 0.025 (z) 3.57 195 0.011 71 0.093 22 0.004 (p) 3.44 17 0.001 5 0.007 11 0.002 (qor) 2.78 30 0.002 6 0.008 22 0.004 (4?) 1.89 685 0.039 103 0.135 350 0.062 (b) 1.42 1436 0.082 169 0.221 1251 0.220 (qo) 1.11 193 0.011 18 0.024 172 0.030 (qoe) 0.88 513 0.029 44 0.058 50 0.009 (h) 0.81 32 0.002 2 0.003 4 0.001 (f) 0.59 456 0.026 26 0.034 340 0.060 (e) 0.54 885 0.050 32 0.042 678 0.119 (o) 0.34 1726 0.098 59 0.077 79 0.014 (g) 0.32 212 0.012 7 0.009 129 0.023 (s) 0.31 186 0.011 4 0.005 130 0.023 (r) 0.20 716 0.041 15 0.020 445 0.078 (sc) 0.20 1152 0.066 23 0.030 543 0.096 (oe) 0.19 400 0.023 5 0.007 205 0.036 (t) 0.12 927 0.053 11 0.014 457 0.080 (tc) 0.11 38 0.002 2 0.003 11 0.002 (c?) 0.50 48 0.003 1 0.001 9 0.002 (cc?) 0.20 29 0.002 0 0.000 3 0.001 (ccc?) 0.00 210 0.012 3 0.004 27 0.005 (hc) 0.13 129 0.007 3 0.004 9 0.002 (hcc) 0.22 30 0.002 1 0.001 6 0.001 (hzc) 0.31 12 0.001 1 0.001 4 0.001 (pz) 0.79 91 0.005 1 0.001 9 0.002 (hz) 0.10 286 0.016 3 0.004 96 0.017 (or) 0.10 147 0.008 1 0.001 3 0.001 (dz) 0.06 126 0.007 1 0.001 84 0.015 (tcc) 0.07 150 0.009 1 0.001 96 0.017 (scc) 0.06 1141 0.065 5 0.007 67 0.012 (d) 0.04 412 0.023 1 0.001 15 0.003 (dc) 0.02 1 0.000 0 0.000 0 0.000 (fzc) 0.00 2 0.000 0 0.000 0 0.000 (IIIK?) 0.00 2 0.000 0 0.000 0 0.000 (IIK?) 0.00 2 0.000 0 0.000 0 0.000 (IIL?) 0.00 3 0.000 0 0.000 2 0.000 (fz) 0.00 6 0.000 0 0.000 2 0.000 (L?) 0.00 7 0.000 0 0.000 0 0.000 (N?) 0.00 9 0.001 0 0.000 2 0.000 (pzc) 0.00 10 0.001 0 0.000 0 0.000 (II?) 0.00 10 0.001 0 0.000 0 0.000 (K?) 0.00 10 0.001 0 0.000 1 0.000 (I?) 0.00 12 0.001 0 0.000 1 0.000 (6?) 0.00 23 0.001 0 0.000 0 0.000 (M?) 0.00 28 0.002 0 0.000 1 0.000 (air?) 0.00 43 0.002 0 0.000 5 0.001 (ak) 0.00 52 0.003 0 0.000 3 0.001 (dzc) 0.00 405 0.023 0 0.000 28 0.005 (ar) 0.00 414 0.024 0 0.000 36 0.006 (am) 0.00 440 0.025 0 0.000 14 0.002 (dcc) 0.00 495 0.028 0 0.000 17 0.003 (an) 0.00 557 0.032 0 0.000 39 0.007 (ae) 0.00 2055 0.117 1 0.001 62 0.011 (bg) 0.00 On the other hand, there are noticeable discrepancies between the statistics of line-starts and word-starts. The following glyphs can be taken as sure word-starts: tot occurs line start word start glyph Pws ---------- ---------- ---------- --------- ----- 365 0.021 138 0.181 140 0.025 (z) 3.57 685 0.039 103 0.135 350 0.062 (b) 1.42 30 0.002 6 0.008 22 0.004 (4?) 1.89 1436 0.082 169 0.221 1251 0.220 (qo) 1.11 The following glyphs are word-starters with high probability: tot occurs line start word start glyph Pws ---------- ---------- ---------- --------- ----- 17 0.001 5 0.007 11 0.002 (qor) 2.78 193 0.011 18 0.024 172 0.030 (qoe) 0.88 These are word-starters with fair probability (~50%): tot occurs line start word start glyph Pws ---------- ---------- ---------- --------- ----- 456 0.026 26 0.034 340 0.060 (e) 0.54 1726 0.098 59 0.077 79 0.014 (g) 0.32 The (o) glyph is structurally like (qo) and hence should be a word-start with high probability. However its frequency as line-start is rather low. Perhaps it is not word-start after all. tot occurs line start word start glyph Pws ---------- ---------- ---------- --------- ----- 885 0.050 32 0.042 678 0.119 (o) 0.34 These apparently MAY occur as word-start but apparently are fairly common also in mid-word: tot occurs line start word start glyph Pws ---------- ---------- ---------- --------- ----- 186 0.011 4 0.005 130 0.023 (r) 0.20 1152 0.066 23 0.030 543 0.096 (oe) 0.19 286 0.016 3 0.004 96 0.017 (or) 0.10 212 0.012 7 0.009 129 0.023 (s) 0.31 716 0.041 15 0.020 445 0.078 (sc) 0.20 400 0.023 5 0.007 205 0.036 (t) 0.12 927 0.053 11 0.014 457 0.080 (tc) 0.11 It would appear tha glyph (d) is impopular as line-starts, whereas (h) is word-start with high probability (~80%). However, if (d) and (h) are equivalent, the difference can easily be explained as a calligraphic effect. In that case the estimated probability of (d)/(h) being word-start drops to 0.28, which is more reasonable. The same can be said about the other gallows, except that they are less probable as word-starts. tot occurs line start word start glyph Pws ---------- ---------- ---------- --------- ----- 513 0.029 44 0.058 50 0.009 (h) 0.81 1141 0.065 5 0.007 67 0.012 (d) 0.04 1654 0.094 49 0.065 117 0.021 ([dh]) 0.28 210 0.012 3 0.004 27 0.005 (hc) 0.13 412 0.023 1 0.001 15 0.003 (dc) 0.02 622 0.035 4 0.005 42 0.008 ([dh]c) 0.06 129 0.007 3 0.004 9 0.002 (hcc) 0.22 440 0.025 0 0.000 14 0.002 (dcc) 0.00 569 0.032 4 0.005 13 0.004 ([dh]cc) 0.07 91 0.005 1 0.001 9 0.002 (hz) 0.10 147 0.008 1 0.001 3 0.001 (dz) 0.06 238 0.013 2 0.002 12 0.003 ([dh]z) 0.08 30 0.002 1 0.001 6 0.001 (hzc) 0.31 52 0.003 0 0.000 3 0.001 (dzc) 0.00 82 0.005 1 0.001 9 0.001 ([dh]zc) 0.12 Thus, an HD-gallows MAY be a word-start, but is not likely to be. The gallows (p) and (f) can be taken as probable word-starts. The (p) sign is in fact anomalous, in that it occurs more often at line-start than at (internal) word-start. This observation is consistent with the theory that (p) and (f) are ornate "capitals" and hence common as par-starts. tot occurs line start word start glyph Pws ---------- ---------- ---------- --------- ----- 195 0.011 71 0.093 22 0.004 (p) 3.44 32 0.002 2 0.003 4 0.001 (f) 0.59 227 0.013 73 0.096 26 0.005 ([fp]) 3.04 The other FP-gallows are possibly word-starts, but are so rare that we should not trust them: tot occurs line start word start glyph Pws ---------- ---------- ---------- --------- ----- 12 0.001 1 0.001 4 0.001 (pz) 0.79 3 0.000 0 0.000 2 0.000 (fz) 0.00 15 0.001 1 0.001 6 0.001 ([fp]z) 0.63 1 0.000 0 0.000 0 0.000 (fzc) 0.00 9 0.001 0 0.000 2 0.000 (pzc) 0.00 10 0.001 0 0.000 2 0.000 ([fp]zc) 0.00 The unattached groups (c?) (cc?), and (ccc?) have a small probability of being word-starts, but since they are likely to be errors, we should not trust them. tot occurs line start word start glyph Pws ---------- ---------- ---------- --------- ----- 38 0.002 2 0.003 11 0.002 (c?) 0.50 48 0.003 1 0.001 9 0.002 (cc?) 0.20 29 0.002 0 0.000 3 0.001 (ccc?) 0.00 Finally, these glyphs seem NOT bo be valid word-starts: tot occurs line start word start glyph Pws ---------- ---------- ---------- --------- ----- 150 0.009 1 0.001 96 0.017 (scc) 0.06 126 0.007 1 0.001 84 0.015 (tcc) 0.07 43 0.002 0 0.000 5 0.001 (ak) 0.00 405 0.023 0 0.000 28 0.005 (ar) 0.00 414 0.024 0 0.000 36 0.006 (am) 0.00 495 0.028 0 0.000 17 0.003 (an) 0.00 557 0.032 0 0.000 39 0.007 (ae) 0.00 2055 0.117 1 0.001 62 0.011 (bg) 0.00 2 0.000 0 0.000 0 0.000 (IIIK?) 0.00 2 0.000 0 0.000 0 0.000 (IIK?) 0.00 2 0.000 0 0.000 0 0.000 (IIL?) 0.00 6 0.000 0 0.000 2 0.000 (L?) 0.00 7 0.000 0 0.000 0 0.000 (N?) 0.00 10 0.001 0 0.000 0 0.000 (II?) 0.00 10 0.001 0 0.000 0 0.000 (K?) 0.00 10 0.001 0 0.000 1 0.000 (I?) 0.00 12 0.001 0 0.000 1 0.000 (6?) 0.00 23 0.001 0 0.000 0 0.000 (M?) 0.00 28 0.002 0 0.000 1 0.000 (air?) 0.00 By the way, note that the following glyphs appear to occur often af word-start, IF we trust the word divisions as transcribed by FSG/Currier. However, if we consider only line-starts, their frequency drops considerably. Hence they are probably false word-starts: tot occurs line start word start glyph Pws ---------- ---------- ---------- --------- ----- 2055 0.117 1 0.001 62 0.011 (bg) 0.00 1141 0.065 5 0.007 67 0.012 (d) 0.04 286 0.016 3 0.004 96 0.017 (or) 0.10 186 0.011 4 0.005 130 0.023 (r) 0.20 212 0.012 7 0.009 129 0.023 (s) 0.31 150 0.009 1 0.001 96 0.017 (scc) 0.06 400 0.023 5 0.007 205 0.036 (t) 0.12 126 0.007 1 0.001 84 0.015 (tcc) 0.07