Some interesting patterns are apparent above, but things may become clearer if we remove the garbage. Meanwhile, here is a count of the digraphs in the "good" words (counting repeated words): o c t i q l | v x y j g s TOT ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- 0 1363 2574 0 502 1874 107 | 0 0 0 0 0 0 6420 o 34 4 112 0 1680 635 1430 | 0 0 0 0 0 0 3895 c 7 172 3922 1447 2002 197 291 | 0 0 3764 12 2728 1445 15987 t 7 96 2696 0 23 18 24 | 0 0 0 0 0 0 2864 i 6 5 31 0 3395 3 3 | 943 2349 0 57 0 913 7705 q 5 1622 35 0 0 2 4 | 0 0 0 969 215 0 2852 l 0 0 0 0 0 0 0 | 0 0 0 2185 36 0 2221 ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- v 912 10 18 0 3 0 0 | 0 0 0 0 0 0 943 x 1085 159 757 0 22 50 276 | 0 0 0 0 0 0 2349 y 3510 7 72 0 37 66 72 | 0 0 0 0 0 0 3764 j 84 138 2658 319 23 0 1 | 0 0 0 0 0 0 3223 g 78 123 2729 25 14 2 8 | 0 0 0 0 0 0 2979 s 692 196 383 1073 4 5 5 | 0 0 0 0 0 0 2358 ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- TOT 6420 3895 15987 2864 7705 2852 2221 | 943 2349 3764 3223 2979 2358 57560 Next-symbol probabilities (× 99): o c t i q l | v x y j g s TOT ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- . 21 40 . 8 29 2 | . . . . . . 99 o 1 . 3 . 43 16 36 | . . . . . . 99 c . 1 24 9 12 1 2 | . . 23 . 17 9 99 t . 3 93 . 1 1 1 | . . . . . . 99 i . . . . 44 . . | 12 30 . 1 . 12 99 q . 56 1 . . . . | . . . 34 7 . 99 l . . . . . . . | . . . 97 2 . 99 ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- v 96 1 2 . . . . | . . . . . . 99 x 46 7 32 . 1 2 12 | . . . . . . 99 y 92 . 2 . 1 2 2 | . . . . . . 99 j 3 4 82 10 1 . . | . . . . . . 99 g 3 4 91 1 . . . | . . . . . . 99 s 29 8 16 45 . . . | . . . . . . 99 ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- TOT 11 7 27 5 13 5 4 | 2 4 6 6 5 4 57560 Previous-symbol probabilities (× 99): o c t i q l | v x y j g s TOT ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- . 35 16 . 6 65 5 | . . . . . . 11 o 1 . 1 . 22 22 64 | . . . . . . 7 c . 4 24 50 26 7 13 | . . 99 . 91 61 27 t . 2 17 . . 1 1 | . . . . . . 5 i . . . . 44 . . | 99 99 . 2 . 38 13 q . 41 . . . . . | . . . 30 7 . 5 l . . . . . . . | . . . 67 1 . 4 ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- v 14 . . . . . . | . . . . . . 2 x 17 4 5 . . 2 12 | . . . . . . 4 y 54 . . . . 2 3 | . . . . . . 6 j 1 4 16 11 . . . | . . . . . . 6 g 1 3 17 1 . . . | . . . . . . 5 s 11 5 2 37 . . . | . . . . . . 4 ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- TOT 99 99 99 99 99 99 99 | 99 99 99 99 99 99 57560 Note that the stroke `l' is always followed by either `j' or `g', hence `lj' and `lg' should be single letters. Note also that there are two clearly different kinds of strokes, "body" B = {`c',`o',`t',`i',`q',`l'} and "limb" L = {`v',`x',`y',`j',`g',`s'}. If we reduce the digraph count matrix to these two classes, plus word break W, we get cat bio-c-jsa-gut.wds \ | tr 'cotiqlvxyjgs' 'BBBBBBLLLLLL' \ | count-digraph-freqs Digraph counts: B L ----- ----- ----- . 6420 . B 59 19849 15616 L 6361 9255 . ----- ----- ----- Next-symbol probabilities (× 99): B L ----- ----- ----- . 99 . B . 55 44 L 40 59 . ----- ----- ----- Previous-symbol probabilities (× 99): B L ----- ----- ----- . 18 . B 1 55 99 L 98 26 . ----- ----- ----- Note that every word begins with a body stroke; this was expected from the definition of the limb strokes (they can be recognized only by their relationship to a previous stroke). Note also that a limb stroke cannot be followed by another limb stroke; this too is not wholly unexpected. The surprise is that almost no words *end* in a body stroke. The least rare body stroke in word-final position is `o'. Here are all the words that end in body strokes, in context. (The "<<" marks the error) ctix ois cstcoixo << // ciix ciix ctccgciiivcgci << // qoljciiiv qjcy ixcstcgcyqo << // qoqjcccgcy ixo << // oljccgciis cgci << // qoqjciix ctoixo << // cgciiscy cgciixo << // cgcy cgciiiiv ctcqjt << // oljcccg qoljciixoiso << // qoljciix qoljccgcy ixo << // cstcccgcy ixctccgcy ixc << // cstccgcy qoljcyisixcstc << // cstcljtcy oisciiisciiso << // isctccgcy qoqjcccgcy ixo << // qoljcccy qoqjcccgcy ixoc << // ljctoix qjctciixoixljcc << cylgctccy isctcs ctcc << oixcs ciiiiv csljciix oixcstccy oixcstcc << qoixcstcccy qoqjcccy cyctcccgcy csciiivo oix ctc << cgcy qoljciis csciiiiv cstccgcci << cstccqjtcy ctccy ixctcgcy cgctcgcy cgci << qciis ixljocgciix oqgci << ljoisoixis // qcqjtcccg cstcgcy qci << oixljcccy cgciiiv // o << cgctcccgcy qoixctccy cyctccciis o << oiiiv occcgcy ctccy qoljciiiv o << isoiiiiv // ciiiv isciiiv o << ljciiv ctixciiiiiv // cycstccis ciiiiv o << cyljcccgcy qoljcccgcy // csciiiv oljci*o << ctccgcy ixljctccgcy cstccqjtcy qoljcio << ixois // cgcicljtcy ixljciijo << cyljcccy ixcstccy cstccyljcccgcy ixljo << oqgctccgcy qoix oixciiiv cstcoix qo << qoljciix cstcivix // qoixctccy qo << ixctccg qoixljcy ctcccgcy qo << ctciis ciiiiv // oljcccy ixctccgcy qo << oixciiiv oqj // cycstccgcy qo << ois oljciiiv cstcqjtcy ctcocy qo << qoljccy cgciixciiiiv ctcccgcy qoix oqo << qoljciiiv oixcstcciij // qocgctccg oixqo << cgciis ctccljto ixoixoix cycstcccyqo << isciis oix ctcccy oixqo cgciis ctccljto << ixoixoix // ixcsto << qoljccy ixcstccgcy cyctcccgcy csciiivo << oix ctc cgcy qoljciis qoljctcgcy ctcqjccy ixo << qoljccgcy qoljciiv qoljctcgcy ctccy ixo << ctcljcy oix qois *cccy ixctccy ixo << cycgciiiv cstccy csciiiisciix csciix cgciixo << qjciiiv cgciiscy cgciixo // oisoix ccccsciix oixo << qjcoix oiscy // // q << ljciiiv cstccqjcy qoljciiiv // q << ljccccy cstccgcy qoljcccgcy // q << qoqjccgcstccgcy ctqjciiis oqgctccgcy q << qjciix ctccgcy ctcqgctccgcy csciix ctccgcy cstcq << lgctois qoqjcicsoixljcy qoljciis cstcccgcy ixct << cstoljciiiv ct* // ctccy ctcljt << cstcoqccccy ctoix // cgciiiiv cst << qoixctccy oqjciixcy qoljciix cst << oixcstciixcscy // // qoljccst << qoljccgcy oqjccgcy // cgciiiv ctccy ixcst << cgciiiiv ctccy // Those of the first group appear to be interference by the line break. (Note that the manuscript does not appear to use any hyphenation mark. Either words are not broken across lines, which would be unusual, or they are broken without any extra marks, which would produce the Those of the second group appear to be due to bogus word breaks in the transcription (e.g. between the `q' and `l') or transcription errors. An interesting observation from the body/limb frequency tables above is that the transition probabilities from body stroke to body and limb are respectively 55% and 45%. Thus, if the limb strokes mark the end of a syllabe (or letter?), the the average number of body strokes in a syllabe is slightly over 2. (Considering that we are counting each "i" as a body stroke, the correct number may well be precisly 2.) I decided that, before spending more time in the analysis, I must first prepare a "corrected" interlinear where discrepancies between FSG and Currier are resolved taking into account the probabilities above. The idea is to make a dictionary of 5-tuples, and try to use it to decide on the corrections. Namely, define the context of a letter occurrence in a text as its four nearest letter occurrences. We can represent a context in sed-like notation as wx.yz where the "." is the position of the central letter. We scan some training text, collecting for each possible context the frequency distribution of the middle letter. At the end, if, for a given context wx.yz there is some central letter t which is more likely than all the others combined, we output a correction rule of the form wx?yz -> wxtyz. So, here is the work. First, I generated a training data set: cat bio-m-evt.evt \ | egrep '^<.*;[FC]> ' \ | grep -v '[][%*_]' \ | sed \ -e 's/<.*;[FC]> */ /g' \ -e 's/{[^}]*}//g' \ -e 's/\!//g'\ > .train.txt lines words bytes file ------ ------- --------- ------------ 858 858 42866 .train.txt Next, I generated the correction patterns from it: cat .train.txt \ | generate-fix-patterns -vMINOCC=10 \ > .fixit.sed lines words bytes file ------ ------- --------- ------------ 592 688 10219 .fixit.sed The parameter MINOCC is the minimum number of times a context must occur before we try to generate a correction rule for it. Next, I generated a "consensus" interlinear file: cat bio-m-evt.evt \ | make-consensus-interlin \ > bio-x-evt.evt I extracted the consensus text from it: cat bio-x-evt.evt \ | egrep '^<.*;J> ' \ | sed \ -e 's/{[^}]*}//g' \ -e 's/[\!]//g' \ > bio-j-evt-raw.evt I applied the corrections: cat bio-j-evt-raw.evt \ | sed -f .fixit.sed \ > bio-j-evt.evt Now let's extract the words and check how many good ones we got: cat bio-j-evt.evt \ | sed \ -e 's/<.*;[A-Z]> *//g' \ -e 's/- *$/.\/\//g' \ -e 's/= *$/.\/\/.=/g' \ | tr '.' '\012' \ | egrep '.' \ > bio-j-evt.wds cat bio-j-evt.wds | sort | uniq -c | sort +0 -1nr > bio-j-evt.frq cat bio-j-evt.wds | sort | uniq > bio-j-evt.dic lines words bytes file ------ ------- --------- ------------ 7216 7216 39223 bio-j-evt.wds 1761 1761 12154 bio-j-evt.dic I extracted the good words: cat bio-j-evt.wds | grep -v '?' > bio-j-evt-gut.wds cat bio-j-evt-gut.wds | sort | uniq > bio-j-evt-gut.dic cat bio-j-evt-gut.wds | sort | uniq -c | sort +0 -1nr > bio-j-evt-gut.frq lines words bytes file ------ ------- --------- ------------ 6188 6188 31705 bio-j-evt-gut.wds 1085 1085 6532 bio-j-evt-gut.dic I created an automaton for bio-j-evt-gut.dic: cat bio-j-evt-gut.dic \ | nice MaintainAutomaton \ -add - \ -dump bio-j-evt-gut.dmp strings letters states finals arcs sub-sts lets/arc -------- -------- -------- -------- -------- -------- -------- 1085 5447 422 90 1263 1027 4.313 I looked for unproductive states: nice AutoAnalysis \ -load bio-j-evt-gut.dmp \ -unprod bio-j-evt-gut-1-unp.sts \ -maxUnprod 1 \ -unprodSugg bio-j-evt-gut-1-unp.sugg 46 unproductive states 46 strange words (with repetitions) listed 31 strange words (without repetitions) listed // 2PTG 42OEDCC8G 4GDAM 4O4ODCCG 4ODGE88G 8ARTCC8AE 8OEDCC2OE CPAEOIR DOHAEG EODAK GCPAM GDT8AR HG4ODG HOROES28G HT8OEH8G ODAROEOK OEPODZG OGSCG OHTOHAR OSCPOE2 P8AESOR PGDC8G PODAN POECC8ARAL POEHCSOE PSAROE RTCAE8 TCIROR TETPSCCG TOEDCCCG I removed these words and tried again: cat bio-j-evt-gut-1-unp.sugg \ | sort -u \ | bool 1-2 j-gut.dic - \ > bio-j-evt-cln-1.dic cat bio-j-evt-cln-1.dic \ | nice MaintainAutomaton \ -add - \ -dump bio-j-evt-cln-1.dmp strings letters states finals arcs sub-sts lets/arc -------- -------- -------- -------- -------- -------- -------- 1054 5237 365 87 1176 950 4.453 nice AutoAnalysis \ -load bio-j-evt-cln-1.dmp \ -unprod bio-j-evt-cln-1-unp.sts \ -maxUnprod 1 \ -unprodSugg bio-j-evt-cln-1-unp.sugg 4 unproductive states 4 strange words (with repetitions) listed 3 strange words (without repetitions) listed 42AN HSCODCC8G CPTG I removed these and tried again: cat bio-j-evt-cln-1-unp.sugg \ | sort -u \ | bool 1-2 bio-j-evt-cln-1.dic - \ > bio-j-evt-cln-2.dic cat bio-j-evt-cln-2.dic \ | nice MaintainAutomaton \ -add - \ -dump bio-j-evt-cln-2.dmp strings letters states finals arcs sub-sts lets/arc -------- -------- -------- -------- -------- -------- -------- 1051 5220 360 87 1167 944 4.473 nice AutoAnalysis \ -load bio-j-evt-cln-2.dmp \ -unprod bio-j-evt-cln-2-unp.sts \ -maxUnprod 1 \ -unprodSugg bio-j-evt-cln-2-unp.sugg 0 unproductive states 0 strange words (with repetitions) listed 0 strange words (without repetitions) listed I recoded it into the "super-analyitic" encoding, but this time treating `qj', `qg', `lj', `lg' as single letters (`h', `k', `f', `p' respectively): cat bio-j-evt.wds \ | fsg2jsa \ | jsa2hoc \ > bio-j-hoc.wds cat bio-j-hoc.wds | sort | uniq > bio-j-hoc.dic cat bio-j-hoc.wds | sort | uniq -c | sort +0 -1nr > bio-j-hoc.frq lines words bytes file ------ ------- --------- ------------ 7216 7216 56394 bio-j-hoc.wds 1761 1761 17523 bio-j-hoc.dic Next I separated the good words: cat bio-j-hoc.wds \ | egrep '^[a-z+^]*$' \ > bio-j-hoc-gut.wds cat bio-j-hoc.dic \ | egrep '^[a-z+^]*$' \ > bio-j-hoc-gut.dic bool 1-2 bio-j-hoc.dic bio-j-hoc-gut.dic \ > bio-j-hoc-bad.dic lines words bytes file ------ ------- --------- ------------ 5427 5427 44172 bio-j-hoc-gut.wds 1083 1083 9958 bio-j-hoc-gut.dic 678 678 7565 bio-j-hoc-bad.dic Next I buit the automaton: cat bio-j-hoc-gut.dic \ | nice MaintainAutomaton \ -add - \ -dump bio-j-hoc-gut.dmp strings letters states finals arcs sub-sts lets/arc -------- -------- -------- -------- -------- -------- -------- 1083 8875 701 91 1492 1258 5.948 Digraph statistics: cat bio-j-hoc-gut.wds \ | count-digraph-freqs Digraph counts: | i o c t q f p h k | v j x s y g TOT ----- + ----- ----- ----- ----- ----- ----- ----- ----- ----- + ----- ----- ----- ---------- ----- ----- . | 396 1146 2210 . 1398 94 1 112 70 | . . . . . . 5427 ----- + ----- ----- ----- ----- ----- ----- ----- ----- ----- + ----- ----- ----- ---------- ----- ----- i 4 | 2248 2 8 . . 2 . . . | 497 40 1979 650 . . 5430 o 19 | 1371 1 69 . 5 1190 8 455 60 | . . . . . . 3178 c 1 | 1367 150 3487 1201 . 245 2 134 25 | . . . 1187 3301 2408 13508 t 4 | 17 73 2320 . . 14 1 10 3 | . . . . . . 2442 q 1 | . 1383 21 . . 1 . . . | . . . . . . 1406 f 6 | 5 47 1543 180 . . . . . | . . . . . . 1781 p . | . 2 15 1 . . . . . | . . . . . . 18 h 3 | 2 41 606 103 . . . . . | . . . . . . 755 k 2 | . 38 111 14 . . . . . | . . . . . . 165 ----- + ----- ----- ----- ----- ----- ----- ----- ----- ----- + ----- ----- ----- ---------- ----- ----- v 493 | . 1 3 . . . . . . | . . . . . . 497 j 40 | . . . . . . . . . | . . . . . . 40 x 1101 | 5 116 545 . 1 183 4 18 6 | . . . . . . 1979 s 540 | 1 128 222 943 . 2 . . 1 | . . . . . . 1837 y 3161 | 12 3 49 . 2 46 2 26 . | . . . . . . 3301 g 52 | 6 47 2299 . . 4 . . . | . . . . . . 2408 ----- | ----- ----- ----- ----- ----- ----- ----- ----- ----- + ----- ----- ----- ---------- ----- ----- TOT 5427 | 5430 3178 13508 2442 1406 1781 18 755 165 | 497 40 1979 1837 3301 2408 44172 Again, it is obvious that `ix' `ij' `iv' are single letters; we will drop the `i' from them. Same goes for `cy' and `cg'; we will drop the `c'. We may also let `cs' and `is' be single letters. This is the right thing to do if the distribution of the letter after the `s' depends on the letter before the s: o c t i k f TOT ---------- ----- ----- ----- ----- ----- ----- cs 45 86 109 943 1 1 2 1187 is 495 42 113 . . . . 650 Next-symbol probabilities (× 99): o c t i k f TOT ----- ----- ----- ----- ----- ----- ----- ----- cs 4 7 9 79 . . . 99 is 75 6 17 . . . . 99 They are similar except that cs if often follwed by `t' whereas `is' is often terminal and is never followed by `t'. (Not surprising since `t' only appears after `c' in this corpus. But OK, let's replace `cs' by `s' and `is' by `r': cat j.wds \ | sed -f fsg2jsa.sed \ > bio-j-hoc.wds cat bio-j-hoc.wds | sort | uniq > bio-j-hoc.dic cat bio-j-hoc.wds | sort | uniq -c | sort +0 -1nr > bio-j-hoc.frq lines words bytes file ------ ------- --------- ------------ 7216 7216 44840 bio-j-hoc.wds 1761 1761 13898 bio-j-hoc.dic cat bio-j-hoc.wds \ | egrep '^[a-z67+^]*$' \ > bio-j-hoc-gut.wds cat bio-j-hoc.dic \ | egrep '^[a-z67+^]*$' \ > bio-j-hoc-gut.dic bool 1-2 bio-j-hoc.dic bio-j-hoc-gut.dic \ > bio-j-hoc-bad.dic lines words bytes file ------ ------- --------- ------------ 5427 5427 34110 bio-j-hoc-gut.wds 1083 1083 7646 bio-j-hoc-gut.dic 678 678 6252 bio-j-hoc-bad.dic Digraph statistics: cat bio-j-hoc-gut.wds \ | count-digraph-freqs Digraph counts: q o c s y g x t i r f h p k v j TOT ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- . 1398 1146 810 865 104 431 310 . . 86 94 112 1 70 . . 5427 q 1 . 1383 18 2 1 . . . . . 1 . . . . . 1406 o 19 5 1 40 8 3 18 1139 . 12 215 1190 455 8 60 . 5 3178 c 1 . 150 964 36 731 1756 . 1201 1366 1 245 134 2 25 . . 6612 s 45 . 86 92 10 4 3 1 943 . . 2 . . 1 . . 1187 y 3161 2 3 17 23 . 9 7 . . 4 46 26 2 . . 1 3301 g 52 . 47 403 35 1860 1 5 . . 1 4 . . . . . 2408 x 1101 1 116 262 126 98 59 3 . . 2 183 18 4 6 . . 1979 t 4 . 73 1953 4 243 120 14 . . 3 14 10 1 3 . . 2442 i 4 . 2 2 4 . 2 493 . 886 338 2 . . . 497 34 2264 r 495 . 42 69 14 27 3 . . . . . . . . . . 650 f 6 . 47 1370 21 151 1 5 180 . . . . . . . . 1781 h 3 . 41 513 21 70 2 2 103 . . . . . . . . 755 p . . 2 14 1 . . . 1 . . . . . . . . 18 k 2 . 38 85 17 6 3 . 14 . . . . . . . . 165 v 493 . 1 . . 3 . . . . . . . . . . . 497 j 40 . . . . . . . . . . . . . . . . 40 ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- TOT 5427 1406 3178 6612 1187 3301 2408 1979 2442 2264 650 1781 755 18 165 497 40 34110 There is something funny about the `t'. I must try to either (a) identify it with `c', or (b) join it with the preceding `c' or `s' as a single letter. Since most `t's have been misidentified as `c's, it is safer to do (a).