I decided to join `iv' to make `w' and identify `t' with `c': cat j.wds \ | sed -f fsg2jsa.sed \ > bio-j-hoc.wds cat bio-j-hoc.wds | sort | uniq > bio-j-hoc.dic cat bio-j-hoc.wds | sort | uniq -c | sort +0 -1nr > bio-j-hoc.frq cat bio-j-hoc.wds \ | egrep '^[a-z67+^]*$' \ > bio-j-hoc-gut.wds cat bio-j-hoc.dic \ | egrep '^[a-z67+^]*$' \ > bio-j-hoc-gut.dic cat bio-j-hoc-gut.wds | sort | uniq -c | sort +0 -1nr > bio-j-hoc-gut.frq bool 1-2 bio-j-hoc.dic bio-j-hoc-gut.dic \ > bio-j-hoc-bad.dic lines words bytes file ------ ------- --------- ------------ 7216 7216 44287 bio-j-hoc.wds 1712 1712 13418 bio-j-hoc.dic 5427 5427 33613 bio-j-hoc-gut.wds 1035 1035 7223 bio-j-hoc-gut.dic 677 677 6195 bio-j-hoc-bad.dic Digraph statistics: cat bio-j-hoc-gut.wds \ | count-digraph-freqs Digraph counts: o s y c g x r f p h k q j w i TOT ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- . 1146 865 104 810 431 310 86 94 1 112 70 1398 . . . 5427 o 19 1 8 3 40 18 1139 215 1190 8 455 60 5 5 2 10 3178 s 45 86 10 4 1035 3 1 . 2 . . 1 . . . . 1187 y 3161 3 23 . 17 9 7 4 46 2 26 . 2 1 . . 3301 c 5 223 40 974 4118 1876 14 4 259 3 144 28 . . 4 1362 9054 g 52 47 35 1860 403 1 5 1 4 . . . . . . . 2408 x 1101 116 126 98 262 59 3 2 183 4 18 6 1 . . . 1979 r 495 42 14 27 69 3 . . . . . . . . . . 650 f 6 47 21 151 1550 1 5 . . . . . . . . . 1781 p . 2 1 . 15 . . . . . . . . . . . 18 h 3 41 21 70 616 2 2 . . . . . . . . . 755 k 2 38 17 6 99 3 . . . . . . . . . . 165 q 1 1383 2 1 18 . . . 1 . . . . . . . 1406 j 40 . . . . . . . . . . . . . . . 40 w 493 1 . 3 . . . . . . . . . . . . 497 i 4 2 4 . 2 2 493 338 2 . . . . 34 491 395 1767 ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- TOT 5427 3178 1187 3301 9054 2408 1979 650 1781 18 755 165 1406 40 497 1767 33613 I computed a "strangeness number" by the formula function strangeness(n, xk, yk, xyk) { if ((xk == 0) || (yk == 0)) { return 0 } else { fx = xk/n; fy = yk/n; fxy = xyk/n; fmax = (fx < fy ? fx : fy); fexp = fx*fy; fmin = 0; if (fxy <= fmin) { return -1 } else if (fxy >= fmax) { return +1 } else { tmax = (fmax - fxy)/(fmax - fexp); tmin = (fxy - fmin)/(fexp - fmin); tsum = (log(tmin) - log(tmax))/log(2.0); if ( tsum > 0 ) { texp = exp(-2*tsum); return (1 - texp)/(1 + texp) } else { texp = exp( 2*tsum); return (texp - 1)/(texp + 1) } } } } function normalness(n, xk, yk, xyk) { str = strangeness(n, xk, yk, xyk); return 1 - str*str } where n is the total number of pairs tested, xk the number of "x" occurences, yk the number of "y" occurrences, and xyk the number of "xy" pairs. The result is 0 is xyk is the expected number, +1 if it is maximum possible = min(xk,yk), and -1 if it is the minimum possible (0). Here is the table, scaled from [-1..+1] to [01..99]: Strangeness (× 99): s y c o g f p h k q x r j w i TOT ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- s 1 . 99 30 1 . . . . 1 . . . . . . 50 y 1 . . . 99 . 2 59 4 . . . . 2 . . 50 c . 58 90 1 . 99 10 14 21 15 . . . . . 99 50 o . . . . . . 99 99 99 98 . 99 98 70 . . 50 99 1 10 95 . 58 3 3 42 97 99 47 33 . . . 50 g 6 99 15 1 . . . . . . . . . . . . 50 f 4 38 99 2 . . . . . . . . . . . . 50 p 79 . 99 62 . . . . . . . . . . . . 50 h 33 45 99 15 . . . . . . . . . . . . 50 k 95 4 97 94 . 2 . . . . . . . . . . 50 q . . . 99 . . . . . . . . . . . . 50 x 86 11 7 18 99 6 84 98 6 19 . . . . . . 50 r 19 6 4 23 99 . . . . . . . . . . . 50 j . . . . 99 . . . . . . . . . . . 50 w . . . . 99 . . . . . . . . . . . 50 i . . . . . . . . . . . 98 99 99 99 98 50 ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- TOT 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 33613 Normalness (× 99): x y o s c g k p f h r j i w q TOT ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- . 99 2 16 . 37 96 8 12 10 97 89 . . . . 99 x 2 . 38 59 47 27 24 61 5 50 23 . . . . . 99 y . . . . 3 . . . 95 6 15 . 6 . . . 99 o . . . . . . . 3 1 . . 4 81 . . . 99 s 4 . . 83 6 . . 2 . . . . . . . . 99 c . . 96 4 . 31 1 52 49 35 67 . . 1 . . 99 g 1 . . 3 24 50 . . . . . . . . . . 99 k . . 17 17 14 7 6 . . . . . . . . . 99 p . . . 93 64 . . . . . . . . . . . 99 f . . 94 8 14 . . . . . . . . . . . 99 h . . 98 51 87 . . . . . . . . . . . 99 r . . 24 71 60 14 . . . . . . . . . . 99 j . . . . . . . . . . . . . . . . 99 i . 2 . . . . . . . . . . . 3 . . 99 w . . . . . . . . . . . . . . . . 99 q . . . . . . . . . . . . . . . . 99 ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- TOT 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 33613 These below are all the `qc' words in the file. They look like misreadings of popular `qo' words. egrep 'qc' bio-j-hoc-gut.frq qccgy qccgy qcfccgy qcfccgy qcccgy qccgccy qccy qcgy qchcccg qchccy qchcgy qchcgys qchcix qchcy qchy qci qcixox qcy 07-07-09 stolfi =============== Summarizing, so far it seems that breaking down all characters into strokes was a very good idea. It led (somewhat indirectly) to two discoveries: that the difference between Guy2 and / is not important, and highly contaminated by error; and that Guy2 `a' is probably not a letter --- it is a `c' stroke (possibly half of the preceding letter) accidentally connected to an `i' stroke (probably the beginning of the next letter). Looking at the above tables, it is now almost certain that `sc' and `qo' are letters on their own. (Note that `sc' is represented as [2C], [2A], [S], [2T] in the interlinear file. In other words, the plume on the is not really attached to the but to the following letter, which is always a `c' stroke. This may be an explanation for the ligature in [S] = , and the reported ligature. Summarizing, I am now going to use the following FGS -> JSA preencoding IIIK -> iiiij IE -> iix A -> ci N -> iiu IIIL -> iiiiu IR -> iis C -> c O -> o IIIR -> iiiis IK -> iij D -> lj P -> ag IIIE -> iiiix 2 -> cs E -> ix R -> is IIE -> iiix 4 -> a F -> lg S -> csc IIR -> iiis 6 -> cj G -> cy T -> cc IIK -> iiij 7 -> ig H -> aj V -> ^ HZ -> cajc 8 -> cg I -> i Y -> + PZ -> cagc K -> ij DZ -> cljc L -> iu FZ -> clgc M -> iiiu followed by the SA -> ad-hoc post-encoding: sc -> s ij -> 7 ig -> 8 aj -> H a -> 4 (if unpaired) ao -> A ix -> e cg -> 8 ag -> H iu -> v cy -> 9 lj -> H is -> r lg -> H Moreover, I am going to use this encoding before preparing the consensus transcription. The consensus-maker will have to be sort of a dynamic programming algorithm... OK, I coded the dynamic consensus-maker, and modified the script fsg2jsa to work on the interlinear file. So: cat bio-m-evt.evt \ | fsg2jsa \ > bio-m-jsa-bug.evt Now extracted the training dataset, and generated a new set of correction patterns from it: cat bio-m-jsa-bug.evt \ | egrep '^<.*;[FC]> ' \ | sed \ -e 's/<.*;[FC]> */ /g' \ -e 's/{[^}]*}//g' \ | grep -v '[*]' \ > .train.txt lines words bytes file ------ ------- --------- ------------ 1470 1470 115821 .train.txt cat .train.txt \ | generate-fix-patterns -vMINOCC=10 \ > .fixit.sed lines words bytes file ------ ------- --------- ------------ 596 716 9932 .fixit.sed Next I generated the consensus interlinear, and ran the automatic context-fixer above: cat bio-m-jsa-bug.evt \ | make-consensus-interlin \ > bio-m-jsa.evt I extracted the consensus text from it, and applied the automatic corrector: cat bio-m-jsa.evt \ | egrep '^<.*;J> ' \ | sed \ -e 's/{[^}]*}//g' \ -e 's/[\!]//g' \ > bio-j-jsa-raw.evt cat bio-j-jsa-raw.evt \ | sed -f .fixit.sed \ > bio-j-jsa-fix.evt I wrote a script "extract-words" that extracts the words from the consensus file, remaps them through an arbitrary encoding, extracts the dictionary, and runs the digraph statistics: ------------------------------ ------------------------------ extract-words-from-interlin \ -recode jsa2hoc \ bio-j-jsa-fix.evt \ jh-1 cat bio-j-hoc-1-gut.wds \ | count-digraph-freqs \ -vchars=' c9po8idervqs74gy' lines words bytes file ------ ------- --------- ------------ 7358 7358 46402 bio-j-hoc-1.wds 1553 1553 14124 bio-j-hoc-1.dic 5873 5873 36448 bio-j-hoc-1-gut.wds 1001 1001 7199 bio-j-hoc-1-gut.dic 552 552 6925 bio-j-hoc-1-bad.dic 16337 16337 111098 total Digraph counts: c 9 p o 8 i d e r v q s 7 4 g y TOT ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- . 1780 128 359 1206 467 68 . 322 28 . 1493 . . 22 . . 5873 c 4 3528 1003 473 187 1875 1548 1129 11 4 . . 159 . . . . 9921 9 3238 45 . 80 3 11 2 . 10 2 . 4 . 1 1 . . 3397 p 14 2629 245 . 142 7 . . 8 . . . . . . . . 3045 o 15 26 1 605 . 12 44 . 972 195 . 5 . 6 . . . 1881 8 58 475 1888 5 48 1 . . 6 1 . . . . . . . 2482 i 5 8 . 3 1 3 1558 130 482 326 828 . . 40 . . . 3384 d 2 937 24 10 34 36 160 27 4 1 . . . . . 8 43 1286 e 1035 452 94 230 121 61 . . 5 2 . 1 . . . . . 2001 r 519 . . 1 46 . . . . . . . . . . . . 566 v 824 . 3 . 1 . . . . . . . . . . . . 828 q 7 23 1 1273 1 8 4 . 179 7 . 1 . . . . . 1504 s 63 . . 5 90 . . . 1 . . . . . . . . 159 7 46 . . . 1 . . . . . . . . . . . . 47 4 1 18 3 1 . . . . . . . . . . . . . 23 g 1 . 7 . . . . . . . . . . . . . . 8 y 41 . . . . 1 . . 1 . . . . . . . . 43 ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- TOT 5873 9921 3397 3045 1881 2482 3384 1286 2001 566 828 1504 159 47 23 8 43 36448