Hacking at the Voynich manuscript Notebook - volume 1 Warning: these notebooks aren't strictly chronological logs. Sometimes I go back and redo things, clarify comments, delete garbage, etc. 97-07-05 stolfi =============== I obtained Landini's interlinear transcription of the VMs, version 1.6 (landini-interln16.evt) from http://sun1.bham.ac.uk/G.Landini/evmt/intrln16.zip I extracted manually from it a homogeneous, full-text sample bio-m-evt.evt, consisting of pages 147-166 of the "biological" section, in Currier's Language B, hand 2. This section includes Currier's and Friedman's transcriptions. Currier's seems to be the most complete of them. I reduced this sample to a file cb.txt in "plain" text format as follows: * I eliminated comments and Friedman's lines: cat bio-m-evt.evt \ | egrep -v '^#' \ | grep ';C>' \ > bio-c-evt.evt * I eliminated (with Emacs) the line numbers <...> at the beginning of each line; * I replaced 3 or more consecutive occurrences of "%" by "%%" * I eliminated the strings of "!", present at end of some lines. * I replaced the end-of-line "-" by ".//." * I replaced the end-of-paragraph "=" by ".//.=." (Note that not all pages end in "=" in the original Currier file.) * I replaced "." by " " (with tr). * I removed " " at end of all lines. lines words bytes file ------ ------- --------- ------------ 765 7230 38906 bio-c-evt.txt Next I computed the word frequencies: cat bio-c-evt.txt \ | tr ' ' '\012' \ | sort \ | uniq -c \ | sort +0 -1nr \ > bio-c-evt.frq cat bio-c-evt.txt \ | tr ' ' '\012' \ | sort \ | uniq \ > bio-c-evt.dic I removed all "bad" words (with "%", "/", "=", "*"): cat bio-c-evt.dic \ | grep -v '[%*/=]' \ | egrep '.' \ > bio-c-evt-gut.dic lines words bytes file ------ ------- --------- ------------ 1851 1850 12117 bio-c-evt.dic 1381 1381 8277 bio-c-evt-gut.dic I created an automaton for bio-c-evt-gut.dic: cat bio-c-evt-gut.dic \ | nice MaintainAutomaton \ -add - \ -dump bio-c-evt-gut.dmp strings letters states finals arcs sub-sts lets/arc -------- -------- -------- -------- -------- -------- -------- 1381 6896 535 114 1633 1341 4.223 I ran AutoAnalysis, looking for unproductive states (2 words or less) and strange words: nice AutoAnalysis \ -load bio-c-evt-gut.dmp \ -unprod bio-c-evt-gut-1-unp.sts \ -maxUnprod 2 \ -unprodSugg bio-c-evt-gut-1-unp.sugg 161 unproductive states 255 strange words (with repetitions) listed 200 strange words (without repetitions) listed I redid again, redefining a state as unproductive if it is used by only one word: nice AutoAnalysis \ -load bio-c-evt-gut.dmp \ -unprod bio-c-evt-gut-2-unp.sts \ -maxUnprod 1 \ -unprodSugg bio-c-evt-gut-2-unp.sugg 67 unproductive states 67 strange words (with repetitions) listed 45 strange words (without repetitions) listed Here are the strange words: 2AESHZ8G 2AIIRAE 42OEDCC8G 4GDAM 4O4ODCCG 4ODGE88G 8ETCRG C2C2 CESC8G CPAEOIR DOHAEG DOROER EE8GR EEAIIRG EFTCAE EODAK EOIIIK FZ8OROE G4ODAN GCPAM GDT8AR GROHCG HCGHC6 HG4ODG HOROES28G HT8OEH8G IEPT8G ODAROEOK OGSCG OHTOHAR OPTOEOR OSCPOE2 P8AESOR PGDC8G POECC8ARAL POEHCSOE POHTODAR PSTC8AE SC8CA SETCR TCIROR TEAIIIL TETPSCCG TOEDCCCG TSCDZG Looked for popular states and radicals: nice AutoAnalysis \ -load bio-c-evt-gut.dmp \ -classes bio-c-evt-gut-1.cls \ -minClassPrefs 5 \ -radicals bio-c-evt-gut-1.rad 37 lexical classes found Modified AutoAnalysis, adding a new command "-prod" to list productive states. AutoAnalysis \ -load bio-c-evt-gut.dmp \ -prod bio-c-evt-gut-1-prd.pst \ -minProductivity 1 \ -maxPrefSize 30 \ -maxSuffSize 30 25 productive states Here they are: state nsuffs nprefs nwords prodty prefs/suffs ------- ------- ------- ------- ------- ----------------- 180 35 2 70 34 { TDZ8, TC2, TAR, RTC8, RSC8, ... }:{ (), G } 60 30 2 60 29 { THZC, TDCC, TCHC, SCDZC, OHTC, ... }:{ 8G, G } 186 15 2 30 14 { TC8O, PZA, PTO, PTC8A, OESCO, ... }:{ E, R } 681 4 4 16 9 { TC8A, OEHA, O8A, GHA }:{ E, M, N, R } 71 5 3 15 8 { 8TCC, 4OPTC, 4OHTC, DSC, 2OETC }:{ 8G, G, OE } 85 8 2 16 7 { OHAES, OFT, EPT, EHC, 8GDC, ... }:{ 8G, C8G } 58 7 2 14 6 { TR, SOE, SC8AE, OHOE, ODOE, ... }:{ (), 8G } 206 4 3 12 6 { OHS, AET, 4ORC, 4ODS }:{ 8G, C8G, CG } 233 4 3 12 6 { TCHZ, ETCDZ, 4OT, 4ODZ }:{ C8G, CG, G } 255 4 3 12 6 { OCC, 8OETC, 8CC, 4OESC }:{ 8G, C8G, G } 123 6 2 12 5 { TCCD, TC8T, SCCDC, ODZ, EDC8, ... }:{ CG, G } 189 5 2 10 4 { SHZCG, SCOEO, OHC8G, 4ODT8G, ... }:{ (), E } 193 3 3 9 4 { OPSC8, ODCC8, 4ODCC8 }:{ (), AE, G } 301 5 2 10 4 { TDAR, ROR, OEOR, HC8G, 4OHOE }:{ (), OE } 463 5 2 10 4 { TCC8, OET8, HTC8, GHCC8, CC8 }:{ AR, G } 30 4 2 8 3 { O2, HAR, CCC2, 2AR }:{ (), AE } 151 4 2 8 3 { ODS, GFT, EDT, 4O8C }:{ C8G, CG } 304 2 4 8 3 { TPZ, 4OHS }:{ 8G, C8G, CG, G } 229 2 3 6 2 { EDCC, 4ODTC }:{ 8, 8G, G } 368 3 2 6 2 { ODAI, AI, 8CI }:{ IIL, R } 372 3 2 6 2 { TCOET, OTC, DCT }:{ 8G, CG } 408 3 2 6 2 { S8, POES8, 8SCC8 }:{ AE, G } 588 2 3 6 2 { SODA, EORA }:{ E, M, N } 662 2 3 6 2 { SCDC, ODCS }:{ 8G, CG, G } 819 3 2 6 2 { SCAE, OEDCCG, ODC8G }:{ (), R } There appear to be two major kinds of words: those that inflect with [M,E,N,R] and those that inflect with [(),G,8G,CG,C8G], or [G] for shrt. I collected the radicals of the class [M,N,E,R]: cat bio-c-evt.frq \ | egrep -v '[%*]' \ | egrep '[MENR]$' \ | sort +1 -2 \ > foo.frq Here are some popular radicals of the [M,E,N,R] class, with aproximate counts of the four inflections: 4ODA(383) O(238) 8A(225) ODA(111) 2A(94) 4O(93) OHA(79) A(75) 4OHA(73) 2O(49) OEDA(41) RA(35) SCO(31) EO(31) TCO(30) OEO(28) DA(24) 4ODO(23) RO(22) ''(20) HA(19) EDA(15) SO(14) SA(13) TC8A(12) OEA(11) PO(11) SC8A(10) TA(10) EA(9) HO(9) ORA(9) SCA(9) Note that the radicals all end in "A" or "O" I collected the radicals of the class [G]: cat bio-c-evt.frq \ | egrep -v '[%*]' \ | egrep '[G]$' \ | sort +1 -2 \ > foo.frq Looking at the most frequent words in this file, it seems that the main suffixes of this class are actually, in decreasing frequency order for each radical: 4OD{C8G,CC8G,CCG,CG,G,SC8G,T8G,TC8G,TCG,TG,ZCG,ZG} 4OH{C8G,CC8G,G,CG,CCG,TC8G,TG} 8{G,SC8G,TC8G,C8G,CC8G,...} D{C8G,CC8G,TCG,CCC8G,CG} E{TC8G,SC8G,TCG,8G,DC8G,G,DCC8G,SCG,TG,SCC8G,OEG,SCCG,...} GD{CC8G,CCG,C8G,} GH{C8G,CCG,CC8G} H{C8G,CC8G,CCG} OD{C8G,CC8G,CCG,G,CG,TCG,SCG,T8G,CSG,SC8G,TG,ZG,CCC8G,...} OE{G,TC8G,SC8G,TCG,8G,...} OED{C8G,CCG,CC8G,...} OH{C8G,CC8G,CCG,G,CG,C8CCG,CCC8G,...} OP{TC8G,SC8G,TCG,TCCG,...} S{C8G,CG,CC8G,CCG,G,8G,...} T{C8G,CG,CCG,CC8G,8G,G,...} So it seems that the [G] class is actually [C8G,CC8G,CCG,CG,G]. However the preceding letter is quite often [S,T,C,Z,D], so the class may include [SC8G,TC8G,...]. I extracted the words with those popular radicals combined with the selected endings, and counted their frequencies: /bin/rm .foo foreach f ( '' 4OD E OH 4OH OD 8 OE H D GD 4OED ) echo 'Prefix "'"${f}"'":' >> .foo echo ' ' >> .foo cat bio-c-evt.frq \ | egrep -v '[%*]' \ | egrep ' '"${f}"'[STZ]?[C]*[8]?[G]?$' \ | sed -e 's@[ ]'"${f}"'@ @' \ | sort -b +0 -1nr \ | format-counts \ | sed -e 's@^@ @' \ >> .foo echo ' ' >> .foo end Prefix "": SC8G(210) TC8G(193) SCG(78) TCG(74) 8G(41) TCCG(34) SCC8G(33) SCCG(30) TCC8G(21) T8G(19) SG(9) G(5) S8G(5) TC8(5) TG(5) CC8G(3) S(2) SC8(2) (1) 8(1) CCC8G(1) CCG(1) SCC8(1) TCCCG(1) Prefix "4OD": C8G(159) CC8G(149) CCG(83) G(58) CG(39) T8G(10) TC8G(7) SC8G(6) TG(5) C8(4) CC8(4) ZCG(4) (3) CCCG(3) TCG(3) CCC8G(2) 8G(1) S8G(1) SCG(1) TC8(1) ZC8G(1) ZG(1) Prefix "E": TC8G(55) SC8G(23) TCG(18) 8G(9) (7) G(7) SCG(6) TG(6) SCC8G(5) SCCG(4) T8G(4) TCCG(4) TC8(3) S8G(2) SC8(2) TCC8G(2) 8(1) C(1) CC8G(1) S(1) T(1) Prefix "OH": C8G(48) CC8G(25) CCG(18) G(17) CG(13) SC8G(4) S8G(3) TC8G(3) TCG(3) (2) SCG(2) TG(2) C8(1) CCC8G(1) T8G(1) ZCG(1) ZG(1) Prefix "4OH": C8G(46) CC8G(39) G(26) CG(8) CCG(7) TC8G(3) TG(3) CC8(2) SC8G(2) TCG(2) (1) CCC8G(1) S8G(1) SCG(1) SG(1) T8G(1) ZCG(1) Prefix "OD": C8G(43) CC8G(32) CCG(16) G(12) CG(11) TCG(6) SCG(3) T8G(3) CC8(2) SC8G(2) TG(2) ZG(2) (1) CCC8G(1) T8(1) TC8G(1) ZCG(1) Prefix "8": G(41) SC8G(17) TC8G(9) C8G(2) CC8G(2) SCC8G(2) SCCG(2) SCG(2) T8G(2) TCG(2) (1) CCC8G(1) CCG(1) S8G(1) SC8(1) TCC8G(1) TCCG(1) TG(1) Prefix "OE": (176) G(26) TC8G(24) SC8G(14) TCG(13) 8G(12) SCG(8) TG(5) S8G(4) T8G(3) TCCG(3) SC8(2) SCCG(2) TCC8G(2) CCC8(1) CCC8G(1) SCC8(1) SCC8G(1) TC8(1) Prefix "H": C8G(20) TC8G(10) CC8G(4) CCG(4) SC8G(3) SCG(3) ZC8G(3) ZCG(3) G(2) TCG(2) Z8G(2) T8G(1) TCCG(1) TG(1) Prefix "D": C8G(11) CC8G(10) TCG(4) CC8(3) (2) CCC8G(2) CG(2) SCG(2) T8G(2) ZCG(2) S8G(1) SC8G(1) TC8G(1) Z8G(1) ZC8G(1) ZG(1) Prefix "GD": CC8G(10) CCG(7) C8G(5) CCCG(1) SC8G(1) ZCG(1) Prefix "4OED": CC8G(4) CCG(4) G(4) CG(1) It seems that the class consists of the suffixes [S,T,Z,][C]*[8,][G]. I collected all prefixes with these suffixes: /bin/rm .bar cat bio-c-evt.frq \ | egrep -v '[%*]' \ | egrep 'G$' \ | /bin/sed \ -e 's@[ ]@ :@' \ -e 's@[ ]\(.*\)\([STZ][C]*[8]\{0,1\}G\)$@ \1- \2@' \ -e 's@[ ]\(.*[^STZC]\)\(C[C]*[8]\{0,1\}G\)$@ \1- \2@' \ -e 's@[ ]\(.*[^STZC]\)\(8G\)$@ \1- \2@' \ -e 's@[ ]\(.*[^STZC8]\)\(G\)$@ \1- \2@' \ -e 's@ :@ @' \ | sort -b +2 -3 +0 -1nr \ > .bar Suffix "8G": (41) OE(12) E(9) 8AE(4) 4O(3) DOE(3) O(3) SOE(3) 4ODG(2) 4ODO(2) 4OHAE(2) PSCOE(2) 2OE(1) 4OD(1) 4ODA(1) 4ODAE(1) 4ODC2(1) 4ODGE8(1) 4ODOE(1) 4ODT2(1) 4OE(1) 4OHAR(1) 4OP(1) 8AIRG(1) 8AR(1) A(1) AEAE(1) EDCOE(1) EO(1) EOE(1) G(1) GSCAE(1) HOROES2(1) HT8OEH(1) ODA(1) ODAE(1) ODOE(1) OEAO(1) OEHG(1) OHAE(1) OHCCO(1) OHCOE(1) OHOE(1) OHTR(1) ORO(1) PZO(1) SC8AE(1) SCCOE(1) SCOE(1) SE(1) TO(1) TR(1) Suffix "C8G": 4OD(159) OH(48) 4OH(46) OD(43) H(20) OED(18) D(11) ED(7) GH(7) GD(5) OEH(3) TD(3) 8(2) 8D(2) EH(2) SCD(2) TCH(2) 2OED(1) 4(1) 4CH(1) 4ODC8(1) 4ODO(1) 4OR(1) 8GD(1) 8GH(1) 8OD(1) 8OE(1) 8OED(1) AED(1) CD(1) OR(1) PGD(1) TH(1) Suffix "CC8G": 4OD(149) 4OH(39) OD(32) OH(25) OED(13) D(10) GD(10) 4O(6) ED(6) 4OED(4) H(4) (3) 2OED(3) GH(3) 2OD(2) 4(2) 8(2) 8GD(2) EH(2) OEH(2) 2OEH(1) 42OED(1) 4O8(1) 4OCCD(1) 4OEH(1) 4OR(1) E(1) EO(1) G8(1) HD(1) HOEOD(1) HSCOD(1) LED(1) O(1) OCD(1) PO(1) TCD(1) TD(1) Suffix "CCC8G": 4O(2) 4OD(2) D(2) (1) 4(1) 4OH(1) 8(1) O(1) OD(1) OE(1) OH(1) Suffix "CCCG": 4OD(3) OED(2) GD(1) TOED(1) Suffix "CCG": 4OD(83) OED(18) OH(18) OD(16) 4OH(7) GD(7) 4OED(4) GH(4) H(4) 2OED(3) ED(3) 8D(2) (1) 2AED(1) 2D(1) 4(1) 4CC8(1) 4CD(1) 4O(1) 4O4OD(1) 4O8(1) 4OR(1) 8(1) 8OH(1) EO(1) EP(1) O(1) OEOED(1) OHC8(1) POED(1) SCCD(1) SCD(1) TD(1) TGH(1) Suffix "CG": 4OD(39) OH(13) OD(11) 4OH(8) OED(5) TCH(3) D(2) SCD(2) TCCD(2) 2OED(1) 4(1) 4CD(1) 4O8GD(1) 4OED(1) 8GH(1) AH(1) CC2(1) ED(1) EDC8(1) GH(1) GROH(1) HOED(1) OEHS8(1) SCCD(1) SCCH(1) SCH(1) TCD(1) Suffix "G": 4OD(58) 4OH(26) OE(26) OH(17) OD(12) TCD(12) OED(11) SCD(11) TCCD(11) SCCD(8) 4OE(7) 8AE(7) E(7) SCCH(7) 4ODAE(6) OR(6) SCH(6) (5) 4OED(4) 8AR(4) AE(4) EOE(4) TCH(4) AR(3) ED(3) OHAE(3) 4ODCO(2) AM(2) ESCH(2) H(2) ODAE(2) OEOD(2) OROE(2) R(2) RTCD(2) SCOD(2) SD(2) TCAE(2) TCCH(2) TH(2) 2(1) 2AE(1) 2OE(1) 2OED(1) 2SCD(1) 2TCH(1) 4CH(1) 4ODAO(1) 4ODCOE(1) 4ODO(1) 4ODOP(1) 4OEDAR(1) 4OEDCCOE(1) 4OGD(1) 4OHAE(1) 4OHAR(1) 4OHCC2(1) 4OP(1) 8AEAR(1) 8AED(1) 8AROR(1) 8ETCR(1) 8GD(1) 8GR(1) 8OE(1) A(1) A2(1) AEOR(1) CH(1) DAR(1) DOEP(1) DOHAE(1) EAE(1) EEAIIR(1) ETC8AR(1) GH(1) GO(1) GOD(1) GR(1) HAE(1) HG4OD(1) O4OD(1) ODAN(1) ODGED(1) ODO(1) OE2AE(1) OEAM(1) OEDOR(1) OESH(1) OFOE(1) OHAR(1) OHOR(1) ONOE(1) OOR(1) OP(1) OPAE(1) OPOE(1) OPTAE(1) OPTCCD(1) OROIR(1) POE8AD(1) POETO(1) RAE(1) RAR(1) ROE(1) SCGD(1) SCO(1) SH(1) SOD(1) TAE(1) TAEOE(1) TAR(1) TC2(1) TCCCH(1) TCO(1) TCOD(1) TCOE(1) TCP(1) TD(1) TDCA(1) THZO(1) TOE(1) TOH(1) Suffix "S8G": (5) OE(4) OH(3) 8AE(2) E(2) 4OD(1) 4OE(1) 4OH(1) 4OP(1) 8(1) D(1) ODC(1) OF(1) OHAE(1) OP(1) P(1) POE(1) R(1) SOD(1) Suffix "SC8G": (210) E(23) 8(17) OE(14) G(8) 4OD(6) R(6) 2(5) OH(4) 4OP(3) H(3) OP(3) 4OH(2) OD(2) OEH(2) 4(1) 4O(1) 4OE(1) 4OHC8(1) 8AE(1) 8E(1) AE(1) CE(1) D(1) GD(1) O(1) OHAE(1) OPAE(1) P(1) POE8(1) Suffix "SCC8G": (33) E(5) 4OE(2) 8(2) R(2) 2(1) 4O(1) G(1) O(1) OE(1) OR(1) TP(1) Suffix "SCCG": (30) E(4) G(3) 8(2) OE(2) TETP(1) Suffix "SCG": (78) OE(8) E(6) G(4) H(3) O(3) OD(3) 2(2) 8(2) D(2) OH(2) 2OE(1) 4OD(1) 4OE(1) 4OH(1) 8D(1) G8AR(1) GDC(1) ODC(1) OED(1) OEDC(1) OG(1) OP(1) POE(1) R(1) SCCP(1) Suffix "SG": (9) ODC(2) 4OE(1) 4OF(1) 4OH(1) 8GD(1) AE(1) HOE(1) POE(1) Suffix "T8G": (19) 4OD(10) E(4) OD(3) OE(3) 8(2) D(2) P(2) 2(1) 2SD(1) 4CD(1) 4ODC(1) 4ODG(1) 4OE(1) 4OH(1) 4OP(1) AE(1) DC(1) DOE(1) EP(1) ETP(1) GE(1) H(1) IEP(1) OF(1) OH(1) OP(1) POE(1) POEAE(1) S(1) TCOE(1) Suffix "TC8G": (193) E(55) OE(24) OP(13) P(13) H(10) 4OP(9) 8(9) 4OD(7) 4OE(6) R(6) 2(5) 2OE(4) F(4) G(4) 4OH(3) OH(3) 4ODC(2) ED(2) OEF(2) OEP(2) OF(2) POE(2) 4O(1) 4ODOE(1) 4OF(1) 4OR(1) 4P(1) 8AE(1) 8OE(1) 8OEF(1) AE(1) CF(1) D(1) EO(1) EP(1) GF(1) O(1) O4OF(1) O8(1) OD(1) OEOH(1) SCCP(1) SCP(1) TCP(1) TP(1) Suffix "TCC8G": (21) E(2) OE(2) 4OE(1) 8(1) 8OE(1) G(1) Suffix "TCCCG": (1) Suffix "TCCG": (34) E(4) OE(3) OP(2) 2OD(1) 8(1) G(1) H(1) O(1) R(1) Suffix "TCG": (74) E(18) OE(13) 4OE(8) OD(6) R(5) D(4) P(4) 2OE(3) 4OD(3) OH(3) OP(3) 4OH(2) 8(2) AE(2) H(2) 2(1) 4O(1) 4ODAE(1) 4ODC(1) 4OF(1) 4OP(1) 8AR(1) 8OE(1) DC(1) DZ(1) ED(1) EDE(1) G(1) GF(1) GS(1) OEH(1) OEOE(1) OR(1) TC8(1) TCOE(1) TOE(1) Suffix "TG": E(6) (5) 4OD(5) OE(5) 4OH(3) OD(2) OEE(2) OH(2) SCP(2) 2OE(1) 2P(1) 4CD(1) 4O(1) 4OE(1) 8(1) CP(1) DOR(1) GDCC(1) H(1) ODAE(1) ODC(1) OED(1) OPC(1) POE(1) R(1) SCCD(1) SCHZC8(1) TC(1) TC8(1) Suffix "Z8G": TD(5) H(2) SCD(2) TP(2) 2AESH(1) D(1) TCD(1) TH(1) Suffix "ZC8G": SD(6) TD(5) H(3) SCD(3) TH(2) 4D(1) 4OD(1) 8TD(1) D(1) ESCD(1) ETCD(1) ETP(1) SCH(1) SH(1) SOP(1) TCH(1) TCP(1) TP(1) Suffix "ZCCG": F(1) Suffix "ZCG": TD(9) SD(5) 4OD(4) H(3) SCD(3) SH(3) D(2) P(2) TCD(2) TCH(2) TP(2) 2H(1) 2OD(1) 4H(1) 4OH(1) AH(1) CP(1) ETCD(1) GD(1) GTCD(1) HOH(1) OD(1) OH(1) SCCH(1) TCCH(1) TF(1) TH(1) Suffix "ZG": TD(31) SD(23) TCD(21) TH(20) SCD(19) TCH(17) SCH(12) SH(12) ETCD(3) ESCD(2) OD(2) OETH(2) SCP(2) TP(2) 2D(1) 2SCD(1) 2TD(1) 2TH(1) 4H(1) 4OD(1) 4ODCTD(1) 8AD(1) 8SCD(1) D(1) ESD(1) ESH(1) ETH(1) GTCH(1) HTH(1) OEPOD(1) OETCD(1) OH(1) RAH(1) SCCD(1) SOD(1) TSCD(1) From looking at these distributions, it seems that the "S", "T", and "Z" are actually part of the roots; and that either the "G" suffix also occurs in words of unrelated classes, or there are more suffixes that end in "G" within this class. Here are again the prefixes of "G" that are not in the oterh classes, this time sorted by last letter of prefix: (5) 2(1) A2(1) 4OHCC2(1) TC2(1) A(1) TDCA(1) 4OD(58) OD(12) TCCD(11) SCD(11) OED(11) SCCD(8) POE8AD(1) OPTCCD(1) 2SCD(1) TCD(12) 4OED(4) RTCD(2) ED(3) 8AED(1) ODGED(1) 2OED(1) 8GD(1) SCGD(1) 4OGD(1) HG4OD(1) O4OD(1) SCOD(2) TCOD(1) OEOD(2) GOD(1) SOD(1) SD(2) TD(1) OE(26) 8AE(7) E(7) AE(4) 2AE(1) OE2AE(1) TCAE(2) ODAE(2) 4ODAE(6) EAE(1) HAE(1) OHAE(3) 4OHAE(1) DOHAE(1) OPAE(1) RAE(1) TAE(1) OPTAE(1) 2OE(1) 4OE(7) 8OE(1) 4OEDCCOE(1) 4ODCOE(1) TCOE(1) EOE(4) TAEOE(1) OFOE(1) ONOE(1) OPOE(1) ROE(1) OROE(2) TOE(1) 4OH(26) OH(17) SCCH(7) SCH(6) TCH(4) H(2) CH(1) 4CH(1) TCCCH(1) TCCH(2) ESCH(2) 2TCH(1) GH(1) TOH(1) SH(1) OESH(1) TH(2) AM(2) OEAM(1) ODAN(1) 4ODCO(2) 4ODAO(1) SCO(1) TCO(1) ODO(1) 4ODO(1) GO(1) POETO(1) THZO(1) TCP(1) DOEP(1) OP(1) 4OP(1) 4ODOP(1) OR(6) 8AR(4) R(2) AR(3) ETC8AR(1) DAR(1) 4OEDAR(1) 8AEAR(1) OHAR(1) 4OHAR(1) RAR(1) TAR(1) 8ETCR(1) GR(1) 8GR(1) EEAIIR(1) OROIR(1) OEDOR(1) AEOR(1) OHOR(1) OOR(1) 8AROR(1) There is a hint that "DG" and "HG" may be related suffixes. Indeed, there is hint that the final "D" and "H" in most of these stems may be part of the suffix. So here is the new guess about the main class of words that end in "G": {DAE8G,HAE8G, 8G, DC8G,HC8G, DCC8G,HCC8G, DCCC8G,HCCC8G,CCC8G, DCCCG, EDCCG,DCCG,HCCG, EDCG,DCG,HCG, DG,HG,EG,EDG,CCDG,CCHG, ES8G,DS8G,HS8G, ESC8G,DSC8G,HSC8G,SC8G, ESCC8G,SCC8G,TCC8G,CC8G, ESCCG,CCG, DSCG,HSCG,ESCG, ESG,HSG, ET8G,DT8G,HT8G,PT8G, ETC8G,DTC8G,HTC8G,PTC8G,TC8G, ETCC8G, ETCCG,PTCCG,DTCCG, ETCG,DTCG,HTCG,PTCG,DAETCG,DCTCG, ETG,DTG,HTG,DCTG,DAETG, DZ8G,HZ8G, DZC8G,HZC8G, DZCG,HZCG,PZCG, DZG,CDZG,CHZG } I tried to separat these suffixes with a giant "sed": /bin/rm .bax cat bio-c-evt.frq \ | egrep -v '[%*]' \ | egrep 'G$' \ | /bin/sed -f split-G-suffs.sed \ | /bin/sed -e 's/:$//' \ | sort -b +2 -3 +0 -1nr \ > .bax The results were not very good, so I decided to look closely at the most popular prefixes among the [G] words, which are [4O,O,G,8,2]: /bin/rm .bax cat bio-c-evt.frq \ | egrep -v '[%*]' \ | egrep '[ ](4O|O|G|8|2).*G$' \ | /bin/sed \ -e 's@[ ]4O@ 4O- @' \ -e 's@[ ]O@ O- @' \ -e 's@[ ]G@ G- @' \ -e 's@[ ]8@ 8- @' \ -e 's@[ ]2@ 2- @' \ | sort -b +1 -2 +0 -1nr \ > .bax Suffixes of "2": SC8G(5) TC8G(5) OETC8G(4) OEDCC8G(3) OEDCCG(3) OETCG(3) ODCC8G(2) SCG(2) AEDCCG(1) AEG(1) AESHZ8G(1) DCCG(1) DZG(1) G(1) HZCG(1) ODTCCG(1) ODZCG(1) OE8G(1) OEDC8G(1) OEDCG(1) OEDG(1) OEG(1) OEHCC8G(1) OESCG(1) OETG(1) PTG(1) SCC8G(1) SCDG(1) SCDZG(1) SDT8G(1) T8G(1) TCG(1) TCHG(1) TDZG(1) THZG(1) (0) Suffixes of "8": G(41) SC8G(17) TC8G(9) AEG(7) AE8G(4) ARG(4) AES8G(2) C8G(2) CC8G(2) DC8G(2) DCCG(2) GDCC8G(2) SCC8G(2) SCCG(2) SCG(2) T8G(2) TCG(2) ADZG(1) AEARG(1) AEDG(1) AESC8G(1) AETC8G(1) AIRG8G(1) AR8G(1) ARORG(1) ARTCG(1) CCC8G(1) CCG(1) DSCG(1) ESC8G(1) ETCRG(1) GDC8G(1) GDG(1) GDSG(1) GHC8G(1) GHCG(1) GRG(1) ODC8G(1) OEC8G(1) OEDC8G(1) OEFTC8G(1) OEG(1) OETC8G(1) OETCC8G(1) OETCG(1) OHCCG(1) S8G(1) SCDZG(1) TCC8G(1) TCCG(1) TDZC8G(1) TG(1) (0) Suffixes of "4O": DC8G(159) DCC8G(149) DCCG(83) DG(58) HC8G(46) DCG(39) HCC8G(39) HG(26) DT8G(10) PTC8G(9) ETCG(8) HCG(8) DTC8G(7) EG(7) HCCG(7) CC8G(6) DAEG(6) DSC8G(6) ETC8G(6) DTG(5) DZCG(4) EDCC8G(4) EDCCG(4) EDG(4) 8G(3) DCCCG(3) DTCG(3) HTC8G(3) HTG(3) PSC8G(3) CCC8G(2) DCCC8G(2) DCOG(2) DCTC8G(2) DG8G(2) DO8G(2) ESCC8G(2) HAE8G(2) HSC8G(2) HTCG(2) 4ODCCG(1) 8CC8G(1) 8CCG(1) 8GDCG(1) CCDCC8G(1) CCG(1) D8G(1) DA8G(1) DAE8G(1) DAETCG(1) DAOG(1) DC28G(1) DC8C8G(1) DCOEG(1) DCT8G(1) DCTCG(1) DCTDZG(1) DGE88G(1) DGT8G(1) DOC8G(1) DOE8G(1) DOETC8G(1) DOG(1) DOPG(1) DS8G(1) DSCG(1) DT28G(1) DZC8G(1) DZG(1) E8G(1) EDARG(1) EDCCOEG(1) EDCG(1) EHCC8G(1) ES8G(1) ESC8G(1) ESCG(1) ESG(1) ET8G(1) ETCC8G(1) ETG(1) FSG(1) FTC8G(1) FTCG(1) GDG(1) HAEG(1) HAR8G(1) HARG(1) HC8SC8G(1) HCC2G(1) HCCC8G(1) HS8G(1) HSCG(1) HSG(1) HT8G(1) HZCG(1) P8G(1) PG(1) PS8G(1) PT8G(1) PTCG(1) RC8G(1) RCC8G(1) RCCG(1) RTC8G(1) SC8G(1) SCC8G(1) TC8G(1) TCG(1) TG(1) (0) (0) Suffixes of "G": DCC8G(10) SC8G(8) DCCG(7) HC8G(7) DC8G(5) HCCG(4) SCG(4) TC8G(4) HCC8G(3) SCCG(3) 8ARSCG(1) 8CC8G(1) 8G(1) DCCCG(1) DCCTG(1) DCSCG(1) DSC8G(1) DZCG(1) ET8G(1) FTC8G(1) FTCG(1) HCG(1) HG(1) ODG(1) OG(1) RG(1) ROHCG(1) SCAE8G(1) SCC8G(1) STCG(1) TCC8G(1) TCCG(1) TCDZCG(1) TCG(1) TCHZG(1) Suffixes of "O": HC8G(48) DC8G(43) DCC8G(32) EG(26) HCC8G(25) ETC8G(24) EDC8G(18) EDCCG(18) HCCG(18) HG(17) DCCG(16) ESC8G(14) EDCC8G(13) ETCG(13) HCG(13) PTC8G(13) DG(12) E8G(12) DCG(11) EDG(11) ESCG(8) DTCG(6) RG(6) EDCG(5) ETG(5) ES8G(4) HSC8G(4) 8G(3) DSCG(3) DT8G(3) EHC8G(3) ET8G(3) ETCCG(3) HAEG(3) HS8G(3) HTC8G(3) HTCG(3) PSC8G(3) PTCG(3) SCG(3) DAEG(2) DCSG(2) DSC8G(2) DTG(2) DZG(2) EDCCCG(2) EETG(2) EFTC8G(2) EHCC8G(2) EHSC8G(2) EODG(2) EPTC8G(2) ESCCG(2) ETCC8G(2) ETHZG(2) FTC8G(2) HSCG(2) HTG(2) PTCCG(2) ROEG(2) 4ODG(1) 4OFTC8G(1) 8TC8G(1) CC8G(1) CCC8G(1) CCG(1) CDCC8G(1) DA8G(1) DAE8G(1) DAETG(1) DANG(1) DCCC8G(1) DCS8G(1) DCSCG(1) DCTG(1) DGEDG(1) DOE8G(1) DOG(1) DTC8G(1) DZCG(1) E2AEG(1) EAMG(1) EAO8G(1) ECCC8G(1) EDCSCG(1) EDORG(1) EDSCG(1) EDTG(1) EHG8G(1) EHS8CG(1) EHTCG(1) EOEDCCG(1) EOETCG(1) EOHTC8G(1) EPODZG(1) ESCC8G(1) ESHG(1) ETCDZG(1) FOEG(1) FS8G(1) FT8G(1) GSCG(1) HAE8G(1) HAES8G(1) HAESC8G(1) HARG(1) HC8CCG(1) HCCC8G(1) HCCO8G(1) HCOE8G(1) HOE8G(1) HORG(1) HT8G(1) HTR8G(1) HZCG(1) HZG(1) NOEG(1) ORG(1) PAEG(1) PAESC8G(1) PCTG(1) PG(1) POEG(1) PS8G(1) PSCG(1) PT8G(1) PTAEG(1) PTCCDG(1) RC8G(1) RO8G(1) ROIRG(1) RSCC8G(1) RTCG(1) SC8G(1) SCC8G(1) TC8G(1) TCCG(1) Common suffixes: cat bio-c-evt.txt \ | tr ' ' '\012' \ | egrep -v '[%*/=]' \ | egrep '.' \ > bio-c-evt.wds lines words bytes file ------ ------- --------- ------------ 5928 5928 32257 bio-c-evt.wds /bin/rm .sf-join.frq /bin/touch .sf-join.frq set ofmt = "0" @ i = 1 set noglob set prefs = ( \ 4OD 4OE 4OH 4O'[^DEH]' \ OD OED OE8 OE'[^D8]' OH OP OR O'[^DEHPR]' \ S T \ '[^4OST]' \ ) foreach f ( ${prefs} ) echo "${i} ${f}" /bin/rm .sf-part-${i}.frq cat bio-c-evt.wds \ | egrep '^'"${f}" \ | /bin/sed \ -e 's@^'"${f}"'\(.*\)$@-\1@' \ | sort \ | uniq -c \ > .sf-part-${i}.frq /n/gnu/bin/join -a 1 -a 2 -j1 2 -e 0 -o "${ofmt},1.1" \ .sf-part-${i}.frq .sf-join.frq > .sf-tmp.frq @ i = ${i} + 1 set ofmt = "${ofmt},2.${i}" mv .sf-tmp.frq .sf-join.frq end unset ofmt i noglob ( echo "# ${prefs}"; cat .sf-join.frq ) \ | add-counts \ > .sf-suffs.frq From the output, it seems indeed there are two classes of words, [A,O][EKLMNR] and [G]. Split bio-c-evt.wds into the two classes plus "misc": egrep '[AO][EKLMNR]$' bio-c-evt.wds > bio-c-evt-eklmnr.wds egrep 'G$' bio-c-evt.wds > bio-c-evt-g.wds egrep -v '([AO][EKLMNR]|G)$' bio-c-evt.wds > bio-c-evt-other.wds lines words bytes file ------ ------- --------- ------------ 3215 3215 19119 bio-c-evt-g.wds 2347 2347 11361 bio-c-evt-eklmnr.wds 366 366 1777 bio-c-evt-other.wds 5928 5928 32257 bio-c-evt.wds Next step: AutoAnalysis on each category. 97-07-06 stolfi =============== Reading again the description of Landini's file, I found out that the `%' and `!' marks should have been handled differently from the way I did. Also, looking at the actual shape of the characters, I realized that the FSG encoding was not very good for my purposes, since is assigns completely different codes to glyphs which may be just calligraphic variations of the same grapheme. Thus I decided to redo everything from the beginning, using a more analytical encoding. I considered using Jacques Guy's "Neo-Frogguy" or "Gui2" encoding, but even that is a bit too synthetic --- for example, his <2> should be "i'", and his <9> should be `c)', for consistency. (The statistics on the occurrence of repeated s apparently confirm this choice). Thus I decided to define my own "super-analytic" or "SA" encoding. The idea is to break all characters doen to individual "logical" strokes, and use one (computer) character to encode each stroke. There is some question as to what is a logical stroke, and when two strokes are different. Obviously, the definition of a stroke must include not only its shape but also the way it connects to the neighboring strokes; and, given the irregularity of handwritten glyphs, that may be hard to decide. For instance, FSG's [A] character can be broken down into two strokes, shaped like the [C] and [I] glyphs. Supposedly, the difference between an [A] and a [CI] is that in the former the strokes are connected into a closed shape. Is this difference significant? I checked the occurrences of [CI], [CM], and [CN] in the interlinear file. Two things are curious. First, these combinations are extremely rare. Second, a good many of them are transcribed differently by Currier and the FSG: where one has [CIIR] the other often has [AIR], and vice-versa. Same for [CM] versus [AN], etc. In light of these observations, I have decided to treat all occurrences of [A] as [CI]. If the two are indeed different, that will be just one more ambiguity added to the inherent ambiguity of natural language; so it cannot make the decipherment task more difficult. Confusing the two will change the letter frequencies, it is true; but, since the language does not appear to be a standardized one, there is not much information we can extract from absolute letter frequencies. The methods we hope to use --- such as automaton analysis --- are not significantly disturbed by collapsing letters. On the other hand, if [A] and [CI] are the same grapheme, using different encodings will seriously confuse statistics --- especially if the spacing depends on the immediate context. I took the file bio-c-evt.txt and removed all `%' from it. Many of them should have been spaces, but so what. I also added a space before and after each line; that may be helpful when doing greps. lines words bytes file ------ ------- --------- ------------ 765 7227 39823 bio-c-evt.txt Next I applied a recoding: cat bio-c-evt.txt \ | fsg2jsa \ > bio-c-jsa.txt Next I extracted words: cat bio-c-jsa.txt \ | tr ' ' '\012' \ | egrep '.' \ > bio-c-jsa.wds cat bio-c-jsa.wds \ | sort \ | uniq \ > bio-c-jsa.dic cat bio-c-jsa.wds \ | sort \ | uniq -c \ | sort +0 -1nr \ > bio-c-jsa.frq lines words bytes file ------ ------- --------- ------------ 765 7227 61678 bio-c-jsa.txt 7227 7227 60144 bio-c-jsa.wds 1687 1687 17398 bio-c-jsa.dic 1687 3374 30894 bio-c-jsa.frq Next I separated the good words: cat bio-c-jsa.wds \ | egrep '^[a-z+^]*$' \ > bio-c-jsa-gut.wds cat bio-c-jsa.dic \ | egrep '^[a-z+^]*$' \ > bio-c-jsa-gut.dic bool 1-2 bio-c-jsa.dic bio-c-jsa-gut.dic \ > bio-c-jsa-bad.dic lines words bytes file ------ ------- --------- ------------ 34 34 270 bio-c-jsa-bad.dic 6420 6420 57560 bio-c-jsa-gut.wds 1653 1653 17128 bio-c-jsa-gut.dic Next I buit the automaton: cat bio-c-jsa-gut.dic \ | nice MaintainAutomaton \ -add - \ -dump bio-c-jsa-gut.dmp strings letters states finals arcs sub-sts lets/arc -------- -------- -------- -------- -------- -------- -------- 1653 15475 1373 178 2609 2234 5.931 Note that the efficiency increased, even though the words got considerably longer! I ran AutoAnalysis, looking for unproductive states (2 words or less) and strange words: nice AutoAnalysis \ -load bio-c-jsa-gut.dmp \ -unprod bio-c-jsa-gut-1-unp.sts \ -maxUnprod 2 \ -unprodSugg bio-c-jsa-gut-1-unp.sugg 546 unproductive states 826 strange words (with repetitions) listed 389 strange words (without repetitions) listed I redid again, considering a state unproductive if it is used by only one word): nice AutoAnalysis \ -load bio-c-jsa-gut.dmp \ -unprod bio-c-jsa-gut-2-unp.sts \ -maxUnprod 1 \ -unprodSugg bio-c-jsa-gut-2-unp.sugg 266 unproductive states 266 strange words (with repetitions) listed 96 strange words (without repetitions) listed Here are the strange words: ccljtciix ccqgtccy cgciiiivciiscyixcy cgciiiixciixcgcy cgciixcyiscg cgciixljccccyiscy cgcstoixcycg cgctcljtccgcy cgcyixctccs cgcyqoljoix cgixctciscy cgoisctccgctccy cgoixcccsoixctccy cicstcyqjcccg ciixcstclgccy cqgciixoiis cqjtcyljoisoix csciixcstcqjtcgcy csciixctqjccgcyqjciis cscstoixcstccqjtcy csolgctciij csqjoixqgctcy cstccgljcoixcy cstcoisoixljctcgcy cstixctcis cstocqgtccgcy cstqjciixctcccgcy ctccgciiivcgci ctccgcyqjctcg ctccicgis ctcciixisois ctccycsctcy ctccyqcyqjciiiiv ctcoixqjciiivcscy ctcstccljtcy ctixctqgcstcccy ctoixljccccy cycgoixcstcccy cycqgciiiiv cyisoqjccy cyljccgcicqgtcy cyqjcccsoixcgcy cyqjctixctccgcy cyqjctoisoixljcy cyqoixcyqois iixqgctcgcy isoisciisoix ixcsciiis ixctccgcyisqgctccgcy ixixcgcyis ixixciiiiscy ixlgctcciix ixoixljccgcyljciiivoix lgctccgcyljciiscy ljccgcyqgocstcy ljciixcyctccy ljoisoiiivcy ljoqjciixcy ocstcqgoixcs ocycstccy oisciiisciiso oixcstciixcscy olgciixcstcljcy olgctcivcycs oljciixctccgcyqjoiscy oqjciisqgcccgcy qcljcyqjccgciis qcsoixljcccgcy qcyisctcs qgcgciixcstois qgciixcgciisciiisciix qgcstctccgciix qgcstoixqgctclgtcgcy qgctccyljccciis qgoisciixcstcy qgoixcccgciisciiv qgoixqjccstoix qgoqjctoljciis qjccciixciiiv qjccyqjccj qjcoixcstcoix qjctcgoixqjcgcy qjctciixoixljcc qjcyqoljcy qjocqjtccy qjoisoixcstcscgcy qjoixoiscstcccy qoljciixljcoix qoljcyisixcstc qoljcyixcgcgcy qoqjcgcgcyciis qoqjcyqjcyqjois qoqjixoixljciix qoqjoisciixoij qoqoljcccy qqgoixljciiiv I looked for popular states and radicals: nice AutoAnalysis \ -load bio-c-jsa-gut.dmp \ -classes bio-c-jsa-gut-1.cls \ -minClassPrefs 5 \ -radicals bio-c-jsa-gut-1.rad 77 lexical classes found The output wasn't very illuminating, though. I looked for productive states instead: AutoAnalysis \ -load bio-c-jsa-gut.dmp \ -prod bio-c-jsa-gut-1-prd.pst \ -minProductivity 1 \ -maxPrefSize 30 \ -maxSuffSize 30 Here is the result: nprefs nsuffs nwords prodty prefs/suffs ------- ------- ------- ------- ----------------- 40 2 80 39 { qoqjciisc, qoljcctcc, qoljcccc, ... }:{ gcy, y } 35 2 70 34 { qoqjcccs, qoljcio, qoixoix, ... }:{ (), cy } 24 2 48 23 { qgciii, oljciiii, oisciiii, ... }:{ iv, v } 15 2 30 14 { qoljccii, ctccgcyi, cgctoi, ... }:{ s, x } 7 3 21 12 { qoixctcc, qoccc, oqjccc, ... }:{ cgcy, gcy, y } 12 2 24 11 { qoljccgci, qljo, qjoixcstco, ... }:{ is, ix } 11 2 22 10 { oixqjcc, ixqjcc, cgcyljcc, ... }:{ cgcy, gcy } 10 2 20 9 { cyctccgc, qoqjciixcgc, ... }:{ iis, y } 4 4 16 9 { qoixcstcc, ctcqjtc, qoqjcstc, ... }:{ cgcy, cy, gcy, y } 9 2 18 8 { qoqjciixcg, qoljciixcg, ... }:{ ciis, cy } 5 3 15 8 { ctccljtc, cstcljcc, csoixctcc, ... }:{ cy, gcy, y } 5 3 15 8 { qocljtc, qoctc, oisctcc, ... }:{ cgcy, cy, y } 8 2 16 7 { qocgcc, qcljct, ixocc, ... }:{ cgcy, cy } 8 2 16 7 { qcljcc, ctccljc, cstccqjc, ... }:{ cy, y } 7 2 14 6 { qoqgcst, oqjciixcst, qoixqjc, ... }:{ ccgcy, cgcy } 4 3 12 6 { qoiscc, ljctc, qoljcstc, ... }:{ cgcy, cy, gcy } 4 3 12 6 { oljcic, ixctccc, cqjtcc, ... }:{ gcy, s, y } 4 3 12 6 { qjciii, isoii, isciii, csoii }:{ iiv, iv, v } 4 3 12 6 { qoct, oisctc, ixctccljt, ... }:{ ccgcy, ccy, cy } 6 2 12 5 { oqgcstccgc, oljcccgc, qjoixcgc, ... }:{ iix, y } 6 2 12 5 { oixoii, cstcljciii, qoiscii, ... }:{ iiv, iv } 6 2 12 5 { cgljcc, qgoixljcc, oixljcstc, ... }:{ cy, gcy } 5 2 10 4 { qgoixctcc, csciixctcc, ... }:{ g, gcy } 2 5 10 4 { cgctcc, qoqjctc }:{ cgcy, cy, gcy, oix, y } 5 2 10 4 { qgoixljc, oixljcst, ljoixct, ... }:{ ccy, cgcy } 5 2 10 4 { qoljciiiv, ctis, oqjcoix, ... }:{ (), cgcy } 5 2 10 4 { qoljciixct, ocgct, ixljct, ... }:{ ccgcy, ccy } 3 3 9 4 { qoixciii, oqjciii, oixljciii }:{ iv, s, v } 2 4 8 3 { cgcc, cgoixctc }:{ ccgcy, cgcy, cy, gcy } 4 2 8 3 { qjoixcg, qgoixcstcg, cstcg, ... }:{ ciix, cy } 2 4 8 3 { qoqgctc, ljcstc }:{ cgcy, cy, gcy, oix } 4 2 8 3 { qoisci, ctcoixljci, csoixljci, ... }:{ iiiv, iiv } 4 2 8 3 { oljccgcy, oixljcccy, ctoixo, ... }:{ (), is } 4 2 8 3 { ocljt, ixljccg, ctoixct, ... }:{ ccy, cy } 4 2 8 3 { qoljctcgcy, oqjccgcy, ... }:{ (), ix } 2 4 8 3 { qoqjcst, ctcqgt }:{ ccgcy, ccy, cgcy, cy } 4 2 8 3 { qoqjciiiv, qjccgcy, oixois, ... }:{ (), oix } 2 4 8 3 { oixqjcii, ocgcii }:{ iiv, iv, s, x } 2 4 8 3 { oixqjci, ocgci }:{ iiiv, iiv, is, ix } 3 2 6 2 { oqjccgciii, ctcqjcii, ... }:{ iv, s } 2 3 6 2 { csciixctc, cgoixcstc }:{ cg, cgcy, oix } 2 3 6 2 { ixctccgcii, ciisoi }:{ s, scy, x } 2 3 6 2 { ixctccgci, ciiso }:{ is, iscy, ix } 3 2 6 2 { oqjoixcgcy, oqgctccgcy, ... }:{ (), ixctccy } 3 2 6 2 { oqoljc, ctcoljc, ctccyljc }:{ iiiv, y } 3 2 6 2 { oqolj, ctcolj, ctccylj }:{ ciiiv, cy } 2 3 6 2 { qoljctcc, ixljccc }:{ g, gcy, y } 2 3 6 2 { qoljcst, oqjcst }:{ ccgcy, ccy, cgcy } 3 2 6 2 { qoixqj, qocst, oiscst }:{ cccgcy, ccgcy } 2 3 6 2 { oqgcstccg, oljcccg }:{ (), ciix, cy } 2 3 6 2 { qoljcoi, oqgoi }:{ s, x, xcy } 2 3 6 2 { qoljcs, oqjcs }:{ tccgcy, tccy, tcgcy } 97-07-07 stolfi =============== Some interesting patterns are apparent above, but things may become clearer if we remove the garbage. Meanwhile, here is a count of the digraphs in the "good" words (counting repeated words): o c t i q l | v x y j g s TOT ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- 0 1363 2574 0 502 1874 107 | 0 0 0 0 0 0 6420 o 34 4 112 0 1680 635 1430 | 0 0 0 0 0 0 3895 c 7 172 3922 1447 2002 197 291 | 0 0 3764 12 2728 1445 15987 t 7 96 2696 0 23 18 24 | 0 0 0 0 0 0 2864 i 6 5 31 0 3395 3 3 | 943 2349 0 57 0 913 7705 q 5 1622 35 0 0 2 4 | 0 0 0 969 215 0 2852 l 0 0 0 0 0 0 0 | 0 0 0 2185 36 0 2221 ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- v 912 10 18 0 3 0 0 | 0 0 0 0 0 0 943 x 1085 159 757 0 22 50 276 | 0 0 0 0 0 0 2349 y 3510 7 72 0 37 66 72 | 0 0 0 0 0 0 3764 j 84 138 2658 319 23 0 1 | 0 0 0 0 0 0 3223 g 78 123 2729 25 14 2 8 | 0 0 0 0 0 0 2979 s 692 196 383 1073 4 5 5 | 0 0 0 0 0 0 2358 ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- TOT 6420 3895 15987 2864 7705 2852 2221 | 943 2349 3764 3223 2979 2358 57560 Next-symbol probabilities (× 99): o c t i q l | v x y j g s TOT ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- . 21 40 . 8 29 2 | . . . . . . 99 o 1 . 3 . 43 16 36 | . . . . . . 99 c . 1 24 9 12 1 2 | . . 23 . 17 9 99 t . 3 93 . 1 1 1 | . . . . . . 99 i . . . . 44 . . | 12 30 . 1 . 12 99 q . 56 1 . . . . | . . . 34 7 . 99 l . . . . . . . | . . . 97 2 . 99 ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- v 96 1 2 . . . . | . . . . . . 99 x 46 7 32 . 1 2 12 | . . . . . . 99 y 92 . 2 . 1 2 2 | . . . . . . 99 j 3 4 82 10 1 . . | . . . . . . 99 g 3 4 91 1 . . . | . . . . . . 99 s 29 8 16 45 . . . | . . . . . . 99 ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- TOT 11 7 27 5 13 5 4 | 2 4 6 6 5 4 57560 Previous-symbol probabilities (× 99): o c t i q l | v x y j g s TOT ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- . 35 16 . 6 65 5 | . . . . . . 11 o 1 . 1 . 22 22 64 | . . . . . . 7 c . 4 24 50 26 7 13 | . . 99 . 91 61 27 t . 2 17 . . 1 1 | . . . . . . 5 i . . . . 44 . . | 99 99 . 2 . 38 13 q . 41 . . . . . | . . . 30 7 . 5 l . . . . . . . | . . . 67 1 . 4 ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- v 14 . . . . . . | . . . . . . 2 x 17 4 5 . . 2 12 | . . . . . . 4 y 54 . . . . 2 3 | . . . . . . 6 j 1 4 16 11 . . . | . . . . . . 6 g 1 3 17 1 . . . | . . . . . . 5 s 11 5 2 37 . . . | . . . . . . 4 ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- TOT 99 99 99 99 99 99 99 | 99 99 99 99 99 99 57560 Note that the stroke `l' is always followed by either `j' or `g', hence `lj' and `lg' should be single letters. Note also that there are two clearly different kinds of strokes, "body" B = {`c',`o',`t',`i',`q',`l'} and "limb" L = {`v',`x',`y',`j',`g',`s'}. If we reduce the digraph count matrix to these two classes, plus word break W, we get cat bio-c-jsa-gut.wds \ | tr 'cotiqlvxyjgs' 'BBBBBBLLLLLL' \ | count-digraph-freqs Digraph counts: B L ----- ----- ----- . 6420 . B 59 19849 15616 L 6361 9255 . ----- ----- ----- Next-symbol probabilities (× 99): B L ----- ----- ----- . 99 . B . 55 44 L 40 59 . ----- ----- ----- Previous-symbol probabilities (× 99): B L ----- ----- ----- . 18 . B 1 55 99 L 98 26 . ----- ----- ----- Note that every word begins with a body stroke; this was expected from the definition of the limb strokes (they can be recognized only by their relationship to a previous stroke). Note also that a limb stroke cannot be followed by another limb stroke; this too is not wholly unexpected. The surprise is that almost no words *end* in a body stroke. The least rare body stroke in word-final position is `o'. Here are all the words that end in body strokes, in context. (The "<<" marks the error) ctix ois cstcoixo << // ciix ciix ctccgciiivcgci << // qoljciiiv qjcy ixcstcgcyqo << // qoqjcccgcy ixo << // oljccgciis cgci << // qoqjciix ctoixo << // cgciiscy cgciixo << // cgcy cgciiiiv ctcqjt << // oljcccg qoljciixoiso << // qoljciix qoljccgcy ixo << // cstcccgcy ixctccgcy ixc << // cstccgcy qoljcyisixcstc << // cstcljtcy oisciiisciiso << // isctccgcy qoqjcccgcy ixo << // qoljcccy qoqjcccgcy ixoc << // ljctoix qjctciixoixljcc << cylgctccy isctcs ctcc << oixcs ciiiiv csljciix oixcstccy oixcstcc << qoixcstcccy qoqjcccy cyctcccgcy csciiivo oix ctc << cgcy qoljciis csciiiiv cstccgcci << cstccqjtcy ctccy ixctcgcy cgctcgcy cgci << qciis ixljocgciix oqgci << ljoisoixis // qcqjtcccg cstcgcy qci << oixljcccy cgciiiv // o << cgctcccgcy qoixctccy cyctccciis o << oiiiv occcgcy ctccy qoljciiiv o << isoiiiiv // ciiiv isciiiv o << ljciiv ctixciiiiiv // cycstccis ciiiiv o << cyljcccgcy qoljcccgcy // csciiiv oljci*o << ctccgcy ixljctccgcy cstccqjtcy qoljcio << ixois // cgcicljtcy ixljciijo << cyljcccy ixcstccy cstccyljcccgcy ixljo << oqgctccgcy qoix oixciiiv cstcoix qo << qoljciix cstcivix // qoixctccy qo << ixctccg qoixljcy ctcccgcy qo << ctciis ciiiiv // oljcccy ixctccgcy qo << oixciiiv oqj // cycstccgcy qo << ois oljciiiv cstcqjtcy ctcocy qo << qoljccy cgciixciiiiv ctcccgcy qoix oqo << qoljciiiv oixcstcciij // qocgctccg oixqo << cgciis ctccljto ixoixoix cycstcccyqo << isciis oix ctcccy oixqo cgciis ctccljto << ixoixoix // ixcsto << qoljccy ixcstccgcy cyctcccgcy csciiivo << oix ctc cgcy qoljciis qoljctcgcy ctcqjccy ixo << qoljccgcy qoljciiv qoljctcgcy ctccy ixo << ctcljcy oix qois *cccy ixctccy ixo << cycgciiiv cstccy csciiiisciix csciix cgciixo << qjciiiv cgciiscy cgciixo // oisoix ccccsciix oixo << qjcoix oiscy // // q << ljciiiv cstccqjcy qoljciiiv // q << ljccccy cstccgcy qoljcccgcy // q << qoqjccgcstccgcy ctqjciiis oqgctccgcy q << qjciix ctccgcy ctcqgctccgcy csciix ctccgcy cstcq << lgctois qoqjcicsoixljcy qoljciis cstcccgcy ixct << cstoljciiiv ct* // ctccy ctcljt << cstcoqccccy ctoix // cgciiiiv cst << qoixctccy oqjciixcy qoljciix cst << oixcstciixcscy // // qoljccst << qoljccgcy oqjccgcy // cgciiiv ctccy ixcst << cgciiiiv ctccy // Those of the first group appear to be interference by the line break. (Note that the manuscript does not appear to use any hyphenation mark. Either words are not broken across lines, which would be unusual, or they are broken without any extra marks, which would produce the Those of the second group appear to be due to bogus word breaks in the transcription (e.g. between the `q' and `l') or transcription errors. An interesting observation from the body/limb frequency tables above is that the transition probabilities from body stroke to body and limb are respectively 55% and 45%. Thus, if the limb strokes mark the end of a syllabe (or letter?), the the average number of body strokes in a syllabe is slightly over 2. (Considering that we are counting each "i" as a body stroke, the correct number may well be precisly 2.) I decided that, before spending more time in the analysis, I must first prepare a "corrected" interlinear where discrepancies between FSG and Currier are resolved taking into account the probabilities above. The idea is to make a dictionary of 5-tuples, and try to use it to decide on the corrections. Namely, define the context of a letter occurrence in a text as its four nearest letter occurrences. We can represent a context in sed-like notation as wx.yz where the "." is the position of the central letter. We scan some training text, collecting for each possible context the frequency distribution of the middle letter. At the end, if, for a given context wx.yz there is some central letter t which is more likely than all the others combined, we output a correction rule of the form wx?yz -> wxtyz. So, here is the work. First, I generated a training data set: cat bio-m-evt.evt \ | egrep '^<.*;[FC]> ' \ | grep -v '[][%*_]' \ | sed \ -e 's/<.*;[FC]> */ /g' \ -e 's/{[^}]*}//g' \ -e 's/\!//g'\ > .train.txt lines words bytes file ------ ------- --------- ------------ 858 858 42866 .train.txt Next, I generated the correction patterns from it: cat .train.txt \ | generate-fix-patterns -vMINOCC=10 \ > .fixit.sed lines words bytes file ------ ------- --------- ------------ 592 688 10219 .fixit.sed The parameter MINOCC is the minimum number of times a context must occur before we try to generate a correction rule for it. Next, I generated a "consensus" interlinear file: cat bio-m-evt.evt \ | make-consensus-interlin \ > bio-x-evt.evt I extracted the consensus text from it: cat bio-x-evt.evt \ | egrep '^<.*;J> ' \ | sed \ -e 's/{[^}]*}//g' \ -e 's/[\!]//g' \ > bio-j-evt-raw.evt I applied the corrections: cat bio-j-evt-raw.evt \ | sed -f .fixit.sed \ > bio-j-evt.evt Now let's extract the words and check how many good ones we got: cat bio-j-evt.evt \ | sed \ -e 's/<.*;[A-Z]> *//g' \ -e 's/- *$/.\/\//g' \ -e 's/= *$/.\/\/.=/g' \ | tr '.' '\012' \ | egrep '.' \ > bio-j-evt.wds cat bio-j-evt.wds | sort | uniq -c | sort +0 -1nr > bio-j-evt.frq cat bio-j-evt.wds | sort | uniq > bio-j-evt.dic lines words bytes file ------ ------- --------- ------------ 7216 7216 39223 bio-j-evt.wds 1761 1761 12154 bio-j-evt.dic I extracted the good words: cat bio-j-evt.wds | grep -v '?' > bio-j-evt-gut.wds cat bio-j-evt-gut.wds | sort | uniq > bio-j-evt-gut.dic cat bio-j-evt-gut.wds | sort | uniq -c | sort +0 -1nr > bio-j-evt-gut.frq lines words bytes file ------ ------- --------- ------------ 6188 6188 31705 bio-j-evt-gut.wds 1085 1085 6532 bio-j-evt-gut.dic I created an automaton for bio-j-evt-gut.dic: cat bio-j-evt-gut.dic \ | nice MaintainAutomaton \ -add - \ -dump bio-j-evt-gut.dmp strings letters states finals arcs sub-sts lets/arc -------- -------- -------- -------- -------- -------- -------- 1085 5447 422 90 1263 1027 4.313 I looked for unproductive states: nice AutoAnalysis \ -load bio-j-evt-gut.dmp \ -unprod bio-j-evt-gut-1-unp.sts \ -maxUnprod 1 \ -unprodSugg bio-j-evt-gut-1-unp.sugg 46 unproductive states 46 strange words (with repetitions) listed 31 strange words (without repetitions) listed // 2PTG 42OEDCC8G 4GDAM 4O4ODCCG 4ODGE88G 8ARTCC8AE 8OEDCC2OE CPAEOIR DOHAEG EODAK GCPAM GDT8AR HG4ODG HOROES28G HT8OEH8G ODAROEOK OEPODZG OGSCG OHTOHAR OSCPOE2 P8AESOR PGDC8G PODAN POECC8ARAL POEHCSOE PSAROE RTCAE8 TCIROR TETPSCCG TOEDCCCG I removed these words and tried again: cat bio-j-evt-gut-1-unp.sugg \ | sort -u \ | bool 1-2 j-gut.dic - \ > bio-j-evt-cln-1.dic cat bio-j-evt-cln-1.dic \ | nice MaintainAutomaton \ -add - \ -dump bio-j-evt-cln-1.dmp strings letters states finals arcs sub-sts lets/arc -------- -------- -------- -------- -------- -------- -------- 1054 5237 365 87 1176 950 4.453 nice AutoAnalysis \ -load bio-j-evt-cln-1.dmp \ -unprod bio-j-evt-cln-1-unp.sts \ -maxUnprod 1 \ -unprodSugg bio-j-evt-cln-1-unp.sugg 4 unproductive states 4 strange words (with repetitions) listed 3 strange words (without repetitions) listed 42AN HSCODCC8G CPTG I removed these and tried again: cat bio-j-evt-cln-1-unp.sugg \ | sort -u \ | bool 1-2 bio-j-evt-cln-1.dic - \ > bio-j-evt-cln-2.dic cat bio-j-evt-cln-2.dic \ | nice MaintainAutomaton \ -add - \ -dump bio-j-evt-cln-2.dmp strings letters states finals arcs sub-sts lets/arc -------- -------- -------- -------- -------- -------- -------- 1051 5220 360 87 1167 944 4.473 nice AutoAnalysis \ -load bio-j-evt-cln-2.dmp \ -unprod bio-j-evt-cln-2-unp.sts \ -maxUnprod 1 \ -unprodSugg bio-j-evt-cln-2-unp.sugg 0 unproductive states 0 strange words (with repetitions) listed 0 strange words (without repetitions) listed I recoded it into the "super-analyitic" encoding, but this time treating `qj', `qg', `lj', `lg' as single letters (`h', `k', `f', `p' respectively): cat bio-j-evt.wds \ | fsg2jsa \ | jsa2hoc \ > bio-j-hoc.wds cat bio-j-hoc.wds | sort | uniq > bio-j-hoc.dic cat bio-j-hoc.wds | sort | uniq -c | sort +0 -1nr > bio-j-hoc.frq lines words bytes file ------ ------- --------- ------------ 7216 7216 56394 bio-j-hoc.wds 1761 1761 17523 bio-j-hoc.dic Next I separated the good words: cat bio-j-hoc.wds \ | egrep '^[a-z+^]*$' \ > bio-j-hoc-gut.wds cat bio-j-hoc.dic \ | egrep '^[a-z+^]*$' \ > bio-j-hoc-gut.dic bool 1-2 bio-j-hoc.dic bio-j-hoc-gut.dic \ > bio-j-hoc-bad.dic lines words bytes file ------ ------- --------- ------------ 5427 5427 44172 bio-j-hoc-gut.wds 1083 1083 9958 bio-j-hoc-gut.dic 678 678 7565 bio-j-hoc-bad.dic Next I buit the automaton: cat bio-j-hoc-gut.dic \ | nice MaintainAutomaton \ -add - \ -dump bio-j-hoc-gut.dmp strings letters states finals arcs sub-sts lets/arc -------- -------- -------- -------- -------- -------- -------- 1083 8875 701 91 1492 1258 5.948 Digraph statistics: cat bio-j-hoc-gut.wds \ | count-digraph-freqs Digraph counts: | i o c t q f p h k | v j x s y g TOT ----- + ----- ----- ----- ----- ----- ----- ----- ----- ----- + ----- ----- ----- ---------- ----- ----- . | 396 1146 2210 . 1398 94 1 112 70 | . . . . . . 5427 ----- + ----- ----- ----- ----- ----- ----- ----- ----- ----- + ----- ----- ----- ---------- ----- ----- i 4 | 2248 2 8 . . 2 . . . | 497 40 1979 650 . . 5430 o 19 | 1371 1 69 . 5 1190 8 455 60 | . . . . . . 3178 c 1 | 1367 150 3487 1201 . 245 2 134 25 | . . . 1187 3301 2408 13508 t 4 | 17 73 2320 . . 14 1 10 3 | . . . . . . 2442 q 1 | . 1383 21 . . 1 . . . | . . . . . . 1406 f 6 | 5 47 1543 180 . . . . . | . . . . . . 1781 p . | . 2 15 1 . . . . . | . . . . . . 18 h 3 | 2 41 606 103 . . . . . | . . . . . . 755 k 2 | . 38 111 14 . . . . . | . . . . . . 165 ----- + ----- ----- ----- ----- ----- ----- ----- ----- ----- + ----- ----- ----- ---------- ----- ----- v 493 | . 1 3 . . . . . . | . . . . . . 497 j 40 | . . . . . . . . . | . . . . . . 40 x 1101 | 5 116 545 . 1 183 4 18 6 | . . . . . . 1979 s 540 | 1 128 222 943 . 2 . . 1 | . . . . . . 1837 y 3161 | 12 3 49 . 2 46 2 26 . | . . . . . . 3301 g 52 | 6 47 2299 . . 4 . . . | . . . . . . 2408 ----- | ----- ----- ----- ----- ----- ----- ----- ----- ----- + ----- ----- ----- ---------- ----- ----- TOT 5427 | 5430 3178 13508 2442 1406 1781 18 755 165 | 497 40 1979 1837 3301 2408 44172 Again, it is obvious that `ix' `ij' `iv' are single letters; we will drop the `i' from them. Same goes for `cy' and `cg'; we will drop the `c'. We may also let `cs' and `is' be single letters. This is the right thing to do if the distribution of the letter after the `s' depends on the letter before the s: o c t i k f TOT ---------- ----- ----- ----- ----- ----- ----- cs 45 86 109 943 1 1 2 1187 is 495 42 113 . . . . 650 Next-symbol probabilities (× 99): o c t i k f TOT ----- ----- ----- ----- ----- ----- ----- ----- cs 4 7 9 79 . . . 99 is 75 6 17 . . . . 99 They are similar except that cs if often follwed by `t' whereas `is' is often terminal and is never followed by `t'. (Not surprising since `t' only appears after `c' in this corpus. But OK, let's replace `cs' by `s' and `is' by `r': cat j.wds \ | sed -f fsg2jsa.sed \ > bio-j-hoc.wds cat bio-j-hoc.wds | sort | uniq > bio-j-hoc.dic cat bio-j-hoc.wds | sort | uniq -c | sort +0 -1nr > bio-j-hoc.frq lines words bytes file ------ ------- --------- ------------ 7216 7216 44840 bio-j-hoc.wds 1761 1761 13898 bio-j-hoc.dic cat bio-j-hoc.wds \ | egrep '^[a-z67+^]*$' \ > bio-j-hoc-gut.wds cat bio-j-hoc.dic \ | egrep '^[a-z67+^]*$' \ > bio-j-hoc-gut.dic bool 1-2 bio-j-hoc.dic bio-j-hoc-gut.dic \ > bio-j-hoc-bad.dic lines words bytes file ------ ------- --------- ------------ 5427 5427 34110 bio-j-hoc-gut.wds 1083 1083 7646 bio-j-hoc-gut.dic 678 678 6252 bio-j-hoc-bad.dic Digraph statistics: cat bio-j-hoc-gut.wds \ | count-digraph-freqs Digraph counts: q o c s y g x t i r f h p k v j TOT ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- . 1398 1146 810 865 104 431 310 . . 86 94 112 1 70 . . 5427 q 1 . 1383 18 2 1 . . . . . 1 . . . . . 1406 o 19 5 1 40 8 3 18 1139 . 12 215 1190 455 8 60 . 5 3178 c 1 . 150 964 36 731 1756 . 1201 1366 1 245 134 2 25 . . 6612 s 45 . 86 92 10 4 3 1 943 . . 2 . . 1 . . 1187 y 3161 2 3 17 23 . 9 7 . . 4 46 26 2 . . 1 3301 g 52 . 47 403 35 1860 1 5 . . 1 4 . . . . . 2408 x 1101 1 116 262 126 98 59 3 . . 2 183 18 4 6 . . 1979 t 4 . 73 1953 4 243 120 14 . . 3 14 10 1 3 . . 2442 i 4 . 2 2 4 . 2 493 . 886 338 2 . . . 497 34 2264 r 495 . 42 69 14 27 3 . . . . . . . . . . 650 f 6 . 47 1370 21 151 1 5 180 . . . . . . . . 1781 h 3 . 41 513 21 70 2 2 103 . . . . . . . . 755 p . . 2 14 1 . . . 1 . . . . . . . . 18 k 2 . 38 85 17 6 3 . 14 . . . . . . . . 165 v 493 . 1 . . 3 . . . . . . . . . . . 497 j 40 . . . . . . . . . . . . . . . . 40 ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- TOT 5427 1406 3178 6612 1187 3301 2408 1979 2442 2264 650 1781 755 18 165 497 40 34110 There is something funny about the `t'. I must try to either (a) identify it with `c', or (b) join it with the preceding `c' or `s' as a single letter. Since most `t's have been misidentified as `c's, it is safer to do (a). 97-07-08 stolfi =============== I decided to join `iv' to make `w' and identify `t' with `c': cat j.wds \ | sed -f fsg2jsa.sed \ > bio-j-hoc.wds cat bio-j-hoc.wds | sort | uniq > bio-j-hoc.dic cat bio-j-hoc.wds | sort | uniq -c | sort +0 -1nr > bio-j-hoc.frq cat bio-j-hoc.wds \ | egrep '^[a-z67+^]*$' \ > bio-j-hoc-gut.wds cat bio-j-hoc.dic \ | egrep '^[a-z67+^]*$' \ > bio-j-hoc-gut.dic cat bio-j-hoc-gut.wds | sort | uniq -c | sort +0 -1nr > bio-j-hoc-gut.frq bool 1-2 bio-j-hoc.dic bio-j-hoc-gut.dic \ > bio-j-hoc-bad.dic lines words bytes file ------ ------- --------- ------------ 7216 7216 44287 bio-j-hoc.wds 1712 1712 13418 bio-j-hoc.dic 5427 5427 33613 bio-j-hoc-gut.wds 1035 1035 7223 bio-j-hoc-gut.dic 677 677 6195 bio-j-hoc-bad.dic Digraph statistics: cat bio-j-hoc-gut.wds \ | count-digraph-freqs Digraph counts: o s y c g x r f p h k q j w i TOT ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- . 1146 865 104 810 431 310 86 94 1 112 70 1398 . . . 5427 o 19 1 8 3 40 18 1139 215 1190 8 455 60 5 5 2 10 3178 s 45 86 10 4 1035 3 1 . 2 . . 1 . . . . 1187 y 3161 3 23 . 17 9 7 4 46 2 26 . 2 1 . . 3301 c 5 223 40 974 4118 1876 14 4 259 3 144 28 . . 4 1362 9054 g 52 47 35 1860 403 1 5 1 4 . . . . . . . 2408 x 1101 116 126 98 262 59 3 2 183 4 18 6 1 . . . 1979 r 495 42 14 27 69 3 . . . . . . . . . . 650 f 6 47 21 151 1550 1 5 . . . . . . . . . 1781 p . 2 1 . 15 . . . . . . . . . . . 18 h 3 41 21 70 616 2 2 . . . . . . . . . 755 k 2 38 17 6 99 3 . . . . . . . . . . 165 q 1 1383 2 1 18 . . . 1 . . . . . . . 1406 j 40 . . . . . . . . . . . . . . . 40 w 493 1 . 3 . . . . . . . . . . . . 497 i 4 2 4 . 2 2 493 338 2 . . . . 34 491 395 1767 ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- TOT 5427 3178 1187 3301 9054 2408 1979 650 1781 18 755 165 1406 40 497 1767 33613 I computed a "strangeness number" by the formula function strangeness(n, xk, yk, xyk) { if ((xk == 0) || (yk == 0)) { return 0 } else { fx = xk/n; fy = yk/n; fxy = xyk/n; fmax = (fx < fy ? fx : fy); fexp = fx*fy; fmin = 0; if (fxy <= fmin) { return -1 } else if (fxy >= fmax) { return +1 } else { tmax = (fmax - fxy)/(fmax - fexp); tmin = (fxy - fmin)/(fexp - fmin); tsum = (log(tmin) - log(tmax))/log(2.0); if ( tsum > 0 ) { texp = exp(-2*tsum); return (1 - texp)/(1 + texp) } else { texp = exp( 2*tsum); return (texp - 1)/(texp + 1) } } } } function normalness(n, xk, yk, xyk) { str = strangeness(n, xk, yk, xyk); return 1 - str*str } where n is the total number of pairs tested, xk the number of "x" occurences, yk the number of "y" occurrences, and xyk the number of "xy" pairs. The result is 0 is xyk is the expected number, +1 if it is maximum possible = min(xk,yk), and -1 if it is the minimum possible (0). Here is the table, scaled from [-1..+1] to [01..99]: Strangeness (× 99): s y c o g f p h k q x r j w i TOT ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- s 1 . 99 30 1 . . . . 1 . . . . . . 50 y 1 . . . 99 . 2 59 4 . . . . 2 . . 50 c . 58 90 1 . 99 10 14 21 15 . . . . . 99 50 o . . . . . . 99 99 99 98 . 99 98 70 . . 50 99 1 10 95 . 58 3 3 42 97 99 47 33 . . . 50 g 6 99 15 1 . . . . . . . . . . . . 50 f 4 38 99 2 . . . . . . . . . . . . 50 p 79 . 99 62 . . . . . . . . . . . . 50 h 33 45 99 15 . . . . . . . . . . . . 50 k 95 4 97 94 . 2 . . . . . . . . . . 50 q . . . 99 . . . . . . . . . . . . 50 x 86 11 7 18 99 6 84 98 6 19 . . . . . . 50 r 19 6 4 23 99 . . . . . . . . . . . 50 j . . . . 99 . . . . . . . . . . . 50 w . . . . 99 . . . . . . . . . . . 50 i . . . . . . . . . . . 98 99 99 99 98 50 ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- TOT 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 33613 Normalness (× 99): x y o s c g k p f h r j i w q TOT ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- . 99 2 16 . 37 96 8 12 10 97 89 . . . . 99 x 2 . 38 59 47 27 24 61 5 50 23 . . . . . 99 y . . . . 3 . . . 95 6 15 . 6 . . . 99 o . . . . . . . 3 1 . . 4 81 . . . 99 s 4 . . 83 6 . . 2 . . . . . . . . 99 c . . 96 4 . 31 1 52 49 35 67 . . 1 . . 99 g 1 . . 3 24 50 . . . . . . . . . . 99 k . . 17 17 14 7 6 . . . . . . . . . 99 p . . . 93 64 . . . . . . . . . . . 99 f . . 94 8 14 . . . . . . . . . . . 99 h . . 98 51 87 . . . . . . . . . . . 99 r . . 24 71 60 14 . . . . . . . . . . 99 j . . . . . . . . . . . . . . . . 99 i . 2 . . . . . . . . . . . 3 . . 99 w . . . . . . . . . . . . . . . . 99 q . . . . . . . . . . . . . . . . 99 ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- TOT 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 33613 These below are all the `qc' words in the file. They look like misreadings of popular `qo' words. egrep 'qc' bio-j-hoc-gut.frq qccgy qccgy qcfccgy qcfccgy qcccgy qccgccy qccy qcgy qchcccg qchccy qchcgy qchcgys qchcix qchcy qchy qci qcixox qcy 07-07-09 stolfi =============== Summarizing, so far it seems that breaking down all characters into strokes was a very good idea. It led (somewhat indirectly) to two discoveries: that the difference between Guy2 and / is not important, and highly contaminated by error; and that Guy2 `a' is probably not a letter --- it is a `c' stroke (possibly half of the preceding letter) accidentally connected to an `i' stroke (probably the beginning of the next letter). Looking at the above tables, it is now almost certain that `sc' and `qo' are letters on their own. (Note that `sc' is represented as [2C], [2A], [S], [2T] in the interlinear file. In other words, the plume on the is not really attached to the but to the following letter, which is always a `c' stroke. This may be an explanation for the ligature in [S] = , and the reported ligature. Summarizing, I am now going to use the following FGS -> JSA preencoding IIIK -> iiiij IE -> iix A -> ci N -> iiu IIIL -> iiiiu IR -> iis C -> c O -> o IIIR -> iiiis IK -> iij D -> lj P -> ag IIIE -> iiiix 2 -> cs E -> ix R -> is IIE -> iiix 4 -> a F -> lg S -> csc IIR -> iiis 6 -> cj G -> cy T -> cc IIK -> iiij 7 -> ig H -> aj V -> ^ HZ -> cajc 8 -> cg I -> i Y -> + PZ -> cagc K -> ij DZ -> cljc L -> iu FZ -> clgc M -> iiiu followed by the SA -> ad-hoc post-encoding: sc -> s ij -> 7 ig -> 8 aj -> H a -> 4 (if unpaired) ao -> A ix -> e cg -> 8 ag -> H iu -> v cy -> 9 lj -> H is -> r lg -> H Moreover, I am going to use this encoding before preparing the consensus transcription. The consensus-maker will have to be sort of a dynamic programming algorithm... OK, I coded the dynamic consensus-maker, and modified the script fsg2jsa to work on the interlinear file. So: cat bio-m-evt.evt \ | fsg2jsa \ > bio-m-jsa-bug.evt Now extracted the training dataset, and generated a new set of correction patterns from it: cat bio-m-jsa-bug.evt \ | egrep '^<.*;[FC]> ' \ | sed \ -e 's/<.*;[FC]> */ /g' \ -e 's/{[^}]*}//g' \ | grep -v '[*]' \ > .train.txt lines words bytes file ------ ------- --------- ------------ 1470 1470 115821 .train.txt cat .train.txt \ | generate-fix-patterns -vMINOCC=10 \ > .fixit.sed lines words bytes file ------ ------- --------- ------------ 596 716 9932 .fixit.sed Next I generated the consensus interlinear, and ran the automatic context-fixer above: cat bio-m-jsa-bug.evt \ | make-consensus-interlin \ > bio-m-jsa.evt I extracted the consensus text from it, and applied the automatic corrector: cat bio-m-jsa.evt \ | egrep '^<.*;J> ' \ | sed \ -e 's/{[^}]*}//g' \ -e 's/[\!]//g' \ > bio-j-jsa-raw.evt cat bio-j-jsa-raw.evt \ | sed -f .fixit.sed \ > bio-j-jsa-fix.evt I wrote a script "extract-words" that extracts the words from the consensus file, remaps them through an arbitrary encoding, extracts the dictionary, and runs the digraph statistics: ------------------------------ ------------------------------ extract-words-from-interlin \ -recode jsa2hoc \ bio-j-jsa-fix.evt \ jh-1 cat bio-j-hoc-1-gut.wds \ | count-digraph-freqs \ -vchars=' c9po8idervqs74gy' lines words bytes file ------ ------- --------- ------------ 7358 7358 46402 bio-j-hoc-1.wds 1553 1553 14124 bio-j-hoc-1.dic 5873 5873 36448 bio-j-hoc-1-gut.wds 1001 1001 7199 bio-j-hoc-1-gut.dic 552 552 6925 bio-j-hoc-1-bad.dic 16337 16337 111098 total Digraph counts: c 9 p o 8 i d e r v q s 7 4 g y TOT ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- . 1780 128 359 1206 467 68 . 322 28 . 1493 . . 22 . . 5873 c 4 3528 1003 473 187 1875 1548 1129 11 4 . . 159 . . . . 9921 9 3238 45 . 80 3 11 2 . 10 2 . 4 . 1 1 . . 3397 p 14 2629 245 . 142 7 . . 8 . . . . . . . . 3045 o 15 26 1 605 . 12 44 . 972 195 . 5 . 6 . . . 1881 8 58 475 1888 5 48 1 . . 6 1 . . . . . . . 2482 i 5 8 . 3 1 3 1558 130 482 326 828 . . 40 . . . 3384 d 2 937 24 10 34 36 160 27 4 1 . . . . . 8 43 1286 e 1035 452 94 230 121 61 . . 5 2 . 1 . . . . . 2001 r 519 . . 1 46 . . . . . . . . . . . . 566 v 824 . 3 . 1 . . . . . . . . . . . . 828 q 7 23 1 1273 1 8 4 . 179 7 . 1 . . . . . 1504 s 63 . . 5 90 . . . 1 . . . . . . . . 159 7 46 . . . 1 . . . . . . . . . . . . 47 4 1 18 3 1 . . . . . . . . . . . . . 23 g 1 . 7 . . . . . . . . . . . . . . 8 y 41 . . . . 1 . . 1 . . . . . . . . 43 ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- TOT 5873 9921 3397 3045 1881 2482 3384 1286 2001 566 828 1504 159 47 23 8 43 36448