Reading again the description of Landini's file, I found out that the `%' and `!' marks should have been handled differently from the way I did. Also, looking at the actual shape of the characters, I realized that the FSG encoding was not very good for my purposes, since is assigns completely different codes to glyphs which may be just calligraphic variations of the same grapheme. Thus I decided to redo everything from the beginning, using a more analytical encoding. I considered using Jacques Guy's "Neo-Frogguy" or "Gui2" encoding, but even that is a bit too synthetic --- for example, his <2> should be "i'", and his <9> should be `c)', for consistency. (The statistics on the occurrence of repeated s apparently confirm this choice). Thus I decided to define my own "super-analytic" or "SA" encoding. The idea is to break all characters doen to individual "logical" strokes, and use one (computer) character to encode each stroke. There is some question as to what is a logical stroke, and when two strokes are different. Obviously, the definition of a stroke must include not only its shape but also the way it connects to the neighboring strokes; and, given the irregularity of handwritten glyphs, that may be hard to decide. For instance, FSG's [A] character can be broken down into two strokes, shaped like the [C] and [I] glyphs. Supposedly, the difference between an [A] and a [CI] is that in the former the strokes are connected into a closed shape. Is this difference significant? I checked the occurrences of [CI], [CM], and [CN] in the interlinear file. Two things are curious. First, these combinations are extremely rare. Second, a good many of them are transcribed differently by Currier and the FSG: where one has [CIIR] the other often has [AIR], and vice-versa. Same for [CM] versus [AN], etc. In light of these observations, I have decided to treat all occurrences of [A] as [CI]. If the two are indeed different, that will be just one more ambiguity added to the inherent ambiguity of natural language; so it cannot make the decipherment task more difficult. Confusing the two will change the letter frequencies, it is true; but, since the language does not appear to be a standardized one, there is not much information we can extract from absolute letter frequencies. The methods we hope to use --- such as automaton analysis --- are not significantly disturbed by collapsing letters. On the other hand, if [A] and [CI] are the same grapheme, using different encodings will seriously confuse statistics --- especially if the spacing depends on the immediate context. I took the file bio-c-evt.txt and removed all `%' from it. Many of them should have been spaces, but so what. I also added a space before and after each line; that may be helpful when doing greps. lines words bytes file ------ ------- --------- ------------ 765 7227 39823 bio-c-evt.txt Next I applied a recoding: cat bio-c-evt.txt \ | fsg2jsa \ > bio-c-jsa.txt Next I extracted words: cat bio-c-jsa.txt \ | tr ' ' '\012' \ | egrep '.' \ > bio-c-jsa.wds cat bio-c-jsa.wds \ | sort \ | uniq \ > bio-c-jsa.dic cat bio-c-jsa.wds \ | sort \ | uniq -c \ | sort +0 -1nr \ > bio-c-jsa.frq lines words bytes file ------ ------- --------- ------------ 765 7227 61678 bio-c-jsa.txt 7227 7227 60144 bio-c-jsa.wds 1687 1687 17398 bio-c-jsa.dic 1687 3374 30894 bio-c-jsa.frq Next I separated the good words: cat bio-c-jsa.wds \ | egrep '^[a-z+^]*$' \ > bio-c-jsa-gut.wds cat bio-c-jsa.dic \ | egrep '^[a-z+^]*$' \ > bio-c-jsa-gut.dic bool 1-2 bio-c-jsa.dic bio-c-jsa-gut.dic \ > bio-c-jsa-bad.dic lines words bytes file ------ ------- --------- ------------ 34 34 270 bio-c-jsa-bad.dic 6420 6420 57560 bio-c-jsa-gut.wds 1653 1653 17128 bio-c-jsa-gut.dic Next I buit the automaton: cat bio-c-jsa-gut.dic \ | nice MaintainAutomaton \ -add - \ -dump bio-c-jsa-gut.dmp strings letters states finals arcs sub-sts lets/arc -------- -------- -------- -------- -------- -------- -------- 1653 15475 1373 178 2609 2234 5.931 Note that the efficiency increased, even though the words got considerably longer! I ran AutoAnalysis, looking for unproductive states (2 words or less) and strange words: nice AutoAnalysis \ -load bio-c-jsa-gut.dmp \ -unprod bio-c-jsa-gut-1-unp.sts \ -maxUnprod 2 \ -unprodSugg bio-c-jsa-gut-1-unp.sugg 546 unproductive states 826 strange words (with repetitions) listed 389 strange words (without repetitions) listed I redid again, considering a state unproductive if it is used by only one word): nice AutoAnalysis \ -load bio-c-jsa-gut.dmp \ -unprod bio-c-jsa-gut-2-unp.sts \ -maxUnprod 1 \ -unprodSugg bio-c-jsa-gut-2-unp.sugg 266 unproductive states 266 strange words (with repetitions) listed 96 strange words (without repetitions) listed Here are the strange words: ccljtciix ccqgtccy cgciiiivciiscyixcy cgciiiixciixcgcy cgciixcyiscg cgciixljccccyiscy cgcstoixcycg cgctcljtccgcy cgcyixctccs cgcyqoljoix cgixctciscy cgoisctccgctccy cgoixcccsoixctccy cicstcyqjcccg ciixcstclgccy cqgciixoiis cqjtcyljoisoix csciixcstcqjtcgcy csciixctqjccgcyqjciis cscstoixcstccqjtcy csolgctciij csqjoixqgctcy cstccgljcoixcy cstcoisoixljctcgcy cstixctcis cstocqgtccgcy cstqjciixctcccgcy ctccgciiivcgci ctccgcyqjctcg ctccicgis ctcciixisois ctccycsctcy ctccyqcyqjciiiiv ctcoixqjciiivcscy ctcstccljtcy ctixctqgcstcccy ctoixljccccy cycgoixcstcccy cycqgciiiiv cyisoqjccy cyljccgcicqgtcy cyqjcccsoixcgcy cyqjctixctccgcy cyqjctoisoixljcy cyqoixcyqois iixqgctcgcy isoisciisoix ixcsciiis ixctccgcyisqgctccgcy ixixcgcyis ixixciiiiscy ixlgctcciix ixoixljccgcyljciiivoix lgctccgcyljciiscy ljccgcyqgocstcy ljciixcyctccy ljoisoiiivcy ljoqjciixcy ocstcqgoixcs ocycstccy oisciiisciiso oixcstciixcscy olgciixcstcljcy olgctcivcycs oljciixctccgcyqjoiscy oqjciisqgcccgcy qcljcyqjccgciis qcsoixljcccgcy qcyisctcs qgcgciixcstois qgciixcgciisciiisciix qgcstctccgciix qgcstoixqgctclgtcgcy qgctccyljccciis qgoisciixcstcy qgoixcccgciisciiv qgoixqjccstoix qgoqjctoljciis qjccciixciiiv qjccyqjccj qjcoixcstcoix qjctcgoixqjcgcy qjctciixoixljcc qjcyqoljcy qjocqjtccy qjoisoixcstcscgcy qjoixoiscstcccy qoljciixljcoix qoljcyisixcstc qoljcyixcgcgcy qoqjcgcgcyciis qoqjcyqjcyqjois qoqjixoixljciix qoqjoisciixoij qoqoljcccy qqgoixljciiiv I looked for popular states and radicals: nice AutoAnalysis \ -load bio-c-jsa-gut.dmp \ -classes bio-c-jsa-gut-1.cls \ -minClassPrefs 5 \ -radicals bio-c-jsa-gut-1.rad 77 lexical classes found The output wasn't very illuminating, though. I looked for productive states instead: AutoAnalysis \ -load bio-c-jsa-gut.dmp \ -prod bio-c-jsa-gut-1-prd.pst \ -minProductivity 1 \ -maxPrefSize 30 \ -maxSuffSize 30 Here is the result: nprefs nsuffs nwords prodty prefs/suffs ------- ------- ------- ------- ----------------- 40 2 80 39 { qoqjciisc, qoljcctcc, qoljcccc, ... }:{ gcy, y } 35 2 70 34 { qoqjcccs, qoljcio, qoixoix, ... }:{ (), cy } 24 2 48 23 { qgciii, oljciiii, oisciiii, ... }:{ iv, v } 15 2 30 14 { qoljccii, ctccgcyi, cgctoi, ... }:{ s, x } 7 3 21 12 { qoixctcc, qoccc, oqjccc, ... }:{ cgcy, gcy, y } 12 2 24 11 { qoljccgci, qljo, qjoixcstco, ... }:{ is, ix } 11 2 22 10 { oixqjcc, ixqjcc, cgcyljcc, ... }:{ cgcy, gcy } 10 2 20 9 { cyctccgc, qoqjciixcgc, ... }:{ iis, y } 4 4 16 9 { qoixcstcc, ctcqjtc, qoqjcstc, ... }:{ cgcy, cy, gcy, y } 9 2 18 8 { qoqjciixcg, qoljciixcg, ... }:{ ciis, cy } 5 3 15 8 { ctccljtc, cstcljcc, csoixctcc, ... }:{ cy, gcy, y } 5 3 15 8 { qocljtc, qoctc, oisctcc, ... }:{ cgcy, cy, y } 8 2 16 7 { qocgcc, qcljct, ixocc, ... }:{ cgcy, cy } 8 2 16 7 { qcljcc, ctccljc, cstccqjc, ... }:{ cy, y } 7 2 14 6 { qoqgcst, oqjciixcst, qoixqjc, ... }:{ ccgcy, cgcy } 4 3 12 6 { qoiscc, ljctc, qoljcstc, ... }:{ cgcy, cy, gcy } 4 3 12 6 { oljcic, ixctccc, cqjtcc, ... }:{ gcy, s, y } 4 3 12 6 { qjciii, isoii, isciii, csoii }:{ iiv, iv, v } 4 3 12 6 { qoct, oisctc, ixctccljt, ... }:{ ccgcy, ccy, cy } 6 2 12 5 { oqgcstccgc, oljcccgc, qjoixcgc, ... }:{ iix, y } 6 2 12 5 { oixoii, cstcljciii, qoiscii, ... }:{ iiv, iv } 6 2 12 5 { cgljcc, qgoixljcc, oixljcstc, ... }:{ cy, gcy } 5 2 10 4 { qgoixctcc, csciixctcc, ... }:{ g, gcy } 2 5 10 4 { cgctcc, qoqjctc }:{ cgcy, cy, gcy, oix, y } 5 2 10 4 { qgoixljc, oixljcst, ljoixct, ... }:{ ccy, cgcy } 5 2 10 4 { qoljciiiv, ctis, oqjcoix, ... }:{ (), cgcy } 5 2 10 4 { qoljciixct, ocgct, ixljct, ... }:{ ccgcy, ccy } 3 3 9 4 { qoixciii, oqjciii, oixljciii }:{ iv, s, v } 2 4 8 3 { cgcc, cgoixctc }:{ ccgcy, cgcy, cy, gcy } 4 2 8 3 { qjoixcg, qgoixcstcg, cstcg, ... }:{ ciix, cy } 2 4 8 3 { qoqgctc, ljcstc }:{ cgcy, cy, gcy, oix } 4 2 8 3 { qoisci, ctcoixljci, csoixljci, ... }:{ iiiv, iiv } 4 2 8 3 { oljccgcy, oixljcccy, ctoixo, ... }:{ (), is } 4 2 8 3 { ocljt, ixljccg, ctoixct, ... }:{ ccy, cy } 4 2 8 3 { qoljctcgcy, oqjccgcy, ... }:{ (), ix } 2 4 8 3 { qoqjcst, ctcqgt }:{ ccgcy, ccy, cgcy, cy } 4 2 8 3 { qoqjciiiv, qjccgcy, oixois, ... }:{ (), oix } 2 4 8 3 { oixqjcii, ocgcii }:{ iiv, iv, s, x } 2 4 8 3 { oixqjci, ocgci }:{ iiiv, iiv, is, ix } 3 2 6 2 { oqjccgciii, ctcqjcii, ... }:{ iv, s } 2 3 6 2 { csciixctc, cgoixcstc }:{ cg, cgcy, oix } 2 3 6 2 { ixctccgcii, ciisoi }:{ s, scy, x } 2 3 6 2 { ixctccgci, ciiso }:{ is, iscy, ix } 3 2 6 2 { oqjoixcgcy, oqgctccgcy, ... }:{ (), ixctccy } 3 2 6 2 { oqoljc, ctcoljc, ctccyljc }:{ iiiv, y } 3 2 6 2 { oqolj, ctcolj, ctccylj }:{ ciiiv, cy } 2 3 6 2 { qoljctcc, ixljccc }:{ g, gcy, y } 2 3 6 2 { qoljcst, oqjcst }:{ ccgcy, ccy, cgcy } 3 2 6 2 { qoixqj, qocst, oiscst }:{ cccgcy, ccgcy } 2 3 6 2 { oqgcstccg, oljcccg }:{ (), ciix, cy } 2 3 6 2 { qoljcoi, oqgoi }:{ s, x, xcy } 2 3 6 2 { qoljcs, oqjcs }:{ tccgcy, tccy, tcgcy }