I obtained Landini's interlinear transcription of the VMs, version 1.6 (landini-interln16.evt) from http://sun1.bham.ac.uk/G.Landini/evmt/intrln16.zip I extracted manually from it a homogeneous, full-text sample bio-m-evt.evt, consisting of pages 147-166 of the "biological" section, in Currier's Language B, hand 2. This section includes Currier's and Friedman's transcriptions. Currier's seems to be the most complete of them. I reduced this sample to a file cb.txt in "plain" text format as follows: * I eliminated comments and Friedman's lines: cat bio-m-evt.evt \ | egrep -v '^#' \ | grep ';C>' \ > bio-c-evt.evt * I eliminated (with Emacs) the line numbers <...> at the beginning of each line; * I replaced 3 or more consecutive occurrences of "%" by "%%" * I eliminated the strings of "!", present at end of some lines. * I replaced the end-of-line "-" by ".//." * I replaced the end-of-paragraph "=" by ".//.=." (Note that not all pages end in "=" in the original Currier file.) * I replaced "." by " " (with tr). * I removed " " at end of all lines. lines words bytes file ------ ------- --------- ------------ 765 7230 38906 bio-c-evt.txt Next I computed the word frequencies: cat bio-c-evt.txt \ | tr ' ' '\012' \ | sort \ | uniq -c \ | sort +0 -1nr \ > bio-c-evt.frq cat bio-c-evt.txt \ | tr ' ' '\012' \ | sort \ | uniq \ > bio-c-evt.dic I removed all "bad" words (with "%", "/", "=", "*"): cat bio-c-evt.dic \ | grep -v '[%*/=]' \ | egrep '.' \ > bio-c-evt-gut.dic lines words bytes file ------ ------- --------- ------------ 1851 1850 12117 bio-c-evt.dic 1381 1381 8277 bio-c-evt-gut.dic I created an automaton for bio-c-evt-gut.dic: cat bio-c-evt-gut.dic \ | nice MaintainAutomaton \ -add - \ -dump bio-c-evt-gut.dmp strings letters states finals arcs sub-sts lets/arc -------- -------- -------- -------- -------- -------- -------- 1381 6896 535 114 1633 1341 4.223 I ran AutoAnalysis, looking for unproductive states (2 words or less) and strange words: nice AutoAnalysis \ -load bio-c-evt-gut.dmp \ -unprod bio-c-evt-gut-1-unp.sts \ -maxUnprod 2 \ -unprodSugg bio-c-evt-gut-1-unp.sugg 161 unproductive states 255 strange words (with repetitions) listed 200 strange words (without repetitions) listed I redid again, redefining a state as unproductive if it is used by only one word: nice AutoAnalysis \ -load bio-c-evt-gut.dmp \ -unprod bio-c-evt-gut-2-unp.sts \ -maxUnprod 1 \ -unprodSugg bio-c-evt-gut-2-unp.sugg 67 unproductive states 67 strange words (with repetitions) listed 45 strange words (without repetitions) listed Here are the strange words: 2AESHZ8G 2AIIRAE 42OEDCC8G 4GDAM 4O4ODCCG 4ODGE88G 8ETCRG C2C2 CESC8G CPAEOIR DOHAEG DOROER EE8GR EEAIIRG EFTCAE EODAK EOIIIK FZ8OROE G4ODAN GCPAM GDT8AR GROHCG HCGHC6 HG4ODG HOROES28G HT8OEH8G IEPT8G ODAROEOK OGSCG OHTOHAR OPTOEOR OSCPOE2 P8AESOR PGDC8G POECC8ARAL POEHCSOE POHTODAR PSTC8AE SC8CA SETCR TCIROR TEAIIIL TETPSCCG TOEDCCCG TSCDZG Looked for popular states and radicals: nice AutoAnalysis \ -load bio-c-evt-gut.dmp \ -classes bio-c-evt-gut-1.cls \ -minClassPrefs 5 \ -radicals bio-c-evt-gut-1.rad 37 lexical classes found Modified AutoAnalysis, adding a new command "-prod" to list productive states. AutoAnalysis \ -load bio-c-evt-gut.dmp \ -prod bio-c-evt-gut-1-prd.pst \ -minProductivity 1 \ -maxPrefSize 30 \ -maxSuffSize 30 25 productive states Here they are: state nsuffs nprefs nwords prodty prefs/suffs ------- ------- ------- ------- ------- ----------------- 180 35 2 70 34 { TDZ8, TC2, TAR, RTC8, RSC8, ... }:{ (), G } 60 30 2 60 29 { THZC, TDCC, TCHC, SCDZC, OHTC, ... }:{ 8G, G } 186 15 2 30 14 { TC8O, PZA, PTO, PTC8A, OESCO, ... }:{ E, R } 681 4 4 16 9 { TC8A, OEHA, O8A, GHA }:{ E, M, N, R } 71 5 3 15 8 { 8TCC, 4OPTC, 4OHTC, DSC, 2OETC }:{ 8G, G, OE } 85 8 2 16 7 { OHAES, OFT, EPT, EHC, 8GDC, ... }:{ 8G, C8G } 58 7 2 14 6 { TR, SOE, SC8AE, OHOE, ODOE, ... }:{ (), 8G } 206 4 3 12 6 { OHS, AET, 4ORC, 4ODS }:{ 8G, C8G, CG } 233 4 3 12 6 { TCHZ, ETCDZ, 4OT, 4ODZ }:{ C8G, CG, G } 255 4 3 12 6 { OCC, 8OETC, 8CC, 4OESC }:{ 8G, C8G, G } 123 6 2 12 5 { TCCD, TC8T, SCCDC, ODZ, EDC8, ... }:{ CG, G } 189 5 2 10 4 { SHZCG, SCOEO, OHC8G, 4ODT8G, ... }:{ (), E } 193 3 3 9 4 { OPSC8, ODCC8, 4ODCC8 }:{ (), AE, G } 301 5 2 10 4 { TDAR, ROR, OEOR, HC8G, 4OHOE }:{ (), OE } 463 5 2 10 4 { TCC8, OET8, HTC8, GHCC8, CC8 }:{ AR, G } 30 4 2 8 3 { O2, HAR, CCC2, 2AR }:{ (), AE } 151 4 2 8 3 { ODS, GFT, EDT, 4O8C }:{ C8G, CG } 304 2 4 8 3 { TPZ, 4OHS }:{ 8G, C8G, CG, G } 229 2 3 6 2 { EDCC, 4ODTC }:{ 8, 8G, G } 368 3 2 6 2 { ODAI, AI, 8CI }:{ IIL, R } 372 3 2 6 2 { TCOET, OTC, DCT }:{ 8G, CG } 408 3 2 6 2 { S8, POES8, 8SCC8 }:{ AE, G } 588 2 3 6 2 { SODA, EORA }:{ E, M, N } 662 2 3 6 2 { SCDC, ODCS }:{ 8G, CG, G } 819 3 2 6 2 { SCAE, OEDCCG, ODC8G }:{ (), R } There appear to be two major kinds of words: those that inflect with [M,E,N,R] and those that inflect with [(),G,8G,CG,C8G], or [G] for shrt. I collected the radicals of the class [M,N,E,R]: cat bio-c-evt.frq \ | egrep -v '[%*]' \ | egrep '[MENR]$' \ | sort +1 -2 \ > foo.frq Here are some popular radicals of the [M,E,N,R] class, with aproximate counts of the four inflections: 4ODA(383) O(238) 8A(225) ODA(111) 2A(94) 4O(93) OHA(79) A(75) 4OHA(73) 2O(49) OEDA(41) RA(35) SCO(31) EO(31) TCO(30) OEO(28) DA(24) 4ODO(23) RO(22) ''(20) HA(19) EDA(15) SO(14) SA(13) TC8A(12) OEA(11) PO(11) SC8A(10) TA(10) EA(9) HO(9) ORA(9) SCA(9) Note that the radicals all end in "A" or "O" I collected the radicals of the class [G]: cat bio-c-evt.frq \ | egrep -v '[%*]' \ | egrep '[G]$' \ | sort +1 -2 \ > foo.frq Looking at the most frequent words in this file, it seems that the main suffixes of this class are actually, in decreasing frequency order for each radical: 4OD{C8G,CC8G,CCG,CG,G,SC8G,T8G,TC8G,TCG,TG,ZCG,ZG} 4OH{C8G,CC8G,G,CG,CCG,TC8G,TG} 8{G,SC8G,TC8G,C8G,CC8G,...} D{C8G,CC8G,TCG,CCC8G,CG} E{TC8G,SC8G,TCG,8G,DC8G,G,DCC8G,SCG,TG,SCC8G,OEG,SCCG,...} GD{CC8G,CCG,C8G,} GH{C8G,CCG,CC8G} H{C8G,CC8G,CCG} OD{C8G,CC8G,CCG,G,CG,TCG,SCG,T8G,CSG,SC8G,TG,ZG,CCC8G,...} OE{G,TC8G,SC8G,TCG,8G,...} OED{C8G,CCG,CC8G,...} OH{C8G,CC8G,CCG,G,CG,C8CCG,CCC8G,...} OP{TC8G,SC8G,TCG,TCCG,...} S{C8G,CG,CC8G,CCG,G,8G,...} T{C8G,CG,CCG,CC8G,8G,G,...} So it seems that the [G] class is actually [C8G,CC8G,CCG,CG,G]. However the preceding letter is quite often [S,T,C,Z,D], so the class may include [SC8G,TC8G,...]. I extracted the words with those popular radicals combined with the selected endings, and counted their frequencies: /bin/rm .foo foreach f ( '' 4OD E OH 4OH OD 8 OE H D GD 4OED ) echo 'Prefix "'"${f}"'":' >> .foo echo ' ' >> .foo cat bio-c-evt.frq \ | egrep -v '[%*]' \ | egrep ' '"${f}"'[STZ]?[C]*[8]?[G]?$' \ | sed -e 's@[ ]'"${f}"'@ @' \ | sort -b +0 -1nr \ | format-counts \ | sed -e 's@^@ @' \ >> .foo echo ' ' >> .foo end Prefix "": SC8G(210) TC8G(193) SCG(78) TCG(74) 8G(41) TCCG(34) SCC8G(33) SCCG(30) TCC8G(21) T8G(19) SG(9) G(5) S8G(5) TC8(5) TG(5) CC8G(3) S(2) SC8(2) (1) 8(1) CCC8G(1) CCG(1) SCC8(1) TCCCG(1) Prefix "4OD": C8G(159) CC8G(149) CCG(83) G(58) CG(39) T8G(10) TC8G(7) SC8G(6) TG(5) C8(4) CC8(4) ZCG(4) (3) CCCG(3) TCG(3) CCC8G(2) 8G(1) S8G(1) SCG(1) TC8(1) ZC8G(1) ZG(1) Prefix "E": TC8G(55) SC8G(23) TCG(18) 8G(9) (7) G(7) SCG(6) TG(6) SCC8G(5) SCCG(4) T8G(4) TCCG(4) TC8(3) S8G(2) SC8(2) TCC8G(2) 8(1) C(1) CC8G(1) S(1) T(1) Prefix "OH": C8G(48) CC8G(25) CCG(18) G(17) CG(13) SC8G(4) S8G(3) TC8G(3) TCG(3) (2) SCG(2) TG(2) C8(1) CCC8G(1) T8G(1) ZCG(1) ZG(1) Prefix "4OH": C8G(46) CC8G(39) G(26) CG(8) CCG(7) TC8G(3) TG(3) CC8(2) SC8G(2) TCG(2) (1) CCC8G(1) S8G(1) SCG(1) SG(1) T8G(1) ZCG(1) Prefix "OD": C8G(43) CC8G(32) CCG(16) G(12) CG(11) TCG(6) SCG(3) T8G(3) CC8(2) SC8G(2) TG(2) ZG(2) (1) CCC8G(1) T8(1) TC8G(1) ZCG(1) Prefix "8": G(41) SC8G(17) TC8G(9) C8G(2) CC8G(2) SCC8G(2) SCCG(2) SCG(2) T8G(2) TCG(2) (1) CCC8G(1) CCG(1) S8G(1) SC8(1) TCC8G(1) TCCG(1) TG(1) Prefix "OE": (176) G(26) TC8G(24) SC8G(14) TCG(13) 8G(12) SCG(8) TG(5) S8G(4) T8G(3) TCCG(3) SC8(2) SCCG(2) TCC8G(2) CCC8(1) CCC8G(1) SCC8(1) SCC8G(1) TC8(1) Prefix "H": C8G(20) TC8G(10) CC8G(4) CCG(4) SC8G(3) SCG(3) ZC8G(3) ZCG(3) G(2) TCG(2) Z8G(2) T8G(1) TCCG(1) TG(1) Prefix "D": C8G(11) CC8G(10) TCG(4) CC8(3) (2) CCC8G(2) CG(2) SCG(2) T8G(2) ZCG(2) S8G(1) SC8G(1) TC8G(1) Z8G(1) ZC8G(1) ZG(1) Prefix "GD": CC8G(10) CCG(7) C8G(5) CCCG(1) SC8G(1) ZCG(1) Prefix "4OED": CC8G(4) CCG(4) G(4) CG(1) It seems that the class consists of the suffixes [S,T,Z,][C]*[8,][G]. I collected all prefixes with these suffixes: /bin/rm .bar cat bio-c-evt.frq \ | egrep -v '[%*]' \ | egrep 'G$' \ | /bin/sed \ -e 's@[ ]@ :@' \ -e 's@[ ]\(.*\)\([STZ][C]*[8]\{0,1\}G\)$@ \1- \2@' \ -e 's@[ ]\(.*[^STZC]\)\(C[C]*[8]\{0,1\}G\)$@ \1- \2@' \ -e 's@[ ]\(.*[^STZC]\)\(8G\)$@ \1- \2@' \ -e 's@[ ]\(.*[^STZC8]\)\(G\)$@ \1- \2@' \ -e 's@ :@ @' \ | sort -b +2 -3 +0 -1nr \ > .bar Suffix "8G": (41) OE(12) E(9) 8AE(4) 4O(3) DOE(3) O(3) SOE(3) 4ODG(2) 4ODO(2) 4OHAE(2) PSCOE(2) 2OE(1) 4OD(1) 4ODA(1) 4ODAE(1) 4ODC2(1) 4ODGE8(1) 4ODOE(1) 4ODT2(1) 4OE(1) 4OHAR(1) 4OP(1) 8AIRG(1) 8AR(1) A(1) AEAE(1) EDCOE(1) EO(1) EOE(1) G(1) GSCAE(1) HOROES2(1) HT8OEH(1) ODA(1) ODAE(1) ODOE(1) OEAO(1) OEHG(1) OHAE(1) OHCCO(1) OHCOE(1) OHOE(1) OHTR(1) ORO(1) PZO(1) SC8AE(1) SCCOE(1) SCOE(1) SE(1) TO(1) TR(1) Suffix "C8G": 4OD(159) OH(48) 4OH(46) OD(43) H(20) OED(18) D(11) ED(7) GH(7) GD(5) OEH(3) TD(3) 8(2) 8D(2) EH(2) SCD(2) TCH(2) 2OED(1) 4(1) 4CH(1) 4ODC8(1) 4ODO(1) 4OR(1) 8GD(1) 8GH(1) 8OD(1) 8OE(1) 8OED(1) AED(1) CD(1) OR(1) PGD(1) TH(1) Suffix "CC8G": 4OD(149) 4OH(39) OD(32) OH(25) OED(13) D(10) GD(10) 4O(6) ED(6) 4OED(4) H(4) (3) 2OED(3) GH(3) 2OD(2) 4(2) 8(2) 8GD(2) EH(2) OEH(2) 2OEH(1) 42OED(1) 4O8(1) 4OCCD(1) 4OEH(1) 4OR(1) E(1) EO(1) G8(1) HD(1) HOEOD(1) HSCOD(1) LED(1) O(1) OCD(1) PO(1) TCD(1) TD(1) Suffix "CCC8G": 4O(2) 4OD(2) D(2) (1) 4(1) 4OH(1) 8(1) O(1) OD(1) OE(1) OH(1) Suffix "CCCG": 4OD(3) OED(2) GD(1) TOED(1) Suffix "CCG": 4OD(83) OED(18) OH(18) OD(16) 4OH(7) GD(7) 4OED(4) GH(4) H(4) 2OED(3) ED(3) 8D(2) (1) 2AED(1) 2D(1) 4(1) 4CC8(1) 4CD(1) 4O(1) 4O4OD(1) 4O8(1) 4OR(1) 8(1) 8OH(1) EO(1) EP(1) O(1) OEOED(1) OHC8(1) POED(1) SCCD(1) SCD(1) TD(1) TGH(1) Suffix "CG": 4OD(39) OH(13) OD(11) 4OH(8) OED(5) TCH(3) D(2) SCD(2) TCCD(2) 2OED(1) 4(1) 4CD(1) 4O8GD(1) 4OED(1) 8GH(1) AH(1) CC2(1) ED(1) EDC8(1) GH(1) GROH(1) HOED(1) OEHS8(1) SCCD(1) SCCH(1) SCH(1) TCD(1) Suffix "G": 4OD(58) 4OH(26) OE(26) OH(17) OD(12) TCD(12) OED(11) SCD(11) TCCD(11) SCCD(8) 4OE(7) 8AE(7) E(7) SCCH(7) 4ODAE(6) OR(6) SCH(6) (5) 4OED(4) 8AR(4) AE(4) EOE(4) TCH(4) AR(3) ED(3) OHAE(3) 4ODCO(2) AM(2) ESCH(2) H(2) ODAE(2) OEOD(2) OROE(2) R(2) RTCD(2) SCOD(2) SD(2) TCAE(2) TCCH(2) TH(2) 2(1) 2AE(1) 2OE(1) 2OED(1) 2SCD(1) 2TCH(1) 4CH(1) 4ODAO(1) 4ODCOE(1) 4ODO(1) 4ODOP(1) 4OEDAR(1) 4OEDCCOE(1) 4OGD(1) 4OHAE(1) 4OHAR(1) 4OHCC2(1) 4OP(1) 8AEAR(1) 8AED(1) 8AROR(1) 8ETCR(1) 8GD(1) 8GR(1) 8OE(1) A(1) A2(1) AEOR(1) CH(1) DAR(1) DOEP(1) DOHAE(1) EAE(1) EEAIIR(1) ETC8AR(1) GH(1) GO(1) GOD(1) GR(1) HAE(1) HG4OD(1) O4OD(1) ODAN(1) ODGED(1) ODO(1) OE2AE(1) OEAM(1) OEDOR(1) OESH(1) OFOE(1) OHAR(1) OHOR(1) ONOE(1) OOR(1) OP(1) OPAE(1) OPOE(1) OPTAE(1) OPTCCD(1) OROIR(1) POE8AD(1) POETO(1) RAE(1) RAR(1) ROE(1) SCGD(1) SCO(1) SH(1) SOD(1) TAE(1) TAEOE(1) TAR(1) TC2(1) TCCCH(1) TCO(1) TCOD(1) TCOE(1) TCP(1) TD(1) TDCA(1) THZO(1) TOE(1) TOH(1) Suffix "S8G": (5) OE(4) OH(3) 8AE(2) E(2) 4OD(1) 4OE(1) 4OH(1) 4OP(1) 8(1) D(1) ODC(1) OF(1) OHAE(1) OP(1) P(1) POE(1) R(1) SOD(1) Suffix "SC8G": (210) E(23) 8(17) OE(14) G(8) 4OD(6) R(6) 2(5) OH(4) 4OP(3) H(3) OP(3) 4OH(2) OD(2) OEH(2) 4(1) 4O(1) 4OE(1) 4OHC8(1) 8AE(1) 8E(1) AE(1) CE(1) D(1) GD(1) O(1) OHAE(1) OPAE(1) P(1) POE8(1) Suffix "SCC8G": (33) E(5) 4OE(2) 8(2) R(2) 2(1) 4O(1) G(1) O(1) OE(1) OR(1) TP(1) Suffix "SCCG": (30) E(4) G(3) 8(2) OE(2) TETP(1) Suffix "SCG": (78) OE(8) E(6) G(4) H(3) O(3) OD(3) 2(2) 8(2) D(2) OH(2) 2OE(1) 4OD(1) 4OE(1) 4OH(1) 8D(1) G8AR(1) GDC(1) ODC(1) OED(1) OEDC(1) OG(1) OP(1) POE(1) R(1) SCCP(1) Suffix "SG": (9) ODC(2) 4OE(1) 4OF(1) 4OH(1) 8GD(1) AE(1) HOE(1) POE(1) Suffix "T8G": (19) 4OD(10) E(4) OD(3) OE(3) 8(2) D(2) P(2) 2(1) 2SD(1) 4CD(1) 4ODC(1) 4ODG(1) 4OE(1) 4OH(1) 4OP(1) AE(1) DC(1) DOE(1) EP(1) ETP(1) GE(1) H(1) IEP(1) OF(1) OH(1) OP(1) POE(1) POEAE(1) S(1) TCOE(1) Suffix "TC8G": (193) E(55) OE(24) OP(13) P(13) H(10) 4OP(9) 8(9) 4OD(7) 4OE(6) R(6) 2(5) 2OE(4) F(4) G(4) 4OH(3) OH(3) 4ODC(2) ED(2) OEF(2) OEP(2) OF(2) POE(2) 4O(1) 4ODOE(1) 4OF(1) 4OR(1) 4P(1) 8AE(1) 8OE(1) 8OEF(1) AE(1) CF(1) D(1) EO(1) EP(1) GF(1) O(1) O4OF(1) O8(1) OD(1) OEOH(1) SCCP(1) SCP(1) TCP(1) TP(1) Suffix "TCC8G": (21) E(2) OE(2) 4OE(1) 8(1) 8OE(1) G(1) Suffix "TCCCG": (1) Suffix "TCCG": (34) E(4) OE(3) OP(2) 2OD(1) 8(1) G(1) H(1) O(1) R(1) Suffix "TCG": (74) E(18) OE(13) 4OE(8) OD(6) R(5) D(4) P(4) 2OE(3) 4OD(3) OH(3) OP(3) 4OH(2) 8(2) AE(2) H(2) 2(1) 4O(1) 4ODAE(1) 4ODC(1) 4OF(1) 4OP(1) 8AR(1) 8OE(1) DC(1) DZ(1) ED(1) EDE(1) G(1) GF(1) GS(1) OEH(1) OEOE(1) OR(1) TC8(1) TCOE(1) TOE(1) Suffix "TG": E(6) (5) 4OD(5) OE(5) 4OH(3) OD(2) OEE(2) OH(2) SCP(2) 2OE(1) 2P(1) 4CD(1) 4O(1) 4OE(1) 8(1) CP(1) DOR(1) GDCC(1) H(1) ODAE(1) ODC(1) OED(1) OPC(1) POE(1) R(1) SCCD(1) SCHZC8(1) TC(1) TC8(1) Suffix "Z8G": TD(5) H(2) SCD(2) TP(2) 2AESH(1) D(1) TCD(1) TH(1) Suffix "ZC8G": SD(6) TD(5) H(3) SCD(3) TH(2) 4D(1) 4OD(1) 8TD(1) D(1) ESCD(1) ETCD(1) ETP(1) SCH(1) SH(1) SOP(1) TCH(1) TCP(1) TP(1) Suffix "ZCCG": F(1) Suffix "ZCG": TD(9) SD(5) 4OD(4) H(3) SCD(3) SH(3) D(2) P(2) TCD(2) TCH(2) TP(2) 2H(1) 2OD(1) 4H(1) 4OH(1) AH(1) CP(1) ETCD(1) GD(1) GTCD(1) HOH(1) OD(1) OH(1) SCCH(1) TCCH(1) TF(1) TH(1) Suffix "ZG": TD(31) SD(23) TCD(21) TH(20) SCD(19) TCH(17) SCH(12) SH(12) ETCD(3) ESCD(2) OD(2) OETH(2) SCP(2) TP(2) 2D(1) 2SCD(1) 2TD(1) 2TH(1) 4H(1) 4OD(1) 4ODCTD(1) 8AD(1) 8SCD(1) D(1) ESD(1) ESH(1) ETH(1) GTCH(1) HTH(1) OEPOD(1) OETCD(1) OH(1) RAH(1) SCCD(1) SOD(1) TSCD(1) From looking at these distributions, it seems that the "S", "T", and "Z" are actually part of the roots; and that either the "G" suffix also occurs in words of unrelated classes, or there are more suffixes that end in "G" within this class. Here are again the prefixes of "G" that are not in the oterh classes, this time sorted by last letter of prefix: (5) 2(1) A2(1) 4OHCC2(1) TC2(1) A(1) TDCA(1) 4OD(58) OD(12) TCCD(11) SCD(11) OED(11) SCCD(8) POE8AD(1) OPTCCD(1) 2SCD(1) TCD(12) 4OED(4) RTCD(2) ED(3) 8AED(1) ODGED(1) 2OED(1) 8GD(1) SCGD(1) 4OGD(1) HG4OD(1) O4OD(1) SCOD(2) TCOD(1) OEOD(2) GOD(1) SOD(1) SD(2) TD(1) OE(26) 8AE(7) E(7) AE(4) 2AE(1) OE2AE(1) TCAE(2) ODAE(2) 4ODAE(6) EAE(1) HAE(1) OHAE(3) 4OHAE(1) DOHAE(1) OPAE(1) RAE(1) TAE(1) OPTAE(1) 2OE(1) 4OE(7) 8OE(1) 4OEDCCOE(1) 4ODCOE(1) TCOE(1) EOE(4) TAEOE(1) OFOE(1) ONOE(1) OPOE(1) ROE(1) OROE(2) TOE(1) 4OH(26) OH(17) SCCH(7) SCH(6) TCH(4) H(2) CH(1) 4CH(1) TCCCH(1) TCCH(2) ESCH(2) 2TCH(1) GH(1) TOH(1) SH(1) OESH(1) TH(2) AM(2) OEAM(1) ODAN(1) 4ODCO(2) 4ODAO(1) SCO(1) TCO(1) ODO(1) 4ODO(1) GO(1) POETO(1) THZO(1) TCP(1) DOEP(1) OP(1) 4OP(1) 4ODOP(1) OR(6) 8AR(4) R(2) AR(3) ETC8AR(1) DAR(1) 4OEDAR(1) 8AEAR(1) OHAR(1) 4OHAR(1) RAR(1) TAR(1) 8ETCR(1) GR(1) 8GR(1) EEAIIR(1) OROIR(1) OEDOR(1) AEOR(1) OHOR(1) OOR(1) 8AROR(1) There is a hint that "DG" and "HG" may be related suffixes. Indeed, there is hint that the final "D" and "H" in most of these stems may be part of the suffix. So here is the new guess about the main class of words that end in "G": {DAE8G,HAE8G, 8G, DC8G,HC8G, DCC8G,HCC8G, DCCC8G,HCCC8G,CCC8G, DCCCG, EDCCG,DCCG,HCCG, EDCG,DCG,HCG, DG,HG,EG,EDG,CCDG,CCHG, ES8G,DS8G,HS8G, ESC8G,DSC8G,HSC8G,SC8G, ESCC8G,SCC8G,TCC8G,CC8G, ESCCG,CCG, DSCG,HSCG,ESCG, ESG,HSG, ET8G,DT8G,HT8G,PT8G, ETC8G,DTC8G,HTC8G,PTC8G,TC8G, ETCC8G, ETCCG,PTCCG,DTCCG, ETCG,DTCG,HTCG,PTCG,DAETCG,DCTCG, ETG,DTG,HTG,DCTG,DAETG, DZ8G,HZ8G, DZC8G,HZC8G, DZCG,HZCG,PZCG, DZG,CDZG,CHZG } I tried to separat these suffixes with a giant "sed": /bin/rm .bax cat bio-c-evt.frq \ | egrep -v '[%*]' \ | egrep 'G$' \ | /bin/sed -f split-G-suffs.sed \ | /bin/sed -e 's/:$//' \ | sort -b +2 -3 +0 -1nr \ > .bax The results were not very good, so I decided to look closely at the most popular prefixes among the [G] words, which are [4O,O,G,8,2]: /bin/rm .bax cat bio-c-evt.frq \ | egrep -v '[%*]' \ | egrep '[ ](4O|O|G|8|2).*G$' \ | /bin/sed \ -e 's@[ ]4O@ 4O- @' \ -e 's@[ ]O@ O- @' \ -e 's@[ ]G@ G- @' \ -e 's@[ ]8@ 8- @' \ -e 's@[ ]2@ 2- @' \ | sort -b +1 -2 +0 -1nr \ > .bax Suffixes of "2": SC8G(5) TC8G(5) OETC8G(4) OEDCC8G(3) OEDCCG(3) OETCG(3) ODCC8G(2) SCG(2) AEDCCG(1) AEG(1) AESHZ8G(1) DCCG(1) DZG(1) G(1) HZCG(1) ODTCCG(1) ODZCG(1) OE8G(1) OEDC8G(1) OEDCG(1) OEDG(1) OEG(1) OEHCC8G(1) OESCG(1) OETG(1) PTG(1) SCC8G(1) SCDG(1) SCDZG(1) SDT8G(1) T8G(1) TCG(1) TCHG(1) TDZG(1) THZG(1) (0) Suffixes of "8": G(41) SC8G(17) TC8G(9) AEG(7) AE8G(4) ARG(4) AES8G(2) C8G(2) CC8G(2) DC8G(2) DCCG(2) GDCC8G(2) SCC8G(2) SCCG(2) SCG(2) T8G(2) TCG(2) ADZG(1) AEARG(1) AEDG(1) AESC8G(1) AETC8G(1) AIRG8G(1) AR8G(1) ARORG(1) ARTCG(1) CCC8G(1) CCG(1) DSCG(1) ESC8G(1) ETCRG(1) GDC8G(1) GDG(1) GDSG(1) GHC8G(1) GHCG(1) GRG(1) ODC8G(1) OEC8G(1) OEDC8G(1) OEFTC8G(1) OEG(1) OETC8G(1) OETCC8G(1) OETCG(1) OHCCG(1) S8G(1) SCDZG(1) TCC8G(1) TCCG(1) TDZC8G(1) TG(1) (0) Suffixes of "4O": DC8G(159) DCC8G(149) DCCG(83) DG(58) HC8G(46) DCG(39) HCC8G(39) HG(26) DT8G(10) PTC8G(9) ETCG(8) HCG(8) DTC8G(7) EG(7) HCCG(7) CC8G(6) DAEG(6) DSC8G(6) ETC8G(6) DTG(5) DZCG(4) EDCC8G(4) EDCCG(4) EDG(4) 8G(3) DCCCG(3) DTCG(3) HTC8G(3) HTG(3) PSC8G(3) CCC8G(2) DCCC8G(2) DCOG(2) DCTC8G(2) DG8G(2) DO8G(2) ESCC8G(2) HAE8G(2) HSC8G(2) HTCG(2) 4ODCCG(1) 8CC8G(1) 8CCG(1) 8GDCG(1) CCDCC8G(1) CCG(1) D8G(1) DA8G(1) DAE8G(1) DAETCG(1) DAOG(1) DC28G(1) DC8C8G(1) DCOEG(1) DCT8G(1) DCTCG(1) DCTDZG(1) DGE88G(1) DGT8G(1) DOC8G(1) DOE8G(1) DOETC8G(1) DOG(1) DOPG(1) DS8G(1) DSCG(1) DT28G(1) DZC8G(1) DZG(1) E8G(1) EDARG(1) EDCCOEG(1) EDCG(1) EHCC8G(1) ES8G(1) ESC8G(1) ESCG(1) ESG(1) ET8G(1) ETCC8G(1) ETG(1) FSG(1) FTC8G(1) FTCG(1) GDG(1) HAEG(1) HAR8G(1) HARG(1) HC8SC8G(1) HCC2G(1) HCCC8G(1) HS8G(1) HSCG(1) HSG(1) HT8G(1) HZCG(1) P8G(1) PG(1) PS8G(1) PT8G(1) PTCG(1) RC8G(1) RCC8G(1) RCCG(1) RTC8G(1) SC8G(1) SCC8G(1) TC8G(1) TCG(1) TG(1) (0) (0) Suffixes of "G": DCC8G(10) SC8G(8) DCCG(7) HC8G(7) DC8G(5) HCCG(4) SCG(4) TC8G(4) HCC8G(3) SCCG(3) 8ARSCG(1) 8CC8G(1) 8G(1) DCCCG(1) DCCTG(1) DCSCG(1) DSC8G(1) DZCG(1) ET8G(1) FTC8G(1) FTCG(1) HCG(1) HG(1) ODG(1) OG(1) RG(1) ROHCG(1) SCAE8G(1) SCC8G(1) STCG(1) TCC8G(1) TCCG(1) TCDZCG(1) TCG(1) TCHZG(1) Suffixes of "O": HC8G(48) DC8G(43) DCC8G(32) EG(26) HCC8G(25) ETC8G(24) EDC8G(18) EDCCG(18) HCCG(18) HG(17) DCCG(16) ESC8G(14) EDCC8G(13) ETCG(13) HCG(13) PTC8G(13) DG(12) E8G(12) DCG(11) EDG(11) ESCG(8) DTCG(6) RG(6) EDCG(5) ETG(5) ES8G(4) HSC8G(4) 8G(3) DSCG(3) DT8G(3) EHC8G(3) ET8G(3) ETCCG(3) HAEG(3) HS8G(3) HTC8G(3) HTCG(3) PSC8G(3) PTCG(3) SCG(3) DAEG(2) DCSG(2) DSC8G(2) DTG(2) DZG(2) EDCCCG(2) EETG(2) EFTC8G(2) EHCC8G(2) EHSC8G(2) EODG(2) EPTC8G(2) ESCCG(2) ETCC8G(2) ETHZG(2) FTC8G(2) HSCG(2) HTG(2) PTCCG(2) ROEG(2) 4ODG(1) 4OFTC8G(1) 8TC8G(1) CC8G(1) CCC8G(1) CCG(1) CDCC8G(1) DA8G(1) DAE8G(1) DAETG(1) DANG(1) DCCC8G(1) DCS8G(1) DCSCG(1) DCTG(1) DGEDG(1) DOE8G(1) DOG(1) DTC8G(1) DZCG(1) E2AEG(1) EAMG(1) EAO8G(1) ECCC8G(1) EDCSCG(1) EDORG(1) EDSCG(1) EDTG(1) EHG8G(1) EHS8CG(1) EHTCG(1) EOEDCCG(1) EOETCG(1) EOHTC8G(1) EPODZG(1) ESCC8G(1) ESHG(1) ETCDZG(1) FOEG(1) FS8G(1) FT8G(1) GSCG(1) HAE8G(1) HAES8G(1) HAESC8G(1) HARG(1) HC8CCG(1) HCCC8G(1) HCCO8G(1) HCOE8G(1) HOE8G(1) HORG(1) HT8G(1) HTR8G(1) HZCG(1) HZG(1) NOEG(1) ORG(1) PAEG(1) PAESC8G(1) PCTG(1) PG(1) POEG(1) PS8G(1) PSCG(1) PT8G(1) PTAEG(1) PTCCDG(1) RC8G(1) RO8G(1) ROIRG(1) RSCC8G(1) RTCG(1) SC8G(1) SCC8G(1) TC8G(1) TCCG(1) Common suffixes: cat bio-c-evt.txt \ | tr ' ' '\012' \ | egrep -v '[%*/=]' \ | egrep '.' \ > bio-c-evt.wds lines words bytes file ------ ------- --------- ------------ 5928 5928 32257 bio-c-evt.wds /bin/rm .sf-join.frq /bin/touch .sf-join.frq set ofmt = "0" @ i = 1 set noglob set prefs = ( \ 4OD 4OE 4OH 4O'[^DEH]' \ OD OED OE8 OE'[^D8]' OH OP OR O'[^DEHPR]' \ S T \ '[^4OST]' \ ) foreach f ( ${prefs} ) echo "${i} ${f}" /bin/rm .sf-part-${i}.frq cat bio-c-evt.wds \ | egrep '^'"${f}" \ | /bin/sed \ -e 's@^'"${f}"'\(.*\)$@-\1@' \ | sort \ | uniq -c \ > .sf-part-${i}.frq /n/gnu/bin/join -a 1 -a 2 -j1 2 -e 0 -o "${ofmt},1.1" \ .sf-part-${i}.frq .sf-join.frq > .sf-tmp.frq @ i = ${i} + 1 set ofmt = "${ofmt},2.${i}" mv .sf-tmp.frq .sf-join.frq end unset ofmt i noglob ( echo "# ${prefs}"; cat .sf-join.frq ) \ | add-counts \ > .sf-suffs.frq From the output, it seems indeed there are two classes of words, [A,O][EKLMNR] and [G]. Split bio-c-evt.wds into the two classes plus "misc": egrep '[AO][EKLMNR]$' bio-c-evt.wds > bio-c-evt-eklmnr.wds egrep 'G$' bio-c-evt.wds > bio-c-evt-g.wds egrep -v '([AO][EKLMNR]|G)$' bio-c-evt.wds > bio-c-evt-other.wds lines words bytes file ------ ------- --------- ------------ 3215 3215 19119 bio-c-evt-g.wds 2347 2347 11361 bio-c-evt-eklmnr.wds 366 366 1777 bio-c-evt-other.wds 5928 5928 32257 bio-c-evt.wds Next step: AutoAnalysis on each category.