I had already remarked that there were two main categories of words, \qoljc/-\qoqjc/ and \ccc/-\cscc/. Comparing the two lists, it seems that the latter can be split into two subclasses, with and without a gallows \lj/ (or \cljc/) in their "suffix". So I tabulated the three classes of words: cat bio-j-huc-gut.wds \ | sed -e 's/^/_/g' -e 's/$/_/g' \ | compare-contexts '_AHc.*_' '_ccc[^HP]*_' '_ccc.*[HP].*_' 201 0.20 AHccgci 387 0.46 ccccgci 90 0.29 cccHcci 191 0.19 AHcccgci 139 0.16 cccci 68 0.22 ccccHcci 116 0.11 AHcie 61 0.07 ccccci 31 0.10 cccHci 94 0.09 AHcim 60 0.07 cccccgci 26 0.08 ccccHci 87 0.09 AHccci 33 0.04 cccoe 15 0.05 cccHcccgci 79 0.08 AHci 21 0.02 cccgci 11 0.04 cccHccci 54 0.05 AHcin 17 0.02 cccor 10 0.03 cccHccgci 50 0.05 AHcir 14 0.02 ccci 5 0.02 ccccHcccgci 43 0.04 AHcci 10 0.01 cccir 5 0.02 cccPccci 19 0.02 AHccccgci 10 0.01 cccc 4 0.01 ccccHccci 12 0.01 AHcccci 9 0.01 ccccgcim 4 0.01 cccH 7 0.01 AHcieci 9 0.01 ccccgcie 3 0.01 cccHcir 5 0.00 AHcccg 8 0.01 ccccir 2 0.01 cccoHci 4 0.00 AHcor 7 0.01 cccie 2 0.01 ccccPcci 4 0.00 AHcif 7 0.01 ccccie 2 0.01 ccccHccie 4 0.00 AHccgcir 7 0.01 ccccg 2 0.01 ccccHccgci 3 0.00 AHciecgci 4 0.00 ccccoe 2 0.01 cccPccccgci 3 0.00 AHcicgci 4 0.00 ccccgcir 2 0.01 cccHcim 3 0.00 AHccg 4 0.00 ccccc 2 0.01 cccHcif 2 0.00 AHcic 3 0.00 cccif 2 0.01 cccHc 2 0.00 AHccgcie 3 0.00 cccgcir 1 0.00 ccciHccci 2 0.00 AHcccgcie 3 0.00 ccccgoe 1 0.00 cccciHci 2 0.00 AHcccccgci 2 0.00 ccccieci 1 0.00 ccccgciHcir 1 0.00 AHcoeci 1 0.00 cccoeoe 1 0.00 cccccHcci 1 0.00 AHcoe 1 0.00 cccoeo 1 0.00 cccccHccci 1 0.00 AHcirci 1 0.00 cccoecgci 1 0.00 ccccPcccgci 1 0.00 AHcircgci 1 0.00 cccoecccgci 1 0.00 ccccHco 1 0.00 AHciie 1 0.00 cccoecccci 1 0.00 ccccHcccgccci 1 0.00 AHcieoe 1 0.00 ccciror 1 0.00 cccPoe 1 0.00 AHciecgcgci 1 0.00 cccirci 1 0.00 cccPci 1 0.00 AHciecccci 1 0.00 cccieoeci 1 0.00 cccPcccgci 1 0.00 AHciec 1 0.00 cccgoe 1 0.00 cccP 1 0.00 AHcicccgci 1 0.00 cccgcin 1 0.00 cccHcor 1 0.00 AHcgci 1 0.00 cccgcif 1 0.00 cccHcie 1 0.00 AHccoe 1 0.00 cccgcie 1 0.00 cccHccie 1 0.00 AHccir 1 0.00 ccccor 1 0.00 cccHccg 1 0.00 AHccie 1 0.00 ccccieor 1 0.00 cccHcccie 1 0.00 AHccgor 1 0.00 ccccgor 1 0.00 cccHcccci 1 0.00 AHccgcim 1 0.00 ccccgcif 1 0.00 cccHccccg 1 0.00 AHccgccgci 1 0.00 ccccgciecgci ---- ---- ---- 1 0.00 AHccgccccgci 1 0.00 ccccgccci 307 1.00 TOT 1 0.00 AHccccg 1 0.00 cccccoe 1 0.00 AHccccci 1 0.00 cccccim 1 0.00 AHccccHcci 1 0.00 cccccie 1 0.00 AHccc 1 0.00 cccccgcir 1 0.00 AHcc 1 0.00 cccccg ---- ---- ---- 1 0.00 cccccci 1010 1.00 TOT ---- ---- ---- 846 1.00 TOT There are only 34 words that begin with "AH" but not "AHc". Of these, 21 are "AHoe", which may well be misreadings of "AHcie". Among the remaining words, there seems to be four additional major classes: cat bio-j-huc-gut.wds \ | sed -e 's/^/_/g' -e 's/$/_/g' \ | compare-contexts '_oHc.*_' '_cgc.*_' '_oeHc.*_' '_Hc.*_' 85 0.20 oHccgci 72 0.22 cgcim 28 0.21 Hccgci 20 0.19 oeHccgci 57 0.13 oHcccgci 51 0.16 cgcir 18 0.13 Hccccgci 19 0.18 oeHccci 41 0.09 oHcim 50 0.15 cgcie 16 0.12 Hcccgci 15 0.15 oeHcccgci 40 0.09 oHcie 36 0.11 cgci 10 0.07 Hcie 10 0.10 oeHci 36 0.08 oHcir 27 0.08 cgccccgci 10 0.07 Hcccci 7 0.07 oeHcir 35 0.08 oHccci 12 0.04 cgcin 8 0.06 Hcir 7 0.07 oeHcim 28 0.06 oHci 6 0.02 cgcif 7 0.05 Hcim 5 0.05 oeHcin 22 0.05 oHcci 6 0.02 cgcieci 5 0.04 Hccci 5 0.05 oeHcie 16 0.04 oHcin 5 0.02 cgcirci 4 0.03 Hccoe 4 0.04 oeHcci 13 0.03 oHcccci 5 0.02 cgcccgci 3 0.02 Hcoe 3 0.03 oeHcccci 10 0.02 oHccccgci 4 0.01 cgcccoe 3 0.02 Hcccoe 2 0.02 oeHccccgci 6 0.01 oHcoe 3 0.01 cgciecgci 2 0.01 Hcin 1 0.01 oeHcoe 4 0.01 oHcieor 3 0.01 cgccoe 2 0.01 Hci 1 0.01 oeHcif 4 0.01 oHcieci 3 0.01 cgcccci 2 0.01 Hcci 1 0.01 oeHcicgci 4 0.01 oHccgcir 3 0.01 cgccccci 2 0.01 Hcccg 1 0.01 oeHccg 3 0.01 oHcccg 2 0.01 cgcieo 1 0.01 Hcircie 1 0.01 oeHcccir 2 0.00 oHcieoe 2 0.01 cgciecccgci 1 0.01 Hcirci 1 0.01 oeHccccci 2 0.00 oHciecgci 2 0.01 cgcieccccgci 1 0.01 Hcif --- ---- ---- 2 0.00 oHccgcie 2 0.01 cgciHccgci 1 0.01 Hcieci 103 1.00 TOT 1 0.00 oHcoeor 2 0.01 cgciHcccgci 1 0.01 Hcic 1 0.00 oHcoecgci 2 0.01 cgccor 1 0.01 HciAHci 1 0.00 oHciroeof 2 0.01 cgccgci 1 0.01 Hccgcioe 1 0.00 oHcircif 2 0.01 cgccci 1 0.01 Hccgcie 1 0.00 oHcioc 2 0.01 cgcccccgci 1 0.01 HcccoHcccgci 1 0.00 oHcieccci 1 0.00 cgcirorci 1 0.01 Hcccir 1 0.00 oHciecccgci 1 0.00 cgcirof 1 0.01 Hcccie 1 0.00 oHcieccccgci 1 0.00 cgcircccci 1 0.01 HcccgoeHcgci 1 0.00 oHcieHci 1 0.00 cgcircccccgcie 1 0.01 Hcccgcif 1 0.00 oHcicgci 1 0.00 cgciie 1 0.01 Hccccgcir 1 0.00 oHcgcie 1 0.00 cgcieoe 1 0.01 Hccccci 1 0.00 oHccor 1 0.00 cgciecirci --- ---- ---- 1 0.00 oHccoe 1 0.00 cgciecircg 135 1.00 TOT 1 0.00 oHccocgci 1 0.00 cgciecir 1 0.00 oHccoHcir 1 0.00 cgciecim 1 0.00 oHccgcim 1 0.00 cgciecie 1 0.00 oHccgcif 1 0.00 cgcieHci 1 0.00 oHccgcieor 1 0.00 cgcicHcci 1 0.00 oHccgccci 1 0.00 cgciHcim 1 0.00 oHccg 1 0.00 cgciHci 1 0.00 oHcccir 1 0.00 cgciHcci 1 0.00 oHcccie 1 0.00 cgciHccci 1 0.00 oHcccgcie 1 0.00 cgccoecicg 1 0.00 oHccccic 1 0.00 cgccccoe 1 0.00 oHccccci 1 0.00 cgcccccgcie --- ---- ---- 1 0.00 cgccccc 435 1.00 TOT 1 0.00 cgcccHcccgci --- ---- ---- 326 1.00 TOT The rest, again, seems to contain two major classes of words: cat bio-j-huc-gut.wds \ | sed -e 's/^/_/g' -e 's/$/_/g' \ | compare-contexts '_oec.*_' '_ec.*_' 73 0.39 eccccgci 38 0.29 oeccccgci 20 0.11 ecccci 22 0.17 oeci 8 0.04 ecgci 19 0.15 oecccci 7 0.04 ecccgci 9 0.07 oecgci 6 0.03 eci 6 0.05 oecccgci 6 0.03 eccccci 4 0.03 oeccci 6 0.03 ecccccgci 4 0.03 oeccccci 5 0.03 eccci 3 0.02 oecin 5 0.03 eccccg 3 0.02 oecim 5 0.03 eccccHcci 3 0.02 oecccoe 3 0.02 ecie 3 0.02 oeccccg 3 0.02 eccor 2 0.02 oecie 3 0.02 ecccoe 2 0.02 oecccHcci 3 0.02 ecccHcci 2 0.02 oec 2 0.01 ecir 1 0.01 oecir 2 0.01 ecim 1 0.01 oecimci 2 0.01 ecce 1 0.01 oeccieci 2 0.01 eccccHcccgci 1 0.01 oecccor 2 0.01 ecccHci 1 0.01 oecccgcir 2 0.01 ecc 1 0.01 oeccccif 1 0.01 ecif 1 0.01 oecccccgci 1 0.01 ecgoe 1 0.01 oecccccg 1 0.01 ecgcir 1 0.01 oeccccc 1 0.01 ecgcim 1 0.01 oeccccHcci 1 0.01 ecgcieor 1 0.01 oeccHci 1 0.01 ecg --- ---- ---- 1 0.01 ecco 131 1.00 TOT 1 0.01 ecccir 1 0.01 ecccgciA 1 0.01 eccccor 1 0.01 eccccir 1 0.01 eccccif 1 0.01 eccccgcirci 1 0.01 eccccgcir 1 0.01 eccccgcie 1 0.01 eccccc 1 0.01 eccccHcie 1 0.01 eccccHccci 1 0.01 eccc 1 0.01 ecccPcccgci --- ---- ---- 185 1.00 TOT Yet some more: cat bio-j-huc-gut.wds \ | sed -e 's/^/_/g' -e 's/$/_/g' \ | compare-contexts '_oPc.*_' '_Pc.*_' '_cic.*_' '_ciHc.*_' 18 0.49 oPccccgci 17 0.40 Pccccgci 12 0.32 ciccccgci 12 0.23 ciHccgci 4 0.11 oPcccci 4 0.10 Pcccoe 5 0.13 cicccci 12 0.23 ciHcccgci 2 0.05 oPcir 4 0.10 Pccccgcir 4 0.11 ciccccci 10 0.19 ciHccci 2 0.05 oPci 3 0.07 Pcccgci 2 0.05 cicccccgci 4 0.08 ciHcim 2 0.05 oPcccgci 3 0.07 Pcccci 1 0.03 cicir 3 0.06 ciHcie 2 0.05 oPccccci 2 0.05 Pccor 1 0.03 cicgcircie 2 0.04 ciHcir 1 0.03 oPcieci 2 0.05 Pccoe 1 0.03 cicgcircccci 2 0.04 ciHcin 1 0.03 oPcieccccgci 1 0.02 Pcir 1 0.03 cicgcim 2 0.04 ciHcci 1 0.03 oPcie 1 0.02 PciHccgci 1 0.03 cicgci 2 0.04 ciHcccgcir 1 0.03 oPcieHcim 1 0.02 Pcgoe 1 0.03 cicccoe 1 0.02 ciHci 1 0.03 oPcccieci 1 0.02 Pcgcieccor 1 0.03 cicccciecgci 1 0.02 ciHcccoe 1 0.03 oPccccgcie 1 0.02 Pcccoecgci 1 0.03 ciccccgcir 1 0.02 ciHccccgci 1 0.03 oPccccg 1 0.02 Pccciroe 1 0.03 ciccccg 1 0.02 ciHccccci --- ---- ---- 1 0.02 Pccccgcie 1 0.03 cicccccci --- ---- ---- 37 1.00 TOT --- ---- ---- 1 0.03 ciccccc 53 1.00 TOT 42 1.00 TOT 1 0.03 ciccccHcci 1 0.03 ciccccHccci 1 0.03 cicPcim 1 0.03 cicHccci --- ---- ---- 38 1.00 TOT Yet some more (yawn): cat bio-j-huc-gut.wds \ | sed -e 's/^/_/g' -e 's/$/_/g' \ | compare-contexts '_APc.*_' '_rc.*_' '_Aec.*_' '_eHc.*_' 13 0.22 rcim 8 0.20 eHccgci 9 0.24 Aecccci 11 0.55 APccccgci 10 0.17 rccccgci 7 0.17 eHcccgci 7 0.19 Aeci 2 0.10 APcccgci 6 0.10 rcie 6 0.15 eHcim 7 0.19 Aeccccgci 2 0.10 APcccci 5 0.08 rcccci 4 0.10 eHcir 3 0.08 Aecccccgci 1 0.05 APci 3 0.05 rcir 3 0.07 eHccci 2 0.05 Aecim 1 0.05 APcgci 3 0.05 rcif 2 0.05 eHcin 2 0.05 Aecie 1 0.05 APccoe 2 0.03 rci 2 0.05 eHci 2 0.05 Aeccci 1 0.05 APcccoe 2 0.03 rccccg 2 0.05 eHccccgci 1 0.03 Aecir 1 0.05 APcccgcie 2 0.03 rcccccgci 1 0.02 eHcoecgci 1 0.03 Aecin --- ---- ---- 2 0.03 rcccHci 1 0.02 eHcifo 1 0.03 Aecgci 20 1.00 TOT 1 0.02 rcirci 1 0.02 eHcieor 1 0.03 Aecccoe 1 0.02 rcin 1 0.02 eHcci 1 0.03 Aecccgci 1 0.02 rcieci 1 0.02 eHccgcci --- ---- ---- 1 0.02 rciecce 1 0.02 eHcccg 37 1.00 TOT 1 0.02 rcicHcci 1 0.02 eHcccci 1 0.02 rccci --- ---- ---- 1 0.02 rcccgci 41 1.00 TOT 1 0.02 rcccciecg 1 0.02 rccccie 1 0.02 rccccci 1 0.02 rcccc 1 0.02 rccc --- ---- ---- 60 1.00 TOT On casual inspection, many of these classes seem to consist of two superimposed classes, differing by a "c/cc" switch. For instance, the '_ccc.*[HP].*_' class seems to be the union of '_cccHc.*_' and '_ccccHc.*_'. Let's try to identify the suffixes: /bin/rm -f .title /bin/rm -f .table /bin/echo "_" > .table set noglob set ofmt = "0" set npat = 1 foreach pat ( \ 'cc[^H]' 'cc.*H' \ 'ci[^H]' 'ci.*H' \ 'cg[^H]' 'cg.*H' \ 'oe[^H]' 'oe.*H' \ 'Ae[^H]' 'Ae.*H' \ 'e[^H]' 'e.*H' \ 'r[^H]' 'r.*H' \ 'AH' 'oH' \ 'AP' 'oP' \ 'H' 'P' \ ) /n/gnu/bin/printf " %7s" "${pat}" >> .title /bin/cat bio-j-huc-gut.wds \ | /n/gnu/bin/sed -e 's/^/_/g' -e 's/$/_/g' \ | /n/gnu/bin/egrep "_${pat}[^H]*_" \ | /n/gnu/bin/sed -e "s/_${pat}//g" \ | /n/gnu/bin/sort | uniq -c \ | /n/gnu/bin/expand \ > .suff.frq /n/gnu/bin/join -a 1 -a 2 -1 1 -2 2 -o"${ofmt},2.1" -e 0 .table .suff.frq > .tmp /bin/mv .tmp .table @ npat = ${npat} + 1 set ofmt = "${ofmt},1.${npat}" end unset noglob /n/gnu/bin/printf "\n" >> .title cat .table \ | /n/gnu/bin/gawk '/./ { s=0; for(i=2;i<=NF;i++) s+=$(i); print s, $0 }' \ | sort -nr \ > .tbsort cat .title .tbsort \ | format-suffix-table Here are the results: SUFFIX TOTAL cc[^H] cc.*H ci[^H] ci.*H cg[^H] cg.*H oe[^H] oe.*H ------------ ----- ------- ------- ------- ------- ------- ------- ------- ------- TOTALORUM 4262 966 322 103 58 345 18 158 115 ------------ ----- ------- ------- ------- ------- ------- ------- ------- ------- cccgci 503 0 21 14 12 27 3 38 15 ccgci 455 60 15 0 12 5 6 6 20 cgci 392 388 0 0 0 2 0 0 0 ci 353 139 68 7 2 0 2 0 13 cci 327 61 160 0 4 2 2 4 8 ccci 254 1 20 5 12 3 3 21 20 cie 187 7 4 0 3 0 0 0 5 cim 167 1 5 0 4 0 1 0 7 cir 125 8 5 1 2 0 0 0 8 ccccgci 125 1 0 4 1 3 0 5 3 im 92 0 0 0 0 72 0 3 0 oe 90 33 3 3 0 1 0 0 1 e 90 37 0 0 0 17 0 7 0 i 87 14 0 0 0 36 0 22 0 cin 81 0 0 0 2 0 0 0 5 _ 78 7 4 47 0 4 0 4 0 cccci 71 0 2 5 0 3 1 4 3 ie 70 7 0 0 0 50 0 2 0 ir 69 10 0 1 0 51 0 1 0 r 42 13 0 0 0 3 0 13 0 gci 40 21 0 1 0 0 0 9 0 m 32 30 0 0 0 0 0 0 0 or 31 17 0 2 0 0 0 0 0 ccoe 22 1 0 1 0 4 0 3 0 cccg 22 0 0 1 0 0 0 3 0 coe 19 4 1 0 0 3 0 0 1 in 17 0 0 0 0 12 0 3 0 cieci 16 2 0 0 0 0 0 1 0 c 15 11 2 0 0 0 0 0 0 if 13 3 0 0 0 6 0 0 0 cor 11 1 1 0 0 2 0 0 0 cgcim 11 11 0 0 0 0 0 0 0 cgcie 10 9 0 0 0 0 0 0 0 ccgcir 10 1 0 0 0 0 0 1 0 cccoe 10 0 0 0 1 1 0 0 0 cif 9 0 2 0 0 0 0 0 1 cg 8 7 0 0 0 0 0 0 0 ccccci 8 0 0 1 1 0 0 0 1 irci 7 1 0 0 0 5 0 0 0 ieci 7 0 0 0 0 6 0 0 0 eci 7 2 0 0 0 0 0 0 0 ccg 7 1 1 0 0 0 0 0 1 cc 7 4 0 0 0 0 0 0 0 cieor 6 1 0 0 0 0 0 0 0 ciecgci 6 0 0 1 0 0 0 0 0 cicgci 5 0 0 0 0 0 0 0 1 cgcir 5 4 0 0 0 0 0 0 0 ccie 5 1 3 0 0 0 0 0 0 ccgcie 5 0 0 0 0 0 0 0 0 cccgcie 5 0 0 0 0 0 0 0 0 ccccgcir 5 0 0 0 0 0 0 0 0 Pccci 5 5 0 0 0 0 0 0 0 oecgci 4 1 0 0 0 0 0 0 0 oecccgci 4 1 0 0 0 0 0 0 0 gcir 4 3 0 0 0 0 0 0 0 ecgci 4 3 0 0 0 0 0 0 0 cgoe 4 3 0 0 0 0 0 0 0 ccor 4 0 0 0 0 0 0 1 0 cccir 4 0 0 0 0 0 0 0 1 cccie 4 0 1 0 0 0 0 0 0 cccgcir 4 0 0 1 2 0 0 0 0 ccccg 4 0 1 0 0 0 0 1 0 cccc 4 0 0 1 0 1 0 1 0 om 3 0 0 0 0 0 0 0 1 oeccccgci 3 0 0 0 0 0 0 0 0 iecgci 3 0 0 0 0 3 0 0 0 ecccci 3 1 0 0 0 0 0 1 0 circi 3 0 0 0 0 0 0 0 0 cieoe 3 0 0 0 0 0 0 0 0 cic 3 0 0 0 0 0 0 0 0 ccccgcie 3 0 0 0 0 1 0 0 0 cccccgci 3 1 0 0 0 0 0 0 0 roe 2 0 0 0 0 0 0 1 0 rcim 2 0 0 0 0 1 0 0 0 rcie 2 1 0 0 0 0 0 0 0 oeoe 2 1 0 0 0 0 0 0 0 oeci 2 0 0 0 0 0 0 0 0 oeccci 2 0 0 0 0 0 0 0 0 oecccci 2 1 0 0 0 0 0 0 0 ieo 2 0 0 0 0 2 0 0 0 iecccgci 2 0 0 0 0 2 0 0 0 ieccccgci 2 0 0 0 0 2 0 0 0 goe 2 1 0 0 0 0 0 0 0 gcim 2 0 0 1 0 0 0 0 0 eor 2 0 0 0 0 0 0 0 0 coecgci 2 0 0 0 0 0 0 0 0 co 2 0 1 0 0 0 0 0 0 cieccccgci 2 0 0 0 0 0 0 0 0 ce 2 0 0 0 0 0 0 0 0 ccir 2 0 0 0 0 0 0 0 0 ccgcim 2 0 0 0 0 0 0 0 0 cccif 2 0 0 0 0 0 0 1 0 ccc 2 0 0 0 0 0 0 0 0 cPcci 2 2 0 0 0 0 0 0 0 cPcccgci 2 2 0 0 0 0 0 0 0 Pccccgci 2 2 0 0 0 0 0 0 0 rcin 1 0 0 0 0 0 0 0 0 oroeccccgci 1 0 0 0 0 0 0 0 0 orci 1 0 0 1 0 0 0 0 0 orccci 1 0 0 0 0 0 0 0 0 of 1 0 0 0 0 0 0 1 0 oeo 1 1 0 0 0 0 0 0 0 oecircir 1 0 0 0 0 0 0 0 0 oecim 1 0 0 0 0 0 0 0 0 oeciecccgci 1 0 0 0 0 0 0 0 0 oecgcie 1 0 0 0 0 0 0 0 0 oecgccccgci 1 0 0 0 0 0 0 0 0 oecccgcie 1 0 0 0 0 0 0 0 0 oecccg 1 0 0 0 0 0 0 0 0 oeccccg 1 0 0 0 0 0 0 0 0 oePci 1 0 0 0 0 0 0 0 0 ocgcie 1 0 0 0 0 0 0 0 0 oPci 1 0 0 0 0 0 0 0 0 no 1 1 0 0 0 0 0 0 0 irorci 1 0 0 0 0 1 0 0 0 iror 1 1 0 0 0 0 0 0 0 irof 1 0 0 0 0 1 0 0 0 ircccci 1 0 0 0 0 1 0 0 0 ircccccgcie 1 0 0 0 0 1 0 0 0 imci 1 0 0 0 0 0 0 1 0 iie 1 0 0 0 0 1 0 0 0 ieoeci 1 1 0 0 0 0 0 0 0 ieoe 1 0 0 0 0 1 0 0 0 iecirci 1 0 0 0 0 1 0 0 0 iecircg 1 0 0 0 0 1 0 0 0 iecir 1 0 0 0 0 1 0 0 0 iecim 1 0 0 0 0 1 0 0 0 iecie 1 0 0 0 0 1 0 0 0 iecce 1 0 0 0 0 0 0 0 0 gcircie 1 0 0 1 0 0 0 0 0 gcircccci 1 0 0 1 0 0 0 0 0 gcin 1 1 0 0 0 0 0 0 0 gcif 1 1 0 0 0 0 0 0 0 gcieor 1 0 0 0 0 0 0 0 0 gcie 1 1 0 0 0 0 0 0 0 g 1 0 0 0 0 0 0 0 0 eof 1 0 0 0 0 0 0 0 0 eoe 1 0 0 0 0 0 0 0 0 eo 1 1 0 0 0 0 0 0 0 eccccgci 1 0 0 0 0 1 0 0 0 eccccg 1 1 0 0 0 0 0 0 0 ecccccgci 1 0 0 0 0 1 0 0 0 ePccccgci 1 0 0 0 0 1 0 0 0 coeor 1 0 0 0 0 0 0 0 0 coecicg 1 0 0 0 0 1 0 0 0 coeci 1 0 0 0 0 0 0 0 0 ciroeof 1 0 0 0 0 0 0 0 0 ciroe 1 0 1 0 0 0 0 0 0 circif 1 0 0 0 0 0 0 0 0 circie 1 0 0 0 0 0 0 0 0 circgci 1 0 0 0 0 0 0 0 0 cioc 1 0 0 0 0 0 0 0 0 ciie 1 0 0 0 0 0 0 0 0 cifo 1 0 0 0 0 0 0 0 0 ciecgcgci 1 0 0 0 0 0 0 0 0 cieccci 1 0 0 0 0 0 0 0 0 ciecccgci 1 0 0 0 0 0 0 0 0 ciecccci 1 0 0 0 0 0 0 0 0 ciec 1 0 0 0 0 0 0 0 0 cicccgci 1 0 0 0 0 0 0 0 0 cgor 1 1 0 0 0 0 0 0 0 cgcif 1 1 0 0 0 0 0 0 0 cgciecgci 1 1 0 0 0 0 0 0 0 cgcieccor 1 0 0 0 0 0 0 0 0 cgccci 1 1 0 0 0 0 0 0 0 ccoeci 1 0 0 0 0 0 0 0 0 ccocgci 1 0 0 0 0 0 0 0 0 ccim 1 1 0 0 0 0 0 0 0 ccgor 1 0 0 0 0 0 0 0 0 ccgcioe 1 0 0 0 0 0 0 0 0 ccgcif 1 0 0 0 0 0 0 0 0 ccgcieor 1 0 0 0 0 0 0 0 0 ccgciA 1 0 0 0 0 0 0 0 0 ccgcci 1 0 0 0 0 0 0 0 0 ccgccgci 1 0 0 0 0 0 0 0 0 ccgccci 1 0 0 0 0 0 0 0 0 ccgccccgci 1 0 0 0 0 0 0 0 0 cccor 1 0 0 0 0 0 0 0 0 cccoecgci 1 0 0 0 0 0 0 0 0 ccciroe 1 0 0 0 0 0 0 0 0 cccieci 1 0 0 0 0 0 0 0 0 ccciecgci 1 0 0 1 0 0 0 0 0 ccciecg 1 0 0 0 0 0 0 0 0 cccgcirci 1 0 0 0 0 0 0 0 0 cccgcif 1 0 0 0 0 0 0 0 0 cccgccci 1 0 1 0 0 0 0 0 0 ccccic 1 0 0 0 0 0 0 0 0 ccccc 1 0 0 1 0 0 0 0 0 ccPcccgci 1 0 0 0 0 0 0 0 0 ccPccccci 1 1 0 0 0 0 0 0 0 Poe 1 1 0 0 0 0 0 0 0 Pcim 1 0 0 1 0 0 0 0 0 Pci 1 1 0 0 0 0 0 0 0 Pcccgci 1 1 0 0 0 0 0 0 0 P 1 1 0 0 0 0 0 0 0 SUFFIX TOTAL Ae[^H] Ae.*H e[^H] e.*H r[^H] r.*H AH oH AP oP H P ------------ ----- ------ ------ ------ ------ ------ ------ ------ ------ ------ ------ ------ TOTALORUM 4262 38 13 219 61 73 3 1043 451 23 40 150 63 ------------ ----- ------ ------ ------ ------ ------ ------ ------ ------ ------ ------ ------ cccgci 503 7 2 74 9 10 0 191 57 2 2 16 3 ccgci 455 1 0 7 8 1 0 201 85 0 0 28 0 cgci 392 0 0 0 0 0 0 1 0 1 0 0 0 ci 353 0 4 0 4 0 2 79 28 1 2 2 0 cci 327 2 1 5 9 1 1 43 22 0 0 2 0 ccci 254 9 3 21 4 5 0 87 35 0 0 5 0 cie 187 0 0 0 1 0 0 116 40 0 1 10 0 cim 167 0 0 0 7 0 0 94 41 0 0 7 0 cir 125 0 0 0 4 0 0 50 36 0 2 8 1 ccccgci 125 3 0 8 2 2 0 19 10 11 18 18 17 im 92 2 0 2 0 13 0 0 0 0 0 0 0 oe 90 0 0 1 1 0 0 21 10 1 1 6 8 e 90 1 0 16 1 9 0 2 0 0 0 0 0 i 87 7 0 6 0 2 0 0 0 0 0 0 0 cin 81 0 0 0 2 0 0 54 16 0 0 2 0 _ 78 0 0 5 0 0 0 2 3 1 0 1 0 cccci 71 0 1 6 1 1 0 12 13 2 4 10 3 ie 70 2 0 3 0 6 0 0 0 0 0 0 0 ir 69 1 0 2 0 3 0 0 0 0 0 0 0 r 42 0 0 10 0 3 0 0 0 0 0 0 0 gci 40 1 0 8 0 0 0 0 0 0 0 0 0 m 32 0 0 0 0 2 0 0 0 0 0 0 0 or 31 0 0 0 0 0 0 4 2 1 1 4 0 ccoe 22 1 0 3 0 0 0 1 1 1 0 4 2 cccg 22 0 0 5 1 2 0 5 3 0 0 2 0 coe 19 0 0 0 0 0 0 1 6 0 0 3 0 in 17 1 0 0 0 1 0 0 0 0 0 0 0 cieci 16 0 0 0 0 0 0 7 4 0 1 1 0 c 15 0 0 2 0 0 0 0 0 0 0 0 0 if 13 0 0 1 0 3 0 0 0 0 0 0 0 cor 11 0 0 3 0 0 0 4 0 0 0 0 0 cgcim 11 0 0 0 0 0 0 0 0 0 0 0 0 cgcie 10 0 0 0 0 0 0 0 1 0 0 0 0 ccgcir 10 0 0 0 0 0 0 4 4 0 0 0 0 cccoe 10 0 0 0 0 0 0 0 0 1 0 3 4 cif 9 0 0 0 1 0 0 4 0 0 0 1 0 cg 8 0 0 1 0 0 0 0 0 0 0 0 0 ccccci 8 0 0 0 0 0 0 1 1 0 2 1 0 irci 7 0 0 0 0 1 0 0 0 0 0 0 0 ieci 7 0 0 0 0 1 0 0 0 0 0 0 0 eci 7 0 0 4 0 1 0 0 0 0 0 0 0 ccg 7 0 0 0 0 0 0 3 1 0 0 0 0 cc 7 0 0 1 0 1 0 1 0 0 0 0 0 cieor 6 0 0 0 1 0 0 0 4 0 0 0 0 ciecgci 6 0 0 0 0 0 0 3 2 0 0 0 0 cicgci 5 0 0 0 0 0 0 3 1 0 0 0 0 cgcir 5 0 0 1 0 0 0 0 0 0 0 0 0 ccie 5 0 0 0 0 0 0 1 0 0 0 0 0 ccgcie 5 0 0 0 0 0 0 2 2 0 0 1 0 cccgcie 5 0 0 1 0 0 0 2 1 1 0 0 0 ccccgcir 5 0 0 0 0 0 0 0 0 0 0 1 4 Pccci 5 0 0 0 0 0 0 0 0 0 0 0 0 oecgci 4 0 0 0 0 0 0 1 1 0 0 1 0 oecccgci 4 0 0 0 0 0 0 0 0 0 0 1 2 gcir 4 0 0 1 0 0 0 0 0 0 0 0 0 ecgci 4 0 0 1 0 0 0 0 0 0 0 0 0 cgoe 4 0 0 0 0 0 0 0 0 0 0 0 1 ccor 4 0 0 0 0 0 0 0 1 0 0 0 2 cccir 4 0 0 1 0 0 0 0 1 0 0 1 0 cccie 4 0 0 0 0 1 0 0 1 0 0 1 0 cccgcir 4 0 0 1 0 0 0 0 0 0 0 0 0 ccccg 4 0 0 0 0 0 0 1 0 0 1 0 0 cccc 4 0 0 1 0 0 0 0 0 0 0 0 0 om 3 0 0 0 0 0 0 1 0 0 0 0 1 oeccccgci 3 0 0 0 0 0 0 1 0 0 0 0 2 iecgci 3 0 0 0 0 0 0 0 0 0 0 0 0 ecccci 3 0 0 0 1 0 0 0 0 0 0 0 0 circi 3 0 1 0 0 0 0 1 0 0 0 1 0 cieoe 3 0 0 0 0 0 0 1 2 0 0 0 0 cic 3 0 0 0 0 0 0 2 0 0 0 1 0 ccccgcie 3 0 0 0 0 0 0 0 0 0 1 0 1 cccccgci 3 0 0 0 0 0 0 2 0 0 0 0 0 roe 2 0 0 0 0 1 0 0 0 0 0 0 0 rcim 2 0 0 1 0 0 0 0 0 0 0 0 0 rcie 2 0 0 1 0 0 0 0 0 0 0 0 0 oeoe 2 0 0 0 0 0 0 1 0 0 0 0 0 oeci 2 0 0 0 0 0 0 0 0 0 2 0 0 oeccci 2 0 0 0 0 0 0 0 0 0 0 0 2 oecccci 2 0 0 0 0 0 0 0 0 0 0 0 1 ieo 2 0 0 0 0 0 0 0 0 0 0 0 0 iecccgci 2 0 0 0 0 0 0 0 0 0 0 0 0 ieccccgci 2 0 0 0 0 0 0 0 0 0 0 0 0 goe 2 0 0 1 0 0 0 0 0 0 0 0 0 gcim 2 0 0 1 0 0 0 0 0 0 0 0 0 eor 2 0 0 1 0 0 0 0 1 0 0 0 0 coecgci 2 0 0 0 1 0 0 0 1 0 0 0 0 co 2 0 0 1 0 0 0 0 0 0 0 0 0 cieccccgci 2 0 0 0 0 0 0 0 1 0 1 0 0 ce 2 0 0 2 0 0 0 0 0 0 0 0 0 ccir 2 0 0 1 0 0 0 1 0 0 0 0 0 ccgcim 2 0 0 0 0 0 0 1 1 0 0 0 0 cccif 2 0 0 1 0 0 0 0 0 0 0 0 0 ccc 2 0 0 0 0 1 0 1 0 0 0 0 0 cPcci 2 0 0 0 0 0 0 0 0 0 0 0 0 cPcccgci 2 0 0 0 0 0 0 0 0 0 0 0 0 Pccccgci 2 0 0 0 0 0 0 0 0 0 0 0 0 rcin 1 0 0 1 0 0 0 0 0 0 0 0 0 oroeccccgci 1 0 0 0 0 0 0 0 0 0 0 1 0 orci 1 0 0 0 0 0 0 0 0 0 0 0 0 orccci 1 0 0 0 0 0 0 0 0 0 0 1 0 of 1 0 0 0 0 0 0 0 0 0 0 0 0 oeo 1 0 0 0 0 0 0 0 0 0 0 0 0 oecircir 1 0 0 0 0 0 0 0 0 0 0 0 1 oecim 1 0 0 0 0 0 0 0 0 0 0 0 1 oeciecccgci 1 0 0 0 0 0 0 0 0 0 0 0 1 oecgcie 1 0 0 0 0 0 0 0 0 0 0 1 0 oecgccccgci 1 0 0 0 0 0 0 0 0 0 0 0 1 oecccgcie 1 0 0 0 0 0 0 0 0 0 0 0 1 oecccg 1 0 0 0 0 0 0 0 0 0 0 1 0 oeccccg 1 0 0 0 0 0 0 0 0 0 0 0 1 oePci 1 0 0 0 0 0 0 0 0 0 0 1 0 ocgcie 1 0 0 0 1 0 0 0 0 0 0 0 0 oPci 1 0 0 0 0 0 0 1 0 0 0 0 0 no 1 0 0 0 0 0 0 0 0 0 0 0 0 irorci 1 0 0 0 0 0 0 0 0 0 0 0 0 iror 1 0 0 0 0 0 0 0 0 0 0 0 0 irof 1 0 0 0 0 0 0 0 0 0 0 0 0 ircccci 1 0 0 0 0 0 0 0 0 0 0 0 0 ircccccgcie 1 0 0 0 0 0 0 0 0 0 0 0 0 imci 1 0 0 0 0 0 0 0 0 0 0 0 0 iie 1 0 0 0 0 0 0 0 0 0 0 0 0 ieoeci 1 0 0 0 0 0 0 0 0 0 0 0 0 ieoe 1 0 0 0 0 0 0 0 0 0 0 0 0 iecirci 1 0 0 0 0 0 0 0 0 0 0 0 0 iecircg 1 0 0 0 0 0 0 0 0 0 0 0 0 iecir 1 0 0 0 0 0 0 0 0 0 0 0 0 iecim 1 0 0 0 0 0 0 0 0 0 0 0 0 iecie 1 0 0 0 0 0 0 0 0 0 0 0 0 iecce 1 0 0 0 0 1 0 0 0 0 0 0 0 gcircie 1 0 0 0 0 0 0 0 0 0 0 0 0 gcircccci 1 0 0 0 0 0 0 0 0 0 0 0 0 gcin 1 0 0 0 0 0 0 0 0 0 0 0 0 gcif 1 0 0 0 0 0 0 0 0 0 0 0 0 gcieor 1 0 0 1 0 0 0 0 0 0 0 0 0 gcie 1 0 0 0 0 0 0 0 0 0 0 0 0 g 1 0 0 1 0 0 0 0 0 0 0 0 0 eof 1 0 0 1 0 0 0 0 0 0 0 0 0 eoe 1 0 0 0 0 0 0 0 1 0 0 0 0 eo 1 0 0 0 0 0 0 0 0 0 0 0 0 eccccgci 1 0 0 0 0 0 0 0 0 0 0 0 0 eccccg 1 0 0 0 0 0 0 0 0 0 0 0 0 ecccccgci 1 0 0 0 0 0 0 0 0 0 0 0 0 ePccccgci 1 0 0 0 0 0 0 0 0 0 0 0 0 coeor 1 0 0 0 0 0 0 0 1 0 0 0 0 coecicg 1 0 0 0 0 0 0 0 0 0 0 0 0 coeci 1 0 0 0 0 0 0 1 0 0 0 0 0 ciroeof 1 0 0 0 0 0 0 0 1 0 0 0 0 ciroe 1 0 0 0 0 0 0 0 0 0 0 0 0 circif 1 0 0 0 0 0 0 0 1 0 0 0 0 circie 1 0 0 0 0 0 0 0 0 0 0 1 0 circgci 1 0 0 0 0 0 0 1 0 0 0 0 0 cioc 1 0 0 0 0 0 0 0 1 0 0 0 0 ciie 1 0 0 0 0 0 0 1 0 0 0 0 0 cifo 1 0 0 0 1 0 0 0 0 0 0 0 0 ciecgcgci 1 0 0 0 0 0 0 1 0 0 0 0 0 cieccci 1 0 0 0 0 0 0 0 1 0 0 0 0 ciecccgci 1 0 0 0 0 0 0 0 1 0 0 0 0 ciecccci 1 0 0 0 0 0 0 1 0 0 0 0 0 ciec 1 0 0 0 0 0 0 1 0 0 0 0 0 cicccgci 1 0 0 0 0 0 0 1 0 0 0 0 0 cgor 1 0 0 0 0 0 0 0 0 0 0 0 0 cgcif 1 0 0 0 0 0 0 0 0 0 0 0 0 cgciecgci 1 0 0 0 0 0 0 0 0 0 0 0 0 cgcieccor 1 0 0 0 0 0 0 0 0 0 0 0 1 cgccci 1 0 0 0 0 0 0 0 0 0 0 0 0 ccoeci 1 0 1 0 0 0 0 0 0 0 0 0 0 ccocgci 1 0 0 0 0 0 0 0 1 0 0 0 0 ccim 1 0 0 0 0 0 0 0 0 0 0 0 0 ccgor 1 0 0 0 0 0 0 1 0 0 0 0 0 ccgcioe 1 0 0 0 0 0 0 0 0 0 0 1 0 ccgcif 1 0 0 0 0 0 0 0 1 0 0 0 0 ccgcieor 1 0 0 0 0 0 0 0 1 0 0 0 0 ccgciA 1 0 0 1 0 0 0 0 0 0 0 0 0 ccgcci 1 0 0 0 1 0 0 0 0 0 0 0 0 ccgccgci 1 0 0 0 0 0 0 1 0 0 0 0 0 ccgccci 1 0 0 0 0 0 0 0 1 0 0 0 0 ccgccccgci 1 0 0 0 0 0 0 1 0 0 0 0 0 cccor 1 0 0 1 0 0 0 0 0 0 0 0 0 cccoecgci 1 0 0 0 0 0 0 0 0 0 0 0 1 ccciroe 1 0 0 0 0 0 0 0 0 0 0 0 1 cccieci 1 0 0 0 0 0 0 0 0 0 1 0 0 ccciecgci 1 0 0 0 0 0 0 0 0 0 0 0 0 ccciecg 1 0 0 0 0 1 0 0 0 0 0 0 0 cccgcirci 1 0 0 1 0 0 0 0 0 0 0 0 0 cccgcif 1 0 0 0 0 0 0 0 0 0 0 1 0 cccgccci 1 0 0 0 0 0 0 0 0 0 0 0 0 ccccic 1 0 0 0 0 0 0 0 1 0 0 0 0 ccccc 1 0 0 0 0 0 0 0 0 0 0 0 0 ccPcccgci 1 0 0 1 0 0 0 0 0 0 0 0 0 ccPccccci 1 0 0 0 0 0 0 0 0 0 0 0 0 Poe 1 0 0 0 0 0 0 0 0 0 0 0 0 Pcim 1 0 0 0 0 0 0 0 0 0 0 0 0 Pci 1 0 0 0 0 0 0 0 0 0 0 0 0 Pcccgci 1 0 0 0 0 0 0 0 0 0 0 0 0 P 1 0 0 0 0 0 0 0 0 0 0 0 0 Beware that there is some ambiguity in the "cc[^H]" column: words like "ccccic" could be parsed as "cc" + "ccic" or as "c" + "cccic" or as "" + "ccccic". Thus, we probably should exclude the "cc" prefix while we decide what are the valid suffixes. Also, the "[^H]" in some prefixes shoudl be removed, since it requires non-empty suffixes. Let's retry again, fixing these bugs, sorting the prefixes by importance, and excluding also the "r" prefix: /bin/rm -f .title /bin/rm -f .table /bin/echo "_" > .table set noglob set ofmt = "0" set npat = 1 foreach pat ( \ 'AH' 'oH' 'cg' 'cc.*H' \ 'e' 'oe' 'H' 'oe.*H' \ 'ci' 'P' 'e.*H' 'ci.*H' \ 'oP' 'Ae' 'AP' 'cg.*H' \ 'Ae.*H' \ ) /n/gnu/bin/printf " %7s" "${pat}" >> .title /bin/cat bio-j-huc-gut.wds \ | /n/gnu/bin/sed -e 's/^/_/g' -e 's/$/_/g' \ | /n/gnu/bin/egrep "_${pat}[^H]*_" \ | /n/gnu/bin/sed -e "s/_${pat}//g" -e '/../s/_$//g' \ | /n/gnu/bin/sort | uniq -c \ | /n/gnu/bin/expand \ > .suff.frq /n/gnu/bin/join -a 1 -a 2 -1 1 -2 2 -o"${ofmt},2.1" -e 0 .table .suff.frq > .tmp /bin/mv .tmp .table @ npat = ${npat} + 1 set ofmt = "${ofmt},1.${npat}" end unset noglob /n/gnu/bin/printf "\n" >> .title cat .table \ | /n/gnu/bin/gawk '/./ { s=0; for(i=2;i<=NF;i++) s+=$(i); print s, $0 }' \ | sort -nr \ > .tbsort cat .title .tbsort \ | format-suffix-table Here are the results: SUFFIX TOTAL AH oH cg cc.*H e oe H oe.*H ci ------------ ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- TOTALORUM 3437 1043 451 345 322 223 289 150 115 104 ------------ ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ccgci 377 201 85 2 15 0 0 28 20 0 cccgci 352 191 57 5 21 7 6 16 15 0 ci 276 79 28 36 68 6 22 2 13 0 ccccgci 256 19 10 27 0 73 38 18 3 12 cci 251 43 22 0 160 0 0 2 8 0 cim 245 94 41 72 5 2 3 7 7 0 cie 237 116 40 50 4 3 2 10 5 0 _ 228 2 3 0 4 4 131 1 0 1 ccci 202 87 35 2 20 5 4 5 20 0 cir 172 50 36 51 5 2 1 8 8 1 cccci 108 12 13 3 2 20 19 10 3 5 cin 97 54 16 12 0 0 3 2 5 0 oe 93 21 10 17 3 16 7 6 1 0 or 38 4 2 3 0 10 13 4 0 0 ccccci 24 1 1 3 0 6 4 1 1 4 m 21 0 0 0 0 0 0 0 0 21 cgci 21 1 0 0 0 8 9 0 0 1 cccoe 21 0 0 4 0 3 3 3 0 1 e 19 2 0 4 0 0 0 0 0 12 cieci 19 7 4 6 0 0 0 1 0 0 cif 16 4 0 6 2 1 0 1 1 0 cccccgci 16 2 0 2 0 6 1 0 0 2 coe 12 1 6 0 1 0 0 3 1 0 ccoe 12 1 1 3 0 0 0 4 0 0 ccccg 12 1 0 0 1 5 3 0 0 1 cccg 11 5 3 0 0 0 0 2 0 0 r 8 0 0 0 0 1 0 0 0 7 circi 8 1 0 5 0 0 0 1 0 0 ciecgci 8 3 2 3 0 0 0 0 0 0 ccor 8 0 1 2 0 3 0 0 0 0 ccgcir 8 4 4 0 0 0 0 0 0 0 ccccgcir 7 0 0 0 0 1 0 1 0 1 oeci 6 0 0 0 0 4 0 0 0 0 ccg 6 3 1 0 1 0 0 0 1 0 Pccccgci 6 0 0 0 0 1 4 0 0 1 o 5 0 0 0 0 4 1 0 0 0 cor 5 4 0 0 1 0 0 0 0 0 cieor 5 0 4 0 0 0 0 0 0 0 cicgci 5 3 1 0 0 0 0 0 1 0 ccgcie 5 2 2 0 0 0 0 1 0 0 oecgci 4 1 1 0 0 1 0 1 0 0 oeccccgci 4 1 0 1 0 0 0 0 0 0 n 4 0 0 0 0 0 0 0 0 4 cieoe 4 1 2 1 0 0 0 0 0 0 cieccccgci 4 0 1 2 0 0 0 0 0 0 ccie 4 1 0 0 3 0 0 0 0 0 cccir 4 0 1 0 0 1 0 1 1 0 cccgcie 4 2 1 0 0 0 0 0 0 0 ccccc 4 0 0 1 0 1 1 0 0 1 c 4 0 0 0 2 0 2 0 0 0 roe 3 0 0 1 0 0 0 0 0 2 om 3 1 0 0 0 0 0 0 1 0 oecccgci 3 0 0 0 0 0 0 1 0 0 f 3 0 0 0 0 0 0 0 0 3 eci 3 0 0 0 0 0 0 0 0 3 ciecccgci 3 0 1 2 0 0 0 0 0 0 cic 3 2 0 0 0 0 0 1 0 0 cccie 3 0 1 0 1 0 0 1 0 0 cccgcir 3 0 0 0 0 0 1 0 0 0 ccccgcie 3 0 0 0 0 1 0 0 0 0 cc 3 1 0 0 0 2 0 0 0 0 rci 2 0 0 0 0 0 0 0 0 2 orcim 2 0 0 1 0 1 0 0 0 0 oeccci 2 0 0 0 0 0 0 0 0 0 oecccci 2 0 0 0 0 0 1 0 0 0 mci 2 0 0 0 0 0 0 0 0 2 eor 2 0 1 0 0 0 0 0 0 1 eoe 2 0 1 0 0 0 0 0 0 1 eccci 2 0 0 0 0 0 2 0 0 0 ecccgci 2 0 0 0 0 0 0 0 0 2 eccccgci 2 0 0 1 0 0 0 0 0 1 coecgci 2 0 1 0 0 0 0 0 0 0 ciie 2 1 0 1 0 0 0 0 0 0 cieo 2 0 0 2 0 0 0 0 0 0 cgoe 2 0 0 0 0 1 0 0 0 0 cgcim 2 0 0 0 0 1 0 0 0 1 ccgcim 2 1 1 0 0 0 0 0 0 0 cce 2 0 0 0 0 2 0 0 0 0 ccccif 2 0 0 0 0 1 1 0 0 0 ccc 2 1 0 0 0 1 0 0 0 0 ror 1 0 0 0 0 0 0 0 0 1 rcir 1 0 0 0 0 0 0 0 0 1 oroeccccgci 1 0 0 0 0 0 0 1 0 0 oroe 1 0 0 0 0 0 1 0 0 0 orcin 1 0 0 0 0 1 0 0 0 0 orcie 1 0 0 0 0 1 0 0 0 0 orccci 1 0 0 0 0 0 0 1 0 0 oeor 1 0 0 0 0 1 0 0 0 0 oeof 1 0 0 0 0 1 0 0 0 0 oeoe 1 1 0 0 0 0 0 0 0 0 oecircir 1 0 0 0 0 0 0 0 0 0 oecim 1 0 0 0 0 0 0 0 0 0 oeciecccgci 1 0 0 0 0 0 0 0 0 0 oecgcie 1 0 0 0 0 0 0 1 0 0 oecgccccgci 1 0 0 0 0 0 0 0 0 0 oecccgcie 1 0 0 0 0 0 0 0 0 0 oecccg 1 0 0 0 0 0 0 1 0 0 oeccccg 1 0 0 0 0 0 0 0 0 0 oecccccgci 1 0 0 1 0 0 0 0 0 0 oePci 1 0 0 0 0 0 0 1 0 0 oePccccgci 1 0 0 1 0 0 0 0 0 0 ocgcie 1 0 0 0 0 0 0 0 0 0 ocg 1 0 0 0 0 1 0 0 0 0 occci 1 0 0 0 0 1 0 0 0 0 occccgci 1 0 0 0 0 1 0 0 0 0 oPci 1 1 0 0 0 0 0 0 0 0 eorci 1 0 0 0 0 0 0 0 0 1 eof 1 0 0 0 0 0 1 0 0 0 eciecgci 1 0 0 0 0 0 0 0 0 1 ecgcir 1 0 0 0 0 1 0 0 0 0 ecccci 1 0 0 0 0 0 0 0 0 0 eccccc 1 0 0 0 0 0 0 0 0 1 coeor 1 0 1 0 0 0 0 0 0 0 coeci 1 1 0 0 0 0 0 0 0 0 co 1 0 0 0 1 0 0 0 0 0 cirorci 1 0 0 1 0 0 0 0 0 0 cirof 1 0 0 1 0 0 0 0 0 0 ciroeof 1 0 1 0 0 0 0 0 0 0 ciroe 1 0 0 0 1 0 0 0 0 0 circif 1 0 1 0 0 0 0 0 0 0 circie 1 0 0 0 0 0 0 1 0 0 circgci 1 1 0 0 0 0 0 0 0 0 circccci 1 0 0 1 0 0 0 0 0 0 circccccgcie 1 0 0 1 0 0 0 0 0 0 cioc 1 0 1 0 0 0 0 0 0 0 cimci 1 0 0 0 0 0 1 0 0 0 cifo 1 0 0 0 0 0 0 0 0 0 ciecirci 1 0 0 1 0 0 0 0 0 0 ciecircg 1 0 0 1 0 0 0 0 0 0 ciecir 1 0 0 1 0 0 0 0 0 0 ciecim 1 0 0 1 0 0 0 0 0 0 ciecie 1 0 0 1 0 0 0 0 0 0 ciecgcgci 1 1 0 0 0 0 0 0 0 0 cieccci 1 0 1 0 0 0 0 0 0 0 ciecccci 1 1 0 0 0 0 0 0 0 0 ciec 1 1 0 0 0 0 0 0 0 0 cicccgci 1 1 0 0 0 0 0 0 0 0 cgcircie 1 0 0 0 0 0 0 0 0 1 cgcircccci 1 0 0 0 0 0 0 0 0 1 cgcir 1 0 0 0 0 1 0 0 0 0 cgcieor 1 0 0 0 0 1 0 0 0 0 cgcieccor 1 0 0 0 0 0 0 0 0 0 cgcie 1 0 1 0 0 0 0 0 0 0 cg 1 0 0 0 0 1 0 0 0 0 ccoecicg 1 0 0 1 0 0 0 0 0 0 ccoeci 1 0 0 0 0 0 0 0 0 0 ccocgci 1 0 1 0 0 0 0 0 0 0 cco 1 0 0 0 0 1 0 0 0 0 ccir 1 1 0 0 0 0 0 0 0 0 ccieci 1 0 0 0 0 0 1 0 0 0 ccgor 1 1 0 0 0 0 0 0 0 0 ccgcioe 1 0 0 0 0 0 0 1 0 0 ccgcif 1 0 1 0 0 0 0 0 0 0 ccgcieor 1 0 1 0 0 0 0 0 0 0 ccgcci 1 0 0 0 0 0 0 0 0 0 ccgccgci 1 1 0 0 0 0 0 0 0 0 ccgccci 1 0 1 0 0 0 0 0 0 0 ccgccccgci 1 1 0 0 0 0 0 0 0 0 cccor 1 0 0 0 0 0 1 0 0 0 cccoecgci 1 0 0 0 0 0 0 0 0 0 ccciroe 1 0 0 0 0 0 0 0 0 0 cccieci 1 0 0 0 0 0 0 0 0 0 cccgcif 1 0 0 0 0 0 0 1 0 0 cccgciA 1 0 0 0 0 1 0 0 0 0 cccgccci 1 0 0 0 1 0 0 0 0 0 ccccor 1 0 0 0 0 1 0 0 0 0 ccccoe 1 0 0 1 0 0 0 0 0 0 ccccir 1 0 0 0 0 1 0 0 0 0 cccciecgci 1 0 0 0 0 0 0 0 0 1 ccccic 1 0 1 0 0 0 0 0 0 0 ccccgcirci 1 0 0 0 0 1 0 0 0 0 cccccgcie 1 0 0 1 0 0 0 0 0 0 cccccg 1 0 0 0 0 0 1 0 0 0 cccccci 1 0 0 0 0 0 0 0 0 1 cccPcccgci 1 0 0 0 0 1 0 0 0 0 cPcim 1 0 0 0 0 0 0 0 0 1 Poe 1 0 0 0 0 1 0 0 0 0 Pcccgci 1 0 0 0 0 1 0 0 0 0 Pcccci 1 0 0 0 0 0 0 0 0 1 A 1 0 0 0 0 0 1 0 0 0 SUFFIX TOTAL P e.*H ci.*H oP Ae AP cg.*H Ae.*H ------------ ----- ----- ----- ----- ----- ----- ----- ----- ----- TOTALORUM 3437 63 61 58 40 119 23 18 13 ------------ ----- ----- ----- ----- ----- ----- ----- ----- ----- ccgci 377 0 8 12 0 0 0 6 0 cccgci 352 3 9 12 2 1 2 3 2 ci 276 0 4 2 2 7 1 2 4 ccccgci 256 17 2 1 18 7 11 0 0 cci 251 0 9 4 0 0 0 2 1 cim 245 0 7 4 0 2 0 1 0 cie 237 0 1 3 1 2 0 0 0 _ 228 0 0 0 0 81 1 0 0 ccci 202 0 4 12 0 2 0 3 3 cir 172 1 4 2 2 1 0 0 0 cccci 108 3 1 0 4 9 2 1 1 cin 97 0 2 2 0 1 0 0 0 oe 93 8 1 0 1 1 1 0 0 or 38 0 0 0 1 0 1 0 0 ccccci 24 0 0 1 2 0 0 0 0 m 21 0 0 0 0 0 0 0 0 cgci 21 0 0 0 0 1 1 0 0 cccoe 21 4 0 1 0 1 1 0 0 e 19 0 1 0 0 0 0 0 0 cieci 19 0 0 0 1 0 0 0 0 cif 16 0 1 0 0 0 0 0 0 cccccgci 16 0 0 0 0 3 0 0 0 coe 12 0 0 0 0 0 0 0 0 ccoe 12 2 0 0 0 0 1 0 0 ccccg 12 0 0 0 1 0 0 0 0 cccg 11 0 1 0 0 0 0 0 0 r 8 0 0 0 0 0 0 0 0 circi 8 0 0 0 0 0 0 0 1 ciecgci 8 0 0 0 0 0 0 0 0 ccor 8 2 0 0 0 0 0 0 0 ccgcir 8 0 0 0 0 0 0 0 0 ccccgcir 7 4 0 0 0 0 0 0 0 oeci 6 0 0 0 2 0 0 0 0 ccg 6 0 0 0 0 0 0 0 0 Pccccgci 6 0 0 0 0 0 0 0 0 o 5 0 0 0 0 0 0 0 0 cor 5 0 0 0 0 0 0 0 0 cieor 5 0 1 0 0 0 0 0 0 cicgci 5 0 0 0 0 0 0 0 0 ccgcie 5 0 0 0 0 0 0 0 0 oecgci 4 0 0 0 0 0 0 0 0 oeccccgci 4 2 0 0 0 0 0 0 0 n 4 0 0 0 0 0 0 0 0 cieoe 4 0 0 0 0 0 0 0 0 cieccccgci 4 0 0 0 1 0 0 0 0 ccie 4 0 0 0 0 0 0 0 0 cccir 4 0 0 0 0 0 0 0 0 cccgcie 4 0 0 0 0 0 1 0 0 ccccc 4 0 0 0 0 0 0 0 0 c 4 0 0 0 0 0 0 0 0 roe 3 0 0 0 0 0 0 0 0 om 3 1 0 0 0 0 0 0 0 oecccgci 3 2 0 0 0 0 0 0 0 f 3 0 0 0 0 0 0 0 0 eci 3 0 0 0 0 0 0 0 0 ciecccgci 3 0 0 0 0 0 0 0 0 cic 3 0 0 0 0 0 0 0 0 cccie 3 0 0 0 0 0 0 0 0 cccgcir 3 0 0 2 0 0 0 0 0 ccccgcie 3 1 0 0 1 0 0 0 0 cc 3 0 0 0 0 0 0 0 0 rci 2 0 0 0 0 0 0 0 0 orcim 2 0 0 0 0 0 0 0 0 oeccci 2 2 0 0 0 0 0 0 0 oecccci 2 1 0 0 0 0 0 0 0 mci 2 0 0 0 0 0 0 0 0 eor 2 0 0 0 0 0 0 0 0 eoe 2 0 0 0 0 0 0 0 0 eccci 2 0 0 0 0 0 0 0 0 ecccgci 2 0 0 0 0 0 0 0 0 eccccgci 2 0 0 0 0 0 0 0 0 coecgci 2 0 1 0 0 0 0 0 0 ciie 2 0 0 0 0 0 0 0 0 cieo 2 0 0 0 0 0 0 0 0 cgoe 2 1 0 0 0 0 0 0 0 cgcim 2 0 0 0 0 0 0 0 0 ccgcim 2 0 0 0 0 0 0 0 0 cce 2 0 0 0 0 0 0 0 0 ccccif 2 0 0 0 0 0 0 0 0 ccc 2 0 0 0 0 0 0 0 0 ror 1 0 0 0 0 0 0 0 0 rcir 1 0 0 0 0 0 0 0 0 oroeccccgci 1 0 0 0 0 0 0 0 0 oroe 1 0 0 0 0 0 0 0 0 orcin 1 0 0 0 0 0 0 0 0 orcie 1 0 0 0 0 0 0 0 0 orccci 1 0 0 0 0 0 0 0 0 oeor 1 0 0 0 0 0 0 0 0 oeof 1 0 0 0 0 0 0 0 0 oeoe 1 0 0 0 0 0 0 0 0 oecircir 1 1 0 0 0 0 0 0 0 oecim 1 1 0 0 0 0 0 0 0 oeciecccgci 1 1 0 0 0 0 0 0 0 oecgcie 1 0 0 0 0 0 0 0 0 oecgccccgci 1 1 0 0 0 0 0 0 0 oecccgcie 1 1 0 0 0 0 0 0 0 oecccg 1 0 0 0 0 0 0 0 0 oeccccg 1 1 0 0 0 0 0 0 0 oecccccgci 1 0 0 0 0 0 0 0 0 oePci 1 0 0 0 0 0 0 0 0 oePccccgci 1 0 0 0 0 0 0 0 0 ocgcie 1 0 1 0 0 0 0 0 0 ocg 1 0 0 0 0 0 0 0 0 occci 1 0 0 0 0 0 0 0 0 occccgci 1 0 0 0 0 0 0 0 0 oPci 1 0 0 0 0 0 0 0 0 eorci 1 0 0 0 0 0 0 0 0 eof 1 0 0 0 0 0 0 0 0 eciecgci 1 0 0 0 0 0 0 0 0 ecgcir 1 0 0 0 0 0 0 0 0 ecccci 1 0 1 0 0 0 0 0 0 eccccc 1 0 0 0 0 0 0 0 0 coeor 1 0 0 0 0 0 0 0 0 coeci 1 0 0 0 0 0 0 0 0 co 1 0 0 0 0 0 0 0 0 cirorci 1 0 0 0 0 0 0 0 0 cirof 1 0 0 0 0 0 0 0 0 ciroeof 1 0 0 0 0 0 0 0 0 ciroe 1 0 0 0 0 0 0 0 0 circif 1 0 0 0 0 0 0 0 0 circie 1 0 0 0 0 0 0 0 0 circgci 1 0 0 0 0 0 0 0 0 circccci 1 0 0 0 0 0 0 0 0 circccccgcie 1 0 0 0 0 0 0 0 0 cioc 1 0 0 0 0 0 0 0 0 cimci 1 0 0 0 0 0 0 0 0 cifo 1 0 1 0 0 0 0 0 0 ciecirci 1 0 0 0 0 0 0 0 0 ciecircg 1 0 0 0 0 0 0 0 0 ciecir 1 0 0 0 0 0 0 0 0 ciecim 1 0 0 0 0 0 0 0 0 ciecie 1 0 0 0 0 0 0 0 0 ciecgcgci 1 0 0 0 0 0 0 0 0 cieccci 1 0 0 0 0 0 0 0 0 ciecccci 1 0 0 0 0 0 0 0 0 ciec 1 0 0 0 0 0 0 0 0 cicccgci 1 0 0 0 0 0 0 0 0 cgcircie 1 0 0 0 0 0 0 0 0 cgcircccci 1 0 0 0 0 0 0 0 0 cgcir 1 0 0 0 0 0 0 0 0 cgcieor 1 0 0 0 0 0 0 0 0 cgcieccor 1 1 0 0 0 0 0 0 0 cgcie 1 0 0 0 0 0 0 0 0 cg 1 0 0 0 0 0 0 0 0 ccoecicg 1 0 0 0 0 0 0 0 0 ccoeci 1 0 0 0 0 0 0 0 1 ccocgci 1 0 0 0 0 0 0 0 0 cco 1 0 0 0 0 0 0 0 0 ccir 1 0 0 0 0 0 0 0 0 ccieci 1 0 0 0 0 0 0 0 0 ccgor 1 0 0 0 0 0 0 0 0 ccgcioe 1 0 0 0 0 0 0 0 0 ccgcif 1 0 0 0 0 0 0 0 0 ccgcieor 1 0 0 0 0 0 0 0 0 ccgcci 1 0 1 0 0 0 0 0 0 ccgccgci 1 0 0 0 0 0 0 0 0 ccgccci 1 0 0 0 0 0 0 0 0 ccgccccgci 1 0 0 0 0 0 0 0 0 cccor 1 0 0 0 0 0 0 0 0 cccoecgci 1 1 0 0 0 0 0 0 0 ccciroe 1 1 0 0 0 0 0 0 0 cccieci 1 0 0 0 1 0 0 0 0 cccgcif 1 0 0 0 0 0 0 0 0 cccgciA 1 0 0 0 0 0 0 0 0 cccgccci 1 0 0 0 0 0 0 0 0 ccccor 1 0 0 0 0 0 0 0 0 ccccoe 1 0 0 0 0 0 0 0 0 ccccir 1 0 0 0 0 0 0 0 0 cccciecgci 1 0 0 0 0 0 0 0 0 ccccic 1 0 0 0 0 0 0 0 0 ccccgcirci 1 0 0 0 0 0 0 0 0 cccccgcie 1 0 0 0 0 0 0 0 0 cccccg 1 0 0 0 0 0 0 0 0 cccccci 1 0 0 0 0 0 0 0 0 cccPcccgci 1 0 0 0 0 0 0 0 0 cPcim 1 0 0 0 0 0 0 0 0 Poe 1 0 0 0 0 0 0 0 0 Pcccgci 1 0 0 0 0 0 0 0 0 Pcccci 1 0 0 0 0 0 0 0 0 A 1 0 0 0 0 0 0 0 0 Analysis: The prefixes "e", "oe", "ci" (without "H") do not appear to be equivalent to the other prefixes above. However, "ec", "oec", and "cic" do appear to fit in. The empty string does not appear to be a valid suffix (yeay!) Let's redo again: /bin/rm -f .title /bin/rm -f .table /bin/touch .table set noglob set ofmt = "0" set npat = 1 foreach pat ( \ 'AH' 'oH' 'cg' 'cc.*H' \ 'ec' 'oec' 'H' 'oe.*H' \ 'cic' 'P' 'e.*H' 'ci.*H' \ 'oP' 'Ae' 'AP' 'cg.*H' \ 'Ae.*H' \ ) /n/gnu/bin/printf " %7s" "${pat}" >> .title /bin/cat bio-j-huc-gut.wds \ | /n/gnu/bin/sed -e 's/^/_/g' -e 's/$/_/g' \ | /n/gnu/bin/egrep "_${pat}[^H][^H]*_" \ | /n/gnu/bin/sed -e "s/_${pat}//g" -e '/../s/_$//g' \ | /n/gnu/bin/sort | uniq -c \ | /n/gnu/bin/expand \ > .suff.frq /n/gnu/bin/join -a 1 -a 2 -1 1 -2 2 -o"${ofmt},2.1" -e 0 .table .suff.frq > .tmp /bin/mv .tmp .table @ npat = ${npat} + 1 set ofmt = "${ofmt},1.${npat}" end unset noglob /n/gnu/bin/printf "\n" >> .title cat .table \ | /n/gnu/bin/gawk '/./ { s=0; for(i=2;i<=NF;i++) s+=$(i); print s, $0 }' \ | sort -nr \ > .tbsort cat .title .tbsort \ | format-suffix-table SUFFIX TOTAL AH oH cg cc.*H ec oec H oe.*H ------------ ----- ----- ----- ----- ----- ----- ----- ----- ----- TOTALORUM 3060 1041 448 345 318 171 125 149 115 ------------ ----- ----- ----- ----- ----- ----- ----- ----- ----- cccgci 462 191 57 5 21 73 38 16 15 ccgci 390 201 85 2 15 7 6 28 20 cci 260 43 22 0 160 5 4 2 8 ci 248 79 28 36 68 0 0 2 13 cim 240 94 41 72 5 0 0 7 7 ccci 237 87 35 2 20 20 19 5 20 cie 232 116 40 50 4 0 0 10 5 cir 168 50 36 51 5 0 0 8 8 ccccgci 142 19 10 27 0 6 1 18 3 cin 94 54 16 12 0 0 0 2 5 cccci 78 12 13 3 2 6 4 10 3 oe 70 21 10 17 3 0 0 6 1 i 28 0 0 0 0 6 22 0 0 cieci 20 7 4 6 0 0 1 1 0 cccg 20 5 3 0 0 5 3 2 0 ccoe 19 1 1 3 0 3 3 4 0 gci 18 0 0 0 0 8 9 0 0 or 15 4 2 3 0 0 0 4 0 cif 15 4 0 6 2 0 0 1 1 cccoe 14 0 0 4 0 0 0 3 0 coe 12 1 6 0 1 0 0 3 1 ccccci 11 1 1 3 0 0 0 1 1 ccgcir 9 4 4 0 0 0 1 0 0 cor 8 4 0 0 1 3 0 0 0 circi 8 1 0 5 0 0 0 1 0 ciecgci 8 3 2 3 0 0 0 0 0 e 7 2 0 4 0 0 0 0 0 cccccgci 7 2 0 2 0 0 0 0 0 ccor 6 0 1 2 0 0 1 0 0 ccg 6 3 1 0 1 0 0 0 1 im 5 0 0 0 0 2 3 0 0 ie 5 0 0 0 0 3 2 0 0 cieor 5 0 4 0 0 0 0 0 0 cicgci 5 3 1 0 0 0 0 0 1 ccgcie 5 2 2 0 0 0 0 1 0 cccgcie 5 2 1 0 0 1 0 0 0 ccccgcir 5 0 0 0 0 0 0 1 0 oeccccgci 4 1 0 1 0 0 0 0 0 ir 4 0 0 0 0 2 1 0 0 cieoe 4 1 2 1 0 0 0 0 0 cieccccgci 4 0 1 2 0 0 0 0 0 ccie 4 1 0 0 3 0 0 0 0 cccir 4 0 1 0 0 1 0 1 1 cccgcir 4 0 0 0 0 1 0 0 0 ccccg 4 1 0 0 1 0 1 0 0 c 4 0 0 0 2 2 0 0 0 om 3 1 0 0 0 0 0 0 1 oecgci 3 1 1 0 0 0 0 1 0 oecccgci 3 0 0 0 0 0 0 1 0 in 3 0 0 0 0 0 3 0 0 ciecccgci 3 0 1 2 0 0 0 0 0 cic 3 2 0 0 0 0 0 1 0 cgci 3 1 0 0 0 0 0 0 0 cccie 3 0 1 0 1 0 0 1 0 cccc 3 0 0 0 0 1 1 0 0 oeci 2 0 0 0 0 0 0 0 0 oeccci 2 0 0 0 0 0 0 0 0 gcim 2 0 0 0 0 1 0 0 0 coecgci 2 0 1 0 0 0 0 0 0 co 2 0 0 0 1 1 0 0 0 ciie 2 1 0 1 0 0 0 0 0 cieo 2 0 0 2 0 0 0 0 0 ce 2 0 0 0 0 2 0 0 0 ccir 2 1 0 0 0 1 0 0 0 ccgcim 2 1 1 0 0 0 0 0 0 cccif 2 0 0 0 0 1 1 0 0 ccccgcie 2 0 0 0 0 0 0 0 0 cc 2 1 0 0 0 1 0 0 0 roe 1 0 0 1 0 0 0 0 0 oroeccccgci 1 0 0 0 0 0 0 1 0 orcim 1 0 0 1 0 0 0 0 0 orccci 1 0 0 0 0 0 0 1 0 oeoe 1 1 0 0 0 0 0 0 0 oecircir 1 0 0 0 0 0 0 0 0 oecim 1 0 0 0 0 0 0 0 0 oeciecccgci 1 0 0 0 0 0 0 0 0 oecgcie 1 0 0 0 0 0 0 1 0 oecgccccgci 1 0 0 0 0 0 0 0 0 oecccgcie 1 0 0 0 0 0 0 0 0 oecccg 1 0 0 0 0 0 0 1 0 oecccci 1 0 0 0 0 0 0 0 0 oeccccg 1 0 0 0 0 0 0 0 0 oecccccgci 1 0 0 1 0 0 0 0 0 oePci 1 0 0 0 0 0 0 1 0 oePccccgci 1 0 0 1 0 0 0 0 0 ocgcie 1 0 0 0 0 0 0 0 0 oPci 1 1 0 0 0 0 0 0 0 imci 1 0 0 0 0 0 1 0 0 if 1 0 0 0 0 1 0 0 0 goe 1 0 0 0 0 1 0 0 0 gcircie 1 0 0 0 0 0 0 0 0 gcircccci 1 0 0 0 0 0 0 0 0 gcir 1 0 0 0 0 1 0 0 0 gcieor 1 0 0 0 0 1 0 0 0 g 1 0 0 0 0 1 0 0 0 eor 1 0 1 0 0 0 0 0 0 eoe 1 0 1 0 0 0 0 0 0 ecccci 1 0 0 0 0 0 0 0 0 eccccgci 1 0 0 1 0 0 0 0 0 coeor 1 0 1 0 0 0 0 0 0 coeci 1 1 0 0 0 0 0 0 0 cirorci 1 0 0 1 0 0 0 0 0 cirof 1 0 0 1 0 0 0 0 0 ciroeof 1 0 1 0 0 0 0 0 0 ciroe 1 0 0 0 1 0 0 0 0 circif 1 0 1 0 0 0 0 0 0 circie 1 0 0 0 0 0 0 1 0 circgci 1 1 0 0 0 0 0 0 0 circccci 1 0 0 1 0 0 0 0 0 circccccgcie 1 0 0 1 0 0 0 0 0 cioc 1 0 1 0 0 0 0 0 0 cifo 1 0 0 0 0 0 0 0 0 ciecirci 1 0 0 1 0 0 0 0 0 ciecircg 1 0 0 1 0 0 0 0 0 ciecir 1 0 0 1 0 0 0 0 0 ciecim 1 0 0 1 0 0 0 0 0 ciecie 1 0 0 1 0 0 0 0 0 ciecgcgci 1 1 0 0 0 0 0 0 0 cieccci 1 0 1 0 0 0 0 0 0 ciecccci 1 1 0 0 0 0 0 0 0 ciec 1 1 0 0 0 0 0 0 0 cicccgci 1 1 0 0 0 0 0 0 0 cgoe 1 0 0 0 0 0 0 0 0 cgcieccor 1 0 0 0 0 0 0 0 0 cgcie 1 0 1 0 0 0 0 0 0 ccoecicg 1 0 0 1 0 0 0 0 0 ccoeci 1 0 0 0 0 0 0 0 0 ccocgci 1 0 1 0 0 0 0 0 0 ccgor 1 1 0 0 0 0 0 0 0 ccgcioe 1 0 0 0 0 0 0 1 0 ccgcif 1 0 1 0 0 0 0 0 0 ccgcieor 1 0 1 0 0 0 0 0 0 ccgciA 1 0 0 0 0 1 0 0 0 ccgcci 1 0 0 0 0 0 0 0 0 ccgccgci 1 1 0 0 0 0 0 0 0 ccgccci 1 0 1 0 0 0 0 0 0 ccgccccgci 1 1 0 0 0 0 0 0 0 cccor 1 0 0 0 0 1 0 0 0 cccoecgci 1 0 0 0 0 0 0 0 0 ccciroe 1 0 0 0 0 0 0 0 0 cccieci 1 0 0 0 0 0 0 0 0 ccciecgci 1 0 0 0 0 0 0 0 0 cccgcirci 1 0 0 0 0 1 0 0 0 cccgcif 1 0 0 0 0 0 0 1 0 cccgccci 1 0 0 0 1 0 0 0 0 ccccoe 1 0 0 1 0 0 0 0 0 ccccic 1 0 1 0 0 0 0 0 0 cccccgcie 1 0 0 1 0 0 0 0 0 ccccc 1 0 0 1 0 0 0 0 0 ccc 1 1 0 0 0 0 0 0 0 ccPcccgci 1 0 0 0 0 1 0 0 0 Pcim 1 0 0 0 0 0 0 0 0 SUFFIX TOTAL cic P e.*H ci.*H oP Ae AP cg.*H Ae.*H ------------ ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- TOTALORUM 3060 35 63 61 58 40 38 22 18 13 ------------ ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- cccgci 462 12 3 9 12 2 1 2 3 2 ccgci 390 0 0 8 12 0 0 0 6 0 cci 260 0 0 9 4 0 0 0 2 1 ci 248 0 0 4 2 2 7 1 2 4 cim 240 0 0 7 4 0 2 0 1 0 ccci 237 5 0 4 12 0 2 0 3 3 cie 232 0 0 1 3 1 2 0 0 0 cir 168 0 1 4 2 2 1 0 0 0 ccccgci 142 2 17 2 1 18 7 11 0 0 cin 94 0 0 2 2 0 1 0 0 0 cccci 78 4 3 1 0 4 9 2 1 1 oe 70 0 8 1 0 1 1 1 0 0 i 28 0 0 0 0 0 0 0 0 0 cieci 20 0 0 0 0 1 0 0 0 0 cccg 20 1 0 1 0 0 0 0 0 0 ccoe 19 1 2 0 0 0 0 1 0 0 gci 18 1 0 0 0 0 0 0 0 0 or 15 0 0 0 0 1 0 1 0 0 cif 15 0 0 1 0 0 0 0 0 0 cccoe 14 0 4 0 1 0 1 1 0 0 coe 12 0 0 0 0 0 0 0 0 0 ccccci 11 1 0 0 1 2 0 0 0 0 ccgcir 9 0 0 0 0 0 0 0 0 0 cor 8 0 0 0 0 0 0 0 0 0 circi 8 0 0 0 0 0 0 0 0 1 ciecgci 8 0 0 0 0 0 0 0 0 0 e 7 0 0 1 0 0 0 0 0 0 cccccgci 7 0 0 0 0 0 3 0 0 0 ccor 6 0 2 0 0 0 0 0 0 0 ccg 6 0 0 0 0 0 0 0 0 0 im 5 0 0 0 0 0 0 0 0 0 ie 5 0 0 0 0 0 0 0 0 0 cieor 5 0 0 1 0 0 0 0 0 0 cicgci 5 0 0 0 0 0 0 0 0 0 ccgcie 5 0 0 0 0 0 0 0 0 0 cccgcie 5 0 0 0 0 0 0 1 0 0 ccccgcir 5 0 4 0 0 0 0 0 0 0 oeccccgci 4 0 2 0 0 0 0 0 0 0 ir 4 1 0 0 0 0 0 0 0 0 cieoe 4 0 0 0 0 0 0 0 0 0 cieccccgci 4 0 0 0 0 1 0 0 0 0 ccie 4 0 0 0 0 0 0 0 0 0 cccir 4 0 0 0 0 0 0 0 0 0 cccgcir 4 1 0 0 2 0 0 0 0 0 ccccg 4 0 0 0 0 1 0 0 0 0 c 4 0 0 0 0 0 0 0 0 0 om 3 0 1 0 0 0 0 0 0 0 oecgci 3 0 0 0 0 0 0 0 0 0 oecccgci 3 0 2 0 0 0 0 0 0 0 in 3 0 0 0 0 0 0 0 0 0 ciecccgci 3 0 0 0 0 0 0 0 0 0 cic 3 0 0 0 0 0 0 0 0 0 cgci 3 0 0 0 0 0 1 1 0 0 cccie 3 0 0 0 0 0 0 0 0 0 cccc 3 1 0 0 0 0 0 0 0 0 oeci 2 0 0 0 0 2 0 0 0 0 oeccci 2 0 2 0 0 0 0 0 0 0 gcim 2 1 0 0 0 0 0 0 0 0 coecgci 2 0 0 1 0 0 0 0 0 0 co 2 0 0 0 0 0 0 0 0 0 ciie 2 0 0 0 0 0 0 0 0 0 cieo 2 0 0 0 0 0 0 0 0 0 ce 2 0 0 0 0 0 0 0 0 0 ccir 2 0 0 0 0 0 0 0 0 0 ccgcim 2 0 0 0 0 0 0 0 0 0 cccif 2 0 0 0 0 0 0 0 0 0 ccccgcie 2 0 1 0 0 1 0 0 0 0 cc 2 0 0 0 0 0 0 0 0 0 roe 1 0 0 0 0 0 0 0 0 0 oroeccccgci 1 0 0 0 0 0 0 0 0 0 orcim 1 0 0 0 0 0 0 0 0 0 orccci 1 0 0 0 0 0 0 0 0 0 oeoe 1 0 0 0 0 0 0 0 0 0 oecircir 1 0 1 0 0 0 0 0 0 0 oecim 1 0 1 0 0 0 0 0 0 0 oeciecccgci 1 0 1 0 0 0 0 0 0 0 oecgcie 1 0 0 0 0 0 0 0 0 0 oecgccccgci 1 0 1 0 0 0 0 0 0 0 oecccgcie 1 0 1 0 0 0 0 0 0 0 oecccg 1 0 0 0 0 0 0 0 0 0 oecccci 1 0 1 0 0 0 0 0 0 0 oeccccg 1 0 1 0 0 0 0 0 0 0 oecccccgci 1 0 0 0 0 0 0 0 0 0 oePci 1 0 0 0 0 0 0 0 0 0 oePccccgci 1 0 0 0 0 0 0 0 0 0 ocgcie 1 0 0 1 0 0 0 0 0 0 oPci 1 0 0 0 0 0 0 0 0 0 imci 1 0 0 0 0 0 0 0 0 0 if 1 0 0 0 0 0 0 0 0 0 goe 1 0 0 0 0 0 0 0 0 0 gcircie 1 1 0 0 0 0 0 0 0 0 gcircccci 1 1 0 0 0 0 0 0 0 0 gcir 1 0 0 0 0 0 0 0 0 0 gcieor 1 0 0 0 0 0 0 0 0 0 g 1 0 0 0 0 0 0 0 0 0 eor 1 0 0 0 0 0 0 0 0 0 eoe 1 0 0 0 0 0 0 0 0 0 ecccci 1 0 0 1 0 0 0 0 0 0 eccccgci 1 0 0 0 0 0 0 0 0 0 coeor 1 0 0 0 0 0 0 0 0 0 coeci 1 0 0 0 0 0 0 0 0 0 cirorci 1 0 0 0 0 0 0 0 0 0 cirof 1 0 0 0 0 0 0 0 0 0 ciroeof 1 0 0 0 0 0 0 0 0 0 ciroe 1 0 0 0 0 0 0 0 0 0 circif 1 0 0 0 0 0 0 0 0 0 circie 1 0 0 0 0 0 0 0 0 0 circgci 1 0 0 0 0 0 0 0 0 0 circccci 1 0 0 0 0 0 0 0 0 0 circccccgcie 1 0 0 0 0 0 0 0 0 0 cioc 1 0 0 0 0 0 0 0 0 0 cifo 1 0 0 1 0 0 0 0 0 0 ciecirci 1 0 0 0 0 0 0 0 0 0 ciecircg 1 0 0 0 0 0 0 0 0 0 ciecir 1 0 0 0 0 0 0 0 0 0 ciecim 1 0 0 0 0 0 0 0 0 0 ciecie 1 0 0 0 0 0 0 0 0 0 ciecgcgci 1 0 0 0 0 0 0 0 0 0 cieccci 1 0 0 0 0 0 0 0 0 0 ciecccci 1 0 0 0 0 0 0 0 0 0 ciec 1 0 0 0 0 0 0 0 0 0 cicccgci 1 0 0 0 0 0 0 0 0 0 cgoe 1 0 1 0 0 0 0 0 0 0 cgcieccor 1 0 1 0 0 0 0 0 0 0 cgcie 1 0 0 0 0 0 0 0 0 0 ccoecicg 1 0 0 0 0 0 0 0 0 0 ccoeci 1 0 0 0 0 0 0 0 0 1 ccocgci 1 0 0 0 0 0 0 0 0 0 ccgor 1 0 0 0 0 0 0 0 0 0 ccgcioe 1 0 0 0 0 0 0 0 0 0 ccgcif 1 0 0 0 0 0 0 0 0 0 ccgcieor 1 0 0 0 0 0 0 0 0 0 ccgciA 1 0 0 0 0 0 0 0 0 0 ccgcci 1 0 0 1 0 0 0 0 0 0 ccgccgci 1 0 0 0 0 0 0 0 0 0 ccgccci 1 0 0 0 0 0 0 0 0 0 ccgccccgci 1 0 0 0 0 0 0 0 0 0 cccor 1 0 0 0 0 0 0 0 0 0 cccoecgci 1 0 1 0 0 0 0 0 0 0 ccciroe 1 0 1 0 0 0 0 0 0 0 cccieci 1 0 0 0 0 1 0 0 0 0 ccciecgci 1 1 0 0 0 0 0 0 0 0 cccgcirci 1 0 0 0 0 0 0 0 0 0 cccgcif 1 0 0 0 0 0 0 0 0 0 cccgccci 1 0 0 0 0 0 0 0 0 0 ccccoe 1 0 0 0 0 0 0 0 0 0 ccccic 1 0 0 0 0 0 0 0 0 0 cccccgcie 1 0 0 0 0 0 0 0 0 0 ccccc 1 0 0 0 0 0 0 0 0 0 ccc 1 0 0 0 0 0 0 0 0 0 ccPcccgci 1 0 0 0 0 0 0 0 0 0 Pcim 1 1 0 0 0 0 0 0 0 0 Analysis: The prefix "cg" seems a bit anomalous. It looks as if the "cg" were actually the first "c" of the suffix. Most valid suffixes apparently begin with "c". Thus we should incorporate the "c" into the prefix. The "i" suffix is bogus; it entered only because of "eci" and "oeci". Note that "ec", "cic" and "oec" incorporate the "c" while the other prefixes don't. The prefixes "cc.*H", "oe.*H", etc. seem anomalous, probably because some "H"s are actually "cHc"s. We should find out what are the actual prefixes, and see if we can take the two classes apart. Most productive suffixes come in pairs that differ by an extra "c" at the beginning. Redoing again with the extra "c"s: /bin/rm -f .title /bin/rm -f .table /bin/touch .table set noglob set ofmt = "0" set npat = 1 foreach pat ( \ 'AH' 'oH' 'cg' 'cc.*H' \ 'e' 'oe' 'H' 'Ae' \ 'oe.*H' 'ci' 'P' 'e.*H' \ 'ci.*H' 'oP' 'AP' 'cg.*H' \ 'Ae.*H' \ ) /n/gnu/bin/printf " %7s" "${pat}" >> .title /bin/cat bio-j-huc-gut.wds \ | /n/gnu/bin/sed -e 's/^/_/g' -e 's/$/_/g' \ | /n/gnu/bin/egrep "_${pat}c[^HPA]*_" \ | /n/gnu/bin/sed -e "s/_${pat}//g" -e '/../s/_$//g' \ | /n/gnu/bin/sort | uniq -c \ | /n/gnu/bin/expand \ > .suff.frq /n/gnu/bin/join -a 1 -a 2 -1 1 -2 2 -o"${ofmt},2.1" -e 0 .table .suff.frq > .tmp /bin/mv .tmp .table @ npat = ${npat} + 1 set ofmt = "${ofmt},1.${npat}" end unset noglob /n/gnu/bin/printf "\n" >> .title cat .table \ | /n/gnu/bin/gawk '/./ { s=0; for(i=2;i<=NF;i++) s+=$(i); print s, $0 }' \ | sort -nr \ > .tbsort cat .title .tbsort \ | format-suffix-table Results: SUFFIX TOTAL AH oH cg cc.*H e oe H oe.*H ------------ ----- ----- ----- ----- ----- ----- ----- ----- ----- TOTALORUM 2927 1009 433 315 315 169 127 132 113 ------------ ----- ----- ----- ----- ----- ----- ----- ----- ----- ccgci 377 201 85 2 15 0 0 28 20 cccgci 352 191 57 5 21 7 6 16 15 ci 276 79 28 36 68 6 22 2 13 ccccgci 256 19 10 27 0 73 38 18 3 cci 251 43 22 0 160 0 0 2 8 cim 245 94 41 72 5 2 3 7 7 cie 237 116 40 50 4 3 2 10 5 ccci 202 87 35 2 20 5 4 5 20 cir 172 50 36 51 5 2 1 8 8 cccci 108 12 13 3 2 20 19 10 3 cin 97 54 16 12 0 0 3 2 5 ccccci 24 1 1 3 0 6 4 1 1 cgci 21 1 0 0 0 8 9 0 0 cccoe 21 0 0 4 0 3 3 3 0 cieci 19 7 4 6 0 0 0 1 0 cif 16 4 0 6 2 1 0 1 1 cccccgci 16 2 0 2 0 6 1 0 0 coe 12 1 6 0 1 0 0 3 1 ccoe 12 1 1 3 0 0 0 4 0 ccccg 12 1 0 0 1 5 3 0 0 cccg 11 5 3 0 0 0 0 2 0 circi 8 1 0 5 0 0 0 1 0 ciecgci 8 3 2 3 0 0 0 0 0 ccor 8 0 1 2 0 3 0 0 0 ccgcir 8 4 4 0 0 0 0 0 0 ccccgcir 7 0 0 0 0 1 0 1 0 ccg 6 3 1 0 1 0 0 0 1 cor 5 4 0 0 1 0 0 0 0 cieor 5 0 4 0 0 0 0 0 0 cicgci 5 3 1 0 0 0 0 0 1 ccgcie 5 2 2 0 0 0 0 1 0 cieoe 4 1 2 1 0 0 0 0 0 cieccccgci 4 0 1 2 0 0 0 0 0 ccie 4 1 0 0 3 0 0 0 0 cccir 4 0 1 0 0 1 0 1 1 cccgcie 4 2 1 0 0 0 0 0 0 ccccc 4 0 0 1 0 1 1 0 0 c 4 0 0 0 2 0 2 0 0 ciecccgci 3 0 1 2 0 0 0 0 0 cic 3 2 0 0 0 0 0 1 0 cccie 3 0 1 0 1 0 0 1 0 cccgcir 3 0 0 0 0 0 1 0 0 ccccgcie 3 0 0 0 0 1 0 0 0 cc 3 1 0 0 0 2 0 0 0 coecgci 2 0 1 0 0 0 0 0 0 ciie 2 1 0 1 0 0 0 0 0 cieo 2 0 0 2 0 0 0 0 0 cgoe 2 0 0 0 0 1 0 0 0 cgcim 2 0 0 0 0 1 0 0 0 ccgcim 2 1 1 0 0 0 0 0 0 cce 2 0 0 0 0 2 0 0 0 ccccif 2 0 0 0 0 1 1 0 0 ccc 2 1 0 0 0 1 0 0 0 coeor 1 0 1 0 0 0 0 0 0 coeci 1 1 0 0 0 0 0 0 0 co 1 0 0 0 1 0 0 0 0 cirorci 1 0 0 1 0 0 0 0 0 cirof 1 0 0 1 0 0 0 0 0 ciroeof 1 0 1 0 0 0 0 0 0 ciroe 1 0 0 0 1 0 0 0 0 circif 1 0 1 0 0 0 0 0 0 circie 1 0 0 0 0 0 0 1 0 circgci 1 1 0 0 0 0 0 0 0 circccci 1 0 0 1 0 0 0 0 0 circccccgcie 1 0 0 1 0 0 0 0 0 cioc 1 0 1 0 0 0 0 0 0 cimci 1 0 0 0 0 0 1 0 0 cifo 1 0 0 0 0 0 0 0 0 ciecirci 1 0 0 1 0 0 0 0 0 ciecircg 1 0 0 1 0 0 0 0 0 ciecir 1 0 0 1 0 0 0 0 0 ciecim 1 0 0 1 0 0 0 0 0 ciecie 1 0 0 1 0 0 0 0 0 ciecgcgci 1 1 0 0 0 0 0 0 0 cieccci 1 0 1 0 0 0 0 0 0 ciecccci 1 1 0 0 0 0 0 0 0 ciec 1 1 0 0 0 0 0 0 0 cicccgci 1 1 0 0 0 0 0 0 0 cgcircie 1 0 0 0 0 0 0 0 0 cgcircccci 1 0 0 0 0 0 0 0 0 cgcir 1 0 0 0 0 1 0 0 0 cgcieor 1 0 0 0 0 1 0 0 0 cgcieccor 1 0 0 0 0 0 0 0 0 cgcie 1 0 1 0 0 0 0 0 0 cg 1 0 0 0 0 1 0 0 0 ccoecicg 1 0 0 1 0 0 0 0 0 ccoeci 1 0 0 0 0 0 0 0 0 ccocgci 1 0 1 0 0 0 0 0 0 cco 1 0 0 0 0 1 0 0 0 ccir 1 1 0 0 0 0 0 0 0 ccieci 1 0 0 0 0 0 1 0 0 ccgor 1 1 0 0 0 0 0 0 0 ccgcioe 1 0 0 0 0 0 0 1 0 ccgcif 1 0 1 0 0 0 0 0 0 ccgcieor 1 0 1 0 0 0 0 0 0 ccgcci 1 0 0 0 0 0 0 0 0 ccgccgci 1 1 0 0 0 0 0 0 0 ccgccci 1 0 1 0 0 0 0 0 0 ccgccccgci 1 1 0 0 0 0 0 0 0 cccor 1 0 0 0 0 0 1 0 0 cccoecgci 1 0 0 0 0 0 0 0 0 ccciroe 1 0 0 0 0 0 0 0 0 cccieci 1 0 0 0 0 0 0 0 0 cccgcif 1 0 0 0 0 0 0 1 0 cccgccci 1 0 0 0 1 0 0 0 0 ccccor 1 0 0 0 0 1 0 0 0 ccccoe 1 0 0 1 0 0 0 0 0 ccccir 1 0 0 0 0 1 0 0 0 cccciecgci 1 0 0 0 0 0 0 0 0 ccccic 1 0 1 0 0 0 0 0 0 ccccgcirci 1 0 0 0 0 1 0 0 0 cccccgcie 1 0 0 1 0 0 0 0 0 cccccg 1 0 0 0 0 0 1 0 0 cccccci 1 0 0 0 0 0 0 0 0 SUFFIX TOTAL Ae ci P e.*H ci.*H oP AP cg.*H Ae.*H ------------ ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- TOTALORUM 2927 37 34 41 57 58 36 20 18 13 ------------ ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ccgci 377 0 0 0 8 12 0 0 6 0 cccgci 352 1 0 3 9 12 2 2 3 2 ci 276 7 0 0 4 2 2 1 2 4 ccccgci 256 7 12 17 2 1 18 11 0 0 cci 251 0 0 0 9 4 0 0 2 1 cim 245 2 0 0 7 4 0 0 1 0 cie 237 2 0 0 1 3 1 0 0 0 ccci 202 2 0 0 4 12 0 0 3 3 cir 172 1 1 1 4 2 2 0 0 0 cccci 108 9 5 3 1 0 4 2 1 1 cin 97 1 0 0 2 2 0 0 0 0 ccccci 24 0 4 0 0 1 2 0 0 0 cgci 21 1 1 0 0 0 0 1 0 0 cccoe 21 1 1 4 0 1 0 1 0 0 cieci 19 0 0 0 0 0 1 0 0 0 cif 16 0 0 0 1 0 0 0 0 0 cccccgci 16 3 2 0 0 0 0 0 0 0 coe 12 0 0 0 0 0 0 0 0 0 ccoe 12 0 0 2 0 0 0 1 0 0 ccccg 12 0 1 0 0 0 1 0 0 0 cccg 11 0 0 0 1 0 0 0 0 0 circi 8 0 0 0 0 0 0 0 0 1 ciecgci 8 0 0 0 0 0 0 0 0 0 ccor 8 0 0 2 0 0 0 0 0 0 ccgcir 8 0 0 0 0 0 0 0 0 0 ccccgcir 7 0 1 4 0 0 0 0 0 0 ccg 6 0 0 0 0 0 0 0 0 0 cor 5 0 0 0 0 0 0 0 0 0 cieor 5 0 0 0 1 0 0 0 0 0 cicgci 5 0 0 0 0 0 0 0 0 0 ccgcie 5 0 0 0 0 0 0 0 0 0 cieoe 4 0 0 0 0 0 0 0 0 0 cieccccgci 4 0 0 0 0 0 1 0 0 0 ccie 4 0 0 0 0 0 0 0 0 0 cccir 4 0 0 0 0 0 0 0 0 0 cccgcie 4 0 0 0 0 0 0 1 0 0 ccccc 4 0 1 0 0 0 0 0 0 0 c 4 0 0 0 0 0 0 0 0 0 ciecccgci 3 0 0 0 0 0 0 0 0 0 cic 3 0 0 0 0 0 0 0 0 0 cccie 3 0 0 0 0 0 0 0 0 0 cccgcir 3 0 0 0 0 2 0 0 0 0 ccccgcie 3 0 0 1 0 0 1 0 0 0 cc 3 0 0 0 0 0 0 0 0 0 coecgci 2 0 0 0 1 0 0 0 0 0 ciie 2 0 0 0 0 0 0 0 0 0 cieo 2 0 0 0 0 0 0 0 0 0 cgoe 2 0 0 1 0 0 0 0 0 0 cgcim 2 0 1 0 0 0 0 0 0 0 ccgcim 2 0 0 0 0 0 0 0 0 0 cce 2 0 0 0 0 0 0 0 0 0 ccccif 2 0 0 0 0 0 0 0 0 0 ccc 2 0 0 0 0 0 0 0 0 0 coeor 1 0 0 0 0 0 0 0 0 0 coeci 1 0 0 0 0 0 0 0 0 0 co 1 0 0 0 0 0 0 0 0 0 cirorci 1 0 0 0 0 0 0 0 0 0 cirof 1 0 0 0 0 0 0 0 0 0 ciroeof 1 0 0 0 0 0 0 0 0 0 ciroe 1 0 0 0 0 0 0 0 0 0 circif 1 0 0 0 0 0 0 0 0 0 circie 1 0 0 0 0 0 0 0 0 0 circgci 1 0 0 0 0 0 0 0 0 0 circccci 1 0 0 0 0 0 0 0 0 0 circccccgcie 1 0 0 0 0 0 0 0 0 0 cioc 1 0 0 0 0 0 0 0 0 0 cimci 1 0 0 0 0 0 0 0 0 0 cifo 1 0 0 0 1 0 0 0 0 0 ciecirci 1 0 0 0 0 0 0 0 0 0 ciecircg 1 0 0 0 0 0 0 0 0 0 ciecir 1 0 0 0 0 0 0 0 0 0 ciecim 1 0 0 0 0 0 0 0 0 0 ciecie 1 0 0 0 0 0 0 0 0 0 ciecgcgci 1 0 0 0 0 0 0 0 0 0 cieccci 1 0 0 0 0 0 0 0 0 0 ciecccci 1 0 0 0 0 0 0 0 0 0 ciec 1 0 0 0 0 0 0 0 0 0 cicccgci 1 0 0 0 0 0 0 0 0 0 cgcircie 1 0 1 0 0 0 0 0 0 0 cgcircccci 1 0 1 0 0 0 0 0 0 0 cgcir 1 0 0 0 0 0 0 0 0 0 cgcieor 1 0 0 0 0 0 0 0 0 0 cgcieccor 1 0 0 1 0 0 0 0 0 0 cgcie 1 0 0 0 0 0 0 0 0 0 cg 1 0 0 0 0 0 0 0 0 0 ccoecicg 1 0 0 0 0 0 0 0 0 0 ccoeci 1 0 0 0 0 0 0 0 0 1 ccocgci 1 0 0 0 0 0 0 0 0 0 cco 1 0 0 0 0 0 0 0 0 0 ccir 1 0 0 0 0 0 0 0 0 0 ccieci 1 0 0 0 0 0 0 0 0 0 ccgor 1 0 0 0 0 0 0 0 0 0 ccgcioe 1 0 0 0 0 0 0 0 0 0 ccgcif 1 0 0 0 0 0 0 0 0 0 ccgcieor 1 0 0 0 0 0 0 0 0 0 ccgcci 1 0 0 0 1 0 0 0 0 0 ccgccgci 1 0 0 0 0 0 0 0 0 0 ccgccci 1 0 0 0 0 0 0 0 0 0 ccgccccgci 1 0 0 0 0 0 0 0 0 0 cccor 1 0 0 0 0 0 0 0 0 0 cccoecgci 1 0 0 1 0 0 0 0 0 0 ccciroe 1 0 0 1 0 0 0 0 0 0 cccieci 1 0 0 0 0 0 1 0 0 0 cccgcif 1 0 0 0 0 0 0 0 0 0 cccgccci 1 0 0 0 0 0 0 0 0 0 ccccor 1 0 0 0 0 0 0 0 0 0 ccccoe 1 0 0 0 0 0 0 0 0 0 ccccir 1 0 0 0 0 0 0 0 0 0 cccciecgci 1 0 1 0 0 0 0 0 0 0 ccccic 1 0 0 0 0 0 0 0 0 0 ccccgcirci 1 0 0 0 0 0 0 0 0 0 cccccgcie 1 0 0 0 0 0 0 0 0 0 cccccg 1 0 0 0 0 0 0 0 0 0 cccccci 1 0 1 0 0 0 0 0 0 0 Analysis: to a first approximation, the "ordinary" words are { AH oH cg cc.*H e oe H oe.*H Ae ci P e.*H ci.*H oP AP cg.*H Ae.*H } × { ccgci cccgci ci ccccgci cci cim cie ccci cir cccci cin } Rarer suffixes are { ccccci cgci cccoe cieci cif cccccgci coe ccoe ... } Looking back, we see that the prefixes "cc.*H" are actually "cccH" and "ccccH". There is also a "ciH" prefix. The "ccc[^HP]*" words that we were using before appear to be "c" prefix. Looked for more info on "H"-containing prefixes: cat bio-j-huc-gut.wds \ | sed -e 's/^/_/g' -e 's/$/_/g' \ | compare-contexts '_ci.*H.*_' '_e.*H.*_' '_cg.*H.*_' '_Ae.*H.*_' '_oe.*H.*_' 8 0.13 eHccgci 12 0.21 ciHccgci 7 0.11 eHcccgci 12 0.21 ciHcccgci 6 0.10 eHcim 10 0.17 ciHccci 5 0.08 eccccHcci 4 0.07 ciHcim 4 0.07 eHcir 3 0.05 ciHcie 3 0.05 ecccHcci 2 0.03 ciHcir 3 0.05 eHccci 2 0.03 ciHcin 2 0.03 eccccHcccgci 2 0.03 ciHcci 2 0.03 ecccHci 2 0.03 ciHcccgcir 2 0.03 eHcin 1 0.02 ciroHcci 2 0.03 eHci 1 0.02 cioHci 2 0.03 eHccccgci 1 0.02 ciccccHcci 1 0.02 eoeHcim 1 0.02 ciccccHccci 1 0.02 eoHcif 1 0.02 cicHccci 1 0.02 eccccHcie 1 0.02 ciHci 1 0.02 eccccHccci 1 0.02 ciHcccoe 1 0.02 eHoe 1 0.02 ciHccccgci 1 0.02 eHocgcie 1 0.02 ciHccccci 1 0.02 eHecccci ----- ---- ---- 1 0.02 eHe 58 1.00 TOT 1 0.02 eHcoecgci 1 0.02 eHcifo 1 0.02 eHcieor 1 0.02 eHcci 1 0.02 eHccgcci 1 0.02 eHcccg 1 0.02 eHcccci ----- ---- ---- 61 1.00 TOT 20 0.17 oeHccgci 2 0.11 cgciHccgci 4 0.31 AeHci 19 0.17 oeHccci 2 0.11 cgciHcccgci 3 0.23 AeHccci 15 0.13 oeHcccgci 2 0.11 cgHccgci 2 0.15 AeHcccgci 10 0.09 oeHci 1 0.06 cgoeHccgci 1 0.08 AeHcirci 7 0.06 oeHcir 1 0.06 cgoHccgci 1 0.08 AeHccoeci 7 0.06 oeHcim 1 0.06 cgoHccci 1 0.08 AeHcci 5 0.04 oeHcin 1 0.06 cgcieHci 1 0.08 AeHcccci 5 0.04 oeHcie 1 0.06 cgcicHcci ----- ---- ---- 4 0.03 oeHcci 1 0.06 cgciHcim 13 1.00 TOT 3 0.03 oeHcccci 1 0.06 cgciHci 2 0.02 oeoHci 1 0.06 cgciHcci 2 0.02 oecccHcci 1 0.06 cgciHccci 2 0.02 oeHccccgci 1 0.06 cgcccHcccgci 1 0.01 oeoeHccci 1 0.06 cgHccci 1 0.01 oeoHcir 1 0.06 cgHcccci 1 0.01 oeoHccccgci ----- ---- ---- 1 0.01 oeccccHcci 18 1.00 TOT 1 0.01 oeccHci 1 0.01 oePocHcci 1 0.01 oeHom 1 0.01 oeHoe 1 0.01 oeHcoe 1 0.01 oeHcif 1 0.01 oeHcicgci 1 0.01 oeHccg 1 0.01 oeHcccir 1 0.01 oeHccccci ----- ---- ---- 115 1.00 TOT So it seems we got some new prefixes: { eH eccccH ciH ciccccH oeH cgciH AeH } Now that we know that "ccgci" is the most common suffix, let's look for all its prefixes: cat bio-j-huc-gut.wds \ | egrep 'ccgci$' \ | sed -e 's/ccgci$//g' \ | wfreq 387 0.25 cc 201 0.13 AH 191 0.12 AHc 85 0.05 oH 73 0.05 ecc 60 0.04 ccc 57 0.04 oHc 38 0.02 oecc 28 0.02 H 27 0.02 cgcc 21 0.01 c 20 0.01 oeH 19 0.01 AHcc 18 0.01 oPcc 18 0.01 Hcc 17 0.01 Pcc 16 0.01 Hc 15 0.01 oeHc 15 0.01 cccHc 12 0.01 cicc 12 0.01 ciHc 12 0.01 ciH 11 0.01 APcc 10 0.01 rcc 10 0.01 oHcc 10 0.01 cccH 8 0.01 eH 7 0.00 ec 7 0.00 eHc 7 0.00 Aecc 6 0.00 oec 6 0.00 eccc 6 0.00 Ac 5 0.00 cgc 5 0.00 ccccHc 4 0.00 oePcc 4 0.00 coeHc 4 0.00 cHc 3 0.00 ccH 3 0.00 cH 3 0.00 Pc 3 0.00 Aeccc 2 0.00 rccc 2 0.00 oeHcc 2 0.00 occ 2 0.00 oPc 2 0.00 eccccHc 2 0.00 eHcc 2 0.00 coecc 2 0.00 coHc 2 0.00 ciec 2 0.00 ciccc 2 0.00 cgciecc 2 0.00 cgciec 2 0.00 cgciHc 2 0.00 cgciH 2 0.00 cgccc 2 0.00 cgH 2 0.00 cg 2 0.00 ccccH 2 0.00 cccPcc 2 0.00 Poecc 2 0.00 Poec 2 0.00 AeHc 2 0.00 Acc 2 0.00 APc 2 0.00 AHccc 1 0.00 rc 1 0.00 orccc 1 0.00 oeoHcc 1 0.00 oeccc 1 0.00 ocgcc 1 0.00 ocHc 1 0.00 oc 1 0.00 oPciecc 1 0.00 oHciecc 1 0.00 oHciec 1 0.00 oAPcc 1 0.00 eocc 1 0.00 ecccPc 1 0.00 ePcc 1 0.00 ePc 1 0.00 coeH 1 0.00 ciecc 1 0.00 ciPcc 1 0.00 ciHcc 1 0.00 cgoeccc 1 0.00 cgoecc 1 0.00 cgoePcc 1 0.00 cgoeH 1 0.00 cgoH 1 0.00 cgecc 1 0.00 cgcccHc 1 0.00 ccocPc 1 0.00 ccoHc 1 0.00 cccoec 1 0.00 ccccPc 1 0.00 cccPc 1 0.00 ccPccc 1 0.00 ccPcc 1 0.00 cPcc 1 0.00 Poeciec 1 0.00 Poecgcc 1 0.00 PciH 1 0.00 Horoecc 1 0.00 Hoec 1 0.00 HcccoHc 1 0.00 Aec 1 0.00 Acgc 1 0.00 AccHc 1 0.00 AcHc 1 0.00 AHoecc 1 0.00 AHcic 1 0.00 AHccgcc 1 0.00 AHccg ----- ---- ---- 1562 1.00 TOT Ditto with "cie": cat bio-j-huc-gut.wds \ | egrep 'cie$' \ | sed -e 's/cie$//g' \ | wfreq 116 0.35 AH 50 0.15 cg 40 0.12 oH 14 0.04 c 12 0.04 10 0.03 H 9 0.03 ccccg 7 0.02 ccc 7 0.02 cc 6 0.02 r 5 0.02 oeH 3 0.01 e 3 0.01 ciH 2 0.01 oe 2 0.01 oHccg 2 0.01 ccccHc 2 0.01 ccH 2 0.01 Ae 2 0.01 AHccg 2 0.01 AHcccg 1 0.00 rccc 1 0.00 or 1 0.00 ocg 1 0.00 oc 1 0.00 oPccccg 1 0.00 oP 1 0.00 oHcg 1 0.00 oHcccg 1 0.00 oHcc 1 0.00 eor 1 0.00 eccccg 1 0.00 eccccH 1 0.00 eHocg 1 0.00 coeccH 1 0.00 cicgcir 1 0.00 cgcircccccg 1 0.00 cgcie 1 0.00 cgcccccg 1 0.00 ccoH 1 0.00 ccir 1 0.00 cccg 1 0.00 cccc 1 0.00 cccHcc 1 0.00 cccHc 1 0.00 cccH 1 0.00 cPc 1 0.00 cHc 1 0.00 cH 1 0.00 Poecccg 1 0.00 Pccccg 1 0.00 Hoecg 1 0.00 Hcir 1 0.00 Hccg 1 0.00 Hcc 1 0.00 APcccg 1 0.00 AHc 1 0.00 A ----- ---- ---- 333 1.00 TOT Looking for suffixes that do not begin with "c": cat bio-j-huc-gut.wds \ | sed -e 's/^/_/g' -e 's/$/_/g' \ | compare-contexts '_AH.*_' '_cg.*_' '_cccH.*_' 72 0.20 cgcim 201 0.19 AHccgci 90 0.51 cccHcci 51 0.14 cgcir 191 0.18 AHcccgci 31 0.18 cccHci 50 0.14 cgcie 116 0.11 AHcie 15 0.08 cccHcccgci 36 0.10 cgci 94 0.09 AHcim 11 0.06 cccHccci 27 0.07 cgccccgci 87 0.08 AHccci 10 0.06 cccHccgci 17 0.05 cgoe 79 0.08 AHci 4 0.02 cccH 12 0.03 cgcin 54 0.05 AHcin 3 0.02 cccHcir 6 0.02 cgcif 50 0.05 AHcir 2 0.01 cccHcim 6 0.02 cgcieci 43 0.04 AHcci 2 0.01 cccHcif 5 0.01 cgcirci 21 0.02 AHoe 2 0.01 cccHc 5 0.01 cgcccgci 19 0.02 AHccccgci 1 0.01 cccHcor 4 0.01 cge 12 0.01 AHcccci 1 0.01 cccHcie 4 0.01 cgcccoe 7 0.01 AHcieci 1 0.01 cccHccie 3 0.01 cgor 5 0.00 AHcccg 1 0.01 cccHccg 3 0.01 cgciecgci 4 0.00 AHor 1 0.01 cccHcccie 3 0.01 cgccoe 4 0.00 AHcor 1 0.01 cccHcccci 3 0.01 cgcccci 4 0.00 AHcif 1 0.01 cccHccccg 3 0.01 cgccccci 4 0.00 AHccgcir ----- ---- ---- 2 0.01 cgcieo 3 0.00 AHciecgci 177 1.00 TOT 2 0.01 cgciecccgci 3 0.00 AHcicgci 2 0.01 cgcieccccgci 3 0.00 AHccg 2 0.01 cgciHccgci 2 0.00 AHe 2 0.01 cgciHcccgci 2 0.00 AHcic 2 0.01 cgccor 2 0.00 AHccgcie 2 0.01 cgccgci 2 0.00 AHcccgcie 2 0.01 cgccci 2 0.00 AHcccccgci 2 0.01 cgcccccgci 2 0.00 AH 2 0.01 cgHccgci 1 0.00 AHom 1 0.00 cgroe 1 0.00 AHoeoe 1 0.00 cgorcim 1 0.00 AHoecgci 1 0.00 cgoeccccgci 1 0.00 AHoeccccgci 1 0.00 cgoecccccgci 1 0.00 AHoPci 1 0.00 cgoePccccgci 1 0.00 AHcoeci 1 0.00 cgoeHccgci 1 0.00 AHcoe 1 0.00 cgoHccgci 1 0.00 AHcirci 1 0.00 cgoHccci 1 0.00 AHcircgci 1 0.00 cgeccccgci 1 0.00 AHciie 1 0.00 cgcirorci 1 0.00 AHcieoe 1 0.00 cgcirof 1 0.00 AHciecgcgci 1 0.00 cgcircccci 1 0.00 AHciecccci 1 0.00 cgcircccccgcie 1 0.00 AHciec 1 0.00 cgciie 1 0.00 AHcicccgci 1 0.00 cgcieoe 1 0.00 AHcgci 1 0.00 cgciecirci 1 0.00 AHccoe 1 0.00 cgciecircg 1 0.00 AHccir 1 0.00 cgciecir 1 0.00 AHccie 1 0.00 cgciecim 1 0.00 AHccgor 1 0.00 cgciecie 1 0.00 AHccgcim 1 0.00 cgcieHci 1 0.00 AHccgccgci 1 0.00 cgcicHcci 1 0.00 AHccgccccgci 1 0.00 cgciHcim 1 0.00 AHccccg 1 0.00 cgciHci 1 0.00 AHccccci 1 0.00 cgciHcci 1 0.00 AHccccHcci 1 0.00 cgciHccci 1 0.00 AHccc 1 0.00 cgccoecicg 1 0.00 AHcc 1 0.00 cgccccoe ----- ---- ---- 1 0.00 cgcccccgcie 1044 1.00 TOT 1 0.00 cgccccc 1 0.00 cgcccHcccgci 1 0.00 cgHccci 1 0.00 cgHcccci ----- ---- ---- 363 1.00 TOT So it seems that { oe or om e } are also valid suffixes. OK, let's try to parse what we can into prefix:suffix: split-prefix-suffix ------------------------------------------------ #! /n/gnu/bin/gawk -f # Attempts to split words into prefix/suffix inserting ":" in between. BEGIN { PREFS = "^(AH|AP|Ae|AeH|H|P|cH|ccH|cccH|ccccH|cg|cgciH|ci|ciH|e|eH|eccccH|oH|oP|oe|oeH|r)" SUFFS = "([co][^HP]*)$" SPLIT = ( PREFS SUFFS ) } ( $0 ~ SPLIT ) { match($0, PREFS) k = RLENGTH $0 = (substr($0, 1, k) ":" substr($0, k + 1)) print next } /./ { print; next } ------------------------------------------------ cat bio-j-huc-gut.wds \ | split-prefix-suffix \ | egrep -v ':' \ | wfreq 387 0.24 ccccgci 139 0.09 cccci 131 0.08 oe 81 0.05 Ae 61 0.04 ccccci 60 0.04 cccccgci 41 0.03 or 33 0.02 cccoe 30 0.02 ccim 25 0.02 coe 23 0.01 ccoe 21 0.01 cim 21 0.01 cccgci 17 0.01 cccor 14 0.01 ccie 14 0.01 ccci 12 0.01 cie 11 0.01 ccir 10 0.01 cor 10 0.01 cccir 10 0.01 cccc 9 0.01 ccccgcim 9 0.01 ccccgcie 8 0.01 ccccir 7 0.00 cir 7 0.00 cccie 7 0.00 ccccie 7 0.00 ccccg 6 0.00 cce 6 0.00 c 6 0.00 Ar 6 0.00 Acccgci 5 0.00 r 5 0.00 oroe 5 0.00 orci 5 0.00 com 5 0.00 cccPccci 4 0.00 oePccccgci 4 0.00 e 4 0.00 coeHcccgci 4 0.00 cin 4 0.00 cge 4 0.00 ccccoe 4 0.00 ccccgcir 4 0.00 ccccc 4 0.00 cccH 4 0.00 cPccci 4 0.00 AcHccci 4 0.00 A 3 0.00 orcim 3 0.00 ocHcci 3 0.00 oH 3 0.00 ecccHcci 3 0.00 coecccci 3 0.00 coeHccci 3 0.00 cif 3 0.00 cieci 3 0.00 ccoecgci 3 0.00 cccif 3 0.00 cccgcir 3 0.00 ccccgoe 3 0.00 Acgci 2 0.00 rcccHci 2 0.00 oroeci 2 0.00 om 2 0.00 oeoHci 2 0.00 oeeccci 2 0.00 oecccHcci 2 0.00 ocgci 2 0.00 occcci 2 0.00 occccgci 2 0.00 ocHccci 2 0.00 ecccHci 2 0.00 coeccccgci 2 0.00 coHcccgci 2 0.00 ciroe 2 0.00 circi 2 0.00 cimci 2 0.00 ciecccgci 2 0.00 cgHccgci 2 0.00 ccoHci 2 0.00 cccoHci 2 0.00 ccccieci 2 0.00 ccccPcci 2 0.00 cccPccccgci 2 0.00 HoHoe 2 0.00 Accci 2 0.00 Accccgci 2 0.00 AHe 2 0.00 AH 1 0.00 rcicHcci 1 0.00 orcir 1 0.00 orcin 1 0.00 orcie 1 0.00 orcccci 1 0.00 orcccccgci 1 0.00 on 1 0.00 oeoeHccci 1 0.00 oeoHcir 1 0.00 oeoHccccgci 1 0.00 oeeof 1 0.00 oeccccHcci 1 0.00 oeccHci 1 0.00 oePocHcci 1 0.00 oeA 1 0.00 ocicccci 1 0.00 ocgcir 1 0.00 ocgcim 1 0.00 ocgcie 1 0.00 ocgccccgci 1 0.00 occie 1 0.00 occcgci 1 0.00 occccin 1 0.00 occccci 1 0.00 occcPoec 1 0.00 ocHcor 1 0.00 ocHccoe 1 0.00 ocHcccgci 1 0.00 oPcieHcim 1 0.00 oHeor 1 0.00 oHeoe 1 0.00 oHcieHci 1 0.00 oHccoHcir 1 0.00 oAe 1 0.00 oAPccccgci 1 0.00 oAHci 1 0.00 oA 1 0.00 o 1 0.00 er 1 0.00 eoeHcim 1 0.00 eoHcif 1 0.00 eecgcir 1 0.00 ecccPcccgci 1 0.00 ePoe 1 0.00 ePcccgci 1 0.00 ePccccgci 1 0.00 eHecccci 1 0.00 eHe 1 0.00 coeor 1 0.00 coeci 1 0.00 coecgci 1 0.00 coecccoe 1 0.00 coeccHcie 1 0.00 coeHci 1 0.00 coeHcci 1 0.00 coeHccgci 1 0.00 cocgcir 1 0.00 cocHccci 1 0.00 coHoe 1 0.00 ciror 1 0.00 ciroHcci 1 0.00 circir 1 0.00 cioHci 1 0.00 cieorci 1 0.00 cieor 1 0.00 cieoe 1 0.00 cieciecgci 1 0.00 cieccccgci 1 0.00 cieccccc 1 0.00 ciccccHcci 1 0.00 ciccccHccci 1 0.00 cicPcim 1 0.00 cicHccci 1 0.00 ciPcccci 1 0.00 ciPccccgci 1 0.00 ci 1 0.00 cgroe 1 0.00 cgoePccccgci 1 0.00 cgoeHccgci 1 0.00 cgoHccgci 1 0.00 cgoHccci 1 0.00 cgeccccgci 1 0.00 cgcieHci 1 0.00 cgcicHcci 1 0.00 cgcccHcccgci 1 0.00 cgHccci 1 0.00 cgHcccci 1 0.00 cec 1 0.00 ccor 1 0.00 ccoeo 1 0.00 ccoeci 1 0.00 ccoecccci 1 0.00 ccoeHcccci 1 0.00 ccocgcim 1 0.00 ccocPcccgci 1 0.00 ccocHcci 1 0.00 ccoHcim 1 0.00 ccoHcie 1 0.00 ccoHcccgci 1 0.00 ccircie 1 0.00 ccino 1 0.00 ccieci 1 0.00 ccieccccg 1 0.00 ccieHccci 1 0.00 cciHcim 1 0.00 cci 1 0.00 ccer 1 0.00 ccecgcim 1 0.00 ccecgci 1 0.00 cceccPccccci 1 0.00 ccec 1 0.00 cccoeoe 1 0.00 cccoeo 1 0.00 cccoecgci 1 0.00 cccoecccgci 1 0.00 cccoecccci 1 0.00 ccciror 1 0.00 cccirci 1 0.00 cccieoeci 1 0.00 ccciHccci 1 0.00 cccgoe 1 0.00 cccgcin 1 0.00 cccgcif 1 0.00 cccgcie 1 0.00 ccccor 1 0.00 ccccieor 1 0.00 cccciHci 1 0.00 ccccgor 1 0.00 ccccgcif 1 0.00 ccccgciecgci 1 0.00 ccccgciHcir 1 0.00 ccccgccci 1 0.00 cccccoe 1 0.00 cccccim 1 0.00 cccccie 1 0.00 cccccgcir 1 0.00 cccccg 1 0.00 cccccci 1 0.00 cccccHcci 1 0.00 cccccHccci 1 0.00 ccccPcccgci 1 0.00 cccPoe 1 0.00 cccPci 1 0.00 cccPcccgci 1 0.00 cccP 1 0.00 ccPcim 1 0.00 ccPccccgci 1 0.00 ccPcccccgci 1 0.00 cc 1 0.00 cPcoe 1 0.00 cPccir 1 0.00 cPccie 1 0.00 cPccgor 1 0.00 cPcccci 1 0.00 cPccccgci 1 0.00 PoecgciHci 1 0.00 PoeHcccoe 1 0.00 PoeHccci 1 0.00 PoHcin 1 0.00 PciHccgci 1 0.00 HoePci 1 0.00 HoeHcci 1 0.00 HocHccci 1 0.00 HoHcieci 1 0.00 HciAHci 1 0.00 HcccoHcccgci 1 0.00 HcccgoeHcgci 1 0.00 H 1 0.00 Arcim 1 0.00 Aoe 1 0.00 An 1 0.00 Acir 1 0.00 Acim 1 0.00 Acie 1 0.00 AciHci 1 0.00 AcgciHcci 1 0.00 Acgccci 1 0.00 Acgcccgci 1 0.00 Acgccccg 1 0.00 Acccci 1 0.00 AccHcccgci 1 0.00 AcHcci 1 0.00 AcHcccgci 1 0.00 AcHccc 1 0.00 AP 1 0.00 AHoPci 1 0.00 AHccccHcci 1 0.00 AAHccci ----- ---- ---- 1585 1.00 TOT I must do something about the "c" prefix.... cat bio-j-huc-gut.wds \ | split-prefix-suffix \ | egrep ':' \ | sed -e 's/^.*://g' \ | wfreq All suffixes recognized by the code: 376 0.12 ccgci 355 0.11 cccgci 267 0.08 ci 265 0.08 ccccgci 255 0.08 cim 243 0.08 cie 240 0.08 cci 201 0.06 ccci 174 0.06 cir 111 0.04 cccci 102 0.03 oe 98 0.03 cin 41 0.01 or 25 0.01 ccccci 21 0.01 cgci 21 0.01 cccoe 20 0.01 cieci 18 0.01 cif 18 0.01 cccccgci 14 0.00 ccccg 13 0.00 coe 12 0.00 ccoe 11 0.00 cccg 9 0.00 circi 8 0.00 ciecgci 8 0.00 ccor 8 0.00 ccgcir 7 0.00 oeci 7 0.00 ccccgcir 6 0.00 cor 6 0.00 ccg 5 0.00 om 5 0.00 o 5 0.00 cieor 5 0.00 cicgci 5 0.00 ccie 5 0.00 ccgcie 4 0.00 oecgci 4 0.00 oeccccgci 4 0.00 cieoe 4 0.00 cieccccgci 4 0.00 cccir 4 0.00 cccgcie 4 0.00 ccccc 4 0.00 c 3 0.00 oecccgci 3 0.00 ciecccgci 3 0.00 cic 3 0.00 cccie 3 0.00 cccgcir 3 0.00 ccccgcie 3 0.00 ccc 3 0.00 cc 2 0.00 oroe 2 0.00 orcim 2 0.00 oeccci 2 0.00 oecccci 2 0.00 coecgci 2 0.00 ciie 2 0.00 cieo 2 0.00 cgoe 2 0.00 cgcim 2 0.00 ccgcim 2 0.00 cce 2 0.00 ccccif 1 0.00 oroeccccgci 1 0.00 orcin 1 0.00 orcie 1 0.00 orccci 1 0.00 oeor 1 0.00 oeof 1 0.00 oeoe 1 0.00 oecircir 1 0.00 oecim 1 0.00 oeciecccgci 1 0.00 oecgcie 1 0.00 oecgccccgci 1 0.00 oecccgcie 1 0.00 oecccg 1 0.00 oeccccg 1 0.00 oecccccgci 1 0.00 ocgcie 1 0.00 ocg 1 0.00 occci 1 0.00 occccgci 1 0.00 coeor 1 0.00 coeci 1 0.00 co 1 0.00 cirorci 1 0.00 cirof 1 0.00 ciroeof 1 0.00 ciroe 1 0.00 circif 1 0.00 circie 1 0.00 circgci 1 0.00 circccci 1 0.00 circccccgcie 1 0.00 cioc 1 0.00 cimci 1 0.00 cifo 1 0.00 ciecirci 1 0.00 ciecircg 1 0.00 ciecir 1 0.00 ciecim 1 0.00 ciecie 1 0.00 ciecgcgci 1 0.00 ciecce 1 0.00 cieccci 1 0.00 ciecccci 1 0.00 ciec 1 0.00 cicccgci 1 0.00 cgcircie 1 0.00 cgcircccci 1 0.00 cgcir 1 0.00 cgcieor 1 0.00 cgcieccor 1 0.00 cgcie 1 0.00 cg 1 0.00 ccoecicg 1 0.00 ccoeci 1 0.00 ccocgci 1 0.00 cco 1 0.00 ccir 1 0.00 ccin 1 0.00 ccieci 1 0.00 ccgor 1 0.00 ccgcioe 1 0.00 ccgcif 1 0.00 ccgcieor 1 0.00 ccgcci 1 0.00 ccgccgci 1 0.00 ccgccci 1 0.00 ccgccccgci 1 0.00 cccor 1 0.00 cccoecgci 1 0.00 ccciroe 1 0.00 cccieci 1 0.00 cccgcif 1 0.00 cccgciA 1 0.00 cccgccci 1 0.00 ccccor 1 0.00 ccccoe 1 0.00 ccccir 1 0.00 cccciecgci 1 0.00 cccciecg 1 0.00 ccccie 1 0.00 ccccic 1 0.00 ccccgcirci 1 0.00 cccccgcie 1 0.00 cccccg 1 0.00 cccccci 1 0.00 cccc ----- ---- ---- 3157 1.00 TOT In reverse-lex order: 1 0.00 cccgciA 4 0.00 c 3 0.00 cc 3 0.00 ccc 1 0.00 cccc 4 0.00 ccccc 1 0.00 ciec 3 0.00 cic 1 0.00 ccccic 1 0.00 cioc 2 0.00 cce 1 0.00 ciecce 243 0.08 cie 5 0.00 ccie 3 0.00 cccie 1 0.00 ccccie 1 0.00 ciecie 1 0.00 cgcie 5 0.00 ccgcie 4 0.00 cccgcie 3 0.00 ccccgcie 1 0.00 cccccgcie 1 0.00 circccccgcie 1 0.00 oecccgcie 1 0.00 oecgcie 1 0.00 ocgcie 1 0.00 circie 1 0.00 cgcircie 1 0.00 orcie 2 0.00 ciie 102 0.03 oe 13 0.00 coe 12 0.00 ccoe 21 0.01 cccoe 1 0.00 ccccoe 4 0.00 cieoe 1 0.00 oeoe 2 0.00 cgoe 1 0.00 ccgcioe 1 0.00 ciroe 1 0.00 ccciroe 2 0.00 oroe 18 0.01 cif 2 0.00 ccccif 1 0.00 ccgcif 1 0.00 cccgcif 1 0.00 circif 1 0.00 oeof 1 0.00 ciroeof 1 0.00 cirof 1 0.00 cg 6 0.00 ccg 11 0.00 cccg 14 0.00 ccccg 1 0.00 cccccg 1 0.00 oeccccg 1 0.00 oecccg 1 0.00 cccciecg 1 0.00 ccoecicg 1 0.00 ocg 1 0.00 ciecircg 267 0.08 ci 240 0.08 cci 201 0.06 ccci 111 0.04 cccci 25 0.01 ccccci 1 0.00 cccccci 1 0.00 ciecccci 2 0.00 oecccci 1 0.00 circccci 1 0.00 cgcircccci 1 0.00 cieccci 2 0.00 oeccci 1 0.00 ccgccci 1 0.00 cccgccci 1 0.00 occci 1 0.00 orccci 1 0.00 ccgcci 20 0.01 cieci 1 0.00 ccieci 1 0.00 cccieci 7 0.00 oeci 1 0.00 coeci 1 0.00 ccoeci 21 0.01 cgci 376 0.12 ccgci 355 0.11 cccgci 265 0.08 ccccgci 18 0.01 cccccgci 1 0.00 oecccccgci 4 0.00 cieccccgci 4 0.00 oeccccgci 1 0.00 oroeccccgci 1 0.00 ccgccccgci 1 0.00 oecgccccgci 1 0.00 occccgci 3 0.00 ciecccgci 1 0.00 oeciecccgci 3 0.00 oecccgci 1 0.00 cicccgci 1 0.00 ccgccgci 8 0.00 ciecgci 1 0.00 cccciecgci 4 0.00 oecgci 2 0.00 coecgci 1 0.00 cccoecgci 1 0.00 ciecgcgci 5 0.00 cicgci 1 0.00 ccocgci 1 0.00 circgci 1 0.00 cimci 9 0.00 circi 1 0.00 ciecirci 1 0.00 ccccgcirci 1 0.00 cirorci 255 0.08 cim 1 0.00 ciecim 1 0.00 oecim 2 0.00 cgcim 2 0.00 ccgcim 2 0.00 orcim 5 0.00 om 98 0.03 cin 1 0.00 ccin 1 0.00 orcin 5 0.00 o 1 0.00 co 1 0.00 cco 2 0.00 cieo 1 0.00 cifo 174 0.06 cir 1 0.00 ccir 4 0.00 cccir 1 0.00 ccccir 1 0.00 ciecir 1 0.00 cgcir 8 0.00 ccgcir 3 0.00 cccgcir 7 0.00 ccccgcir 1 0.00 oecircir 41 0.01 or 6 0.00 cor 8 0.00 ccor 1 0.00 cccor 1 0.00 ccccor 1 0.00 cgcieccor 5 0.00 cieor 1 0.00 cgcieor 1 0.00 ccgcieor 1 0.00 oeor 1 0.00 coeor 1 0.00 ccgor Note the isolated peaks at suffixes that beging with "ci" or "o". These sharp peaks are evidence that the prefixes recognized above do NOT have alternates with a "c" appended. That is, the suffixes 21 0.01 cgci 376 0.12 ccgci 355 0.11 cccgci 265 0.08 ccccgci 18 0.01 cccccgci appear to be different suffixes, and not the same "ccgci" suffix attached to different prefixes. Here are the most significant suffixes: 243 0.08 cie 102 0.03 oe 13 0.00 coe 12 0.00 ccoe 21 0.01 cccoe 18 0.01 cif 11 0.00 cccg 14 0.00 ccccg 267 0.08 ci 240 0.08 cci 201 0.06 ccci 111 0.04 cccci 25 0.01 ccccci 20 0.01 cieci 21 0.01 cgci 376 0.12 ccgci 355 0.11 cccgci 265 0.08 ccccgci 18 0.01 cccccgci 255 0.08 cim 98 0.03 cin 174 0.06 cir 41 0.01 or Removing the strings of "c": cat bio-j-huc-gut.wds \ | split-prefix-suffix \ | egrep ':c' \ | sed -e 's/^.*:c*//g' \ | wfreq 1035 0.35 gci 845 0.29 i 255 0.09 im 252 0.09 ie 180 0.06 ir 99 0.03 in 47 0.02 oe 33 0.01 g 22 0.01 ieci 20 0.01 if 19 0.01 gcir 16 0.01 or 15 0.01 14 0.00 gcie 9 0.00 irci 9 0.00 iecgci 5 0.00 ieor 5 0.00 icgci 4 0.00 ieoe 4 0.00 ieccccgci 4 0.00 ic 4 0.00 gcim 3 0.00 oecgci 3 0.00 iecccgci 2 0.00 oeci 2 0.00 o 2 0.00 iroe 2 0.00 iie 2 0.00 ieo 2 0.00 goe 2 0.00 gcif 2 0.00 gcieor 2 0.00 gccci 2 0.00 e 1 0.00 oeor 1 0.00 oecicg 1 0.00 ocgci 1 0.00 irorci 1 0.00 irof 1 0.00 iroeof 1 0.00 ircif 1 0.00 ircie 1 0.00 ircgci 1 0.00 ircccci 1 0.00 ircccccgcie 1 0.00 ioc 1 0.00 imci 1 0.00 ifo 1 0.00 iecirci 1 0.00 iecircg 1 0.00 iecir 1 0.00 iecim 1 0.00 iecie 1 0.00 iecgcgci 1 0.00 iecg 1 0.00 iecce 1 0.00 ieccci 1 0.00 iecccci 1 0.00 iec 1 0.00 icccgci 1 0.00 gor 1 0.00 gcircie 1 0.00 gcirci 1 0.00 gcircccci 1 0.00 gcioe 1 0.00 gcieccor 1 0.00 gciA 1 0.00 gcci 1 0.00 gccgci 1 0.00 gccccgci ----- ---- ---- 2958 1.00 TOT Redefined suffixes, and added a few prefixes: split-prefix-suffix ------------------------------------------------ #! /n/gnu/bin/gawk -f # Attempts to split words into prefix/suffix inserting ":" in between. BEGIN { PREFS = "^(AH|AP|Ae|AeH|H|P|cH|ccH|cccH|ccccH|cg|cgciH|ci|ciH|e|eH|ecccH|eccccH|oH|oP|oe|oeH|oecccH|coeH|r|c)" SUFFS = "([coe][cgiroe]*([mnrfe]|))$" SPLIT = ( PREFS SUFFS ) } ( $0 ~ SPLIT ) { match($0, PREFS) k = RLENGTH $0 = (substr($0, 1, k) ":" substr($0, k + 1)) print next } /./ { print; next } ------------------------------------------------ Here are the suffixes recognized by this code: 746 0.18 cccgci 398 0.09 ccgci 343 0.08 ccci 325 0.08 ccccgci 285 0.07 cim 271 0.06 ci 260 0.06 cci 257 0.06 cie 185 0.04 cir 172 0.04 cccci 127 0.03 oe 98 0.02 cin 51 0.01 or 45 0.01 ccoe 36 0.01 coe 26 0.01 ccccci 25 0.01 ccor 25 0.01 cccoe 21 0.00 cieci 21 0.00 cgci 19 0.00 e 18 0.00 cif 18 0.00 cccg 18 0.00 cccccgci 15 0.00 ccccg 13 0.00 cccgcie 13 0.00 ccc 12 0.00 ccie 12 0.00 cccir 11 0.00 ccir 11 0.00 ccgcir 10 0.00 om 10 0.00 cccie 9 0.00 circi 9 0.00 cccgcim 8 0.00 oeci 8 0.00 ciecgci 8 0.00 ccccgcir 7 0.00 cor 7 0.00 cccgcir 6 0.00 oeccccgci 6 0.00 ce 6 0.00 ccgcie 6 0.00 ccg 5 0.00 oecgci 5 0.00 oecccci 5 0.00 o 5 0.00 coecgci 5 0.00 cieor 5 0.00 cicgci 5 0.00 cccc 5 0.00 c 4 0.00 cieoe 4 0.00 cieccccgci 4 0.00 ccccc 3 0.00 oecccgci 3 0.00 eci 3 0.00 ciecccgci 3 0.00 cic 3 0.00 ccif 3 0.00 cccieci 3 0.00 cccgoe 3 0.00 ccccgcie 3 0.00 cc 2 0.00 oroe 2 0.00 orcim 2 0.00 oeor 2 0.00 oeccci 2 0.00 eor 2 0.00 eoe 2 0.00 eccci 2 0.00 ecccgci 2 0.00 eccccgci 2 0.00 coeci 2 0.00 circie 2 0.00 ciie 2 0.00 cieo 2 0.00 cgoe 2 0.00 cgcim 2 0.00 ccgcim 2 0.00 ccgcif 2 0.00 cce 2 0.00 cccor 2 0.00 cccgcif 2 0.00 cccgccci 2 0.00 ccccoe 2 0.00 ccccif 2 0.00 ccccie 1 0.00 oroeccccgci 1 0.00 orcin 1 0.00 orcie 1 0.00 orccci 1 0.00 oeof 1 0.00 oeoe 1 0.00 oecircir 1 0.00 oecim 1 0.00 oeciecccgci 1 0.00 oecgcie 1 0.00 oecgccccgci 1 0.00 oecccoe 1 0.00 oecccgcie 1 0.00 oecccg 1 0.00 oeccccg 1 0.00 oecccccgci 1 0.00 ocgcir 1 0.00 ocgcie 1 0.00 ocg 1 0.00 occci 1 0.00 occccgci 1 0.00 eorci 1 0.00 eof 1 0.00 eciecgci 1 0.00 ecgcir 1 0.00 ecccci 1 0.00 eccccc 1 0.00 ec 1 0.00 coeor 1 0.00 coeo 1 0.00 coecccci 1 0.00 cocgcim 1 0.00 co 1 0.00 cirorci 1 0.00 cirof 1 0.00 ciroeof 1 0.00 ciroe 1 0.00 circif 1 0.00 circgci 1 0.00 circccci 1 0.00 circccccgcie 1 0.00 cioc 1 0.00 ciecirci 1 0.00 ciecircg 1 0.00 ciecir 1 0.00 ciecim 1 0.00 ciecie 1 0.00 ciecgcgci 1 0.00 ciecce 1 0.00 cieccci 1 0.00 ciecccci 1 0.00 cieccccg 1 0.00 ciec 1 0.00 cicccgci 1 0.00 cgcircie 1 0.00 cgcircccci 1 0.00 cgcir 1 0.00 cgcieor 1 0.00 cgcieccor 1 0.00 cgcie 1 0.00 cg 1 0.00 cer 1 0.00 cecgcim 1 0.00 cecgci 1 0.00 cec 1 0.00 ccoeoe 1 0.00 ccoeo 1 0.00 ccoecicg 1 0.00 ccoeci 1 0.00 ccoecgci 1 0.00 ccoecccgci 1 0.00 ccoecccci 1 0.00 ccocgci 1 0.00 cco 1 0.00 cciror 1 0.00 ccirci 1 0.00 ccin 1 0.00 ccieoeci 1 0.00 ccieci 1 0.00 ccgor 1 0.00 ccgoe 1 0.00 ccgcioe 1 0.00 ccgcin 1 0.00 ccgcieor 1 0.00 ccgcci 1 0.00 ccgccgci 1 0.00 ccgccci 1 0.00 ccgccccgci 1 0.00 cccoecgci 1 0.00 ccciroe 1 0.00 cccieor 1 0.00 cccgor 1 0.00 cccgciecgci 1 0.00 ccccor 1 0.00 ccccir 1 0.00 ccccim 1 0.00 cccciecgci 1 0.00 cccciecg 1 0.00 ccccic 1 0.00 ccccgcirci 1 0.00 cccccgcie 1 0.00 cccccg 1 0.00 cccccci ----- ---- ---- 4207 1.00 TOT In reverse lex order: 5 0.00 c 3 0.00 cc 13 0.00 ccc 5 0.00 cccc 4 0.00 ccccc 1 0.00 eccccc 1 0.00 ec 1 0.00 cec 1 0.00 ciec 3 0.00 cic 1 0.00 ccccic 1 0.00 cioc 19 0.00 e 6 0.00 ce 2 0.00 cce 1 0.00 ciecce 257 0.06 cie 12 0.00 ccie 10 0.00 cccie 2 0.00 ccccie 1 0.00 ciecie 1 0.00 cgcie 6 0.00 ccgcie 13 0.00 cccgcie 3 0.00 ccccgcie 1 0.00 cccccgcie 1 0.00 circccccgcie 1 0.00 oecccgcie 1 0.00 oecgcie 1 0.00 ocgcie 2 0.00 circie 1 0.00 cgcircie 1 0.00 orcie 2 0.00 ciie 127 0.03 oe 36 0.01 coe 45 0.01 ccoe 25 0.01 cccoe 2 0.00 ccccoe 1 0.00 oecccoe 2 0.00 eoe 4 0.00 cieoe 1 0.00 oeoe 1 0.00 ccoeoe 2 0.00 cgoe 1 0.00 ccgoe 3 0.00 cccgoe 1 0.00 ccgcioe 1 0.00 ciroe 1 0.00 ccciroe 2 0.00 oroe 18 0.00 cif 3 0.00 ccif 2 0.00 ccccif 2 0.00 ccgcif 2 0.00 cccgcif 1 0.00 circif 1 0.00 eof 1 0.00 oeof 1 0.00 ciroeof 1 0.00 cirof 1 0.00 cg 6 0.00 ccg 18 0.00 cccg 15 0.00 ccccg 1 0.00 cccccg 1 0.00 cieccccg 1 0.00 oeccccg 1 0.00 oecccg 1 0.00 cccciecg 1 0.00 ccoecicg 1 0.00 ocg 1 0.00 ciecircg 271 0.06 ci 260 0.06 cci 343 0.08 ccci 172 0.04 cccci 26 0.01 ccccci 1 0.00 cccccci 1 0.00 ecccci 1 0.00 ciecccci 5 0.00 oecccci 1 0.00 coecccci 1 0.00 ccoecccci 1 0.00 circccci 1 0.00 cgcircccci 2 0.00 eccci 1 0.00 cieccci 2 0.00 oeccci 1 0.00 ccgccci 2 0.00 cccgccci 1 0.00 occci 1 0.00 orccci 1 0.00 ccgcci 3 0.00 eci 21 0.00 cieci 1 0.00 ccieci 3 0.00 cccieci 8 0.00 oeci 2 0.00 coeci 1 0.00 ccoeci 1 0.00 ccieoeci 21 0.00 cgci 398 0.09 ccgci 746 0.18 cccgci 325 0.08 ccccgci 18 0.00 cccccgci 1 0.00 oecccccgci 2 0.00 eccccgci 4 0.00 cieccccgci 6 0.00 oeccccgci 1 0.00 oroeccccgci 1 0.00 ccgccccgci 1 0.00 oecgccccgci 1 0.00 occccgci 2 0.00 ecccgci 3 0.00 ciecccgci 1 0.00 oeciecccgci 3 0.00 oecccgci 1 0.00 ccoecccgci 1 0.00 cicccgci 1 0.00 ccgccgci 1 0.00 cecgci 8 0.00 ciecgci 1 0.00 cccciecgci 1 0.00 eciecgci 1 0.00 cccgciecgci 5 0.00 oecgci 5 0.00 coecgci 1 0.00 ccoecgci 1 0.00 cccoecgci 1 0.00 ciecgcgci 5 0.00 cicgci 1 0.00 ccocgci 1 0.00 circgci 9 0.00 circi 1 0.00 ccirci 1 0.00 ciecirci 1 0.00 ccccgcirci 1 0.00 eorci 1 0.00 cirorci 285 0.07 cim 1 0.00 ccccim 1 0.00 ciecim 1 0.00 oecim 2 0.00 cgcim 2 0.00 ccgcim 9 0.00 cccgcim 1 0.00 cecgcim 1 0.00 cocgcim 2 0.00 orcim 10 0.00 om 98 0.02 cin 1 0.00 ccin 1 0.00 ccgcin 1 0.00 orcin 5 0.00 o 1 0.00 co 1 0.00 cco 2 0.00 cieo 1 0.00 coeo 1 0.00 ccoeo 1 0.00 cer 185 0.04 cir 11 0.00 ccir 12 0.00 cccir 1 0.00 ccccir 1 0.00 ciecir 1 0.00 cgcir 11 0.00 ccgcir 7 0.00 cccgcir 8 0.00 ccccgcir 1 0.00 ecgcir 1 0.00 ocgcir 1 0.00 oecircir 51 0.01 or 7 0.00 cor 25 0.01 ccor 2 0.00 cccor 1 0.00 ccccor 1 0.00 cgcieccor 2 0.00 eor 5 0.00 cieor 1 0.00 cccieor 1 0.00 cgcieor 1 0.00 ccgcieor 2 0.00 oeor 1 0.00 coeor 1 0.00 ccgor 1 0.00 cccgor 1 0.00 cciror These are the significant ones (15 or more occurrences): 19 0.00 e 257 0.06 cie 127 0.03 oe 36 0.01 coe 45 0.01 ccoe 25 0.01 cccoe 18 0.00 cif 18 0.00 cccg 15 0.00 ccccg 271 0.06 ci 260 0.06 cci 343 0.08 ccci 172 0.04 cccci 26 0.01 ccccci 21 0.00 cieci 21 0.00 cgci 398 0.09 ccgci 746 0.18 cccgci 325 0.08 ccccgci 18 0.00 cccccgci 285 0.07 cim 98 0.02 cin 185 0.04 cir 51 0.01 or 25 0.01 ccor Stripping the "[coe]c*" prefix of all suffixes: cat bio-j-huc-gut.wds \ | split-prefix-suffix \ | egrep ':' \ | sed -e 's/^.*:[coe]c*//g' \ | wfreq 1513 0.36 gci 1080 0.26 i 286 0.07 im 281 0.07 ie 209 0.05 ir 135 0.03 e 110 0.03 oe 99 0.02 in 56 0.01 51 0.01 r 42 0.01 g 37 0.01 or 29 0.01 gcir 25 0.01 ieci 25 0.01 gcie 23 0.01 if 13 0.00 gcim 10 0.00 m 10 0.00 irci 10 0.00 iecgci 8 0.00 eci 7 0.00 oecgci 6 0.00 ieor 6 0.00 goe 6 0.00 ecgci 6 0.00 eccccgci 5 0.00 icgci 5 0.00 ecccci 4 0.00 ieoe 4 0.00 ieccccgci 4 0.00 ic 4 0.00 gcif 3 0.00 oeci 3 0.00 iecccgci 3 0.00 gccci 3 0.00 ecccgci 2 0.00 roe 2 0.00 rcim 2 0.00 oeo 2 0.00 oecccci 2 0.00 o 2 0.00 iroe 2 0.00 ircie 2 0.00 iie 2 0.00 ieo 2 0.00 gor 2 0.00 gcieor 2 0.00 eor 2 0.00 eccci 1 0.00 roeccccgci 1 0.00 rcin 1 0.00 rcie 1 0.00 rccci 1 0.00 orci 1 0.00 of 1 0.00 oeor 1 0.00 oeoe 1 0.00 oecicg 1 0.00 oecccgci 1 0.00 ocgcim 1 0.00 ocgci 1 0.00 irorci 1 0.00 iror 1 0.00 irof 1 0.00 iroeof 1 0.00 ircif 1 0.00 ircgci 1 0.00 ircccci 1 0.00 ircccccgcie 1 0.00 ioc 1 0.00 ieoeci 1 0.00 iecirci 1 0.00 iecircg 1 0.00 iecir 1 0.00 iecim 1 0.00 iecie 1 0.00 iecgcgci 1 0.00 iecg 1 0.00 iecce 1 0.00 ieccci 1 0.00 iecccci 1 0.00 ieccccg 1 0.00 iec 1 0.00 icccgci 1 0.00 gcircie 1 0.00 gcirci 1 0.00 gcircccci 1 0.00 gcioe 1 0.00 gcin 1 0.00 gciecgci 1 0.00 gcieccor 1 0.00 gcci 1 0.00 gccgci 1 0.00 gccccgci 1 0.00 er 1 0.00 eof 1 0.00 eoe 1 0.00 ecircir 1 0.00 ecim 1 0.00 eciecccgci 1 0.00 ecgcim 1 0.00 ecgcie 1 0.00 ecgccccgci 1 0.00 ecccoe 1 0.00 ecccgcie 1 0.00 ecccg 1 0.00 eccccg 1 0.00 ecccccgci 1 0.00 ec ----- ---- ---- 4207 1.00 TOT The significant ones (mostly >40 cases) 56 0.01 _ 135 0.03 e 42 0.01 g 1513 0.36 gci 1080 0.26 i 281 0.07 ie 23 0.01 if 286 0.07 im 99 0.02 in 209 0.05 ir 110 0.03 oe 37 0.01 or 51 0.01 r Redefined split-prefix-suffix accordingly: SUFFS = "([coe]c*(|e|g|gci|i|ie|if|im|in|ir|oe|or|r))$" Let's see what prefixes we get now: cat bio-j-huc-gut.wds \ | split-prefix-suffix \ | egrep ':' \ | sed -e 's/:.*$//g' \ | wfreq 1000 0.25 AH 921 0.23 c 412 0.11 oH 307 0.08 cg 194 0.05 e 173 0.04 cccH 147 0.04 oe 134 0.03 H 107 0.03 ccccH 103 0.03 oeH 65 0.02 r 51 0.01 ciH 50 0.01 ci 40 0.01 eH 40 0.01 P 38 0.01 Ae 34 0.01 oP 21 0.01 cH 21 0.01 AP 19 0.00 ccH 11 0.00 AeH 10 0.00 coeH 9 0.00 eccccH 8 0.00 cgciH 5 0.00 ecccH 2 0.00 oecccH ----- ---- ---- 3922 1.00 TOT Keping only the significant ones ( >20 occurrences: 1000 0.25 AH 21 0.01 AP 38 0.01 Ae 134 0.03 H 40 0.01 P 921 0.23 c 21 0.01 cH 173 0.04 cccH 107 0.03 ccccH 307 0.08 cg 50 0.01 ci 51 0.01 ciH 194 0.05 e 40 0.01 eH 412 0.11 oH 34 0.01 oP 147 0.04 oe 103 0.03 oeH 65 0.02 r Fixing split-prefix-suffix: PREFS = "^(AH|AP|Ae|H|P|c|cH|cccH|ccccH|cg|ci|ciH|e|eH|oH|oP|oe|oeH|r)" Listing again the suffixes, without "[coe]c*" prefixes: cat bio-j-huc-gut.wds \ | split-prefix-suffix \ | egrep ':' \ | sed -e 's/^.*:[coe]c*//g' \ | wfreq 1497 0.39 gci 1042 0.27 i 284 0.07 im 278 0.07 ie 208 0.05 ir 132 0.03 e 109 0.03 oe 99 0.03 in 56 0.01 51 0.01 r 42 0.01 g 37 0.01 or 23 0.01 if ----- ---- ---- 3858 1.00 TOT The complete suffixes: 736 0.19 cccgci 392 0.10 ccgci 333 0.09 ccci 325 0.08 ccccgci 283 0.07 cim 257 0.07 ci 254 0.07 cie 247 0.06 cci 184 0.05 cir 171 0.04 cccci 124 0.03 oe 98 0.03 cin 51 0.01 or 45 0.01 ccoe 35 0.01 coe 26 0.01 ccccci 25 0.01 ccor 25 0.01 cccoe 21 0.01 cgci 19 0.00 e 18 0.00 cif 18 0.00 cccg 18 0.00 cccccgci 15 0.00 ccccg 13 0.00 ccc 12 0.00 ccie 12 0.00 cccir 11 0.00 ccir 10 0.00 cccie 7 0.00 cor 6 0.00 ce 6 0.00 ccg 5 0.00 o 5 0.00 cccc 5 0.00 c 4 0.00 ccccc 3 0.00 eci 3 0.00 ccif 3 0.00 cc 2 0.00 eor 2 0.00 eoe 2 0.00 eccci 2 0.00 ecccgci 2 0.00 eccccgci 2 0.00 cce 2 0.00 cccor 2 0.00 ccccoe 2 0.00 ccccif 2 0.00 ccccie 1 0.00 ocg 1 0.00 occci 1 0.00 occccgci 1 0.00 ecccci 1 0.00 eccccc 1 0.00 ec 1 0.00 cg 1 0.00 ccin 1 0.00 ccccor 1 0.00 ccccir 1 0.00 ccccim 1 0.00 cccccg 1 0.00 cccccci ----- ---- ---- 3858 1.00 TOT Removed suffixes beginning with "e": SUFFS = "([co]c*(|e|g|gci|i|ie|if|im|in|ir|oe|or|r))$" Tabulating prefixes: 998 0.26 AH 920 0.24 c 410 0.11 oH 302 0.08 cg 194 0.05 e 173 0.05 cccH 145 0.04 oe 134 0.04 H 107 0.03 ccccH 103 0.03 oeH 65 0.02 r 51 0.01 ciH 40 0.01 P 38 0.01 eH 38 0.01 Ae 34 0.01 oP 29 0.01 ci 21 0.01 cH 21 0.01 AP ----- ---- ---- 3823 1.00 TOT It looks like "P" is equivalent to "H"... Recomputing prefix/suffix table: /bin/rm -f .title /bin/rm -f .table /bin/touch .table set noglob set ofmt = "0" set npat = 1 foreach pat ( \ 'AH' 'c' 'oH' 'cg' \ 'e' 'cccH' 'oe' 'H' \ 'ccccH' 'oeH' 'r' 'ciH' \ 'P' 'eH' 'Ae' 'oP' \ 'ci' 'cH' 'AP' \ ) /n/gnu/bin/printf " %7s" "${pat}" >> .title /bin/cat bio-j-huc-gut.wds \ | /n/gnu/bin/sed -e 's/^/_/g' -e 's/$/_/g' \ | /n/gnu/bin/egrep "_${pat}[co][^HPA]*_" \ | /n/gnu/bin/sed -e "s/_${pat}//g" -e '/../s/_$//g' \ | /n/gnu/bin/sort | uniq -c \ | /n/gnu/bin/expand \ > .suff.frq /n/gnu/bin/join -a 1 -a 2 -1 1 -2 2 -o"${ofmt},2.1" -e 0 .table .suff.frq > .tmp /bin/mv .tmp .table @ npat = ${npat} + 1 set ofmt = "${ofmt},1.${npat}" end unset noglob /n/gnu/bin/printf "\n" >> .title cat .table \ | /n/gnu/bin/gawk '/./ { s=0; for(i=2;i<=NF;i++) s+=$(i); print s, $0 }' \ | sort -nr \ > .tbsort cat .title .tbsort \ | format-suffix-table TOTAL AH oH c H oeH ciH cH eH cccH ccccH cg P e oe Ae r oP ci AP ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- SUFFIX \ TOT 4104 1038 446 998 148 105 53 21 43 173 109 338 63 212 150 38 73 40 34 22 ------------ ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ci 257 79 28 1 2 10 1 1 2 31 26 36 0 6 22 7 2 2 0 1 cci 247 43 22 14 2 4 2 1 1 90 68 0 0 0 0 0 0 0 0 0 ccci 333 87 35 139 5 19 10 6 3 11 4 2 0 5 4 2 1 0 0 0 cccci 171 12 13 61 10 3 0 0 1 1 0 3 3 20 19 9 5 4 5 2 ccccci 26 1 1 1 1 1 1 0 0 0 0 3 0 6 4 0 1 2 4 0 cccccci 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 cgci 21 1 0 0 0 0 0 0 0 0 0 0 0 8 9 1 0 0 1 1 ccgci 392 201 85 21 28 20 12 3 8 10 2 2 0 0 0 0 0 0 0 0 cccgci 736 191 57 387 16 15 12 4 7 15 5 5 3 7 6 1 1 2 0 2 ccccgci 325 19 10 60 18 2 1 0 2 0 0 27 17 73 38 7 10 18 12 11 cccccgci 18 2 0 0 0 0 0 0 0 0 0 2 0 6 1 3 2 0 2 0 cim 283 94 41 30 7 7 4 0 6 2 0 72 0 2 3 2 13 0 0 0 ccccim 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 cie 254 116 40 14 10 5 3 1 0 1 0 50 0 3 2 2 6 1 0 0 ccie 12 1 0 7 0 0 0 1 0 1 2 0 0 0 0 0 0 0 0 0 cccie 10 0 1 7 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 ccccie 2 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 cir 184 50 36 11 8 7 2 1 4 3 0 51 1 2 1 1 3 2 1 0 ccir 11 1 0 10 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 cccir 12 0 1 8 1 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 ccccir 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 cin 98 54 16 0 2 5 2 0 2 0 0 12 0 0 3 1 1 0 0 0 ccin 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 cif 18 4 0 0 1 1 0 0 0 2 0 6 0 1 0 0 3 0 0 0 ccif 3 0 0 3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 oe 124 21 10 25 6 1 0 0 1 0 0 17 8 16 7 1 9 1 0 1 coe 35 1 6 23 3 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 ccoe 45 1 1 33 4 0 0 0 0 0 0 3 2 0 0 0 0 0 0 1 cccoe 25 0 0 4 3 0 1 0 0 0 0 4 4 3 3 1 0 0 1 1 or 51 4 2 10 4 0 0 0 0 0 0 3 0 10 13 0 3 1 0 1 cor 7 4 0 1 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 ccor 25 0 1 17 0 0 0 0 0 0 0 2 2 3 0 0 0 0 0 0 cccor 2 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 ccccor 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 om 10 1 0 5 0 1 0 0 0 0 0 0 1 0 0 0 2 0 0 0 cccg 18 5 3 7 2 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 ccccg 15 1 0 1 0 0 0 0 0 1 0 0 0 5 3 0 2 1 1 0 cieci 21 7 4 1 1 0 0 0 0 0 0 6 0 0 0 0 1 1 0 0 cccgcie 13 2 1 9 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 ccc 13 1 0 10 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 ccgcir 11 4 4 3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 cccgcim 9 0 0 9 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 oeci 8 0 0 1 0 0 0 0 0 0 0 0 0 4 0 0 1 2 0 0 circi 8 1 0 0 1 0 0 0 0 0 0 5 0 0 0 0 1 0 0 0 ciecgci 8 3 2 0 0 0 0 0 0 0 0 3 0 0 0 0 0 0 0 0 ccccgcir 8 0 0 1 1 0 0 0 0 0 0 0 4 1 0 0 0 0 1 0 cccgcir 7 0 0 4 0 0 2 0 0 0 0 0 0 0 1 0 0 0 0 0 oeccccgci 6 1 0 2 0 0 0 0 0 0 0 1 2 0 0 0 0 0 0 0 ccg 6 3 1 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 ce 6 0 0 6 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ccgcie 6 2 2 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 oecgci 5 1 1 1 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 oecccci 5 0 0 3 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 o 5 0 0 0 0 0 0 0 0 0 0 0 0 4 1 0 0 0 0 0 coecgci 5 0 1 3 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 cieor 5 0 4 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 cicgci 5 3 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 cccc 5 0 0 4 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 c 5 0 0 1 0 0 0 0 0 2 0 0 0 0 2 0 0 0 0 0 cieoe 4 1 2 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 cieccccgci 4 0 1 0 0 0 0 0 0 0 0 2 0 0 0 0 0 1 0 0 ccccc 4 0 0 0 0 0 0 0 0 0 0 1 0 1 1 0 0 0 1 0 oecccgci 3 0 0 0 1 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 ciecccgci 3 0 1 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 cic 3 2 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 cccieci 3 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 cccgoe 3 0 0 3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ccccgcie 3 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 1 0 0 cc 3 1 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 oroe 2 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 orcim 2 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 oeor 2 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 oeccci 2 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 coeci 2 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 circie 2 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ciie 2 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 cieo 2 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 cgoe 2 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 cgcim 2 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 ccgcim 2 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ccgcif 2 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 cce 2 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 cccgcif 2 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 cccgccci 2 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 ccccoe 2 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 ccccif 2 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 oroeccccgci 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 orcin 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 orcie 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 orccci 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 oeof 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 oeoe 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 oecircir 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 oecim 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 oeciecccgci 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 oecgcie 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 oecgccccgci 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 oecccoe 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 oecccgcie 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 oecccg 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 oeccccg 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 oecccccgci 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 ocgcir 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ocgcie 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 ocg 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 occci 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 occccgci 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 coeor 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 coeo 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 coecccci 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 cocgcim 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 co 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 cirorci 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 cirof 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 ciroeof 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 circif 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 circgci 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 circccci 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 circccccgcie 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 cioc 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 cino 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 cimci 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 cifo 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 ciecirci 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 ciecircg 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 ciecir 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 ciecim 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 ciecie 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 ciecgcgci 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ciecce 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 cieccci 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ciecccci 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 cieccccg 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ciec 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 cicccgci 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 cgcircie 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 cgcircccci 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 cgcir 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 cgcieor 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 cgcieccor 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 cgcie 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 cg 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 cer 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 cecgcim 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 cecgci 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 cec 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ccoeoe 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ccoeo 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ccoecicg 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 ccoecgci 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ccoecccgci 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ccoecccci 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ccocgci 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 cco 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 cciror 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ccirci 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ccieoeci 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ccieci 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 ccgor 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ccgoe 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ccgcioe 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ccgcin 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ccgcieor 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ccgcci 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 ccgccgci 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ccgccci 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ccgccccgci 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 cccoecgci 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 ccciroe 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 cccieor 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 cccgor 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 cccgciecgci 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 cccciecgci 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 cccciecg 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 ccccic 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ccccgcirci 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 cccccgcie 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 cccccg 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 Redefined split-prefix-suffix with only the significant suffixes: SUFFS = "(ccc3ci|cc3ci|ccci|cccc3ci|cim|ci|cix|cci|ci2|cccci|ox|cin|o2|ccox|cox|ccccci|cccox|cco2)$"