Hacking at the Voynich manuscript Notebook - volume 2 Warning: these notebooks aren't strictly chronological logs. Sometimes I go back and redo things, clarify comments, delete garbage, etc. Summary of previous notebooks ============================= On 97-07-05 I obtained Landini's interlinear transcription of the VMs, version 1.6 (landini-interln16.evt) from http://sun1.bham.ac.uk/G.Landini/evmt/intrln16.zip I manually extracted from it a homogeneous, full-text sample bio-m-evt.evt, consisting of pages 147-166 (f75r--f84v) of the "biological" section, in Currier's Language B, hand 2. This section includes Currier's and Friedman's transcriptions. Currier's seems to be the most complete of them. I played around with the file over the next three or four days. The boring details are in Notebook-1.txt. I decided it was time to start all again from the beginning. 97-07-09 stolfi =============== From these preliminary hacking, I got the following conclusions: The manuscript does not appear to use any hyphenation mark. Either words are not broken across lines, which would be unusual, or they are broken without any extra marks. Such word breaks may result in statistical anomalies at the beginning and end of lines. Could this explain Currier's claim that lines are "functional units"? Comparing the two versions (Currier and Friedman), and looking at the word statistics, it seems that both are highly contamiated with error (5-10% of the words. This large amount of noise will mess up any statistical analysis based on either text alone. Therefore, before spending more time in the analysis, I must first prepare a "corrected" interlinear where discrepancies between FSG and Currier are resolved, taking into account the probabilities above. Loking at the actual shape of the characters, I realized that the FSG encoding was not very good for my purposes, since is assigns completely different codes to glyphs which may be just calligraphic variations of the same grapheme. Thus I decided to do most processing using a more analytical encoding, which can be lumped later. I considered using Jacques Guy's "Neo-Frogguy" or "Gui2" encoding, but even that is a bit too synthetic --- for example, his <2> should be "i'", and his <9> should be `c)', for consistency. (The statistics on the occurrence of repeated s apparently confirm this choice). Thus I decided to define my own "super-analytic" or "JSA" encoding. My super-analytic encoding -------------------------- The idea is to break all characters doen to individual "logical" strokes, and use one (computer) character to encode each stroke. There is some question as to what is a logical stroke, and when two strokes are different. Obviously, the definition of a stroke must include not only its shape but also the way it connects to the neighboring strokes; and, given the irregularity of handwritten glyphs, that may be hard to decide. For instance, FSG's [A] character can be broken down into two strokes, shaped like the [C] and [I] glyphs. Supposedly, the difference between an [A] and a [CI] is that in the former the strokes are connected into a closed shape. Is this difference significant? I checked the occurrences of [CI], [CM], and [CN] in the interlinear file. Two things are curious. First, these combinations are extremely rare. Second, a good many of them are transcribed differently by Currier and the FSG: where one has [CIIR] the other often has [AIR], and vice-versa. Same for [CM] versus [AN], etc. In light of these observations, I have decided to treat all occurrences of [A] as [CI]. If the two are indeed different, that will be just one more ambiguity added to the inherent ambiguity of natural language; so it cannot make the decipherment task more difficult. Confusing the two will change the letter frequencies, it is true; but, since the language does not appear to be a standardized one, there is not much information we can extract from absolute letter frequencies. The methods we hope to use --- such as automaton analysis --- are not significantly disturbed by collapsing letters. On the other hand, if [A] and [CI] are the same grapheme, using different encodings will seriously confuse statistics --- especially if the spacing depends on the immediate context. For similar resons, it is best to ignore the distinction between [T] and [CC], or between [S] and [2C]. The ligature is often lost, and we don't know whether it is significant. Also, the characters that Currier transcribes as [6] are usually transcribed [K] by Friedman, and the two are very similar. Strangely [K] seems to occur mostly at the end of *lines*. The characters [7] [V} [Y] do not occur in this corpus. Summarizing, the JSA encoding breaks down evey character into strokes, which are cast into one of these types: 1. "Body" strokes: q same as FSG [4], Guy <4>; also part of [H], [P], [HZ], ... o same as [O], c same as [C], ; also part of [A], [8], etc. i same as [I], ; also part of [A], [M], [N], [R], etc. l long vertical bar of [D], [F], [DZ], [FZ] 2. "Limb" strokes ("flourishes", "plumes", ...) g an 8-shaped loop with both ends attached to the previous letter, as the right three-fourths of [8] and [7]; and also the right-hand swirls of [P], [F], [PZ], [FZ]. y a curving descender shaped like a right-parenthesis, attached to the top of the preceding stroke; the right-hand stroke of [G] = <9> s a plume attached to the top of the preceding char, pointing NE and curving up, as in [2] = , [R] = <2>, and [S] x a hook attached to the top of an \i/ stroke, curving sharply down and crossing the \i/; half of [E] = . j a P-shaped loop with one end attached to the top of the previous slope, and the other extending straight down; as in the right half of [H], [D], [HZ], [DZ], and [K]. u a plume similar to \s/, but attached to the *bottom* of the preceding stroke; as in [L], [N], [M]. The ligature in [T] is ignored, i.e. Guy's and are identified with his , and denoted uniformly by \c/. This identification is consistent with the digraph statistics. The character is rendered \ci/. In fact, is probably not a letter --- it appears to be a \c/ stroke (possibly half of the preceding letter) accidentally connected to an \i/ stroke (probably the beginning of the next letter). The weirdo symbols [Y], [V], etc. will be translated as \?/. The FGS -> JSA correspondence is, therefore IIIK -> iiiij IE -> iix A -> ci N -> iiu IIIL -> iiiiu IR -> iis C -> c O -> o IIIR -> iiiis IK -> iij D -> lj P -> ag IIIE -> iiiix 2 -> cs E -> ix R -> is IIE -> iiix 4 -> a F -> lg S -> csc IIR -> iiis 6 -> cj G -> cy T -> cc IIK -> iiij 7 -> ig H -> aj V -> ? HZ -> cajc 8 -> cg I -> i Y -> ? PZ -> cagc K -> ij DZ -> cljc L -> iu FZ -> clgc M -> iiiu Note that the \i/ groups have one more \i/ in JSA than they have in Guy's encoding. This is redundant but makes it more evident that , , <2> are homologous members of their respecive series. Also, this encoding fixes a minor discrepancy of Guy2, which uses one extra \i/ in the series , , ... . Ad-hoc encodings ---------------- After mapping everything to the JSA encoding, and looking at the digraph frequency tables, I observed that: The stroke `l' is always followed by either `j' or `g', hence `lj' and `lg' should be single letters. Note also that there are two clearly different kinds of strokes, "body" B = {`c',`o',`t',`i',`q',`l'} and "limb" L = {`u',`x',`y',`j',`g',`s'}. If we reduce the digraph count matrix to these two classes, plus word break W, we get B L ----- ----- ----- . 6420 . B 59 19849 15616 L 6361 9255 . ----- ----- ----- Next-symbol probabilities (× 99): B L ----- ----- ----- . 99 . B . 55 44 L 40 59 . ----- ----- ----- Previous-symbol probabilities (× 99): B L ----- ----- ----- . 18 . B 1 55 99 L 98 26 . ----- ----- ----- Note that every word begins with a body stroke; this was expected from the definition of the limb strokes (they can be recognized only by their relationship to a previous stroke). Note also that a limb stroke cannot be followed by another limb stroke; this too is not wholly unexpected. The surprise is that almost no words *end* in a body stroke. The least rare body stroke in word-final position is `o'. The words that end in body strokes appear to be errors or the result of breaking a line in the middle of a word. An interesting observation from the body/limb frequency tables above is that the transition probabilities from body stroke to body and limb are respectively 55% and 45%. Thus, if the limb strokes mark the end of a syllabe (or letter?), the the average number of body strokes in a syllabe is slightly over 2. (Considering that we are counting each \i/ as a body stroke, the correct number may well be precisly 2.) 97-07-10 stolfi =============== Restarted everything from the beginning. cat bio-m-evt.evt \ | fsg2jsa \ > bio-m-jsa.evt Prepared a raw text file for training cat bio-m-jsa.evt \ | egrep '^<.*;[FC]> ' \ | sed \ -e 's/<.*;[FC]> *//g' \ -e 's/{[^}]*}//g' \ > bio-m-jsa.txt Note that fsg2jsa removes the "%" and "!" characters, so the lines in the *-jsa.evt output files are not aligned (to align them we must run some dynamic programming). cat bio-m-jsa.txt \ | grep -v '[*]' \ | sed -e 's/^/ /g' \ > bio-m-jsa-trainset.txt cat bio-m-jsa-trainset.txt \ | generate-fix-patterns -vMINOCC=10 \ > bio-m-jsa-fixer.sed lines words bytes file ------ ------- --------- ------------ 1669 3486 106719 bio-m-evt.evt 1530 1530 75404 bio-m-evt.txt 1530 1530 117592 bio-m-jsa.txt 1470 1470 115821 bio-m-jsa-trainset.txt 596 716 9932 bio-m-jsa-fixer.sed Generate consensus: cat bio-m-jsa.evt \ | make-consensus-interlin \ > bio-x-jsa.evt cat bio-x-jsa.evt \ | egrep '^<.*;J> ' \ | sed \ -e 's/{[^}]*}//g' \ > bio-j-jsa.evt extract-words-from-interlin \ -chars "qocilgysxju" \ bio-j-jsa.evt \ bio-j-jsa lines words bytes file ------ ------- --------- ------------ 7054 7054 62690 bio-j-jsa.wds 2132 2132 24925 bio-j-jsa.dic 4661 4661 40897 bio-j-jsa-gut.wds 992 992 9720 bio-j-jsa-gut.dic 840 840 2445 bio-j-jsa-fun.wds 2 2 5 bio-j-jsa-fun.dic 1553 1553 19348 bio-j-jsa-bad.wds 1138 1138 15200 bio-j-jsa-bad.dic Digraph counts: q o c i l g y s x j u TOT ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- . 1398 965 1877 361 60 . . . . . . 4661 q 1 . 1229 18 . 1 154 . . . 700 . 2103 o 21 486 1 63 1087 1071 . . . . . . 2729 c 4 167 176 6137 1209 232 2114 2921 1019 . . . 13979 i 4 1 1 8 1997 2 . . 560 1616 37 457 4683 l . . . . . . 16 . . . 1566 . 1582 g 52 . 74 2150 4 4 . . . . . . 2284 y 2790 26 2 47 13 43 . . . . . . 2921 s 463 1 99 1013 1 2 . . . . . . 1579 x 827 24 105 488 5 167 . . . . . . 1616 j 46 . 76 2175 6 . . . . . . . 2303 u 453 . 1 3 . . . . . . . . 457 ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- TOT 4661 2103 2729 13979 4683 1582 2284 2921 1579 1616 2303 457 40897 Word length statistics: cat bio-j-jsa-gut.wds \ | tr 'a-z0-9' '..........................................................................' \ | sort | uniq -c 2 . 21 .. 177 ... 176 .... 295 ..... 568 ...... 640 ....... 1021 ........ 793 ......... 627 .......... 184 ........... 91 ............ 40 ............. 12 .............. 11 ............... 1 ................ 2 ................. Ditto, removing limb letters: cat bio-j-jsa-gut.wds \ | tr -d 'gysxju' \ | tr 'a-z0-9' '..........................................................................' \ | sort | uniq -c 19 . 227 .. 334 ... 638 .... 1115 ..... 1290 ...... 685 ....... 263 ........ 72 ......... 13 .......... 3 ........... 2 ............ Treating limbs as end-of-words: cat bio-j-jsa-gut.wds \ | sed -e 's/\([gysxju]\)/\1 /g' \ | tr ' ' '\012' \ | egrep '.' \ | sort | uniq -c \ | sort +0 -1nr \ | pr -4 -t -s' ' \ | expand 2058 cy 27 ciij 4 ciiu 1 ciiiis 973 cs 24 ccccqj 4 cix 1 ciocs 821 qolj 20 qoqg 4 ocqj 1 ciqj 721 cccg 19 ciiis 4 oij 1 coclj 628 oix 16 ccois 4 olg 1 cocqg 454 ccccg 15 ccccs 4 qoclj 1 ij 438 cg 15 cqj 3 c 1 occcciiiu 436 ccg 13 ccciix 3 ccciij 1 occcg 374 ciix 12 ccciis 3 ci 1 occcy 353 cccy 12 o 3 colj 1 ocy 327 ciiiiu 11 ccix 3 qcccg 1 oiis 303 ix 11 clj 2 ccccois 1 oiiu 272 ccy 10 cccqg 2 ccis 1 oqo 269 lj 9 cois 2 ccocg 1 oqoix 245 ciis 9 cqg 2 ccolj 1 oqolg 237 olj 9 ocg 2 ccoqj 1 oqolj 219 oqj 9 oiiiu 2 cicg 1 q 205 qoqj 8 cccciix 2 ciiix 1 qccccg 186 ccccy 8 cciis 2 cilj 1 qcccy 138 ois 8 ccs 2 cis 1 qccy 134 qj 7 cccs 2 clg 1 qci 133 qoix 7 cciix 2 co 1 qcs 108 ciiiu 7 ccqg 2 occccg 1 qcy 101 ccclj 7 lg 2 qclj 1 qlj 86 is 7 qcqj 2 qoccccg 1 qoccccy 71 qg 7 qocg 2 qocqj 1 qocccy 65 cclj 7 qois 1 cc 1 qocclj 55 ccoix 6 cccois 1 ccccciiii 1 qociiiiu 54 cccqj 6 oclj 1 cccccoix 1 qociis 44 cccccy 6 qo 1 cccciij 1 qociix 37 cccclj 6 qocccg 1 ccccoix 1 qocy 37 cccoix 5 cccccs 1 ccccqg 1 qoiiu 36 coix 5 cccciis 1 cciij 1 qolg 35 oqg 5 ocs 1 cclg 1 qooix 32 ccqj 4 cics 1 ciclj 1 qoqolj 30 cccccg 4 ciiiiiu 1 cicqj Obvously, not all "letters" or "syllabes" end with limb strokes; some "cc" groups must be broken, too. (Many are probably the [T] and [S] characters.) Here is the table again, sorted by reverse-lex order: cat bio-j-jsa-gut.wds \ | sed -e 's/\([gysxju]\)/\1 /g' \ | tr ' ' '\012' \ | egrep '.' \ | revbytes | sort | revbytes | uniq -c \ | pr -4 -t -s' ' \ | expand 3 c 3 ccciij 2 co 4 ciiiiiu 1 cc 1 cccciij 6 qo 9 oiiiu 438 cg 4 oij 1 oqo 1 oiiu 436 ccg 269 lj 1 q 1 qoiiu 721 cccg 11 clj 973 cs 303 ix 454 ccccg 65 cclj 8 ccs 4 cix 30 cccccg 101 ccclj 7 cccs 11 ccix 2 occccg 37 cccclj 15 ccccs 374 ciix 2 qoccccg 1 qocclj 5 cccccs 7 cciix 1 qccccg 1 ciclj 4 cics 13 ccciix 1 occcg 6 oclj 5 ocs 8 cccciix 6 qocccg 1 coclj 1 ciocs 1 qociix 3 qcccg 4 qoclj 1 qcs 2 ciiix 2 cicg 2 qclj 86 is 628 oix 9 ocg 2 cilj 2 cis 36 coix 2 ccocg 237 olj 2 ccis 55 ccoix 7 qocg 3 colj 245 ciis 37 cccoix 7 lg 2 ccolj 8 cciis 1 ccccoix 2 clg 821 qolj 12 ccciis 1 cccccoi 1 cclg 1 oqolj 5 cccciis 1 qooix 4 olg 1 qoqolj 1 qociis 133 qoix 1 qolg 1 qlj 19 ciiis 1 oqoix 1 oqolg 134 qj 1 ciiiis 2058 cy 71 qg 15 cqj 1 oiis 272 ccy 9 cqg 32 ccqj 138 ois 353 cccy 7 ccqg 54 cccqj 9 cois 186 ccccy 10 cccqg 24 ccccqj 16 ccois 44 cccccy 1 ccccqg 1 cicqj 6 cccois 1 qoccccy 1 cocqg 4 ocqj 2 ccccois 1 occcy 35 oqg 2 qocqj 7 qois 1 qocccy 20 qoqg 7 qcqj 4 ciiu 1 qcccy 3 ci 1 ciqj 108 ciiiu 1 qccy 1 qci 219 oqj 1 occccii 1 ocy 1 ij 2 ccoqj 327 ciiiiu 1 qocy 27 ciij 205 qoqj 1 cccccii 1 qcy 1 cciij 12 o 1 qociiii Let's analyze the family {i,ii,iii,iiii}{u,s,j,x}, specifically: cat bio-j-jsa-gut.wds \ | sed \ -e 's/cs/z/g' \ -e 's/\([^i]i\)/ \1/g' \ -e 's/i\([^iusxj]\)/i\1 \1/g' \ -e 's/\([usjx]\)/\1 \1/g' \ -e 's/z/cs/g' \ | tr ' ' '\012' \ | egrep 'i' \ | revbytes | sort | revbytes \ | uniq -c | expand Results: | 32 ciij | 1 gis | 4 ciiu | 281 ix | 4 cic | 4 oij | 79 is | 109 ciiiu | 14 cix | 1 i | 1 yij | 2 xis | 2 oiiu | 8 yix | 5 ci | | 4 cis | 329 ciiiiu | 3 gix | 3 oi | | 4 yis | 9 oiiiu | 1 csix | 2 cil | | 271 ciis | 4 ciiiiiu | 6 jix | 1 cio | | 178 ois | | 3 xix | 1 ciq | | 19 ciiis | | 403 ciix | 4 cics | | 1 oiis | | 890 oix | | | 1 ciiiis | | 2 ciiix | Observations: Note that, by far, these suffixes are always preceded by \c/ or \o/. The exceptions are \ix/ and \is/ which are often initial and occasionally preceded by \y/, \lj/, \qj/ or another \ix/. Except for the \iu/ family, it seems that \ci/ and \o/ are satistically equivalent. The peculiarity of the \iu/ family could be explained by the fact that FSG has separate codes [M] and [N] for its members, whereas the other families are denoted [IR], [IIR], [IIIR], etc. Identifying \o/ with \ci/, the most frequent members of these families are \ciij/ (0.97 of the \ij/ family) \is/ (0.16) \ciis/ (0.80) \ciiis/ (0.04) \ciiiu/ (0.24) \ciiiiu/ (0.74) \ix/ (0.19) \ciix/ (0.78) Let's check the hypothesis that \o/ ~ \ci/, by looking at the distribution of adjacent letters. I will exclude the words that begin with \qo/ since these appear to be special: cat bio-j-jsa-gut.wds \ | egrep -v '^qo' \ | sed \ -e 's/[ql][jg]/p/g' \ -e 's/ci/a/g' \ | enum-trigraphs \ | egrep '.[ao].' \ > .ao.tri cat .ao.tri \ | egrep '.a.' \ > .a.tri cat .ao.tri \ | egrep '.o.' \ > .o.tri lines words bytes file ------ ------- --------- ------------ 855 855 3420 .a.tri 1460 1460 5840 .o.tri 2315 2315 9260 .ao.tri cat .a.tri \ | tr ' ' '_' \ | sed -e 's/\(.\).\(.\)/\1.\2 \1a\2/g' \ | sort | uniq -c \ > .a.frq cat .o.tri \ | tr ' ' '_' \ | sed -e 's/\(.\).\(.\)/\1.\2 \1o\2/g' \ | sort | uniq -c \ > .o.frq join -t' ' -a1 -a2 -e '000' -j1 2 -j2 2 -o1.3,1.1,2.3,2.1 .a.frq .o.frq \ | expand \ | gawk ' { printf " %3s %4.2f %3s %4.2f %4.2f\n", $1, ($2/855), $3, ($4/1460), ($2/855 + $4/1460) } ' \ | sed -e 's/\b000/.../g' \ | sort +4 -5nr _ai 0.07 _oi 0.31 0.39 gai 0.35 goi 0.02 0.38 pai 0.31 poi 0.05 0.36 _ap 0.00 _op 0.33 0.33 sai 0.13 soi 0.06 0.19 cai 0.07 coi 0.10 0.17 xai 0.03 xoi 0.06 0.09 _ac 0.00 _oc 0.02 0.02 ... 0.00 xo_ 0.01 0.01 cax 0.01 ... 0.00 0.01 Nothing clear emerges. There may be confusion between \ci/ and \o/, but the two do not seem to be equivalent. In all, it seems that \ci/ and \o/ are lexically similar but distinct letters. The valid \i/ sequences are \ij/ \is/ \iis/ \iiu/ \iiiu/ \ix/; the others are likely to be scription or transcription errors. So let's replace the \i/ sequences by distinct letters, then \ci/ by `a', and look at what we get. Let's use this ad-hoc encoding: jsa2hoc ------------------------------------------------- #! /n/gnu/bin/sed -f # Recoding superanalytic to ad-hoc encoding: s/ij/f/g s/ix/e/g s/ci/a/g s/iiiu/m/g s/iiu/n/g s/iis/v/g s/is/r/g s/cs/z/g s/cg/8/g s/cy/9/g s/qo/A/g s/qj/H/g s/qg/P/g s/lj/H/g s/lg/P/g ------------------------------------------------- extract-words-from-interlin \ -recode jsa2hoc \ -chars "aocz89AHPermnvf" \ bio-j-jsa.evt \ bio-j-hoc lines words bytes file ------ ------- --------- ------------ 7054 7054 41600 bio-j-hoc.wds 1995 1995 15504 bio-j-hoc.dic 4626 4626 26296 bio-j-hoc-gut.wds 858 858 5513 bio-j-hoc-gut.dic 875 875 2655 bio-j-hoc-fun.wds 33 33 189 bio-j-hoc-fun.dic 1553 1553 12649 bio-j-hoc-bad.wds 1104 1104 9802 bio-j-hoc-bad.dic Digraph counts: a o c z 8 9 A H P e r m n v f TOT ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- . 66 965 621 727 362 95 1215 151 64 282 78 . . . . 4626 a 3 . 1 2 4 2 . . 3 . 402 270 328 108 19 32 1174 o 14 . . 17 6 11 1 4 463 39 758 170 9 1 1 4 1498 c 4 60 176 3509 35 1646 855 . 358 31 14 . . . . . 6688 z 41 68 63 823 9 3 4 . 2 1 1 . . . . . 1015 8 49 308 37 37 31 1 1631 . 4 . 3 1 . . . . 2102 9 2778 1 2 16 20 9 . 2 64 2 8 4 . . . 1 2907 A 7 3 1 17 . 7 1 1 1022 22 134 7 . 1 . . 1223 H 10 583 74 1323 41 3 207 . . . 6 . . . . . 2247 P 2 11 36 97 14 3 6 . . . . . . . . . 169 e 824 32 105 205 115 53 81 1 180 10 3 2 . . . . 1611 r 396 42 36 21 13 2 22 . . . . . . . . . 532 m 334 . . . . . 3 . . . . . . . . . 337 n 109 . 1 . . . . . . . . . . . . . 110 v 19 . . . . . 1 . . . . . . . . . 20 f 36 . 1 . . . . . . . . . . . . . 37 ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- TOT 4626 1174 1498 6688 1015 2102 2907 1223 2247 169 1611 532 337 110 20 37 26296 Next-symbol probability (× 99): a o c z 8 9 A H P e r m n v f TOT ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- . 1 21 13 16 8 2 26 3 1 6 2 . . . . 99 a . . . . . . . . . . 34 23 28 9 2 3 99 o 1 . . 1 . 1 . . 31 3 50 11 1 . . . 99 A 1 . . 1 . 1 . . 83 2 11 1 . . . . 99 ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- e 51 2 6 13 7 3 5 . 11 1 . . . . . . 99 ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- c . 1 3 52 1 24 13 . 5 . . . . . . . 99 z 4 7 6 80 1 . . . . . . . . . . . 99 8 2 15 2 2 1 . 77 . . . . . . . . . 99 H . 26 3 58 2 . 9 . . . . . . . . . 99 P 1 6 21 57 8 2 4 . . . . . . . . . 99 ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- 9 95 . . 1 1 . . . 2 . . . . . . . 99 r 74 8 7 4 2 . 4 . . . . . . . . . 99 m 98 . . . . . 1 . . . . . . . . . 99 n 98 . 1 . . . . . . . . . . . . . 99 v 94 . . . . . 5 . . . . . . . . . 99 f 96 . 3 . . . . . . . . . . . . . 99 ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- TOT 17 4 6 25 4 8 11 5 8 1 6 2 1 0 0 0 26296 Previous-symbol probability (× 99): a 9 o c z 8 A H P e r m n v f TOT ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- . 6 3 64 9 71 17 98 7 37 17 15 . . . . 17 a . . . . . . . . . . 25 50 96 97 94 86 4 o . . . . . 1 1 . 20 23 47 32 3 1 5 11 6 A . . . . . . . . 45 13 8 1 . 1 . . 5 c . 5 29 12 52 3 78 . 16 18 1 . . . . . 25 z 1 6 . 4 12 1 . . . 1 . . . . . . 4 8 1 26 56 2 1 3 . . . . . . . . . . 8 9 59 . . . . 2 . . 3 1 . 1 . . . 3 11 H . 49 7 5 20 4 . . . . . . . . . . 8 P . 1 . 2 1 1 . . . . . . . . . . 1 e 18 3 3 7 3 11 2 . 8 6 . . . . . . 6 r 8 4 1 2 . 1 . . . . . . . . . . 2 m 7 . . . . . . . . . . . . . . . 1 n 2 . . . . . . . . . . . . . . . 0 v . . . . . . . . . . . . . . . . 0 f 1 . . . . . . . . . . . . . . . 0 ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- TOT 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 26296 97-07-11 stolfi =============== I took the list of good words and examined the contexts of all "gallows" letters: cat bio-j-jsa.wds \ | jsa2hoc \ | enum-contexts -vCTX=1 -vPAT='[HP]' \ | wfreq4 710 AHc 361 AHa 320 cHc 319 oHc 170 oHa 146 eHc 108 Hc 90 AH9 80 ?Hc 72 cH9 57 9Hc Statistics of \c/ strings: cat bio-j-jsa-gut.wds \ | jsa2hoc \ | enum-contexts -vCTX=0 -vPAT='cc*' \ | wfreq4 860 c 1317 cc 884 ccc 145 cccc 1 ccccc These numbers suggest that \cc/ may be a single letter. Let's look at the frequencies of single \c/: cat bio-j-jsa-gut.wds \ | jsa2hoc \ | enum-contexts -vCTX=00 -vPAT='[^c]c[^c]' \ | wfreq4 399 Hc8 9 qcH 4 zce 1 rca 247 Hc9 8 zca 2 acH 1 qc9 30 zc8 8 zcH 2 Pca 1 Pco 25 zco 7 Hcz 2 Pc9 1 Pc8 23 Hco 6 Hca 2 8c8 1 9cP 13 zc9 6 AcH 1 zcz 1 9cH 10 ocH It seems that single \c/s are all associated with \z/, \H/, or \P/. I will try to remove the cHc and cPc sequences and see what happens: cat bio-j-jsa-gut.wds \ | jsa2hoc \ | sed -e 's/c[HP]c/X/g' \ | enum-contexts -vCTX=00 -vPAT='[^c]c[^c]' \ | wfreq4 380 Hc8 25 zco 4 zce 1 qc9 78 Hc9 19 Hco 2 qcH 1 Xcz 60 zcX 13 zc9 2 Hca 1 Xco 30 zc8 8 zca 2 8c8 1 AcX 30 Xc9 7 Hcz 1 zcz 1 9cP 29 Xc8 5 zcH 1 rca It seems that \c8/ and \c9/ are also major sources of single \c/s cat bio-j-jsa-gut.wds \ | jsa2hoc \ | sed \ -e 's/c[HP]c/X/g' \ -e 's/zc/Z/g' \ -e 's/c8/K/g' \ -e 's/c9/W/g' \ | enum-contexts -vCTX=00 -vPAT='[^c]c[^c]' \ | wfreq4 315 HcK 9 Zca 2 rcW 1 Zcz 166 HcW 6 XcK 2 qcH 1 ZcP 52 ZcK 6 AcK 2 XcW 1 Xcz 48 ZcX 5 PcK 2 Hcz 1 Xco 40 Zco 5 HcZ 2 Hca 1 PcW 34 ZcW 5 8cK 1 rca 1 AcX 25 ZcH 3 qcK 1 qcW 1 AcW 19 Hco 3 KcW 1 ocW 1 9cP 15 ecK 3 8cW 1 ocK 1 9cK 14 ecW Not very illuminating. Let's examine again the \c/ strings in context: cat bio-j-jsa-gut.wds \ | jsa2hoc \ | enum-contexts -vCTX=1 -vPAT='cc*' \ | wfreq4 401 Hc8 11 8ccc8 3 Hcccc9 1 ocP 347 Hcc8 11 ccca 3 8cco 1 eccz 322 zcc8 10 cccz 3 8ccco 1 eccccz 255 Hc9 10 cP 3 8c8 1 ecccco 200 Hcc9 9 zcca 3 ccccz 1 ecccP 192 ccc8 9 qcH 3 ccP 1 ecca 116 zcc9 9 Pcc8 2 zcccP 1 eccP 97 eccc8 9 cce 2 zc 1 eccH 90 cccH 8 zca 2 rcccH 1 ecc 82 zccH 8 zcH 2 rcc9 1 Pco 67 ccc9 8 Pcc9 2 occc8 1 Pcca 59 zcccH 7 Hcz 2 ecce 1 Pc8 53 Pccc8 7 8cc9 2 eccca 1 HccccH 52 zccc8 7 cccP 2 acH 1 Hc 51 ccccH 6 rccc9 2 Pcccc9 1 Accc9 42 eccc9 6 rccc8 2 Pca 1 AccH 40 zcco 6 eccccH 2 Pc9 1 Acc9 37 Hccc8 6 Hca 2 Hccco 1 9ccco 34 zccc9 6 Acc8 2 Hcccc8 1 9ccccz 31 zc8 6 AcH 2 Accc8 1 9cccc9 31 cccc9 6 9ccc8 2 9ccccH 1 9cccc8 26 zco 6 cc9 1 zcz 1 9ccc9 25 Hco 5 eccco 1 zccz 1 9cc8 25 ccco 5 ecccc8 1 zccccH 1 9cP 24 Hccc9 5 8cc8 1 zcccc9 1 9cH 22 cc8 4 zce 1 rccz 1 8cccco 20 cccc8 4 zcccz 1 rcccz 1 8cccc9 17 cco 4 zccP 1 rcccc9 1 8cccc8 17 cH 4 ecccc9 1 rcccc8 1 8cccH 15 ecc8 4 Pccco 1 rccca 1 8ccc9 14 zc9 4 Hccz 1 rca 1 8c9 14 ecc9 4 Hcca 1 qccc8 1 cccco 14 ccH 3 zccco 1 qcc9 1 ccccco 13 cca 3 qcc8 1 qc9 1 cccca 11 ocH 3 ecco 1 occca 1 ccccP 11 Pccc9 3 ecccH 1 occ9 1 ca 11 Hcco 3 Pcco 1 occ8 Let's look more closely at the \Hc^*8/ strings: cat bio-j-jsa-gut.wds \ | jsa2hoc \ | enum-contexts -vCTX=1 -vPAT='Hcc*8' \ | wfreq4 198 AHc89 14 Hccc89 2 AHcccc89 1 cHc8 184 AHcc89 14 Hcc89 2 9Hcc8a 1 AHccc8 85 oHc89 11 AHccc89 2 8Hc89 1 AHcc8a 59 oHcc89 9 oHc8a 2 Hcc8 1 AHc8z 31 cHcc89 6 oHccc89 1 oHcc8a 1 AHc8o 29 eHc89 5 AHcc8 1 oHc8c 1 AHc8c 27 eHcc89 5 AHc8a 1 oHc8 1 Hccc8a 26 Hc89 3 oHcc8 1 eHcc8 1 Hcc8o 20 cHc89 3 AHc8 1 eHc8c 1 Hcc8a 15 9Hc89 2 eHccc89 1 eHc8 1 Hc8a 14 9Hcc89 2 cHccc8 1 cHcc8c Curiously, so the string \Hc*8/ is almost always followed by \9/. I wonder if that is true in general: cat bio-j-jsa-gut.wds \ | jsa2hoc \ | enum-contexts -vLCTX=0 -vRCTX=2 -vPAT='c8' \ | wfreq4 1526 c89 5 c8oe 2 c8av 1 c89z 45 c8 4 c8cc 1 c8zc 1 c89o 24 c8ar 4 c8af 1 c8c9 1 c89H 22 c8ae 3 c8or 1 c8c8 1 c89A 10 c8am 3 c89e 1 c8an It does seem that \c8/ is almost always followed by \9/, and occasionally by end-of-word. Let's look at the left context: cat bio-j-jsa-gut.wds \ | jsa2hoc \ | enum-contexts -vLCTX=1 -vRCTX=0 -vPAT='cc*8' \ | wfreq4 401 Hc8 31 zc8 6 9ccc8 1 rcccc8 347 Hcc8 22 cc8 5 ecccc8 1 qccc8 322 zcc8 20 cccc8 5 8cc8 1 occ8 192 ccc8 15 ecc8 3 qcc8 1 Pc8 97 eccc8 11 8ccc8 2 occc8 1 9cccc8 53 Pccc8 9 Pcc8 2 Hcccc8 1 9cc8 52 zccc8 6 rccc8 2 Accc8 1 8cccc8 37 Hccc8 6 Acc8 2 8c8 1 c8 Let's look at fout \c/s: cat bio-j-jsa-gut.wds \ | jsa2hoc \ | enum-contexts -vLCTX=0 -vRCTX=1 -vPAT='cccc' \ | wfreq4 61 ccccH 30 cccc8 3 cccco 1 cccca 44 cccc9 5 ccccz 1 ccccc 1 ccccP cat bio-j-jsa-gut.wds \ | jsa2hoc \ | enum-contexts -vLCTX=1 -vRCTX=0 -vPAT='cccc' \ | wfreq4 109 cccc 6 Hcccc 3 8cccc 2 rcccc 17 ecccc 5 9cccc 2 zcccc 2 Pcccc Let's examine the contexts of \zc/: cat bio-j-jsa-gut.wds \ | jsa2hoc \ | enum-contexts -vLCTX=1 -vRCTX=0 -vPAT='zc' \ | wfreq4 580 zc 31 8zc 13 rzc 5 czc 108 ezc 19 9zc 9 zzc 4 ozc 41 Hzc 14 Pzc cat bio-j-jsa-gut.wds \ | jsa2hoc \ | enum-contexts -vLCTX=0 -vRCTX=1 -vPAT='zc' \ | wfreq4 730 zcc 14 zc9 8 zcH 2 zc 31 zc8 8 zca 4 zce 1 zcz 26 zco It is time to change my ad-hoc encoding to reflect the consonant/vowel theory (as suggested by Grove). We have identified several categories of symbols: \iiiu/, \iiu/, \iis/, \ij/ The ziggies: strictly final, preceded always by \ci/ or, more rarely, by \o/. \cy/ Almost always final, but occasionaly followed by other letters. Preceded by about the same letters as \ci/; indeed, it is probably the final form of \ci/. \cg/ May be followed by many letters, most often \cy/ and \ci/. Almost always prededed by \c/, or initial; rarely by \ix/ or \o/. \cs/ Most often followed by \c/, somewhat less often by \o/, \ci/, or word break. Most often initial, but also preceded by \ix/, gallows, \c/, \cy/, \cg/, \is/. \lg/, \qg/, \lj/, \qj/ The capitals: Very similar to each other, different from the rest. probably to be combined with \c/ on both sides. \qo/ Strictly initial, almost always followed by a capital. \ix/ Usually initial or preceded by \ci/ or \o/; followed by any letter except ziggies and \qo/, \ix/, \is/ \is/ Similar to \ix/ except that it cannot be followed by capitals or \cg/, either. \ci/ May be followed only by the ziggies, \ix/, or \ir/ only. Often follows a capital, but also \cg/, \cs/, \c/, \ix/, \is/, or word break. \o/ Similar to \a/, but is very often word-initial. With these considerations, I defined a new encoding, "hic": jsa2hic ------------------------------------------------- #! /n/gnu/bin/sed -f # Recoding superanalytic to ad-hoc encoding: s/ij/J/g s/ix/I/g s/ci/S/g s/iiiu/M/g s/iiu/N/g s/iis/L/g s/is/C/g s/csc/Z/g s/cg/U/g s/cy/S/g s/qo/W/g s/qj/A/g s/qg/E/g s/lj/Y/g s/lg/O/g s/o/R/g s/cc/T/g ------------------------------------------------- extract-words-from-interlin \ -recode jsa2hic \ -chars "AEIOUYWSRTMNLc" \ bio-j-jsa.evt \ bio-j-hic lines words bytes file ------ ------- --------- ------------ 7054 7054 41600 bio-j-hic.wds 1934 1934 15117 bio-j-hic.dic 4626 4626 26296 bio-j-hic-gut.wds 800 800 5150 bio-j-hic-gut.dic 875 875 2655 bio-j-hic-fun.wds 33 33 189 bio-j-hic-fun.dic 1553 1553 12649 bio-j-hic-bad.wds 1101 1101 9778 bio-j-hic-bad.dic S T R L N Z H Q A E O U c TOT ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- . 161 965 282 78 215 727 362 1215 . . . . 621 4626 S 2781 1 3 410 274 69 24 11 2 33 19 108 328 18 4081 T 14 1 . 758 170 502 6 11 4 4 1 1 9 17 1498 Q 7 4 1 134 7 1044 . 7 1 . . 1 . 17 1223 R 824 113 105 3 2 190 115 53 1 . . . . 205 1611 L 396 64 36 . . . 13 2 . . . . . 21 532 N 12 807 110 6 . . 55 6 . . . . . 1420 2416 Z 41 72 63 1 . 3 9 3 . . . . . 823 1015 H 49 1939 37 3 1 4 31 1 . . . . . 37 2102 A 36 . 1 . . . . . . . . . . . 37 E 19 1 . . . . . . . . . . . . 20 O 109 . 1 . . . . . . . . . . . 110 U 334 3 . . . . . . . . . . . . 337 c 4 915 176 14 . 389 35 1646 . . . . . 3509 6688 ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- TOT 4626 4081 1498 1611 532 2416 1015 2102 1223 37 20 110 337 6688 26296 Next-symbol probability (× 99): N R L S T Z Q H A E O U c TOT ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- . 5 6 2 3 21 16 26 8 . . . . 13 99 N . . . . 33 5 2 . . . . . . 58 99 R 51 12 . . 7 6 7 . 3 . . . . 13 99 L 74 . . . 12 7 2 . . . . . . 4 99 S 67 2 10 7 . . 1 . . 1 . 3 8 . 99 T 1 33 50 11 . . . . 1 . . . 1 1 99 Z 4 . . . 7 6 1 . . . . . . 80 99 Q 1 85 11 1 . . . . 1 . . . . 1 99 H 2 . . . 91 2 1 . . . . . . 2 99 A 96 . . . . 3 . . . . . . . . 99 E 94 . . . 5 . . . . . . . . . 99 O 98 . . . . 1 . . . . . . . . 99 U 98 . . . 1 . . . . . . . . . 99 c . 6 . . 14 3 1 . 24 . . . . 52 99 ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- TOT 17 9 6 2 15 6 4 5 8 0 0 0 1 25 26296 Previous-symbol probability (× 99): N R L S T Z Q H A E O U c TOT ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- . 9 17 15 4 64 71 98 17 . . . . 9 17 N . . . . 20 7 5 . . . . . . 21 9 R 18 8 . . 3 7 11 . 2 . . . . 3 6 L 8 . . . 2 2 1 . . . . . . . 2 S 60 3 25 51 . . 2 . 1 88 94 97 96 . 15 T . 21 47 32 . . 1 . 1 11 5 1 3 . 6 Z 1 . . . 2 4 1 . . . . . . 12 4 Q . 43 8 1 . . . . . . . 1 . . 5 H 1 . . . 47 2 3 . . . . . . 1 8 A 1 . . . . . . . . . . . . . 0 E . . . . . . . . . . . . . . 0 O 2 . . . . . . . . . . . . . 0 U 7 . . . . . . . . . . . . . 1 c . 16 1 . 22 12 3 . 78 . . . . 52 25 ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- TOT 99 99 99 99 99 99 99 99 99 99 99 99 99 99 26296 cat bio-j-jsa.wds \ | /n/gnu/bin/awk -f foo.awk \ | sort | uniq -c | sort -nr | expand \ > bio-j-jsa-wpairs.freq 97-07-14 stolfi =============== Over the weekend I have pored over the list of most common words. They can be gruped into the following major sets, based on prefixes: qoljc*/qoqjc* cccc*/csccc* oljc*/oqjc* cgc* ixcccc*/ixcsccc* cccljc*/csccljc*/cccqjc* plus the following short connectives: oix qoix ois csoix oixcg ixoix csciiiu csciiiiu ciiiiu To confirm some hunches about gallows, I tabulated all \lj/ and \qj/ gallows letters and their neighboring \c/ strings: cat bio-j-jsa-gut.wds \ | sed -e 's/cs/c/g' \ | enum-contexts -vPAT='c*qjc*' -vLCTX=0 -vRCTX=1 \ | wfreq \ > .foo cat bio-j-jsa-gut.wds \ | sed -e 's/cs/c/g' \ | enum-contexts -vPAT='c*ljc*' -vLCTX=0 -vRCTX=1 \ | wfreq \ > .bar pr -m -s' ' -t -i' '1 .foo .bar > .baz \qj/ gallows \lj/ gallows ----------------- ------------------- 129 0.18 qjccg 251 0.16 ljccg 79 0.11 qjcccg 244 0.16 ljcccg 44 0.06 qjcy 131 0.08 ljcccy 38 0.05 cccqjccy 100 0.06 ljcy 36 0.05 qjcccy 55 0.04 ljccy 31 0.04 qjo 54 0.03 cccljccy 29 0.04 ccccqjccy 44 0.03 ccccljccy 26 0.04 qjccccg 42 0.03 ljo 24 0.03 qjccy 26 0.02 ljccccg 14 0.02 qjccccy 25 0.02 ljccccy 10 0.01 cccqjcy 20 0.01 cccljcy 10 0.01 ccccqjcy 15 0.01 ccccljcy 8 0.01 qjco 12 0.01 cccljcccg 6 0.01 cqjcccy 11 0.01 ljco 5 0.01 qjcco 9 0.01 cccljcccy 4 0.01 qjccco 8 0.01 cccljccg 4 0.01 ccqjcy 7 0.00 cljccy 4 0.01 cccqjcccy 7 0.00 cljcccy 3 0.00 qjccci 7 0.00 cccljci 3 0.00 qj 5 0.00 lji 3 0.00 cqjco 5 0.00 ljcco 3 0.00 cqjccy 5 0.00 cljcccg 3 0.00 cqjccg 5 0.00 ccljci 3 0.00 cqjcccg 5 0.00 ccccljcccy 3 0.00 cccqjccg 5 0.00 ccccljcccg 3 0.00 cccqjcccg 4 0.00 ljcccccy 3 0.00 ccccqjcccg 4 0.00 ccclj 2 0.00 qjcg 3 0.00 lj 2 0.00 cqjcy 3 0.00 ccljcy 2 0.00 ccqjo 2 0.00 ljcci 2 0.00 ccqjci 2 0.00 ljccco 1 0.00 qji 2 0.00 ljcccccg 1 0.00 qjcccccy 2 0.00 cljci 1 0.00 qjccc 2 0.00 cljccg 1 0.00 qjcc 2 0.00 ccljccg 1 0.00 cqjci 2 0.00 ccccljcci 1 0.00 cqjcco 2 0.00 ccccljccg 1 0.00 cqjcci 1 0.00 ljcg 1 0.00 cqjccccg 1 0.00 ljccci 1 0.00 cqjccc 1 0.00 ljccccl 1 0.00 ccqjccg 1 0.00 ccljco 1 0.00 ccqjcccy 1 0.00 ccljcccy 1 0.00 cccqjci 1 0.00 ccljcccg 1 0.00 ccccqjci 1 0.00 cccljco 1 0.00 ccccqjcccy 1 0.00 cccljcci 1 0.00 cccccqjccc 1 0.00 cccljccccy ----- ---- ---- 1 0.00 cccljccccg 700 1.00 TOT 1 0.00 cccljc 1 0.00 ccccljco 1 0.00 cccccljccy ----- ---- ---- 1566 1.00 TOT Now let's check the significance of the \s/ plume on \c/. First, let's list all initial \c/-strings that have plumes against all that don't have them: cat bio-j-jsa-gut.wds \ | sed \ -e 's/cs/z/g' \ -e 's/^/_/g' \ -e 's/$/_/g' \ | enum-contexts -vPAT='_c*z[zc]*[^zc]' -vLCTX=0 -vRCTX=0 \ | wfreq \ > .foo cat bio-j-jsa-gut.wds \ | sed \ -e 's/cs/z/g' \ -e 's/^/_/g' \ -e 's/$/_/g' \ | enum-contexts -vPAT='_c*c[^zc]' -vLCTX=0 -vRCTX=0 \ | wfreq \ > .bar pr -m -s' ' -t -i' '1 .foo .bar > .baz "z"=\cs/ prefixes "c" prefixes ------------------ ------------------ 219 0.30 _zcccg 364 0.32 _cg 71 0.10 _zcccy 192 0.17 _ccccg 69 0.09 _zci 95 0.08 _cy 63 0.08 _zo 67 0.06 _ccccy 54 0.07 _zccl 66 0.06 _ci 38 0.05 _zccccg 60 0.05 _cccl 34 0.05 _zcccl 37 0.03 _cccq 30 0.04 _zcco 31 0.03 _cccccy 26 0.04 _zccq 29 0.03 _ccccl 23 0.03 _zccccy 25 0.02 _ccco 22 0.03 _zcccq 23 0.02 _ccccq 19 0.03 _zco 22 0.02 _cccg 9 0.01 _zccy 20 0.02 _cccccg 8 0.01 _cccz_ 19 0.02 _cq 7 0.01 _zcci 17 0.01 _cco 6 0.01 _zccg 13 0.01 _ccci 6 0.01 _zccci 11 0.01 _cccci 6 0.01 _z_ 10 0.01 _cci 4 0.01 _zzcccg 9 0.01 _ccl 3 0.00 _zcq 8 0.01 _cl 3 0.00 _zcl 8 0.01 _ccq 3 0.00 _zccco 6 0.01 _cccy 3 0.00 _ccccz_ 1 0.00 _cccco 2 0.00 _zl 1 0.00 _ccccco 1 0.00 _zzcl 1 0.00 _ccccci 1 0.00 _zzcco ----- ---- ---- 1 0.00 _zzccl 1135 1.00 TOT 1 0.00 _zzcccy 1 0.00 _zzcccl 1 0.00 _zq 1 0.00 _zi 1 0.00 _zcy 1 0.00 _zccz_ 1 0.00 _zcccz_ 1 0.00 _zccccq 1 0.00 _zc_ 1 0.00 _ccczcy 1 0.00 _ccczci ----- ---- ---- 742 1.00 TOT Let's do it again with whole words: cat bio-j-jsa-gut.wds \ | sed -e 's/cs/z/g' \ | egrep '^z' \ | wfreq \ > .foo cat bio-j-jsa-gut.wds \ | sed -e 's/cs/z/g' \ | egrep '^c' \ | wfreq \ > .bar pr -m -s' ' -t -i' '1 .foo .bar > .baz "z"=\cs/ words "c" words ------------------ ------------------ 204 0.28 zcccgcy 172 0.15 ccccgcy 69 0.09 zcccy 73 0.06 cgciiiiu 36 0.05 zccccgcy 67 0.06 ccccy 31 0.04 zciiiiu 51 0.04 cgciis 25 0.03 zoix 50 0.04 cgciix 24 0.03 zccljccy 33 0.03 cgcy 23 0.03 zccccy 31 0.03 cccccy 20 0.03 zcccljccy 29 0.03 cccljccy 17 0.02 zccoix 21 0.02 cccqjccy 14 0.02 zciix 20 0.02 ciiiiu 13 0.02 zccqjccy 19 0.02 cccccgcy 11 0.02 zcoix 18 0.02 ccccljccy 11 0.02 zciis 17 0.01 cgzcccgcy 11 0.02 zcccqjccy 17 0.01 cgoix 10 0.01 zois 17 0.01 cccoix 10 0.01 zccljcy 17 0.01 ccccqjccy 9 0.01 zccy 16 0.01 cccgcy 9 0.01 zccois 12 0.01 ciix 7 0.01 zcccqjcy 12 0.01 cgciiiu 6 0.01 zcccgciix 11 0.01 ccoix 6 0.01 z 9 0.01 cyljcccgcy 5 0.01 zccljcccy 9 0.01 cgccccgcy 5 0.01 zccljcccgcy 9 0.01 ccccljcy 5 0.01 zcciis 8 0.01 cyzcccgcy 5 0.01 zccgcy 8 0.01 cccz 5 0.01 zcccljcy 8 0.01 cccljcy 4 0.01 zzcccgcy 7 0.01 cyqjccgcy 4 0.01 zccqjcy 7 0.01 ciis 3 0.00 zoixljcccy 6 0.01 cyljcccy 3 0.00 zoixljcccgcy 6 0.01 cgciixcy 3 0.00 zoiiiu 6 0.01 cccois 3 0.00 zcoixcgcy 5 0.00 cyljccgcy 3 0.00 zciiis 5 0.00 ciiiu 3 0.00 zccqjcccy 5 0.00 cgciij 3 0.00 zcclj 5 0.00 cccy 3 0.00 zcccoix 5 0.00 cccljccgcy 3 0.00 zcccljcccgcy 5 0.00 cccljcccgcy 3 0.00 zccciix 5 0.00 ccccgciiiiu 3 0.00 zccciis 5 0.00 ccccg 3 0.00 zcccgciiiiu 4 0.00 cyzcccy 2 0.00 zoljcccgcy 4 0.00 cyccccgcy 2 0.00 zoixccccy 4 0.00 cgciiscy 2 0.00 zoixccccgcy 4 0.00 cgcccgcy 2 0.00 zcljcy 4 0.00 ccix 2 0.00 zcix 4 0.00 cccqjcy 2 0.00 zccqgcccy 4 0.00 cccljcccy 2 0.00 zccljccgcy 4 0.00 ccciix 2 0.00 zcccqgccy 4 0.00 ccciis 2 0.00 zcccljcciix 4 0.00 cccciix 2 0.00 zcccljccgcy 4 0.00 cccciis 2 0.00 zcccljcccy 3 0.00 cyzccccy 2 0.00 zcccg 3 0.00 cyqjcccy 1 0.00 zzcljcccgcy 3 0.00 cyqjcccgcy 1 0.00 zzccoix 3 0.00 cqjcccy 1 0.00 zzccljcy 3 0.00 cqjcccgcy 1 0.00 zzcccy 3 0.00 cqgcccy 1 0.00 zzcccljccy 3 0.00 ciixcy 1 0.00 zqgcccy 3 0.00 ciij 1 0.00 zoljoix 3 0.00 cgois 1 0.00 zoixzcccy 3 0.00 cgix 1 0.00 zoixqjcccgcy 3 0.00 cgciixcgcy 1 0.00 zoixois 3 0.00 cgcccoix 1 0.00 zoixljcy 3 0.00 cccqjccgcy 1 0.00 zoixljccy 3 0.00 cccgciis 1 0.00 zoixljccgcy 3 0.00 ccccz 1 0.00 zoixcy 3 0.00 ccccqjcy 1 0.00 zoixcgcy 2 0.00 cyqjciiiiu 1 0.00 zoixccljciix 2 0.00 cyljciiiiu 1 0.00 zoixcccoix 2 0.00 cljccgcy 1 0.00 zocljcccy 2 0.00 ciisoix 1 0.00 zocgciis 2 0.00 ciiiiucy 1 0.00 zljciis 2 0.00 cgzcccy 1 0.00 zljcccy 2 0.00 cgzccccy 1 0.00 zixz 2 0.00 cgljccgcy 1 0.00 zcy 2 0.00 cgcyljcccgcy 1 0.00 zcqjoix 2 0.00 cgciixzccgcy 1 0.00 zcqjcy 2 0.00 cgciixo 1 0.00 zcqjcccy 2 0.00 cgciiis 1 0.00 zcoljzccgcy 2 0.00 cgci 1 0.00 zcoljcy 2 0.00 cgccoix 1 0.00 zcoljciiiiu 2 0.00 cgccgcy 1 0.00 zcocqgcccgcy 2 0.00 cgcccy 1 0.00 zcocljccy 2 0.00 ccqjcy 1 0.00 zcljcoix 2 0.00 ccljccgcy 1 0.00 zcixcgcy 2 0.00 cccqjcccgcy 1 0.00 zcis 2 0.00 cccqgcccy 1 0.00 zciljciiiiu 2 0.00 cccljciis 1 0.00 zciixljcccy 2 0.00 ccciij 1 0.00 zciixcy 2 0.00 cccciixcy 1 0.00 zciixccccg 2 0.00 ccccgoix 1 0.00 zciisciix 2 0.00 ccccgciix 1 0.00 zciiiuo 2 0.00 ccccgciis 1 0.00 zccz 1 0.00 cyzccciixcgcy 1 0.00 zccqjciis 1 0.00 cyzccccgcy 1 0.00 zccqjcccyix 1 0.00 cyzcccccy 1 0.00 zccqjcccgcy 1 0.00 cyqjcy 1 0.00 zccqgccccgcy 1 0.00 cyqjciix 1 0.00 zccoljcy 1 0.00 cyqjciis 1 0.00 zccoixoix 1 0.00 cyqjciiiu 1 0.00 zccoixo 1 0.00 cyqjccy 1 0.00 zccoixcgcy 1 0.00 cyqjcccgciis 1 0.00 zccljciix 1 0.00 cyoljcy 1 0.00 zccljciij 1 0.00 cyljzccoix 1 0.00 zccljciiiiu 1 0.00 cyljzcccgcy 1 0.00 zccljciiiiiu 1 0.00 cyljciix 1 0.00 zccljccccy 1 0.00 cyljciis 1 0.00 zcciix 1 0.00 cyljciiiu 1 0.00 zcciij 1 0.00 cyljcccgciis 1 0.00 zccgciix 1 0.00 cyljcccccy 1 0.00 zcccz 1 0.00 cylgccccy 1 0.00 zcccyljcy 1 0.00 cylgccccgcy 1 0.00 zcccyis 1 0.00 cyixois 1 0.00 zcccqjcccgcy 1 0.00 cyixcccgcy 1 0.00 zcccqjcccgcccy 1 0.00 cyiscy 1 0.00 zcccgoix 1 0.00 cycqgciiiiu 1 0.00 zcccgciixcgcy 1 0.00 cycljcccy 1 0.00 zcccgciis 1 0.00 cyciis 1 0.00 zcccgciij 1 0.00 cycgcy 1 0.00 zccccqjcccy 1 0.00 cycgciiszcccy 1 0.00 zccccgciiis 1 0.00 cycgciisciix 1 0.00 zccccg 1 0.00 cycgciiiiu 1 0.00 zc 1 0.00 cycccoix ----- ---- ---- 1 0.00 cyccccz 729 1.00 TOT 1 0.00 cyccccy 1 0.00 cyccccqjccy 1 0.00 cyccccljcccy 1 0.00 cyccccgciis 1 0.00 cyccccg 1 0.00 cycccccy 1 0.00 cycccccgcy 1 0.00 cy 1 0.00 cqjcy 1 0.00 cqjcoix 1 0.00 cqjcois 1 0.00 cqjcciix 1 0.00 cqjccgcy 1 0.00 cqgcoix 1 0.00 cqgciixoiis 1 0.00 cqgcciix 1 0.00 cqgcciis 1 0.00 cqgccgois 1 0.00 cljciix 1 0.00 cljccy 1 0.00 cljcccy 1 0.00 cljcccgcy 1 0.00 clgccccy 1 0.00 clgccccgcy 1 0.00 cizcy 1 0.00 ciqjccy 1 0.00 ciixzcccz 1 0.00 ciixoix 1 0.00 ciixoiscy 1 0.00 ciixciixcgcy 1 0.00 ciixcccgcy 1 0.00 ciixccccgcy 1 0.00 ciisois 1 0.00 ciiscy 1 0.00 ciisciis 1 0.00 ciiis 1 0.00 cgzcoixcycg 1 0.00 cgzcoix 1 0.00 cgzcois 1 0.00 cgzccoix 1 0.00 cgzccgcy 1 0.00 cgzcccz 1 0.00 cgzccccgcy 1 0.00 cgzccccgciix 1 0.00 cgoqjcccy 1 0.00 cgoljccgcy 1 0.00 cgoixljccgcy 1 0.00 cgoixlgccccgcy 1 0.00 cgoixccccgcy 1 0.00 cgoixcccccgcy 1 0.00 cgoisciiiiu 1 0.00 cgljzcccy 1 0.00 cgljcccy 1 0.00 cgisoix 1 0.00 cgcyqjccy 1 0.00 cgcyqjccgcy 1 0.00 cgcyljzccy 1 0.00 cgcyljcy 1 0.00 cgcyljciiiiu 1 0.00 cgcyljccgcy 1 0.00 cgcyij 1 0.00 cgciixzcccgcy 1 0.00 cgciixoix 1 0.00 cgciixljcy 1 0.00 cgciixcyiscg 1 0.00 cgciixciix 1 0.00 cgciixciiscy 1 0.00 cgciixciis 1 0.00 cgciixciiiiu 1 0.00 cgciixccccgcy 1 0.00 cgciisoiscy 1 0.00 cgciisoij 1 0.00 cgciiscgcy 1 0.00 cgciisccccy 1 0.00 cgciiscccccgciix 1 0.00 cgciiix 1 0.00 cgciiiscycgcy 1 0.00 cgciiiiiu 1 0.00 cgcicljccy 1 0.00 cgccois 1 0.00 cgcccljcccgcy 1 0.00 cgccccy 1 0.00 cgccccoix 1 0.00 cgcccccy 1 0.00 cgcccccgcy 1 0.00 ccqjoix 1 0.00 ccqjciix 1 0.00 ccqjciiis 1 0.00 ccqjccgcy 1 0.00 ccqgzccccgcy 1 0.00 ccqgccccgcy 1 0.00 ccoqjcy 1 0.00 ccoixo 1 0.00 ccoixljccccy 1 0.00 ccoixcy 1 0.00 ccoixccccy 1 0.00 ccocgciiiiu 1 0.00 ccljcy 1 0.00 ccljciix 1 0.00 ccljciisoix 1 0.00 ccljciis 1 0.00 ccljciiiiu 1 0.00 ccljcccy 1 0.00 cclgciiiiu 1 0.00 ccixz 1 0.00 ccixis 1 0.00 ccixciiiiiu 1 0.00 ccixcgciiiiu 1 0.00 ccixccqgzccccy 1 0.00 ccis 1 0.00 ccczcy 1 0.00 ccczciix 1 0.00 cccyqjcccy 1 0.00 cccqgoix 1 0.00 cccqgcy 1 0.00 cccqgcccgcy 1 0.00 cccqgccccgcy 1 0.00 cccqg 1 0.00 cccoixcccgcy 1 0.00 cccoixccccy 1 0.00 cccljcois 1 0.00 cccljciij 1 0.00 cccljcciix 1 0.00 cccljccg 1 0.00 cccljccccg 1 0.00 cccljc 1 0.00 ccclj 1 0.00 ccciixoixcy 1 0.00 ccciisois 1 0.00 ccciiscy 1 0.00 cccgoix 1 0.00 cccgciij 1 0.00 cccgciiiu 1 0.00 ccccqjcccy 1 0.00 ccccqjcccgcy 1 0.00 ccccqgcccgcy 1 0.00 ccccois 1 0.00 ccccljco 1 0.00 ccccljcccy 1 0.00 cccciixois 1 0.00 ccccgois 1 0.00 ccccgcyljciis 1 0.00 ccccgcyix 1 0.00 ccccgcccy 1 0.00 cccccoix 1 0.00 ccccciiiiu 1 0.00 cccccgciis ----- ---- ---- 1148 1.00 TOT Some conclusions: * The gallows characters \qj/ and \lj/ appear to be closely related: for every common word with \lj/, there appears to be a a word with \qj/ that occurs with about 1/4 the frequency. * The same phenomenon can be noted with respect to prefixes containing \cc/ and \csc/: for every word beginning with \cc/, there is a word where the first \cc/ is replaced by \csc/, and practically the same frequency. * There apepars to be much confusion between the suffixes \iu/ and \iiiu/. * There appears to be much confusions between \o/ and \ci/ Recall also our previous guess that \cy/ is just the final form of \ci/. Therefore, I have decided to do the following simplifications before recomputing the consensus file: * Ignore, for the time being, the difference between \qj/ and \lj/, and between \qg/ and \lg/, replacing them by \h/ and \p/, respectively; * omit the \s/ plume after \c/; * replace all strings \iiu/, \iiiu/, \iiiiu/, etc. by \m/ * replace \cy/ by \ci/. * replace \o/ by \ci/ Needless to say, I don't mean that these differences are meaningless; it is just that there seems to be structure to be discovered that does not depend on these features. cat bio-m-jsa.evt \ | jsa2hec \ | make-consensus-interlin \ > bio-x-hec.evt cat bio-x-hec.evt \ | egrep '^<.*;J> ' \ | sed \ -e 's/{[^}]*}//g' \ > bio-j-hec.evt extract-words-from-interlin \ -chars "mrcgiAeHP" \ bio-j-hec.evt \ bio-j-hec jsa2hec ------------------------------------------------- #! /n/gnu/bin/sed -f # Recoding superanalytic to ad-hoc encoding: /^[^#]/s/ij/f/g /^[^#]/s/ix/e/g /^[^#]/s/cy/X/g /^[^#]/s/ci/X/g /^[^#]/s/iiiiu/m/g /^[^#]/s/iiiu/m/g /^[^#]/s/iiu/m/g /^[^#]/s/iis/v/g /^[^#]/s/is/r/g /^[^#]/s/X/ci/g /^[^#]/s/o/ci/g /^[^#]/s/cs/c/g /^[^#]/s/qci/A/g /^[^#]/s/qj/H/g /^[^#]/s/qg/P/g /^[^#]/s/lj/H/g /^[^#]/s/lg/P/g ------------------------------------------------- lines words bytes file ------ ------- --------- ------------ 7069 7069 51059 bio-j-hec.wds 1495 1495 14517 bio-j-hec.dic 5081 5081 37045 bio-j-hec-gut.wds 627 627 5347 bio-j-hec-gut.dic 929 929 3085 bio-j-hec-fun.wds 63 63 473 bio-j-hec-fun.dic 1059 1059 10929 bio-j-hec-bad.wds 805 805 8697 bio-j-hec-bad.dic Digraph counts: m r c g i A e H P TOT ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- . . 84 3090 . . 1376 288 173 70 5081 m 738 . . 5 . . . . . . 743 r 449 . . 154 . . . . 2 . 605 c 47 . . 7573 2197 6164 . 16 382 33 16412 g 50 . 1 2138 . . . 4 4 . 2197 i 2889 742 509 97 . 2 8 1274 600 45 6166 A 8 1 9 32 . . 1 137 1174 24 1386 e 888 . 2 614 . . 1 3 209 11 1728 H 10 . . 2528 . . . 6 . . 2544 P 2 . . 181 . . . . . . 183 ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- TOT 5081 743 605 16412 2197 6166 1386 1728 2544 183 37045 Next-symbol probability (× 99): m r c g i A e H P TOT ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- . . 2 60 . . 27 6 3 1 99 m 98 . . 1 . . . . . . 99 r 73 . . 25 . . . . . . 99 c . . . 46 13 37 . . 2 . 99 g 2 . . 96 . . . . . . 99 i 46 12 8 2 . . . 20 10 1 99 A 1 . 1 2 . . . 10 84 2 99 e 51 . . 35 . . . . 12 1 99 H . . . 98 . . . . . . 99 P 1 . . 98 . . . . . . 99 ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- TOT 14 2 2 44 6 16 4 5 7 0 37045 Previous-symbol probability (× 99): m r c g i A e H P TOT ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- . . 14 19 . . 98 17 7 38 14 m 14 . . . . . . . . . 2 r 9 . . 1 . . . . . . 2 c 1 . . 46 99 99 . 1 15 18 44 g 1 . . 13 . . . . . . 6 i 56 99 83 1 . . 1 73 23 24 16 A . . 1 . . . . 8 46 13 4 e 17 . . 4 . . . . 8 6 5 H . . . 15 . . . . . . 7 P . . . 1 . . . . . . 0 ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- TOT 99 99 99 99 99 99 99 99 99 99 37045 cat bio-j-hec.wds \ | /n/gnu/bin/awk -f foo.awk \ | sort | uniq -c | sort -nr | expand \ > bio-j-hec-wpairs.freq It seems that the gallows letters transected by \c..c/ are distinct letters. To check that, let's look at the distribution of \c/ strings on their own and around the \lj/ and \qj/ gallows, ignoring the \s/ plumes on \c/: cat bio-j-jsa.wds \ | sed -e 's/cs/c/' \ | enum-contexts -vPAT='[clqj]*c[clqj]*' -vLCTX=0 -vRCTX=1 \ | wfreq4 1848 cy 14 qjccccy 3 qcccg 1 qcqjccccg 672 ccccg 14 ccccljcy 3 cs 1 qci 469 ci 12 cccqg 3 cqjco 1 qccy 453 cg 11 ljco 3 cqjcccg 1 qcci 425 ljci 11 cqg 3 ccqg 1 qcccy 251 ljccg 11 cccljcccg 3 cccqjccg 1 qccccg 243 ljcccg 10 cccqjcy 3 cccqjcccg 1 ljcs 227 ccccy 10 ccccqjcy 3 ccccqjcccg 1 ljcg 149 qjci 9 ccs 3 ccccqg 1 ljccci 131 ljcccy 9 cccljcccy 3 cc 1 ljccccljccy 129 qjccg 9 cccc 2 qjcg 1 cqjcy 100 ljcy 8 qjco 2 qcqjccg 1 cqjcco 87 cci 8 cccljccg 2 qcljcccg 1 cqjcci 86 cccg 7 cljccy 2 ljcci 1 cqjccg 80 cccccg 7 cljcccy 2 ljccco 1 cqjccc 79 qjcccg 7 cccljci 2 ljcccccg 1 ccqjccg 75 cccccy 6 cccco 2 cqjccy 1 ccqjcccy 74 ccco 5 qjcco 2 cljci 1 ccljco 64 co 5 ljcco 2 cljccg 1 ccljcccy 55 cccljccy 5 cqjcccy 2 clg 1 ccljcccg 54 ljccy 5 ccy 2 ccqjo 1 cclg 52 cco 5 ccljci 2 ccqjci 1 cccs 52 cccy 5 ccg 2 ccljccg 1 cccqjci 44 qjcy 5 ccccljcccy 2 ccccljcci 1 cccljco 44 ccccljccy 5 ccccljcccg 2 ccccljccg 1 cccljcci 38 cccqjccy 5 ccccc 2 ccccci 1 cccljccccy 36 qjcccy 4 qjccco 2 ccc 1 cccljccccg 29 ccccqjccy 4 ljcccccy 1 qjcccccy 1 cccljc 26 qjccccg 4 cljcccg 1 qjccc 1 ccccs 26 ljccccg 4 ccqjcy 1 qjcc 1 ccccqjci 25 ljccccy 4 ccljcy 1 qcy 1 ccccqjcccy 24 qjccy 4 cccqjcccy 1 qcqjcy 1 ccccljco 24 cccci 4 ccclj 1 qcqjci 1 cccccqjcccy 23 ccci 4 cccccs 1 qcqjccy 1 ccccco 20 cccljcy 3 qjccci 1 qcqjcccy 1 ccccccy 16 c Separating into categories: No gallows \lj/ gallows \qj/ gallows 1848 cy 425 ljci 149 qjci 469 ci 251 ljccg 129 qjccg 453 cg 243 ljcccg 79 qjcccg 672 ccccg 131 ljcccy 44 qjcy 227 ccccy 100 ljcy 36 qjcccy 87 cci 55 cccljccy 38 cccqjccy 86 cccg 54 ljccy 29 ccccqjccy 80 cccccg 44 ccccljccy 26 qjccccg 75 cccccy 26 ljccccg 24 qjccy 74 ccco 25 ljccccy 14 qjccccy 64 co 20 cccljcy 12 cccqg 52 cco 14 ccccljcy 11 cqg 52 cccy 11 ljco 10 cccqjcy 24 cccci 11 cccljcccg 10 ccccqjcy 23 ccci 8 qjco 16 c 97-07-15 stolfi =============== Created sample texts in English ("A mysterious affair at Styles") and Portuguese (Rober M. Rosi's master thesis), depunctualized and decapitalized. Extracted letter-in-context statistics for "t" and "d", "p" and "b" in those texts. Results posted to the Voynich list, in reply to Jacques Guy. Extracted comparative statistics for \c/ strings beginning with \cs/ and \c/: cat bio-j-jsa-gut.wds \ | sed \ -e 's/cs/S/g' \ -e 's/[ql][gj]/H/g' \ -e 's/cg/8/g' \ -e 's/cy/9/g' \ -e 's/ci/a/g' \ | sed -e 's/^/_/g' -e 's/$/_/g' \ | compare-contexts -rctx 1 '_cc[cS]*' '_Sc[cS]*' > .foo 192 0.33 _ccc8 219 0.38 _Scc8 97 0.17 _cccH 80 0.14 _SccH 67 0.11 _ccc9 71 0.12 _Scc9 52 0.09 _ccccH 56 0.10 _ScccH 31 0.05 _cccc9 38 0.07 _Sccc8 25 0.04 _ccco 30 0.05 _Scco 22 0.04 _cc8 23 0.04 _Sccc9 20 0.03 _cccc8 19 0.03 _Sco 17 0.03 _cco 9 0.02 _Sc9 17 0.03 _ccH 7 0.01 _Sca 13 0.02 _cca 6 0.01 _Scca 11 0.02 _ccca 6 0.01 _ScH 8 0.01 _cccS_ 6 0.01 _Sc8 6 0.01 _cc9 3 0.01 _Sccco 3 0.01 _ccccS_ 1 0.00 _SccccH 1 0.00 _cccco 1 0.00 _ScccS_ 1 0.00 _ccccco 1 0.00 _SccS_ 1 0.00 _cccca 1 0.00 _Sc_ 1 0.00 _cccSa ----- ---- ---- 1 0.00 _cccS9 577 1.00 TOT ----- ---- ---- 586 1.00 TOT Trying yet another variant encoding, "huc": cat bio-m-jsa.evt \ | jsa2huc \ | make-consensus-interlin \ > bio-x-huc.evt cat bio-x-huc.evt \ | egrep '^<.*;J> ' \ | sed \ -e 's/{[^}]*}//g' \ > bio-j-huc.evt extract-words-from-interlin \ -chars "mnrfcgiaoAeHP" \ bio-j-huc.evt \ bio-j-huc jsa2huc ------------------------------------------------- #! /n/gnu/bin/sed -f # Recoding superanalytic to ad-hoc encoding: /^[^#]/s/ij/f/g /^[^#]/s/ix/e/g /^[^#]/s/cy/X/g /^[^#]/s/ci/X/g /^[^#]/s/iiiiu/m/g /^[^#]/s/iiiu/m/g /^[^#]/s/iiu/n/g /^[^#]/s/iis/v/g /^[^#]/s/is/r/g /^[^#]/s/X/ci/g /^[^#]/s/cs/c/g /^[^#]/s/qo/A/g /^[^#]/s/qj/H/g /^[^#]/s/qg/P/g /^[^#]/s/lj/H/g /^[^#]/s/lg/P/g ------------------------------------------------- lines words bytes file ------ ------- --------- ------------ 7098 7098 48799 bio-j-huc.wds 1705 1705 14931 bio-j-huc.dic 4742 4742 33342 bio-j-huc-gut.wds 737 737 5720 bio-j-huc-gut.dic 892 892 2815 bio-j-huc-fun.wds 42 42 296 bio-j-huc-fun.dic 1464 1464 12642 bio-j-huc-bad.wds 926 926 8915 bio-j-huc-bad.dic Digraph counts: m n r f c g i o A e H P TOT ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- . . . 81 . 1917 . . 991 1243 284 158 68 4742 m 350 . . . . 3 . . . . . . . 353 n 110 . . . . . . . 1 . . . . 111 r 415 . . . . 106 . . 37 . . . . 558 f 36 . . . . . . . 1 . . . . 37 c 47 . . . . 7238 2147 4186 248 . 15 377 32 14290 g 49 . . 1 . 2051 . . 37 . 5 4 . 2147 i 2862 341 109 290 33 52 . 2 3 2 423 69 2 4188 o 14 12 1 177 4 37 . . . 4 765 478 42 1534 A 7 . 1 7 . 30 . . 1 1 133 1047 24 1251 e 840 . . 2 . 489 . . 104 1 4 185 10 1635 H 10 . . . . 2227 . . 75 . 6 . . 2318 P 2 . . . . 140 . . 36 . . . . 178 ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- TOT 4742 353 111 558 37 14290 2147 4188 1534 1251 1635 2318 178 33342 Next-symbol probability (× 99): m n r f c g i o A e H P TOT ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- . . . 2 . 40 . . 21 26 6 3 1 99 m 98 . . . . 1 . . . . . . . 99 n 98 . . . . . . . 1 . . . . 99 r 74 . . . . 19 . . 7 . . . . 99 f 96 . . . . . . . 3 . . . . 99 c . . . . . 50 15 29 2 . . 3 . 99 g 2 . . . . 95 . . 2 . . . . 99 i 68 8 3 7 1 1 . . . . 10 2 . 99 o 1 1 . 11 . 2 . . . . 49 31 3 99 A 1 . . 1 . 2 . . . . 11 83 2 99 e 51 . . . . 30 . . 6 . . 11 1 99 H . . . . . 95 . . 3 . . . . 99 P 1 . . . . 78 . . 20 . . . . 99 ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- TOT 14 1 0 2 0 42 6 12 5 4 5 7 1 33342 Previous-symbol probability (× 99): m n r f c g i o A e H P TOT ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- . . . 14 . 13 . . 64 98 17 7 38 14 m 7 . . . . . . . . . . . . 1 n 2 . . . . . . . . . . . . 0 r 9 . . . . 1 . . 2 . . . . 2 f 1 . . . . . . . . . . . . 0 c 1 . . . . 50 99 99 16 . 1 16 18 42 g 1 . . . . 14 . . 2 . . . . 6 i 60 96 97 51 88 . . . . . 26 3 1 12 o . 3 1 31 11 . . . . . 46 20 23 5 A . . 1 1 . . . . . . 8 45 13 4 e 18 . . . . 3 . . 7 . . 8 6 5 H . . . . . 15 . . 5 . . . . 7 P . . . . . 1 . . 2 . . . . 1 ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- TOT 99 99 99 99 99 99 99 99 99 99 99 99 99 33342 97-07-16 stolfi =============== I had already remarked that there were two main categories of words, \qoljc/-\qoqjc/ and \ccc/-\cscc/. Comparing the two lists, it seems that the latter can be split into two subclasses, with and without a gallows \lj/ (or \cljc/) in their "suffix". So I tabulated the three classes of words: cat bio-j-huc-gut.wds \ | sed -e 's/^/_/g' -e 's/$/_/g' \ | compare-contexts '_AHc.*_' '_ccc[^HP]*_' '_ccc.*[HP].*_' 201 0.20 AHccgci 387 0.46 ccccgci 90 0.29 cccHcci 191 0.19 AHcccgci 139 0.16 cccci 68 0.22 ccccHcci 116 0.11 AHcie 61 0.07 ccccci 31 0.10 cccHci 94 0.09 AHcim 60 0.07 cccccgci 26 0.08 ccccHci 87 0.09 AHccci 33 0.04 cccoe 15 0.05 cccHcccgci 79 0.08 AHci 21 0.02 cccgci 11 0.04 cccHccci 54 0.05 AHcin 17 0.02 cccor 10 0.03 cccHccgci 50 0.05 AHcir 14 0.02 ccci 5 0.02 ccccHcccgci 43 0.04 AHcci 10 0.01 cccir 5 0.02 cccPccci 19 0.02 AHccccgci 10 0.01 cccc 4 0.01 ccccHccci 12 0.01 AHcccci 9 0.01 ccccgcim 4 0.01 cccH 7 0.01 AHcieci 9 0.01 ccccgcie 3 0.01 cccHcir 5 0.00 AHcccg 8 0.01 ccccir 2 0.01 cccoHci 4 0.00 AHcor 7 0.01 cccie 2 0.01 ccccPcci 4 0.00 AHcif 7 0.01 ccccie 2 0.01 ccccHccie 4 0.00 AHccgcir 7 0.01 ccccg 2 0.01 ccccHccgci 3 0.00 AHciecgci 4 0.00 ccccoe 2 0.01 cccPccccgci 3 0.00 AHcicgci 4 0.00 ccccgcir 2 0.01 cccHcim 3 0.00 AHccg 4 0.00 ccccc 2 0.01 cccHcif 2 0.00 AHcic 3 0.00 cccif 2 0.01 cccHc 2 0.00 AHccgcie 3 0.00 cccgcir 1 0.00 ccciHccci 2 0.00 AHcccgcie 3 0.00 ccccgoe 1 0.00 cccciHci 2 0.00 AHcccccgci 2 0.00 ccccieci 1 0.00 ccccgciHcir 1 0.00 AHcoeci 1 0.00 cccoeoe 1 0.00 cccccHcci 1 0.00 AHcoe 1 0.00 cccoeo 1 0.00 cccccHccci 1 0.00 AHcirci 1 0.00 cccoecgci 1 0.00 ccccPcccgci 1 0.00 AHcircgci 1 0.00 cccoecccgci 1 0.00 ccccHco 1 0.00 AHciie 1 0.00 cccoecccci 1 0.00 ccccHcccgccci 1 0.00 AHcieoe 1 0.00 ccciror 1 0.00 cccPoe 1 0.00 AHciecgcgci 1 0.00 cccirci 1 0.00 cccPci 1 0.00 AHciecccci 1 0.00 cccieoeci 1 0.00 cccPcccgci 1 0.00 AHciec 1 0.00 cccgoe 1 0.00 cccP 1 0.00 AHcicccgci 1 0.00 cccgcin 1 0.00 cccHcor 1 0.00 AHcgci 1 0.00 cccgcif 1 0.00 cccHcie 1 0.00 AHccoe 1 0.00 cccgcie 1 0.00 cccHccie 1 0.00 AHccir 1 0.00 ccccor 1 0.00 cccHccg 1 0.00 AHccie 1 0.00 ccccieor 1 0.00 cccHcccie 1 0.00 AHccgor 1 0.00 ccccgor 1 0.00 cccHcccci 1 0.00 AHccgcim 1 0.00 ccccgcif 1 0.00 cccHccccg 1 0.00 AHccgccgci 1 0.00 ccccgciecgci ---- ---- ---- 1 0.00 AHccgccccgci 1 0.00 ccccgccci 307 1.00 TOT 1 0.00 AHccccg 1 0.00 cccccoe 1 0.00 AHccccci 1 0.00 cccccim 1 0.00 AHccccHcci 1 0.00 cccccie 1 0.00 AHccc 1 0.00 cccccgcir 1 0.00 AHcc 1 0.00 cccccg ---- ---- ---- 1 0.00 cccccci 1010 1.00 TOT ---- ---- ---- 846 1.00 TOT There are only 34 words that begin with "AH" but not "AHc". Of these, 21 are "AHoe", which may well be misreadings of "AHcie". Among the remaining words, there seems to be four additional major classes: cat bio-j-huc-gut.wds \ | sed -e 's/^/_/g' -e 's/$/_/g' \ | compare-contexts '_oHc.*_' '_cgc.*_' '_oeHc.*_' '_Hc.*_' 85 0.20 oHccgci 72 0.22 cgcim 28 0.21 Hccgci 20 0.19 oeHccgci 57 0.13 oHcccgci 51 0.16 cgcir 18 0.13 Hccccgci 19 0.18 oeHccci 41 0.09 oHcim 50 0.15 cgcie 16 0.12 Hcccgci 15 0.15 oeHcccgci 40 0.09 oHcie 36 0.11 cgci 10 0.07 Hcie 10 0.10 oeHci 36 0.08 oHcir 27 0.08 cgccccgci 10 0.07 Hcccci 7 0.07 oeHcir 35 0.08 oHccci 12 0.04 cgcin 8 0.06 Hcir 7 0.07 oeHcim 28 0.06 oHci 6 0.02 cgcif 7 0.05 Hcim 5 0.05 oeHcin 22 0.05 oHcci 6 0.02 cgcieci 5 0.04 Hccci 5 0.05 oeHcie 16 0.04 oHcin 5 0.02 cgcirci 4 0.03 Hccoe 4 0.04 oeHcci 13 0.03 oHcccci 5 0.02 cgcccgci 3 0.02 Hcoe 3 0.03 oeHcccci 10 0.02 oHccccgci 4 0.01 cgcccoe 3 0.02 Hcccoe 2 0.02 oeHccccgci 6 0.01 oHcoe 3 0.01 cgciecgci 2 0.01 Hcin 1 0.01 oeHcoe 4 0.01 oHcieor 3 0.01 cgccoe 2 0.01 Hci 1 0.01 oeHcif 4 0.01 oHcieci 3 0.01 cgcccci 2 0.01 Hcci 1 0.01 oeHcicgci 4 0.01 oHccgcir 3 0.01 cgccccci 2 0.01 Hcccg 1 0.01 oeHccg 3 0.01 oHcccg 2 0.01 cgcieo 1 0.01 Hcircie 1 0.01 oeHcccir 2 0.00 oHcieoe 2 0.01 cgciecccgci 1 0.01 Hcirci 1 0.01 oeHccccci 2 0.00 oHciecgci 2 0.01 cgcieccccgci 1 0.01 Hcif --- ---- ---- 2 0.00 oHccgcie 2 0.01 cgciHccgci 1 0.01 Hcieci 103 1.00 TOT 1 0.00 oHcoeor 2 0.01 cgciHcccgci 1 0.01 Hcic 1 0.00 oHcoecgci 2 0.01 cgccor 1 0.01 HciAHci 1 0.00 oHciroeof 2 0.01 cgccgci 1 0.01 Hccgcioe 1 0.00 oHcircif 2 0.01 cgccci 1 0.01 Hccgcie 1 0.00 oHcioc 2 0.01 cgcccccgci 1 0.01 HcccoHcccgci 1 0.00 oHcieccci 1 0.00 cgcirorci 1 0.01 Hcccir 1 0.00 oHciecccgci 1 0.00 cgcirof 1 0.01 Hcccie 1 0.00 oHcieccccgci 1 0.00 cgcircccci 1 0.01 HcccgoeHcgci 1 0.00 oHcieHci 1 0.00 cgcircccccgcie 1 0.01 Hcccgcif 1 0.00 oHcicgci 1 0.00 cgciie 1 0.01 Hccccgcir 1 0.00 oHcgcie 1 0.00 cgcieoe 1 0.01 Hccccci 1 0.00 oHccor 1 0.00 cgciecirci --- ---- ---- 1 0.00 oHccoe 1 0.00 cgciecircg 135 1.00 TOT 1 0.00 oHccocgci 1 0.00 cgciecir 1 0.00 oHccoHcir 1 0.00 cgciecim 1 0.00 oHccgcim 1 0.00 cgciecie 1 0.00 oHccgcif 1 0.00 cgcieHci 1 0.00 oHccgcieor 1 0.00 cgcicHcci 1 0.00 oHccgccci 1 0.00 cgciHcim 1 0.00 oHccg 1 0.00 cgciHci 1 0.00 oHcccir 1 0.00 cgciHcci 1 0.00 oHcccie 1 0.00 cgciHccci 1 0.00 oHcccgcie 1 0.00 cgccoecicg 1 0.00 oHccccic 1 0.00 cgccccoe 1 0.00 oHccccci 1 0.00 cgcccccgcie --- ---- ---- 1 0.00 cgccccc 435 1.00 TOT 1 0.00 cgcccHcccgci --- ---- ---- 326 1.00 TOT The rest, again, seems to contain two major classes of words: cat bio-j-huc-gut.wds \ | sed -e 's/^/_/g' -e 's/$/_/g' \ | compare-contexts '_oec.*_' '_ec.*_' 73 0.39 eccccgci 38 0.29 oeccccgci 20 0.11 ecccci 22 0.17 oeci 8 0.04 ecgci 19 0.15 oecccci 7 0.04 ecccgci 9 0.07 oecgci 6 0.03 eci 6 0.05 oecccgci 6 0.03 eccccci 4 0.03 oeccci 6 0.03 ecccccgci 4 0.03 oeccccci 5 0.03 eccci 3 0.02 oecin 5 0.03 eccccg 3 0.02 oecim 5 0.03 eccccHcci 3 0.02 oecccoe 3 0.02 ecie 3 0.02 oeccccg 3 0.02 eccor 2 0.02 oecie 3 0.02 ecccoe 2 0.02 oecccHcci 3 0.02 ecccHcci 2 0.02 oec 2 0.01 ecir 1 0.01 oecir 2 0.01 ecim 1 0.01 oecimci 2 0.01 ecce 1 0.01 oeccieci 2 0.01 eccccHcccgci 1 0.01 oecccor 2 0.01 ecccHci 1 0.01 oecccgcir 2 0.01 ecc 1 0.01 oeccccif 1 0.01 ecif 1 0.01 oecccccgci 1 0.01 ecgoe 1 0.01 oecccccg 1 0.01 ecgcir 1 0.01 oeccccc 1 0.01 ecgcim 1 0.01 oeccccHcci 1 0.01 ecgcieor 1 0.01 oeccHci 1 0.01 ecg --- ---- ---- 1 0.01 ecco 131 1.00 TOT 1 0.01 ecccir 1 0.01 ecccgciA 1 0.01 eccccor 1 0.01 eccccir 1 0.01 eccccif 1 0.01 eccccgcirci 1 0.01 eccccgcir 1 0.01 eccccgcie 1 0.01 eccccc 1 0.01 eccccHcie 1 0.01 eccccHccci 1 0.01 eccc 1 0.01 ecccPcccgci --- ---- ---- 185 1.00 TOT Yet some more: cat bio-j-huc-gut.wds \ | sed -e 's/^/_/g' -e 's/$/_/g' \ | compare-contexts '_oPc.*_' '_Pc.*_' '_cic.*_' '_ciHc.*_' 18 0.49 oPccccgci 17 0.40 Pccccgci 12 0.32 ciccccgci 12 0.23 ciHccgci 4 0.11 oPcccci 4 0.10 Pcccoe 5 0.13 cicccci 12 0.23 ciHcccgci 2 0.05 oPcir 4 0.10 Pccccgcir 4 0.11 ciccccci 10 0.19 ciHccci 2 0.05 oPci 3 0.07 Pcccgci 2 0.05 cicccccgci 4 0.08 ciHcim 2 0.05 oPcccgci 3 0.07 Pcccci 1 0.03 cicir 3 0.06 ciHcie 2 0.05 oPccccci 2 0.05 Pccor 1 0.03 cicgcircie 2 0.04 ciHcir 1 0.03 oPcieci 2 0.05 Pccoe 1 0.03 cicgcircccci 2 0.04 ciHcin 1 0.03 oPcieccccgci 1 0.02 Pcir 1 0.03 cicgcim 2 0.04 ciHcci 1 0.03 oPcie 1 0.02 PciHccgci 1 0.03 cicgci 2 0.04 ciHcccgcir 1 0.03 oPcieHcim 1 0.02 Pcgoe 1 0.03 cicccoe 1 0.02 ciHci 1 0.03 oPcccieci 1 0.02 Pcgcieccor 1 0.03 cicccciecgci 1 0.02 ciHcccoe 1 0.03 oPccccgcie 1 0.02 Pcccoecgci 1 0.03 ciccccgcir 1 0.02 ciHccccgci 1 0.03 oPccccg 1 0.02 Pccciroe 1 0.03 ciccccg 1 0.02 ciHccccci --- ---- ---- 1 0.02 Pccccgcie 1 0.03 cicccccci --- ---- ---- 37 1.00 TOT --- ---- ---- 1 0.03 ciccccc 53 1.00 TOT 42 1.00 TOT 1 0.03 ciccccHcci 1 0.03 ciccccHccci 1 0.03 cicPcim 1 0.03 cicHccci --- ---- ---- 38 1.00 TOT Yet some more (yawn): cat bio-j-huc-gut.wds \ | sed -e 's/^/_/g' -e 's/$/_/g' \ | compare-contexts '_APc.*_' '_rc.*_' '_Aec.*_' '_eHc.*_' 13 0.22 rcim 8 0.20 eHccgci 9 0.24 Aecccci 11 0.55 APccccgci 10 0.17 rccccgci 7 0.17 eHcccgci 7 0.19 Aeci 2 0.10 APcccgci 6 0.10 rcie 6 0.15 eHcim 7 0.19 Aeccccgci 2 0.10 APcccci 5 0.08 rcccci 4 0.10 eHcir 3 0.08 Aecccccgci 1 0.05 APci 3 0.05 rcir 3 0.07 eHccci 2 0.05 Aecim 1 0.05 APcgci 3 0.05 rcif 2 0.05 eHcin 2 0.05 Aecie 1 0.05 APccoe 2 0.03 rci 2 0.05 eHci 2 0.05 Aeccci 1 0.05 APcccoe 2 0.03 rccccg 2 0.05 eHccccgci 1 0.03 Aecir 1 0.05 APcccgcie 2 0.03 rcccccgci 1 0.02 eHcoecgci 1 0.03 Aecin --- ---- ---- 2 0.03 rcccHci 1 0.02 eHcifo 1 0.03 Aecgci 20 1.00 TOT 1 0.02 rcirci 1 0.02 eHcieor 1 0.03 Aecccoe 1 0.02 rcin 1 0.02 eHcci 1 0.03 Aecccgci 1 0.02 rcieci 1 0.02 eHccgcci --- ---- ---- 1 0.02 rciecce 1 0.02 eHcccg 37 1.00 TOT 1 0.02 rcicHcci 1 0.02 eHcccci 1 0.02 rccci --- ---- ---- 1 0.02 rcccgci 41 1.00 TOT 1 0.02 rcccciecg 1 0.02 rccccie 1 0.02 rccccci 1 0.02 rcccc 1 0.02 rccc --- ---- ---- 60 1.00 TOT On casual inspection, many of these classes seem to consist of two superimposed classes, differing by a "c/cc" switch. For instance, the '_ccc.*[HP].*_' class seems to be the union of '_cccHc.*_' and '_ccccHc.*_'. Let's try to identify the suffixes: /bin/rm -f .title /bin/rm -f .table /bin/echo "_" > .table set noglob set ofmt = "0" set npat = 1 foreach pat ( \ 'cc[^H]' 'cc.*H' \ 'ci[^H]' 'ci.*H' \ 'cg[^H]' 'cg.*H' \ 'oe[^H]' 'oe.*H' \ 'Ae[^H]' 'Ae.*H' \ 'e[^H]' 'e.*H' \ 'r[^H]' 'r.*H' \ 'AH' 'oH' \ 'AP' 'oP' \ 'H' 'P' \ ) /n/gnu/bin/printf " %7s" "${pat}" >> .title /bin/cat bio-j-huc-gut.wds \ | /n/gnu/bin/sed -e 's/^/_/g' -e 's/$/_/g' \ | /n/gnu/bin/egrep "_${pat}[^H]*_" \ | /n/gnu/bin/sed -e "s/_${pat}//g" \ | /n/gnu/bin/sort | uniq -c \ | /n/gnu/bin/expand \ > .suff.frq /n/gnu/bin/join -a 1 -a 2 -1 1 -2 2 -o"${ofmt},2.1" -e 0 .table .suff.frq > .tmp /bin/mv .tmp .table @ npat = ${npat} + 1 set ofmt = "${ofmt},1.${npat}" end unset noglob /n/gnu/bin/printf "\n" >> .title cat .table \ | /n/gnu/bin/gawk '/./ { s=0; for(i=2;i<=NF;i++) s+=$(i); print s, $0 }' \ | sort -nr \ > .tbsort cat .title .tbsort \ | format-suffix-table Here are the results: SUFFIX TOTAL cc[^H] cc.*H ci[^H] ci.*H cg[^H] cg.*H oe[^H] oe.*H ------------ ----- ------- ------- ------- ------- ------- ------- ------- ------- TOTALORUM 4262 966 322 103 58 345 18 158 115 ------------ ----- ------- ------- ------- ------- ------- ------- ------- ------- cccgci 503 0 21 14 12 27 3 38 15 ccgci 455 60 15 0 12 5 6 6 20 cgci 392 388 0 0 0 2 0 0 0 ci 353 139 68 7 2 0 2 0 13 cci 327 61 160 0 4 2 2 4 8 ccci 254 1 20 5 12 3 3 21 20 cie 187 7 4 0 3 0 0 0 5 cim 167 1 5 0 4 0 1 0 7 cir 125 8 5 1 2 0 0 0 8 ccccgci 125 1 0 4 1 3 0 5 3 im 92 0 0 0 0 72 0 3 0 oe 90 33 3 3 0 1 0 0 1 e 90 37 0 0 0 17 0 7 0 i 87 14 0 0 0 36 0 22 0 cin 81 0 0 0 2 0 0 0 5 _ 78 7 4 47 0 4 0 4 0 cccci 71 0 2 5 0 3 1 4 3 ie 70 7 0 0 0 50 0 2 0 ir 69 10 0 1 0 51 0 1 0 r 42 13 0 0 0 3 0 13 0 gci 40 21 0 1 0 0 0 9 0 m 32 30 0 0 0 0 0 0 0 or 31 17 0 2 0 0 0 0 0 ccoe 22 1 0 1 0 4 0 3 0 cccg 22 0 0 1 0 0 0 3 0 coe 19 4 1 0 0 3 0 0 1 in 17 0 0 0 0 12 0 3 0 cieci 16 2 0 0 0 0 0 1 0 c 15 11 2 0 0 0 0 0 0 if 13 3 0 0 0 6 0 0 0 cor 11 1 1 0 0 2 0 0 0 cgcim 11 11 0 0 0 0 0 0 0 cgcie 10 9 0 0 0 0 0 0 0 ccgcir 10 1 0 0 0 0 0 1 0 cccoe 10 0 0 0 1 1 0 0 0 cif 9 0 2 0 0 0 0 0 1 cg 8 7 0 0 0 0 0 0 0 ccccci 8 0 0 1 1 0 0 0 1 irci 7 1 0 0 0 5 0 0 0 ieci 7 0 0 0 0 6 0 0 0 eci 7 2 0 0 0 0 0 0 0 ccg 7 1 1 0 0 0 0 0 1 cc 7 4 0 0 0 0 0 0 0 cieor 6 1 0 0 0 0 0 0 0 ciecgci 6 0 0 1 0 0 0 0 0 cicgci 5 0 0 0 0 0 0 0 1 cgcir 5 4 0 0 0 0 0 0 0 ccie 5 1 3 0 0 0 0 0 0 ccgcie 5 0 0 0 0 0 0 0 0 cccgcie 5 0 0 0 0 0 0 0 0 ccccgcir 5 0 0 0 0 0 0 0 0 Pccci 5 5 0 0 0 0 0 0 0 oecgci 4 1 0 0 0 0 0 0 0 oecccgci 4 1 0 0 0 0 0 0 0 gcir 4 3 0 0 0 0 0 0 0 ecgci 4 3 0 0 0 0 0 0 0 cgoe 4 3 0 0 0 0 0 0 0 ccor 4 0 0 0 0 0 0 1 0 cccir 4 0 0 0 0 0 0 0 1 cccie 4 0 1 0 0 0 0 0 0 cccgcir 4 0 0 1 2 0 0 0 0 ccccg 4 0 1 0 0 0 0 1 0 cccc 4 0 0 1 0 1 0 1 0 om 3 0 0 0 0 0 0 0 1 oeccccgci 3 0 0 0 0 0 0 0 0 iecgci 3 0 0 0 0 3 0 0 0 ecccci 3 1 0 0 0 0 0 1 0 circi 3 0 0 0 0 0 0 0 0 cieoe 3 0 0 0 0 0 0 0 0 cic 3 0 0 0 0 0 0 0 0 ccccgcie 3 0 0 0 0 1 0 0 0 cccccgci 3 1 0 0 0 0 0 0 0 roe 2 0 0 0 0 0 0 1 0 rcim 2 0 0 0 0 1 0 0 0 rcie 2 1 0 0 0 0 0 0 0 oeoe 2 1 0 0 0 0 0 0 0 oeci 2 0 0 0 0 0 0 0 0 oeccci 2 0 0 0 0 0 0 0 0 oecccci 2 1 0 0 0 0 0 0 0 ieo 2 0 0 0 0 2 0 0 0 iecccgci 2 0 0 0 0 2 0 0 0 ieccccgci 2 0 0 0 0 2 0 0 0 goe 2 1 0 0 0 0 0 0 0 gcim 2 0 0 1 0 0 0 0 0 eor 2 0 0 0 0 0 0 0 0 coecgci 2 0 0 0 0 0 0 0 0 co 2 0 1 0 0 0 0 0 0 cieccccgci 2 0 0 0 0 0 0 0 0 ce 2 0 0 0 0 0 0 0 0 ccir 2 0 0 0 0 0 0 0 0 ccgcim 2 0 0 0 0 0 0 0 0 cccif 2 0 0 0 0 0 0 1 0 ccc 2 0 0 0 0 0 0 0 0 cPcci 2 2 0 0 0 0 0 0 0 cPcccgci 2 2 0 0 0 0 0 0 0 Pccccgci 2 2 0 0 0 0 0 0 0 rcin 1 0 0 0 0 0 0 0 0 oroeccccgci 1 0 0 0 0 0 0 0 0 orci 1 0 0 1 0 0 0 0 0 orccci 1 0 0 0 0 0 0 0 0 of 1 0 0 0 0 0 0 1 0 oeo 1 1 0 0 0 0 0 0 0 oecircir 1 0 0 0 0 0 0 0 0 oecim 1 0 0 0 0 0 0 0 0 oeciecccgci 1 0 0 0 0 0 0 0 0 oecgcie 1 0 0 0 0 0 0 0 0 oecgccccgci 1 0 0 0 0 0 0 0 0 oecccgcie 1 0 0 0 0 0 0 0 0 oecccg 1 0 0 0 0 0 0 0 0 oeccccg 1 0 0 0 0 0 0 0 0 oePci 1 0 0 0 0 0 0 0 0 ocgcie 1 0 0 0 0 0 0 0 0 oPci 1 0 0 0 0 0 0 0 0 no 1 1 0 0 0 0 0 0 0 irorci 1 0 0 0 0 1 0 0 0 iror 1 1 0 0 0 0 0 0 0 irof 1 0 0 0 0 1 0 0 0 ircccci 1 0 0 0 0 1 0 0 0 ircccccgcie 1 0 0 0 0 1 0 0 0 imci 1 0 0 0 0 0 0 1 0 iie 1 0 0 0 0 1 0 0 0 ieoeci 1 1 0 0 0 0 0 0 0 ieoe 1 0 0 0 0 1 0 0 0 iecirci 1 0 0 0 0 1 0 0 0 iecircg 1 0 0 0 0 1 0 0 0 iecir 1 0 0 0 0 1 0 0 0 iecim 1 0 0 0 0 1 0 0 0 iecie 1 0 0 0 0 1 0 0 0 iecce 1 0 0 0 0 0 0 0 0 gcircie 1 0 0 1 0 0 0 0 0 gcircccci 1 0 0 1 0 0 0 0 0 gcin 1 1 0 0 0 0 0 0 0 gcif 1 1 0 0 0 0 0 0 0 gcieor 1 0 0 0 0 0 0 0 0 gcie 1 1 0 0 0 0 0 0 0 g 1 0 0 0 0 0 0 0 0 eof 1 0 0 0 0 0 0 0 0 eoe 1 0 0 0 0 0 0 0 0 eo 1 1 0 0 0 0 0 0 0 eccccgci 1 0 0 0 0 1 0 0 0 eccccg 1 1 0 0 0 0 0 0 0 ecccccgci 1 0 0 0 0 1 0 0 0 ePccccgci 1 0 0 0 0 1 0 0 0 coeor 1 0 0 0 0 0 0 0 0 coecicg 1 0 0 0 0 1 0 0 0 coeci 1 0 0 0 0 0 0 0 0 ciroeof 1 0 0 0 0 0 0 0 0 ciroe 1 0 1 0 0 0 0 0 0 circif 1 0 0 0 0 0 0 0 0 circie 1 0 0 0 0 0 0 0 0 circgci 1 0 0 0 0 0 0 0 0 cioc 1 0 0 0 0 0 0 0 0 ciie 1 0 0 0 0 0 0 0 0 cifo 1 0 0 0 0 0 0 0 0 ciecgcgci 1 0 0 0 0 0 0 0 0 cieccci 1 0 0 0 0 0 0 0 0 ciecccgci 1 0 0 0 0 0 0 0 0 ciecccci 1 0 0 0 0 0 0 0 0 ciec 1 0 0 0 0 0 0 0 0 cicccgci 1 0 0 0 0 0 0 0 0 cgor 1 1 0 0 0 0 0 0 0 cgcif 1 1 0 0 0 0 0 0 0 cgciecgci 1 1 0 0 0 0 0 0 0 cgcieccor 1 0 0 0 0 0 0 0 0 cgccci 1 1 0 0 0 0 0 0 0 ccoeci 1 0 0 0 0 0 0 0 0 ccocgci 1 0 0 0 0 0 0 0 0 ccim 1 1 0 0 0 0 0 0 0 ccgor 1 0 0 0 0 0 0 0 0 ccgcioe 1 0 0 0 0 0 0 0 0 ccgcif 1 0 0 0 0 0 0 0 0 ccgcieor 1 0 0 0 0 0 0 0 0 ccgciA 1 0 0 0 0 0 0 0 0 ccgcci 1 0 0 0 0 0 0 0 0 ccgccgci 1 0 0 0 0 0 0 0 0 ccgccci 1 0 0 0 0 0 0 0 0 ccgccccgci 1 0 0 0 0 0 0 0 0 cccor 1 0 0 0 0 0 0 0 0 cccoecgci 1 0 0 0 0 0 0 0 0 ccciroe 1 0 0 0 0 0 0 0 0 cccieci 1 0 0 0 0 0 0 0 0 ccciecgci 1 0 0 1 0 0 0 0 0 ccciecg 1 0 0 0 0 0 0 0 0 cccgcirci 1 0 0 0 0 0 0 0 0 cccgcif 1 0 0 0 0 0 0 0 0 cccgccci 1 0 1 0 0 0 0 0 0 ccccic 1 0 0 0 0 0 0 0 0 ccccc 1 0 0 1 0 0 0 0 0 ccPcccgci 1 0 0 0 0 0 0 0 0 ccPccccci 1 1 0 0 0 0 0 0 0 Poe 1 1 0 0 0 0 0 0 0 Pcim 1 0 0 1 0 0 0 0 0 Pci 1 1 0 0 0 0 0 0 0 Pcccgci 1 1 0 0 0 0 0 0 0 P 1 1 0 0 0 0 0 0 0 SUFFIX TOTAL Ae[^H] Ae.*H e[^H] e.*H r[^H] r.*H AH oH AP oP H P ------------ ----- ------ ------ ------ ------ ------ ------ ------ ------ ------ ------ ------ TOTALORUM 4262 38 13 219 61 73 3 1043 451 23 40 150 63 ------------ ----- ------ ------ ------ ------ ------ ------ ------ ------ ------ ------ ------ cccgci 503 7 2 74 9 10 0 191 57 2 2 16 3 ccgci 455 1 0 7 8 1 0 201 85 0 0 28 0 cgci 392 0 0 0 0 0 0 1 0 1 0 0 0 ci 353 0 4 0 4 0 2 79 28 1 2 2 0 cci 327 2 1 5 9 1 1 43 22 0 0 2 0 ccci 254 9 3 21 4 5 0 87 35 0 0 5 0 cie 187 0 0 0 1 0 0 116 40 0 1 10 0 cim 167 0 0 0 7 0 0 94 41 0 0 7 0 cir 125 0 0 0 4 0 0 50 36 0 2 8 1 ccccgci 125 3 0 8 2 2 0 19 10 11 18 18 17 im 92 2 0 2 0 13 0 0 0 0 0 0 0 oe 90 0 0 1 1 0 0 21 10 1 1 6 8 e 90 1 0 16 1 9 0 2 0 0 0 0 0 i 87 7 0 6 0 2 0 0 0 0 0 0 0 cin 81 0 0 0 2 0 0 54 16 0 0 2 0 _ 78 0 0 5 0 0 0 2 3 1 0 1 0 cccci 71 0 1 6 1 1 0 12 13 2 4 10 3 ie 70 2 0 3 0 6 0 0 0 0 0 0 0 ir 69 1 0 2 0 3 0 0 0 0 0 0 0 r 42 0 0 10 0 3 0 0 0 0 0 0 0 gci 40 1 0 8 0 0 0 0 0 0 0 0 0 m 32 0 0 0 0 2 0 0 0 0 0 0 0 or 31 0 0 0 0 0 0 4 2 1 1 4 0 ccoe 22 1 0 3 0 0 0 1 1 1 0 4 2 cccg 22 0 0 5 1 2 0 5 3 0 0 2 0 coe 19 0 0 0 0 0 0 1 6 0 0 3 0 in 17 1 0 0 0 1 0 0 0 0 0 0 0 cieci 16 0 0 0 0 0 0 7 4 0 1 1 0 c 15 0 0 2 0 0 0 0 0 0 0 0 0 if 13 0 0 1 0 3 0 0 0 0 0 0 0 cor 11 0 0 3 0 0 0 4 0 0 0 0 0 cgcim 11 0 0 0 0 0 0 0 0 0 0 0 0 cgcie 10 0 0 0 0 0 0 0 1 0 0 0 0 ccgcir 10 0 0 0 0 0 0 4 4 0 0 0 0 cccoe 10 0 0 0 0 0 0 0 0 1 0 3 4 cif 9 0 0 0 1 0 0 4 0 0 0 1 0 cg 8 0 0 1 0 0 0 0 0 0 0 0 0 ccccci 8 0 0 0 0 0 0 1 1 0 2 1 0 irci 7 0 0 0 0 1 0 0 0 0 0 0 0 ieci 7 0 0 0 0 1 0 0 0 0 0 0 0 eci 7 0 0 4 0 1 0 0 0 0 0 0 0 ccg 7 0 0 0 0 0 0 3 1 0 0 0 0 cc 7 0 0 1 0 1 0 1 0 0 0 0 0 cieor 6 0 0 0 1 0 0 0 4 0 0 0 0 ciecgci 6 0 0 0 0 0 0 3 2 0 0 0 0 cicgci 5 0 0 0 0 0 0 3 1 0 0 0 0 cgcir 5 0 0 1 0 0 0 0 0 0 0 0 0 ccie 5 0 0 0 0 0 0 1 0 0 0 0 0 ccgcie 5 0 0 0 0 0 0 2 2 0 0 1 0 cccgcie 5 0 0 1 0 0 0 2 1 1 0 0 0 ccccgcir 5 0 0 0 0 0 0 0 0 0 0 1 4 Pccci 5 0 0 0 0 0 0 0 0 0 0 0 0 oecgci 4 0 0 0 0 0 0 1 1 0 0 1 0 oecccgci 4 0 0 0 0 0 0 0 0 0 0 1 2 gcir 4 0 0 1 0 0 0 0 0 0 0 0 0 ecgci 4 0 0 1 0 0 0 0 0 0 0 0 0 cgoe 4 0 0 0 0 0 0 0 0 0 0 0 1 ccor 4 0 0 0 0 0 0 0 1 0 0 0 2 cccir 4 0 0 1 0 0 0 0 1 0 0 1 0 cccie 4 0 0 0 0 1 0 0 1 0 0 1 0 cccgcir 4 0 0 1 0 0 0 0 0 0 0 0 0 ccccg 4 0 0 0 0 0 0 1 0 0 1 0 0 cccc 4 0 0 1 0 0 0 0 0 0 0 0 0 om 3 0 0 0 0 0 0 1 0 0 0 0 1 oeccccgci 3 0 0 0 0 0 0 1 0 0 0 0 2 iecgci 3 0 0 0 0 0 0 0 0 0 0 0 0 ecccci 3 0 0 0 1 0 0 0 0 0 0 0 0 circi 3 0 1 0 0 0 0 1 0 0 0 1 0 cieoe 3 0 0 0 0 0 0 1 2 0 0 0 0 cic 3 0 0 0 0 0 0 2 0 0 0 1 0 ccccgcie 3 0 0 0 0 0 0 0 0 0 1 0 1 cccccgci 3 0 0 0 0 0 0 2 0 0 0 0 0 roe 2 0 0 0 0 1 0 0 0 0 0 0 0 rcim 2 0 0 1 0 0 0 0 0 0 0 0 0 rcie 2 0 0 1 0 0 0 0 0 0 0 0 0 oeoe 2 0 0 0 0 0 0 1 0 0 0 0 0 oeci 2 0 0 0 0 0 0 0 0 0 2 0 0 oeccci 2 0 0 0 0 0 0 0 0 0 0 0 2 oecccci 2 0 0 0 0 0 0 0 0 0 0 0 1 ieo 2 0 0 0 0 0 0 0 0 0 0 0 0 iecccgci 2 0 0 0 0 0 0 0 0 0 0 0 0 ieccccgci 2 0 0 0 0 0 0 0 0 0 0 0 0 goe 2 0 0 1 0 0 0 0 0 0 0 0 0 gcim 2 0 0 1 0 0 0 0 0 0 0 0 0 eor 2 0 0 1 0 0 0 0 1 0 0 0 0 coecgci 2 0 0 0 1 0 0 0 1 0 0 0 0 co 2 0 0 1 0 0 0 0 0 0 0 0 0 cieccccgci 2 0 0 0 0 0 0 0 1 0 1 0 0 ce 2 0 0 2 0 0 0 0 0 0 0 0 0 ccir 2 0 0 1 0 0 0 1 0 0 0 0 0 ccgcim 2 0 0 0 0 0 0 1 1 0 0 0 0 cccif 2 0 0 1 0 0 0 0 0 0 0 0 0 ccc 2 0 0 0 0 1 0 1 0 0 0 0 0 cPcci 2 0 0 0 0 0 0 0 0 0 0 0 0 cPcccgci 2 0 0 0 0 0 0 0 0 0 0 0 0 Pccccgci 2 0 0 0 0 0 0 0 0 0 0 0 0 rcin 1 0 0 1 0 0 0 0 0 0 0 0 0 oroeccccgci 1 0 0 0 0 0 0 0 0 0 0 1 0 orci 1 0 0 0 0 0 0 0 0 0 0 0 0 orccci 1 0 0 0 0 0 0 0 0 0 0 1 0 of 1 0 0 0 0 0 0 0 0 0 0 0 0 oeo 1 0 0 0 0 0 0 0 0 0 0 0 0 oecircir 1 0 0 0 0 0 0 0 0 0 0 0 1 oecim 1 0 0 0 0 0 0 0 0 0 0 0 1 oeciecccgci 1 0 0 0 0 0 0 0 0 0 0 0 1 oecgcie 1 0 0 0 0 0 0 0 0 0 0 1 0 oecgccccgci 1 0 0 0 0 0 0 0 0 0 0 0 1 oecccgcie 1 0 0 0 0 0 0 0 0 0 0 0 1 oecccg 1 0 0 0 0 0 0 0 0 0 0 1 0 oeccccg 1 0 0 0 0 0 0 0 0 0 0 0 1 oePci 1 0 0 0 0 0 0 0 0 0 0 1 0 ocgcie 1 0 0 0 1 0 0 0 0 0 0 0 0 oPci 1 0 0 0 0 0 0 1 0 0 0 0 0 no 1 0 0 0 0 0 0 0 0 0 0 0 0 irorci 1 0 0 0 0 0 0 0 0 0 0 0 0 iror 1 0 0 0 0 0 0 0 0 0 0 0 0 irof 1 0 0 0 0 0 0 0 0 0 0 0 0 ircccci 1 0 0 0 0 0 0 0 0 0 0 0 0 ircccccgcie 1 0 0 0 0 0 0 0 0 0 0 0 0 imci 1 0 0 0 0 0 0 0 0 0 0 0 0 iie 1 0 0 0 0 0 0 0 0 0 0 0 0 ieoeci 1 0 0 0 0 0 0 0 0 0 0 0 0 ieoe 1 0 0 0 0 0 0 0 0 0 0 0 0 iecirci 1 0 0 0 0 0 0 0 0 0 0 0 0 iecircg 1 0 0 0 0 0 0 0 0 0 0 0 0 iecir 1 0 0 0 0 0 0 0 0 0 0 0 0 iecim 1 0 0 0 0 0 0 0 0 0 0 0 0 iecie 1 0 0 0 0 0 0 0 0 0 0 0 0 iecce 1 0 0 0 0 1 0 0 0 0 0 0 0 gcircie 1 0 0 0 0 0 0 0 0 0 0 0 0 gcircccci 1 0 0 0 0 0 0 0 0 0 0 0 0 gcin 1 0 0 0 0 0 0 0 0 0 0 0 0 gcif 1 0 0 0 0 0 0 0 0 0 0 0 0 gcieor 1 0 0 1 0 0 0 0 0 0 0 0 0 gcie 1 0 0 0 0 0 0 0 0 0 0 0 0 g 1 0 0 1 0 0 0 0 0 0 0 0 0 eof 1 0 0 1 0 0 0 0 0 0 0 0 0 eoe 1 0 0 0 0 0 0 0 1 0 0 0 0 eo 1 0 0 0 0 0 0 0 0 0 0 0 0 eccccgci 1 0 0 0 0 0 0 0 0 0 0 0 0 eccccg 1 0 0 0 0 0 0 0 0 0 0 0 0 ecccccgci 1 0 0 0 0 0 0 0 0 0 0 0 0 ePccccgci 1 0 0 0 0 0 0 0 0 0 0 0 0 coeor 1 0 0 0 0 0 0 0 1 0 0 0 0 coecicg 1 0 0 0 0 0 0 0 0 0 0 0 0 coeci 1 0 0 0 0 0 0 1 0 0 0 0 0 ciroeof 1 0 0 0 0 0 0 0 1 0 0 0 0 ciroe 1 0 0 0 0 0 0 0 0 0 0 0 0 circif 1 0 0 0 0 0 0 0 1 0 0 0 0 circie 1 0 0 0 0 0 0 0 0 0 0 1 0 circgci 1 0 0 0 0 0 0 1 0 0 0 0 0 cioc 1 0 0 0 0 0 0 0 1 0 0 0 0 ciie 1 0 0 0 0 0 0 1 0 0 0 0 0 cifo 1 0 0 0 1 0 0 0 0 0 0 0 0 ciecgcgci 1 0 0 0 0 0 0 1 0 0 0 0 0 cieccci 1 0 0 0 0 0 0 0 1 0 0 0 0 ciecccgci 1 0 0 0 0 0 0 0 1 0 0 0 0 ciecccci 1 0 0 0 0 0 0 1 0 0 0 0 0 ciec 1 0 0 0 0 0 0 1 0 0 0 0 0 cicccgci 1 0 0 0 0 0 0 1 0 0 0 0 0 cgor 1 0 0 0 0 0 0 0 0 0 0 0 0 cgcif 1 0 0 0 0 0 0 0 0 0 0 0 0 cgciecgci 1 0 0 0 0 0 0 0 0 0 0 0 0 cgcieccor 1 0 0 0 0 0 0 0 0 0 0 0 1 cgccci 1 0 0 0 0 0 0 0 0 0 0 0 0 ccoeci 1 0 1 0 0 0 0 0 0 0 0 0 0 ccocgci 1 0 0 0 0 0 0 0 1 0 0 0 0 ccim 1 0 0 0 0 0 0 0 0 0 0 0 0 ccgor 1 0 0 0 0 0 0 1 0 0 0 0 0 ccgcioe 1 0 0 0 0 0 0 0 0 0 0 1 0 ccgcif 1 0 0 0 0 0 0 0 1 0 0 0 0 ccgcieor 1 0 0 0 0 0 0 0 1 0 0 0 0 ccgciA 1 0 0 1 0 0 0 0 0 0 0 0 0 ccgcci 1 0 0 0 1 0 0 0 0 0 0 0 0 ccgccgci 1 0 0 0 0 0 0 1 0 0 0 0 0 ccgccci 1 0 0 0 0 0 0 0 1 0 0 0 0 ccgccccgci 1 0 0 0 0 0 0 1 0 0 0 0 0 cccor 1 0 0 1 0 0 0 0 0 0 0 0 0 cccoecgci 1 0 0 0 0 0 0 0 0 0 0 0 1 ccciroe 1 0 0 0 0 0 0 0 0 0 0 0 1 cccieci 1 0 0 0 0 0 0 0 0 0 1 0 0 ccciecgci 1 0 0 0 0 0 0 0 0 0 0 0 0 ccciecg 1 0 0 0 0 1 0 0 0 0 0 0 0 cccgcirci 1 0 0 1 0 0 0 0 0 0 0 0 0 cccgcif 1 0 0 0 0 0 0 0 0 0 0 1 0 cccgccci 1 0 0 0 0 0 0 0 0 0 0 0 0 ccccic 1 0 0 0 0 0 0 0 1 0 0 0 0 ccccc 1 0 0 0 0 0 0 0 0 0 0 0 0 ccPcccgci 1 0 0 1 0 0 0 0 0 0 0 0 0 ccPccccci 1 0 0 0 0 0 0 0 0 0 0 0 0 Poe 1 0 0 0 0 0 0 0 0 0 0 0 0 Pcim 1 0 0 0 0 0 0 0 0 0 0 0 0 Pci 1 0 0 0 0 0 0 0 0 0 0 0 0 Pcccgci 1 0 0 0 0 0 0 0 0 0 0 0 0 P 1 0 0 0 0 0 0 0 0 0 0 0 0 Beware that there is some ambiguity in the "cc[^H]" column: words like "ccccic" could be parsed as "cc" + "ccic" or as "c" + "cccic" or as "" + "ccccic". Thus, we probably should exclude the "cc" prefix while we decide what are the valid suffixes. Also, the "[^H]" in some prefixes shoudl be removed, since it requires non-empty suffixes. Let's retry again, fixing these bugs, sorting the prefixes by importance, and excluding also the "r" prefix: /bin/rm -f .title /bin/rm -f .table /bin/echo "_" > .table set noglob set ofmt = "0" set npat = 1 foreach pat ( \ 'AH' 'oH' 'cg' 'cc.*H' \ 'e' 'oe' 'H' 'oe.*H' \ 'ci' 'P' 'e.*H' 'ci.*H' \ 'oP' 'Ae' 'AP' 'cg.*H' \ 'Ae.*H' \ ) /n/gnu/bin/printf " %7s" "${pat}" >> .title /bin/cat bio-j-huc-gut.wds \ | /n/gnu/bin/sed -e 's/^/_/g' -e 's/$/_/g' \ | /n/gnu/bin/egrep "_${pat}[^H]*_" \ | /n/gnu/bin/sed -e "s/_${pat}//g" -e '/../s/_$//g' \ | /n/gnu/bin/sort | uniq -c \ | /n/gnu/bin/expand \ > .suff.frq /n/gnu/bin/join -a 1 -a 2 -1 1 -2 2 -o"${ofmt},2.1" -e 0 .table .suff.frq > .tmp /bin/mv .tmp .table @ npat = ${npat} + 1 set ofmt = "${ofmt},1.${npat}" end unset noglob /n/gnu/bin/printf "\n" >> .title cat .table \ | /n/gnu/bin/gawk '/./ { s=0; for(i=2;i<=NF;i++) s+=$(i); print s, $0 }' \ | sort -nr \ > .tbsort cat .title .tbsort \ | format-suffix-table Here are the results: SUFFIX TOTAL AH oH cg cc.*H e oe H oe.*H ci ------------ ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- TOTALORUM 3437 1043 451 345 322 223 289 150 115 104 ------------ ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ccgci 377 201 85 2 15 0 0 28 20 0 cccgci 352 191 57 5 21 7 6 16 15 0 ci 276 79 28 36 68 6 22 2 13 0 ccccgci 256 19 10 27 0 73 38 18 3 12 cci 251 43 22 0 160 0 0 2 8 0 cim 245 94 41 72 5 2 3 7 7 0 cie 237 116 40 50 4 3 2 10 5 0 _ 228 2 3 0 4 4 131 1 0 1 ccci 202 87 35 2 20 5 4 5 20 0 cir 172 50 36 51 5 2 1 8 8 1 cccci 108 12 13 3 2 20 19 10 3 5 cin 97 54 16 12 0 0 3 2 5 0 oe 93 21 10 17 3 16 7 6 1 0 or 38 4 2 3 0 10 13 4 0 0 ccccci 24 1 1 3 0 6 4 1 1 4 m 21 0 0 0 0 0 0 0 0 21 cgci 21 1 0 0 0 8 9 0 0 1 cccoe 21 0 0 4 0 3 3 3 0 1 e 19 2 0 4 0 0 0 0 0 12 cieci 19 7 4 6 0 0 0 1 0 0 cif 16 4 0 6 2 1 0 1 1 0 cccccgci 16 2 0 2 0 6 1 0 0 2 coe 12 1 6 0 1 0 0 3 1 0 ccoe 12 1 1 3 0 0 0 4 0 0 ccccg 12 1 0 0 1 5 3 0 0 1 cccg 11 5 3 0 0 0 0 2 0 0 r 8 0 0 0 0 1 0 0 0 7 circi 8 1 0 5 0 0 0 1 0 0 ciecgci 8 3 2 3 0 0 0 0 0 0 ccor 8 0 1 2 0 3 0 0 0 0 ccgcir 8 4 4 0 0 0 0 0 0 0 ccccgcir 7 0 0 0 0 1 0 1 0 1 oeci 6 0 0 0 0 4 0 0 0 0 ccg 6 3 1 0 1 0 0 0 1 0 Pccccgci 6 0 0 0 0 1 4 0 0 1 o 5 0 0 0 0 4 1 0 0 0 cor 5 4 0 0 1 0 0 0 0 0 cieor 5 0 4 0 0 0 0 0 0 0 cicgci 5 3 1 0 0 0 0 0 1 0 ccgcie 5 2 2 0 0 0 0 1 0 0 oecgci 4 1 1 0 0 1 0 1 0 0 oeccccgci 4 1 0 1 0 0 0 0 0 0 n 4 0 0 0 0 0 0 0 0 4 cieoe 4 1 2 1 0 0 0 0 0 0 cieccccgci 4 0 1 2 0 0 0 0 0 0 ccie 4 1 0 0 3 0 0 0 0 0 cccir 4 0 1 0 0 1 0 1 1 0 cccgcie 4 2 1 0 0 0 0 0 0 0 ccccc 4 0 0 1 0 1 1 0 0 1 c 4 0 0 0 2 0 2 0 0 0 roe 3 0 0 1 0 0 0 0 0 2 om 3 1 0 0 0 0 0 0 1 0 oecccgci 3 0 0 0 0 0 0 1 0 0 f 3 0 0 0 0 0 0 0 0 3 eci 3 0 0 0 0 0 0 0 0 3 ciecccgci 3 0 1 2 0 0 0 0 0 0 cic 3 2 0 0 0 0 0 1 0 0 cccie 3 0 1 0 1 0 0 1 0 0 cccgcir 3 0 0 0 0 0 1 0 0 0 ccccgcie 3 0 0 0 0 1 0 0 0 0 cc 3 1 0 0 0 2 0 0 0 0 rci 2 0 0 0 0 0 0 0 0 2 orcim 2 0 0 1 0 1 0 0 0 0 oeccci 2 0 0 0 0 0 0 0 0 0 oecccci 2 0 0 0 0 0 1 0 0 0 mci 2 0 0 0 0 0 0 0 0 2 eor 2 0 1 0 0 0 0 0 0 1 eoe 2 0 1 0 0 0 0 0 0 1 eccci 2 0 0 0 0 0 2 0 0 0 ecccgci 2 0 0 0 0 0 0 0 0 2 eccccgci 2 0 0 1 0 0 0 0 0 1 coecgci 2 0 1 0 0 0 0 0 0 0 ciie 2 1 0 1 0 0 0 0 0 0 cieo 2 0 0 2 0 0 0 0 0 0 cgoe 2 0 0 0 0 1 0 0 0 0 cgcim 2 0 0 0 0 1 0 0 0 1 ccgcim 2 1 1 0 0 0 0 0 0 0 cce 2 0 0 0 0 2 0 0 0 0 ccccif 2 0 0 0 0 1 1 0 0 0 ccc 2 1 0 0 0 1 0 0 0 0 ror 1 0 0 0 0 0 0 0 0 1 rcir 1 0 0 0 0 0 0 0 0 1 oroeccccgci 1 0 0 0 0 0 0 1 0 0 oroe 1 0 0 0 0 0 1 0 0 0 orcin 1 0 0 0 0 1 0 0 0 0 orcie 1 0 0 0 0 1 0 0 0 0 orccci 1 0 0 0 0 0 0 1 0 0 oeor 1 0 0 0 0 1 0 0 0 0 oeof 1 0 0 0 0 1 0 0 0 0 oeoe 1 1 0 0 0 0 0 0 0 0 oecircir 1 0 0 0 0 0 0 0 0 0 oecim 1 0 0 0 0 0 0 0 0 0 oeciecccgci 1 0 0 0 0 0 0 0 0 0 oecgcie 1 0 0 0 0 0 0 1 0 0 oecgccccgci 1 0 0 0 0 0 0 0 0 0 oecccgcie 1 0 0 0 0 0 0 0 0 0 oecccg 1 0 0 0 0 0 0 1 0 0 oeccccg 1 0 0 0 0 0 0 0 0 0 oecccccgci 1 0 0 1 0 0 0 0 0 0 oePci 1 0 0 0 0 0 0 1 0 0 oePccccgci 1 0 0 1 0 0 0 0 0 0 ocgcie 1 0 0 0 0 0 0 0 0 0 ocg 1 0 0 0 0 1 0 0 0 0 occci 1 0 0 0 0 1 0 0 0 0 occccgci 1 0 0 0 0 1 0 0 0 0 oPci 1 1 0 0 0 0 0 0 0 0 eorci 1 0 0 0 0 0 0 0 0 1 eof 1 0 0 0 0 0 1 0 0 0 eciecgci 1 0 0 0 0 0 0 0 0 1 ecgcir 1 0 0 0 0 1 0 0 0 0 ecccci 1 0 0 0 0 0 0 0 0 0 eccccc 1 0 0 0 0 0 0 0 0 1 coeor 1 0 1 0 0 0 0 0 0 0 coeci 1 1 0 0 0 0 0 0 0 0 co 1 0 0 0 1 0 0 0 0 0 cirorci 1 0 0 1 0 0 0 0 0 0 cirof 1 0 0 1 0 0 0 0 0 0 ciroeof 1 0 1 0 0 0 0 0 0 0 ciroe 1 0 0 0 1 0 0 0 0 0 circif 1 0 1 0 0 0 0 0 0 0 circie 1 0 0 0 0 0 0 1 0 0 circgci 1 1 0 0 0 0 0 0 0 0 circccci 1 0 0 1 0 0 0 0 0 0 circccccgcie 1 0 0 1 0 0 0 0 0 0 cioc 1 0 1 0 0 0 0 0 0 0 cimci 1 0 0 0 0 0 1 0 0 0 cifo 1 0 0 0 0 0 0 0 0 0 ciecirci 1 0 0 1 0 0 0 0 0 0 ciecircg 1 0 0 1 0 0 0 0 0 0 ciecir 1 0 0 1 0 0 0 0 0 0 ciecim 1 0 0 1 0 0 0 0 0 0 ciecie 1 0 0 1 0 0 0 0 0 0 ciecgcgci 1 1 0 0 0 0 0 0 0 0 cieccci 1 0 1 0 0 0 0 0 0 0 ciecccci 1 1 0 0 0 0 0 0 0 0 ciec 1 1 0 0 0 0 0 0 0 0 cicccgci 1 1 0 0 0 0 0 0 0 0 cgcircie 1 0 0 0 0 0 0 0 0 1 cgcircccci 1 0 0 0 0 0 0 0 0 1 cgcir 1 0 0 0 0 1 0 0 0 0 cgcieor 1 0 0 0 0 1 0 0 0 0 cgcieccor 1 0 0 0 0 0 0 0 0 0 cgcie 1 0 1 0 0 0 0 0 0 0 cg 1 0 0 0 0 1 0 0 0 0 ccoecicg 1 0 0 1 0 0 0 0 0 0 ccoeci 1 0 0 0 0 0 0 0 0 0 ccocgci 1 0 1 0 0 0 0 0 0 0 cco 1 0 0 0 0 1 0 0 0 0 ccir 1 1 0 0 0 0 0 0 0 0 ccieci 1 0 0 0 0 0 1 0 0 0 ccgor 1 1 0 0 0 0 0 0 0 0 ccgcioe 1 0 0 0 0 0 0 1 0 0 ccgcif 1 0 1 0 0 0 0 0 0 0 ccgcieor 1 0 1 0 0 0 0 0 0 0 ccgcci 1 0 0 0 0 0 0 0 0 0 ccgccgci 1 1 0 0 0 0 0 0 0 0 ccgccci 1 0 1 0 0 0 0 0 0 0 ccgccccgci 1 1 0 0 0 0 0 0 0 0 cccor 1 0 0 0 0 0 1 0 0 0 cccoecgci 1 0 0 0 0 0 0 0 0 0 ccciroe 1 0 0 0 0 0 0 0 0 0 cccieci 1 0 0 0 0 0 0 0 0 0 cccgcif 1 0 0 0 0 0 0 1 0 0 cccgciA 1 0 0 0 0 1 0 0 0 0 cccgccci 1 0 0 0 1 0 0 0 0 0 ccccor 1 0 0 0 0 1 0 0 0 0 ccccoe 1 0 0 1 0 0 0 0 0 0 ccccir 1 0 0 0 0 1 0 0 0 0 cccciecgci 1 0 0 0 0 0 0 0 0 1 ccccic 1 0 1 0 0 0 0 0 0 0 ccccgcirci 1 0 0 0 0 1 0 0 0 0 cccccgcie 1 0 0 1 0 0 0 0 0 0 cccccg 1 0 0 0 0 0 1 0 0 0 cccccci 1 0 0 0 0 0 0 0 0 1 cccPcccgci 1 0 0 0 0 1 0 0 0 0 cPcim 1 0 0 0 0 0 0 0 0 1 Poe 1 0 0 0 0 1 0 0 0 0 Pcccgci 1 0 0 0 0 1 0 0 0 0 Pcccci 1 0 0 0 0 0 0 0 0 1 A 1 0 0 0 0 0 1 0 0 0 SUFFIX TOTAL P e.*H ci.*H oP Ae AP cg.*H Ae.*H ------------ ----- ----- ----- ----- ----- ----- ----- ----- ----- TOTALORUM 3437 63 61 58 40 119 23 18 13 ------------ ----- ----- ----- ----- ----- ----- ----- ----- ----- ccgci 377 0 8 12 0 0 0 6 0 cccgci 352 3 9 12 2 1 2 3 2 ci 276 0 4 2 2 7 1 2 4 ccccgci 256 17 2 1 18 7 11 0 0 cci 251 0 9 4 0 0 0 2 1 cim 245 0 7 4 0 2 0 1 0 cie 237 0 1 3 1 2 0 0 0 _ 228 0 0 0 0 81 1 0 0 ccci 202 0 4 12 0 2 0 3 3 cir 172 1 4 2 2 1 0 0 0 cccci 108 3 1 0 4 9 2 1 1 cin 97 0 2 2 0 1 0 0 0 oe 93 8 1 0 1 1 1 0 0 or 38 0 0 0 1 0 1 0 0 ccccci 24 0 0 1 2 0 0 0 0 m 21 0 0 0 0 0 0 0 0 cgci 21 0 0 0 0 1 1 0 0 cccoe 21 4 0 1 0 1 1 0 0 e 19 0 1 0 0 0 0 0 0 cieci 19 0 0 0 1 0 0 0 0 cif 16 0 1 0 0 0 0 0 0 cccccgci 16 0 0 0 0 3 0 0 0 coe 12 0 0 0 0 0 0 0 0 ccoe 12 2 0 0 0 0 1 0 0 ccccg 12 0 0 0 1 0 0 0 0 cccg 11 0 1 0 0 0 0 0 0 r 8 0 0 0 0 0 0 0 0 circi 8 0 0 0 0 0 0 0 1 ciecgci 8 0 0 0 0 0 0 0 0 ccor 8 2 0 0 0 0 0 0 0 ccgcir 8 0 0 0 0 0 0 0 0 ccccgcir 7 4 0 0 0 0 0 0 0 oeci 6 0 0 0 2 0 0 0 0 ccg 6 0 0 0 0 0 0 0 0 Pccccgci 6 0 0 0 0 0 0 0 0 o 5 0 0 0 0 0 0 0 0 cor 5 0 0 0 0 0 0 0 0 cieor 5 0 1 0 0 0 0 0 0 cicgci 5 0 0 0 0 0 0 0 0 ccgcie 5 0 0 0 0 0 0 0 0 oecgci 4 0 0 0 0 0 0 0 0 oeccccgci 4 2 0 0 0 0 0 0 0 n 4 0 0 0 0 0 0 0 0 cieoe 4 0 0 0 0 0 0 0 0 cieccccgci 4 0 0 0 1 0 0 0 0 ccie 4 0 0 0 0 0 0 0 0 cccir 4 0 0 0 0 0 0 0 0 cccgcie 4 0 0 0 0 0 1 0 0 ccccc 4 0 0 0 0 0 0 0 0 c 4 0 0 0 0 0 0 0 0 roe 3 0 0 0 0 0 0 0 0 om 3 1 0 0 0 0 0 0 0 oecccgci 3 2 0 0 0 0 0 0 0 f 3 0 0 0 0 0 0 0 0 eci 3 0 0 0 0 0 0 0 0 ciecccgci 3 0 0 0 0 0 0 0 0 cic 3 0 0 0 0 0 0 0 0 cccie 3 0 0 0 0 0 0 0 0 cccgcir 3 0 0 2 0 0 0 0 0 ccccgcie 3 1 0 0 1 0 0 0 0 cc 3 0 0 0 0 0 0 0 0 rci 2 0 0 0 0 0 0 0 0 orcim 2 0 0 0 0 0 0 0 0 oeccci 2 2 0 0 0 0 0 0 0 oecccci 2 1 0 0 0 0 0 0 0 mci 2 0 0 0 0 0 0 0 0 eor 2 0 0 0 0 0 0 0 0 eoe 2 0 0 0 0 0 0 0 0 eccci 2 0 0 0 0 0 0 0 0 ecccgci 2 0 0 0 0 0 0 0 0 eccccgci 2 0 0 0 0 0 0 0 0 coecgci 2 0 1 0 0 0 0 0 0 ciie 2 0 0 0 0 0 0 0 0 cieo 2 0 0 0 0 0 0 0 0 cgoe 2 1 0 0 0 0 0 0 0 cgcim 2 0 0 0 0 0 0 0 0 ccgcim 2 0 0 0 0 0 0 0 0 cce 2 0 0 0 0 0 0 0 0 ccccif 2 0 0 0 0 0 0 0 0 ccc 2 0 0 0 0 0 0 0 0 ror 1 0 0 0 0 0 0 0 0 rcir 1 0 0 0 0 0 0 0 0 oroeccccgci 1 0 0 0 0 0 0 0 0 oroe 1 0 0 0 0 0 0 0 0 orcin 1 0 0 0 0 0 0 0 0 orcie 1 0 0 0 0 0 0 0 0 orccci 1 0 0 0 0 0 0 0 0 oeor 1 0 0 0 0 0 0 0 0 oeof 1 0 0 0 0 0 0 0 0 oeoe 1 0 0 0 0 0 0 0 0 oecircir 1 1 0 0 0 0 0 0 0 oecim 1 1 0 0 0 0 0 0 0 oeciecccgci 1 1 0 0 0 0 0 0 0 oecgcie 1 0 0 0 0 0 0 0 0 oecgccccgci 1 1 0 0 0 0 0 0 0 oecccgcie 1 1 0 0 0 0 0 0 0 oecccg 1 0 0 0 0 0 0 0 0 oeccccg 1 1 0 0 0 0 0 0 0 oecccccgci 1 0 0 0 0 0 0 0 0 oePci 1 0 0 0 0 0 0 0 0 oePccccgci 1 0 0 0 0 0 0 0 0 ocgcie 1 0 1 0 0 0 0 0 0 ocg 1 0 0 0 0 0 0 0 0 occci 1 0 0 0 0 0 0 0 0 occccgci 1 0 0 0 0 0 0 0 0 oPci 1 0 0 0 0 0 0 0 0 eorci 1 0 0 0 0 0 0 0 0 eof 1 0 0 0 0 0 0 0 0 eciecgci 1 0 0 0 0 0 0 0 0 ecgcir 1 0 0 0 0 0 0 0 0 ecccci 1 0 1 0 0 0 0 0 0 eccccc 1 0 0 0 0 0 0 0 0 coeor 1 0 0 0 0 0 0 0 0 coeci 1 0 0 0 0 0 0 0 0 co 1 0 0 0 0 0 0 0 0 cirorci 1 0 0 0 0 0 0 0 0 cirof 1 0 0 0 0 0 0 0 0 ciroeof 1 0 0 0 0 0 0 0 0 ciroe 1 0 0 0 0 0 0 0 0 circif 1 0 0 0 0 0 0 0 0 circie 1 0 0 0 0 0 0 0 0 circgci 1 0 0 0 0 0 0 0 0 circccci 1 0 0 0 0 0 0 0 0 circccccgcie 1 0 0 0 0 0 0 0 0 cioc 1 0 0 0 0 0 0 0 0 cimci 1 0 0 0 0 0 0 0 0 cifo 1 0 1 0 0 0 0 0 0 ciecirci 1 0 0 0 0 0 0 0 0 ciecircg 1 0 0 0 0 0 0 0 0 ciecir 1 0 0 0 0 0 0 0 0 ciecim 1 0 0 0 0 0 0 0 0 ciecie 1 0 0 0 0 0 0 0 0 ciecgcgci 1 0 0 0 0 0 0 0 0 cieccci 1 0 0 0 0 0 0 0 0 ciecccci 1 0 0 0 0 0 0 0 0 ciec 1 0 0 0 0 0 0 0 0 cicccgci 1 0 0 0 0 0 0 0 0 cgcircie 1 0 0 0 0 0 0 0 0 cgcircccci 1 0 0 0 0 0 0 0 0 cgcir 1 0 0 0 0 0 0 0 0 cgcieor 1 0 0 0 0 0 0 0 0 cgcieccor 1 1 0 0 0 0 0 0 0 cgcie 1 0 0 0 0 0 0 0 0 cg 1 0 0 0 0 0 0 0 0 ccoecicg 1 0 0 0 0 0 0 0 0 ccoeci 1 0 0 0 0 0 0 0 1 ccocgci 1 0 0 0 0 0 0 0 0 cco 1 0 0 0 0 0 0 0 0 ccir 1 0 0 0 0 0 0 0 0 ccieci 1 0 0 0 0 0 0 0 0 ccgor 1 0 0 0 0 0 0 0 0 ccgcioe 1 0 0 0 0 0 0 0 0 ccgcif 1 0 0 0 0 0 0 0 0 ccgcieor 1 0 0 0 0 0 0 0 0 ccgcci 1 0 1 0 0 0 0 0 0 ccgccgci 1 0 0 0 0 0 0 0 0 ccgccci 1 0 0 0 0 0 0 0 0 ccgccccgci 1 0 0 0 0 0 0 0 0 cccor 1 0 0 0 0 0 0 0 0 cccoecgci 1 1 0 0 0 0 0 0 0 ccciroe 1 1 0 0 0 0 0 0 0 cccieci 1 0 0 0 1 0 0 0 0 cccgcif 1 0 0 0 0 0 0 0 0 cccgciA 1 0 0 0 0 0 0 0 0 cccgccci 1 0 0 0 0 0 0 0 0 ccccor 1 0 0 0 0 0 0 0 0 ccccoe 1 0 0 0 0 0 0 0 0 ccccir 1 0 0 0 0 0 0 0 0 cccciecgci 1 0 0 0 0 0 0 0 0 ccccic 1 0 0 0 0 0 0 0 0 ccccgcirci 1 0 0 0 0 0 0 0 0 cccccgcie 1 0 0 0 0 0 0 0 0 cccccg 1 0 0 0 0 0 0 0 0 cccccci 1 0 0 0 0 0 0 0 0 cccPcccgci 1 0 0 0 0 0 0 0 0 cPcim 1 0 0 0 0 0 0 0 0 Poe 1 0 0 0 0 0 0 0 0 Pcccgci 1 0 0 0 0 0 0 0 0 Pcccci 1 0 0 0 0 0 0 0 0 A 1 0 0 0 0 0 0 0 0 Analysis: The prefixes "e", "oe", "ci" (without "H") do not appear to be equivalent to the other prefixes above. However, "ec", "oec", and "cic" do appear to fit in. The empty string does not appear to be a valid suffix (yeay!) Let's redo again: /bin/rm -f .title /bin/rm -f .table /bin/touch .table set noglob set ofmt = "0" set npat = 1 foreach pat ( \ 'AH' 'oH' 'cg' 'cc.*H' \ 'ec' 'oec' 'H' 'oe.*H' \ 'cic' 'P' 'e.*H' 'ci.*H' \ 'oP' 'Ae' 'AP' 'cg.*H' \ 'Ae.*H' \ ) /n/gnu/bin/printf " %7s" "${pat}" >> .title /bin/cat bio-j-huc-gut.wds \ | /n/gnu/bin/sed -e 's/^/_/g' -e 's/$/_/g' \ | /n/gnu/bin/egrep "_${pat}[^H][^H]*_" \ | /n/gnu/bin/sed -e "s/_${pat}//g" -e '/../s/_$//g' \ | /n/gnu/bin/sort | uniq -c \ | /n/gnu/bin/expand \ > .suff.frq /n/gnu/bin/join -a 1 -a 2 -1 1 -2 2 -o"${ofmt},2.1" -e 0 .table .suff.frq > .tmp /bin/mv .tmp .table @ npat = ${npat} + 1 set ofmt = "${ofmt},1.${npat}" end unset noglob /n/gnu/bin/printf "\n" >> .title cat .table \ | /n/gnu/bin/gawk '/./ { s=0; for(i=2;i<=NF;i++) s+=$(i); print s, $0 }' \ | sort -nr \ > .tbsort cat .title .tbsort \ | format-suffix-table SUFFIX TOTAL AH oH cg cc.*H ec oec H oe.*H ------------ ----- ----- ----- ----- ----- ----- ----- ----- ----- TOTALORUM 3060 1041 448 345 318 171 125 149 115 ------------ ----- ----- ----- ----- ----- ----- ----- ----- ----- cccgci 462 191 57 5 21 73 38 16 15 ccgci 390 201 85 2 15 7 6 28 20 cci 260 43 22 0 160 5 4 2 8 ci 248 79 28 36 68 0 0 2 13 cim 240 94 41 72 5 0 0 7 7 ccci 237 87 35 2 20 20 19 5 20 cie 232 116 40 50 4 0 0 10 5 cir 168 50 36 51 5 0 0 8 8 ccccgci 142 19 10 27 0 6 1 18 3 cin 94 54 16 12 0 0 0 2 5 cccci 78 12 13 3 2 6 4 10 3 oe 70 21 10 17 3 0 0 6 1 i 28 0 0 0 0 6 22 0 0 cieci 20 7 4 6 0 0 1 1 0 cccg 20 5 3 0 0 5 3 2 0 ccoe 19 1 1 3 0 3 3 4 0 gci 18 0 0 0 0 8 9 0 0 or 15 4 2 3 0 0 0 4 0 cif 15 4 0 6 2 0 0 1 1 cccoe 14 0 0 4 0 0 0 3 0 coe 12 1 6 0 1 0 0 3 1 ccccci 11 1 1 3 0 0 0 1 1 ccgcir 9 4 4 0 0 0 1 0 0 cor 8 4 0 0 1 3 0 0 0 circi 8 1 0 5 0 0 0 1 0 ciecgci 8 3 2 3 0 0 0 0 0 e 7 2 0 4 0 0 0 0 0 cccccgci 7 2 0 2 0 0 0 0 0 ccor 6 0 1 2 0 0 1 0 0 ccg 6 3 1 0 1 0 0 0 1 im 5 0 0 0 0 2 3 0 0 ie 5 0 0 0 0 3 2 0 0 cieor 5 0 4 0 0 0 0 0 0 cicgci 5 3 1 0 0 0 0 0 1 ccgcie 5 2 2 0 0 0 0 1 0 cccgcie 5 2 1 0 0 1 0 0 0 ccccgcir 5 0 0 0 0 0 0 1 0 oeccccgci 4 1 0 1 0 0 0 0 0 ir 4 0 0 0 0 2 1 0 0 cieoe 4 1 2 1 0 0 0 0 0 cieccccgci 4 0 1 2 0 0 0 0 0 ccie 4 1 0 0 3 0 0 0 0 cccir 4 0 1 0 0 1 0 1 1 cccgcir 4 0 0 0 0 1 0 0 0 ccccg 4 1 0 0 1 0 1 0 0 c 4 0 0 0 2 2 0 0 0 om 3 1 0 0 0 0 0 0 1 oecgci 3 1 1 0 0 0 0 1 0 oecccgci 3 0 0 0 0 0 0 1 0 in 3 0 0 0 0 0 3 0 0 ciecccgci 3 0 1 2 0 0 0 0 0 cic 3 2 0 0 0 0 0 1 0 cgci 3 1 0 0 0 0 0 0 0 cccie 3 0 1 0 1 0 0 1 0 cccc 3 0 0 0 0 1 1 0 0 oeci 2 0 0 0 0 0 0 0 0 oeccci 2 0 0 0 0 0 0 0 0 gcim 2 0 0 0 0 1 0 0 0 coecgci 2 0 1 0 0 0 0 0 0 co 2 0 0 0 1 1 0 0 0 ciie 2 1 0 1 0 0 0 0 0 cieo 2 0 0 2 0 0 0 0 0 ce 2 0 0 0 0 2 0 0 0 ccir 2 1 0 0 0 1 0 0 0 ccgcim 2 1 1 0 0 0 0 0 0 cccif 2 0 0 0 0 1 1 0 0 ccccgcie 2 0 0 0 0 0 0 0 0 cc 2 1 0 0 0 1 0 0 0 roe 1 0 0 1 0 0 0 0 0 oroeccccgci 1 0 0 0 0 0 0 1 0 orcim 1 0 0 1 0 0 0 0 0 orccci 1 0 0 0 0 0 0 1 0 oeoe 1 1 0 0 0 0 0 0 0 oecircir 1 0 0 0 0 0 0 0 0 oecim 1 0 0 0 0 0 0 0 0 oeciecccgci 1 0 0 0 0 0 0 0 0 oecgcie 1 0 0 0 0 0 0 1 0 oecgccccgci 1 0 0 0 0 0 0 0 0 oecccgcie 1 0 0 0 0 0 0 0 0 oecccg 1 0 0 0 0 0 0 1 0 oecccci 1 0 0 0 0 0 0 0 0 oeccccg 1 0 0 0 0 0 0 0 0 oecccccgci 1 0 0 1 0 0 0 0 0 oePci 1 0 0 0 0 0 0 1 0 oePccccgci 1 0 0 1 0 0 0 0 0 ocgcie 1 0 0 0 0 0 0 0 0 oPci 1 1 0 0 0 0 0 0 0 imci 1 0 0 0 0 0 1 0 0 if 1 0 0 0 0 1 0 0 0 goe 1 0 0 0 0 1 0 0 0 gcircie 1 0 0 0 0 0 0 0 0 gcircccci 1 0 0 0 0 0 0 0 0 gcir 1 0 0 0 0 1 0 0 0 gcieor 1 0 0 0 0 1 0 0 0 g 1 0 0 0 0 1 0 0 0 eor 1 0 1 0 0 0 0 0 0 eoe 1 0 1 0 0 0 0 0 0 ecccci 1 0 0 0 0 0 0 0 0 eccccgci 1 0 0 1 0 0 0 0 0 coeor 1 0 1 0 0 0 0 0 0 coeci 1 1 0 0 0 0 0 0 0 cirorci 1 0 0 1 0 0 0 0 0 cirof 1 0 0 1 0 0 0 0 0 ciroeof 1 0 1 0 0 0 0 0 0 ciroe 1 0 0 0 1 0 0 0 0 circif 1 0 1 0 0 0 0 0 0 circie 1 0 0 0 0 0 0 1 0 circgci 1 1 0 0 0 0 0 0 0 circccci 1 0 0 1 0 0 0 0 0 circccccgcie 1 0 0 1 0 0 0 0 0 cioc 1 0 1 0 0 0 0 0 0 cifo 1 0 0 0 0 0 0 0 0 ciecirci 1 0 0 1 0 0 0 0 0 ciecircg 1 0 0 1 0 0 0 0 0 ciecir 1 0 0 1 0 0 0 0 0 ciecim 1 0 0 1 0 0 0 0 0 ciecie 1 0 0 1 0 0 0 0 0 ciecgcgci 1 1 0 0 0 0 0 0 0 cieccci 1 0 1 0 0 0 0 0 0 ciecccci 1 1 0 0 0 0 0 0 0 ciec 1 1 0 0 0 0 0 0 0 cicccgci 1 1 0 0 0 0 0 0 0 cgoe 1 0 0 0 0 0 0 0 0 cgcieccor 1 0 0 0 0 0 0 0 0 cgcie 1 0 1 0 0 0 0 0 0 ccoecicg 1 0 0 1 0 0 0 0 0 ccoeci 1 0 0 0 0 0 0 0 0 ccocgci 1 0 1 0 0 0 0 0 0 ccgor 1 1 0 0 0 0 0 0 0 ccgcioe 1 0 0 0 0 0 0 1 0 ccgcif 1 0 1 0 0 0 0 0 0 ccgcieor 1 0 1 0 0 0 0 0 0 ccgciA 1 0 0 0 0 1 0 0 0 ccgcci 1 0 0 0 0 0 0 0 0 ccgccgci 1 1 0 0 0 0 0 0 0 ccgccci 1 0 1 0 0 0 0 0 0 ccgccccgci 1 1 0 0 0 0 0 0 0 cccor 1 0 0 0 0 1 0 0 0 cccoecgci 1 0 0 0 0 0 0 0 0 ccciroe 1 0 0 0 0 0 0 0 0 cccieci 1 0 0 0 0 0 0 0 0 ccciecgci 1 0 0 0 0 0 0 0 0 cccgcirci 1 0 0 0 0 1 0 0 0 cccgcif 1 0 0 0 0 0 0 1 0 cccgccci 1 0 0 0 1 0 0 0 0 ccccoe 1 0 0 1 0 0 0 0 0 ccccic 1 0 1 0 0 0 0 0 0 cccccgcie 1 0 0 1 0 0 0 0 0 ccccc 1 0 0 1 0 0 0 0 0 ccc 1 1 0 0 0 0 0 0 0 ccPcccgci 1 0 0 0 0 1 0 0 0 Pcim 1 0 0 0 0 0 0 0 0 SUFFIX TOTAL cic P e.*H ci.*H oP Ae AP cg.*H Ae.*H ------------ ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- TOTALORUM 3060 35 63 61 58 40 38 22 18 13 ------------ ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- cccgci 462 12 3 9 12 2 1 2 3 2 ccgci 390 0 0 8 12 0 0 0 6 0 cci 260 0 0 9 4 0 0 0 2 1 ci 248 0 0 4 2 2 7 1 2 4 cim 240 0 0 7 4 0 2 0 1 0 ccci 237 5 0 4 12 0 2 0 3 3 cie 232 0 0 1 3 1 2 0 0 0 cir 168 0 1 4 2 2 1 0 0 0 ccccgci 142 2 17 2 1 18 7 11 0 0 cin 94 0 0 2 2 0 1 0 0 0 cccci 78 4 3 1 0 4 9 2 1 1 oe 70 0 8 1 0 1 1 1 0 0 i 28 0 0 0 0 0 0 0 0 0 cieci 20 0 0 0 0 1 0 0 0 0 cccg 20 1 0 1 0 0 0 0 0 0 ccoe 19 1 2 0 0 0 0 1 0 0 gci 18 1 0 0 0 0 0 0 0 0 or 15 0 0 0 0 1 0 1 0 0 cif 15 0 0 1 0 0 0 0 0 0 cccoe 14 0 4 0 1 0 1 1 0 0 coe 12 0 0 0 0 0 0 0 0 0 ccccci 11 1 0 0 1 2 0 0 0 0 ccgcir 9 0 0 0 0 0 0 0 0 0 cor 8 0 0 0 0 0 0 0 0 0 circi 8 0 0 0 0 0 0 0 0 1 ciecgci 8 0 0 0 0 0 0 0 0 0 e 7 0 0 1 0 0 0 0 0 0 cccccgci 7 0 0 0 0 0 3 0 0 0 ccor 6 0 2 0 0 0 0 0 0 0 ccg 6 0 0 0 0 0 0 0 0 0 im 5 0 0 0 0 0 0 0 0 0 ie 5 0 0 0 0 0 0 0 0 0 cieor 5 0 0 1 0 0 0 0 0 0 cicgci 5 0 0 0 0 0 0 0 0 0 ccgcie 5 0 0 0 0 0 0 0 0 0 cccgcie 5 0 0 0 0 0 0 1 0 0 ccccgcir 5 0 4 0 0 0 0 0 0 0 oeccccgci 4 0 2 0 0 0 0 0 0 0 ir 4 1 0 0 0 0 0 0 0 0 cieoe 4 0 0 0 0 0 0 0 0 0 cieccccgci 4 0 0 0 0 1 0 0 0 0 ccie 4 0 0 0 0 0 0 0 0 0 cccir 4 0 0 0 0 0 0 0 0 0 cccgcir 4 1 0 0 2 0 0 0 0 0 ccccg 4 0 0 0 0 1 0 0 0 0 c 4 0 0 0 0 0 0 0 0 0 om 3 0 1 0 0 0 0 0 0 0 oecgci 3 0 0 0 0 0 0 0 0 0 oecccgci 3 0 2 0 0 0 0 0 0 0 in 3 0 0 0 0 0 0 0 0 0 ciecccgci 3 0 0 0 0 0 0 0 0 0 cic 3 0 0 0 0 0 0 0 0 0 cgci 3 0 0 0 0 0 1 1 0 0 cccie 3 0 0 0 0 0 0 0 0 0 cccc 3 1 0 0 0 0 0 0 0 0 oeci 2 0 0 0 0 2 0 0 0 0 oeccci 2 0 2 0 0 0 0 0 0 0 gcim 2 1 0 0 0 0 0 0 0 0 coecgci 2 0 0 1 0 0 0 0 0 0 co 2 0 0 0 0 0 0 0 0 0 ciie 2 0 0 0 0 0 0 0 0 0 cieo 2 0 0 0 0 0 0 0 0 0 ce 2 0 0 0 0 0 0 0 0 0 ccir 2 0 0 0 0 0 0 0 0 0 ccgcim 2 0 0 0 0 0 0 0 0 0 cccif 2 0 0 0 0 0 0 0 0 0 ccccgcie 2 0 1 0 0 1 0 0 0 0 cc 2 0 0 0 0 0 0 0 0 0 roe 1 0 0 0 0 0 0 0 0 0 oroeccccgci 1 0 0 0 0 0 0 0 0 0 orcim 1 0 0 0 0 0 0 0 0 0 orccci 1 0 0 0 0 0 0 0 0 0 oeoe 1 0 0 0 0 0 0 0 0 0 oecircir 1 0 1 0 0 0 0 0 0 0 oecim 1 0 1 0 0 0 0 0 0 0 oeciecccgci 1 0 1 0 0 0 0 0 0 0 oecgcie 1 0 0 0 0 0 0 0 0 0 oecgccccgci 1 0 1 0 0 0 0 0 0 0 oecccgcie 1 0 1 0 0 0 0 0 0 0 oecccg 1 0 0 0 0 0 0 0 0 0 oecccci 1 0 1 0 0 0 0 0 0 0 oeccccg 1 0 1 0 0 0 0 0 0 0 oecccccgci 1 0 0 0 0 0 0 0 0 0 oePci 1 0 0 0 0 0 0 0 0 0 oePccccgci 1 0 0 0 0 0 0 0 0 0 ocgcie 1 0 0 1 0 0 0 0 0 0 oPci 1 0 0 0 0 0 0 0 0 0 imci 1 0 0 0 0 0 0 0 0 0 if 1 0 0 0 0 0 0 0 0 0 goe 1 0 0 0 0 0 0 0 0 0 gcircie 1 1 0 0 0 0 0 0 0 0 gcircccci 1 1 0 0 0 0 0 0 0 0 gcir 1 0 0 0 0 0 0 0 0 0 gcieor 1 0 0 0 0 0 0 0 0 0 g 1 0 0 0 0 0 0 0 0 0 eor 1 0 0 0 0 0 0 0 0 0 eoe 1 0 0 0 0 0 0 0 0 0 ecccci 1 0 0 1 0 0 0 0 0 0 eccccgci 1 0 0 0 0 0 0 0 0 0 coeor 1 0 0 0 0 0 0 0 0 0 coeci 1 0 0 0 0 0 0 0 0 0 cirorci 1 0 0 0 0 0 0 0 0 0 cirof 1 0 0 0 0 0 0 0 0 0 ciroeof 1 0 0 0 0 0 0 0 0 0 ciroe 1 0 0 0 0 0 0 0 0 0 circif 1 0 0 0 0 0 0 0 0 0 circie 1 0 0 0 0 0 0 0 0 0 circgci 1 0 0 0 0 0 0 0 0 0 circccci 1 0 0 0 0 0 0 0 0 0 circccccgcie 1 0 0 0 0 0 0 0 0 0 cioc 1 0 0 0 0 0 0 0 0 0 cifo 1 0 0 1 0 0 0 0 0 0 ciecirci 1 0 0 0 0 0 0 0 0 0 ciecircg 1 0 0 0 0 0 0 0 0 0 ciecir 1 0 0 0 0 0 0 0 0 0 ciecim 1 0 0 0 0 0 0 0 0 0 ciecie 1 0 0 0 0 0 0 0 0 0 ciecgcgci 1 0 0 0 0 0 0 0 0 0 cieccci 1 0 0 0 0 0 0 0 0 0 ciecccci 1 0 0 0 0 0 0 0 0 0 ciec 1 0 0 0 0 0 0 0 0 0 cicccgci 1 0 0 0 0 0 0 0 0 0 cgoe 1 0 1 0 0 0 0 0 0 0 cgcieccor 1 0 1 0 0 0 0 0 0 0 cgcie 1 0 0 0 0 0 0 0 0 0 ccoecicg 1 0 0 0 0 0 0 0 0 0 ccoeci 1 0 0 0 0 0 0 0 0 1 ccocgci 1 0 0 0 0 0 0 0 0 0 ccgor 1 0 0 0 0 0 0 0 0 0 ccgcioe 1 0 0 0 0 0 0 0 0 0 ccgcif 1 0 0 0 0 0 0 0 0 0 ccgcieor 1 0 0 0 0 0 0 0 0 0 ccgciA 1 0 0 0 0 0 0 0 0 0 ccgcci 1 0 0 1 0 0 0 0 0 0 ccgccgci 1 0 0 0 0 0 0 0 0 0 ccgccci 1 0 0 0 0 0 0 0 0 0 ccgccccgci 1 0 0 0 0 0 0 0 0 0 cccor 1 0 0 0 0 0 0 0 0 0 cccoecgci 1 0 1 0 0 0 0 0 0 0 ccciroe 1 0 1 0 0 0 0 0 0 0 cccieci 1 0 0 0 0 1 0 0 0 0 ccciecgci 1 1 0 0 0 0 0 0 0 0 cccgcirci 1 0 0 0 0 0 0 0 0 0 cccgcif 1 0 0 0 0 0 0 0 0 0 cccgccci 1 0 0 0 0 0 0 0 0 0 ccccoe 1 0 0 0 0 0 0 0 0 0 ccccic 1 0 0 0 0 0 0 0 0 0 cccccgcie 1 0 0 0 0 0 0 0 0 0 ccccc 1 0 0 0 0 0 0 0 0 0 ccc 1 0 0 0 0 0 0 0 0 0 ccPcccgci 1 0 0 0 0 0 0 0 0 0 Pcim 1 1 0 0 0 0 0 0 0 0 Analysis: The prefix "cg" seems a bit anomalous. It looks as if the "cg" were actually the first "c" of the suffix. Most valid suffixes apparently begin with "c". Thus we should incorporate the "c" into the prefix. The "i" suffix is bogus; it entered only because of "eci" and "oeci". Note that "ec", "cic" and "oec" incorporate the "c" while the other prefixes don't. The prefixes "cc.*H", "oe.*H", etc. seem anomalous, probably because some "H"s are actually "cHc"s. We should find out what are the actual prefixes, and see if we can take the two classes apart. Most productive suffixes come in pairs that differ by an extra "c" at the beginning. Redoing again with the extra "c"s: /bin/rm -f .title /bin/rm -f .table /bin/touch .table set noglob set ofmt = "0" set npat = 1 foreach pat ( \ 'AH' 'oH' 'cg' 'cc.*H' \ 'e' 'oe' 'H' 'Ae' \ 'oe.*H' 'ci' 'P' 'e.*H' \ 'ci.*H' 'oP' 'AP' 'cg.*H' \ 'Ae.*H' \ ) /n/gnu/bin/printf " %7s" "${pat}" >> .title /bin/cat bio-j-huc-gut.wds \ | /n/gnu/bin/sed -e 's/^/_/g' -e 's/$/_/g' \ | /n/gnu/bin/egrep "_${pat}c[^HPA]*_" \ | /n/gnu/bin/sed -e "s/_${pat}//g" -e '/../s/_$//g' \ | /n/gnu/bin/sort | uniq -c \ | /n/gnu/bin/expand \ > .suff.frq /n/gnu/bin/join -a 1 -a 2 -1 1 -2 2 -o"${ofmt},2.1" -e 0 .table .suff.frq > .tmp /bin/mv .tmp .table @ npat = ${npat} + 1 set ofmt = "${ofmt},1.${npat}" end unset noglob /n/gnu/bin/printf "\n" >> .title cat .table \ | /n/gnu/bin/gawk '/./ { s=0; for(i=2;i<=NF;i++) s+=$(i); print s, $0 }' \ | sort -nr \ > .tbsort cat .title .tbsort \ | format-suffix-table Results: SUFFIX TOTAL AH oH cg cc.*H e oe H oe.*H ------------ ----- ----- ----- ----- ----- ----- ----- ----- ----- TOTALORUM 2927 1009 433 315 315 169 127 132 113 ------------ ----- ----- ----- ----- ----- ----- ----- ----- ----- ccgci 377 201 85 2 15 0 0 28 20 cccgci 352 191 57 5 21 7 6 16 15 ci 276 79 28 36 68 6 22 2 13 ccccgci 256 19 10 27 0 73 38 18 3 cci 251 43 22 0 160 0 0 2 8 cim 245 94 41 72 5 2 3 7 7 cie 237 116 40 50 4 3 2 10 5 ccci 202 87 35 2 20 5 4 5 20 cir 172 50 36 51 5 2 1 8 8 cccci 108 12 13 3 2 20 19 10 3 cin 97 54 16 12 0 0 3 2 5 ccccci 24 1 1 3 0 6 4 1 1 cgci 21 1 0 0 0 8 9 0 0 cccoe 21 0 0 4 0 3 3 3 0 cieci 19 7 4 6 0 0 0 1 0 cif 16 4 0 6 2 1 0 1 1 cccccgci 16 2 0 2 0 6 1 0 0 coe 12 1 6 0 1 0 0 3 1 ccoe 12 1 1 3 0 0 0 4 0 ccccg 12 1 0 0 1 5 3 0 0 cccg 11 5 3 0 0 0 0 2 0 circi 8 1 0 5 0 0 0 1 0 ciecgci 8 3 2 3 0 0 0 0 0 ccor 8 0 1 2 0 3 0 0 0 ccgcir 8 4 4 0 0 0 0 0 0 ccccgcir 7 0 0 0 0 1 0 1 0 ccg 6 3 1 0 1 0 0 0 1 cor 5 4 0 0 1 0 0 0 0 cieor 5 0 4 0 0 0 0 0 0 cicgci 5 3 1 0 0 0 0 0 1 ccgcie 5 2 2 0 0 0 0 1 0 cieoe 4 1 2 1 0 0 0 0 0 cieccccgci 4 0 1 2 0 0 0 0 0 ccie 4 1 0 0 3 0 0 0 0 cccir 4 0 1 0 0 1 0 1 1 cccgcie 4 2 1 0 0 0 0 0 0 ccccc 4 0 0 1 0 1 1 0 0 c 4 0 0 0 2 0 2 0 0 ciecccgci 3 0 1 2 0 0 0 0 0 cic 3 2 0 0 0 0 0 1 0 cccie 3 0 1 0 1 0 0 1 0 cccgcir 3 0 0 0 0 0 1 0 0 ccccgcie 3 0 0 0 0 1 0 0 0 cc 3 1 0 0 0 2 0 0 0 coecgci 2 0 1 0 0 0 0 0 0 ciie 2 1 0 1 0 0 0 0 0 cieo 2 0 0 2 0 0 0 0 0 cgoe 2 0 0 0 0 1 0 0 0 cgcim 2 0 0 0 0 1 0 0 0 ccgcim 2 1 1 0 0 0 0 0 0 cce 2 0 0 0 0 2 0 0 0 ccccif 2 0 0 0 0 1 1 0 0 ccc 2 1 0 0 0 1 0 0 0 coeor 1 0 1 0 0 0 0 0 0 coeci 1 1 0 0 0 0 0 0 0 co 1 0 0 0 1 0 0 0 0 cirorci 1 0 0 1 0 0 0 0 0 cirof 1 0 0 1 0 0 0 0 0 ciroeof 1 0 1 0 0 0 0 0 0 ciroe 1 0 0 0 1 0 0 0 0 circif 1 0 1 0 0 0 0 0 0 circie 1 0 0 0 0 0 0 1 0 circgci 1 1 0 0 0 0 0 0 0 circccci 1 0 0 1 0 0 0 0 0 circccccgcie 1 0 0 1 0 0 0 0 0 cioc 1 0 1 0 0 0 0 0 0 cimci 1 0 0 0 0 0 1 0 0 cifo 1 0 0 0 0 0 0 0 0 ciecirci 1 0 0 1 0 0 0 0 0 ciecircg 1 0 0 1 0 0 0 0 0 ciecir 1 0 0 1 0 0 0 0 0 ciecim 1 0 0 1 0 0 0 0 0 ciecie 1 0 0 1 0 0 0 0 0 ciecgcgci 1 1 0 0 0 0 0 0 0 cieccci 1 0 1 0 0 0 0 0 0 ciecccci 1 1 0 0 0 0 0 0 0 ciec 1 1 0 0 0 0 0 0 0 cicccgci 1 1 0 0 0 0 0 0 0 cgcircie 1 0 0 0 0 0 0 0 0 cgcircccci 1 0 0 0 0 0 0 0 0 cgcir 1 0 0 0 0 1 0 0 0 cgcieor 1 0 0 0 0 1 0 0 0 cgcieccor 1 0 0 0 0 0 0 0 0 cgcie 1 0 1 0 0 0 0 0 0 cg 1 0 0 0 0 1 0 0 0 ccoecicg 1 0 0 1 0 0 0 0 0 ccoeci 1 0 0 0 0 0 0 0 0 ccocgci 1 0 1 0 0 0 0 0 0 cco 1 0 0 0 0 1 0 0 0 ccir 1 1 0 0 0 0 0 0 0 ccieci 1 0 0 0 0 0 1 0 0 ccgor 1 1 0 0 0 0 0 0 0 ccgcioe 1 0 0 0 0 0 0 1 0 ccgcif 1 0 1 0 0 0 0 0 0 ccgcieor 1 0 1 0 0 0 0 0 0 ccgcci 1 0 0 0 0 0 0 0 0 ccgccgci 1 1 0 0 0 0 0 0 0 ccgccci 1 0 1 0 0 0 0 0 0 ccgccccgci 1 1 0 0 0 0 0 0 0 cccor 1 0 0 0 0 0 1 0 0 cccoecgci 1 0 0 0 0 0 0 0 0 ccciroe 1 0 0 0 0 0 0 0 0 cccieci 1 0 0 0 0 0 0 0 0 cccgcif 1 0 0 0 0 0 0 1 0 cccgccci 1 0 0 0 1 0 0 0 0 ccccor 1 0 0 0 0 1 0 0 0 ccccoe 1 0 0 1 0 0 0 0 0 ccccir 1 0 0 0 0 1 0 0 0 cccciecgci 1 0 0 0 0 0 0 0 0 ccccic 1 0 1 0 0 0 0 0 0 ccccgcirci 1 0 0 0 0 1 0 0 0 cccccgcie 1 0 0 1 0 0 0 0 0 cccccg 1 0 0 0 0 0 1 0 0 cccccci 1 0 0 0 0 0 0 0 0 SUFFIX TOTAL Ae ci P e.*H ci.*H oP AP cg.*H Ae.*H ------------ ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- TOTALORUM 2927 37 34 41 57 58 36 20 18 13 ------------ ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ccgci 377 0 0 0 8 12 0 0 6 0 cccgci 352 1 0 3 9 12 2 2 3 2 ci 276 7 0 0 4 2 2 1 2 4 ccccgci 256 7 12 17 2 1 18 11 0 0 cci 251 0 0 0 9 4 0 0 2 1 cim 245 2 0 0 7 4 0 0 1 0 cie 237 2 0 0 1 3 1 0 0 0 ccci 202 2 0 0 4 12 0 0 3 3 cir 172 1 1 1 4 2 2 0 0 0 cccci 108 9 5 3 1 0 4 2 1 1 cin 97 1 0 0 2 2 0 0 0 0 ccccci 24 0 4 0 0 1 2 0 0 0 cgci 21 1 1 0 0 0 0 1 0 0 cccoe 21 1 1 4 0 1 0 1 0 0 cieci 19 0 0 0 0 0 1 0 0 0 cif 16 0 0 0 1 0 0 0 0 0 cccccgci 16 3 2 0 0 0 0 0 0 0 coe 12 0 0 0 0 0 0 0 0 0 ccoe 12 0 0 2 0 0 0 1 0 0 ccccg 12 0 1 0 0 0 1 0 0 0 cccg 11 0 0 0 1 0 0 0 0 0 circi 8 0 0 0 0 0 0 0 0 1 ciecgci 8 0 0 0 0 0 0 0 0 0 ccor 8 0 0 2 0 0 0 0 0 0 ccgcir 8 0 0 0 0 0 0 0 0 0 ccccgcir 7 0 1 4 0 0 0 0 0 0 ccg 6 0 0 0 0 0 0 0 0 0 cor 5 0 0 0 0 0 0 0 0 0 cieor 5 0 0 0 1 0 0 0 0 0 cicgci 5 0 0 0 0 0 0 0 0 0 ccgcie 5 0 0 0 0 0 0 0 0 0 cieoe 4 0 0 0 0 0 0 0 0 0 cieccccgci 4 0 0 0 0 0 1 0 0 0 ccie 4 0 0 0 0 0 0 0 0 0 cccir 4 0 0 0 0 0 0 0 0 0 cccgcie 4 0 0 0 0 0 0 1 0 0 ccccc 4 0 1 0 0 0 0 0 0 0 c 4 0 0 0 0 0 0 0 0 0 ciecccgci 3 0 0 0 0 0 0 0 0 0 cic 3 0 0 0 0 0 0 0 0 0 cccie 3 0 0 0 0 0 0 0 0 0 cccgcir 3 0 0 0 0 2 0 0 0 0 ccccgcie 3 0 0 1 0 0 1 0 0 0 cc 3 0 0 0 0 0 0 0 0 0 coecgci 2 0 0 0 1 0 0 0 0 0 ciie 2 0 0 0 0 0 0 0 0 0 cieo 2 0 0 0 0 0 0 0 0 0 cgoe 2 0 0 1 0 0 0 0 0 0 cgcim 2 0 1 0 0 0 0 0 0 0 ccgcim 2 0 0 0 0 0 0 0 0 0 cce 2 0 0 0 0 0 0 0 0 0 ccccif 2 0 0 0 0 0 0 0 0 0 ccc 2 0 0 0 0 0 0 0 0 0 coeor 1 0 0 0 0 0 0 0 0 0 coeci 1 0 0 0 0 0 0 0 0 0 co 1 0 0 0 0 0 0 0 0 0 cirorci 1 0 0 0 0 0 0 0 0 0 cirof 1 0 0 0 0 0 0 0 0 0 ciroeof 1 0 0 0 0 0 0 0 0 0 ciroe 1 0 0 0 0 0 0 0 0 0 circif 1 0 0 0 0 0 0 0 0 0 circie 1 0 0 0 0 0 0 0 0 0 circgci 1 0 0 0 0 0 0 0 0 0 circccci 1 0 0 0 0 0 0 0 0 0 circccccgcie 1 0 0 0 0 0 0 0 0 0 cioc 1 0 0 0 0 0 0 0 0 0 cimci 1 0 0 0 0 0 0 0 0 0 cifo 1 0 0 0 1 0 0 0 0 0 ciecirci 1 0 0 0 0 0 0 0 0 0 ciecircg 1 0 0 0 0 0 0 0 0 0 ciecir 1 0 0 0 0 0 0 0 0 0 ciecim 1 0 0 0 0 0 0 0 0 0 ciecie 1 0 0 0 0 0 0 0 0 0 ciecgcgci 1 0 0 0 0 0 0 0 0 0 cieccci 1 0 0 0 0 0 0 0 0 0 ciecccci 1 0 0 0 0 0 0 0 0 0 ciec 1 0 0 0 0 0 0 0 0 0 cicccgci 1 0 0 0 0 0 0 0 0 0 cgcircie 1 0 1 0 0 0 0 0 0 0 cgcircccci 1 0 1 0 0 0 0 0 0 0 cgcir 1 0 0 0 0 0 0 0 0 0 cgcieor 1 0 0 0 0 0 0 0 0 0 cgcieccor 1 0 0 1 0 0 0 0 0 0 cgcie 1 0 0 0 0 0 0 0 0 0 cg 1 0 0 0 0 0 0 0 0 0 ccoecicg 1 0 0 0 0 0 0 0 0 0 ccoeci 1 0 0 0 0 0 0 0 0 1 ccocgci 1 0 0 0 0 0 0 0 0 0 cco 1 0 0 0 0 0 0 0 0 0 ccir 1 0 0 0 0 0 0 0 0 0 ccieci 1 0 0 0 0 0 0 0 0 0 ccgor 1 0 0 0 0 0 0 0 0 0 ccgcioe 1 0 0 0 0 0 0 0 0 0 ccgcif 1 0 0 0 0 0 0 0 0 0 ccgcieor 1 0 0 0 0 0 0 0 0 0 ccgcci 1 0 0 0 1 0 0 0 0 0 ccgccgci 1 0 0 0 0 0 0 0 0 0 ccgccci 1 0 0 0 0 0 0 0 0 0 ccgccccgci 1 0 0 0 0 0 0 0 0 0 cccor 1 0 0 0 0 0 0 0 0 0 cccoecgci 1 0 0 1 0 0 0 0 0 0 ccciroe 1 0 0 1 0 0 0 0 0 0 cccieci 1 0 0 0 0 0 1 0 0 0 cccgcif 1 0 0 0 0 0 0 0 0 0 cccgccci 1 0 0 0 0 0 0 0 0 0 ccccor 1 0 0 0 0 0 0 0 0 0 ccccoe 1 0 0 0 0 0 0 0 0 0 ccccir 1 0 0 0 0 0 0 0 0 0 cccciecgci 1 0 1 0 0 0 0 0 0 0 ccccic 1 0 0 0 0 0 0 0 0 0 ccccgcirci 1 0 0 0 0 0 0 0 0 0 cccccgcie 1 0 0 0 0 0 0 0 0 0 cccccg 1 0 0 0 0 0 0 0 0 0 cccccci 1 0 1 0 0 0 0 0 0 0 Analysis: to a first approximation, the "ordinary" words are { AH oH cg cc.*H e oe H oe.*H Ae ci P e.*H ci.*H oP AP cg.*H Ae.*H } × { ccgci cccgci ci ccccgci cci cim cie ccci cir cccci cin } Rarer suffixes are { ccccci cgci cccoe cieci cif cccccgci coe ccoe ... } Looking back, we see that the prefixes "cc.*H" are actually "cccH" and "ccccH". There is also a "ciH" prefix. The "ccc[^HP]*" words that we were using before appear to be "c" prefix. Looked for more info on "H"-containing prefixes: cat bio-j-huc-gut.wds \ | sed -e 's/^/_/g' -e 's/$/_/g' \ | compare-contexts '_ci.*H.*_' '_e.*H.*_' '_cg.*H.*_' '_Ae.*H.*_' '_oe.*H.*_' 8 0.13 eHccgci 12 0.21 ciHccgci 7 0.11 eHcccgci 12 0.21 ciHcccgci 6 0.10 eHcim 10 0.17 ciHccci 5 0.08 eccccHcci 4 0.07 ciHcim 4 0.07 eHcir 3 0.05 ciHcie 3 0.05 ecccHcci 2 0.03 ciHcir 3 0.05 eHccci 2 0.03 ciHcin 2 0.03 eccccHcccgci 2 0.03 ciHcci 2 0.03 ecccHci 2 0.03 ciHcccgcir 2 0.03 eHcin 1 0.02 ciroHcci 2 0.03 eHci 1 0.02 cioHci 2 0.03 eHccccgci 1 0.02 ciccccHcci 1 0.02 eoeHcim 1 0.02 ciccccHccci 1 0.02 eoHcif 1 0.02 cicHccci 1 0.02 eccccHcie 1 0.02 ciHci 1 0.02 eccccHccci 1 0.02 ciHcccoe 1 0.02 eHoe 1 0.02 ciHccccgci 1 0.02 eHocgcie 1 0.02 ciHccccci 1 0.02 eHecccci ----- ---- ---- 1 0.02 eHe 58 1.00 TOT 1 0.02 eHcoecgci 1 0.02 eHcifo 1 0.02 eHcieor 1 0.02 eHcci 1 0.02 eHccgcci 1 0.02 eHcccg 1 0.02 eHcccci ----- ---- ---- 61 1.00 TOT 20 0.17 oeHccgci 2 0.11 cgciHccgci 4 0.31 AeHci 19 0.17 oeHccci 2 0.11 cgciHcccgci 3 0.23 AeHccci 15 0.13 oeHcccgci 2 0.11 cgHccgci 2 0.15 AeHcccgci 10 0.09 oeHci 1 0.06 cgoeHccgci 1 0.08 AeHcirci 7 0.06 oeHcir 1 0.06 cgoHccgci 1 0.08 AeHccoeci 7 0.06 oeHcim 1 0.06 cgoHccci 1 0.08 AeHcci 5 0.04 oeHcin 1 0.06 cgcieHci 1 0.08 AeHcccci 5 0.04 oeHcie 1 0.06 cgcicHcci ----- ---- ---- 4 0.03 oeHcci 1 0.06 cgciHcim 13 1.00 TOT 3 0.03 oeHcccci 1 0.06 cgciHci 2 0.02 oeoHci 1 0.06 cgciHcci 2 0.02 oecccHcci 1 0.06 cgciHccci 2 0.02 oeHccccgci 1 0.06 cgcccHcccgci 1 0.01 oeoeHccci 1 0.06 cgHccci 1 0.01 oeoHcir 1 0.06 cgHcccci 1 0.01 oeoHccccgci ----- ---- ---- 1 0.01 oeccccHcci 18 1.00 TOT 1 0.01 oeccHci 1 0.01 oePocHcci 1 0.01 oeHom 1 0.01 oeHoe 1 0.01 oeHcoe 1 0.01 oeHcif 1 0.01 oeHcicgci 1 0.01 oeHccg 1 0.01 oeHcccir 1 0.01 oeHccccci ----- ---- ---- 115 1.00 TOT So it seems we got some new prefixes: { eH eccccH ciH ciccccH oeH cgciH AeH } Now that we know that "ccgci" is the most common suffix, let's look for all its prefixes: cat bio-j-huc-gut.wds \ | egrep 'ccgci$' \ | sed -e 's/ccgci$//g' \ | wfreq 387 0.25 cc 201 0.13 AH 191 0.12 AHc 85 0.05 oH 73 0.05 ecc 60 0.04 ccc 57 0.04 oHc 38 0.02 oecc 28 0.02 H 27 0.02 cgcc 21 0.01 c 20 0.01 oeH 19 0.01 AHcc 18 0.01 oPcc 18 0.01 Hcc 17 0.01 Pcc 16 0.01 Hc 15 0.01 oeHc 15 0.01 cccHc 12 0.01 cicc 12 0.01 ciHc 12 0.01 ciH 11 0.01 APcc 10 0.01 rcc 10 0.01 oHcc 10 0.01 cccH 8 0.01 eH 7 0.00 ec 7 0.00 eHc 7 0.00 Aecc 6 0.00 oec 6 0.00 eccc 6 0.00 Ac 5 0.00 cgc 5 0.00 ccccHc 4 0.00 oePcc 4 0.00 coeHc 4 0.00 cHc 3 0.00 ccH 3 0.00 cH 3 0.00 Pc 3 0.00 Aeccc 2 0.00 rccc 2 0.00 oeHcc 2 0.00 occ 2 0.00 oPc 2 0.00 eccccHc 2 0.00 eHcc 2 0.00 coecc 2 0.00 coHc 2 0.00 ciec 2 0.00 ciccc 2 0.00 cgciecc 2 0.00 cgciec 2 0.00 cgciHc 2 0.00 cgciH 2 0.00 cgccc 2 0.00 cgH 2 0.00 cg 2 0.00 ccccH 2 0.00 cccPcc 2 0.00 Poecc 2 0.00 Poec 2 0.00 AeHc 2 0.00 Acc 2 0.00 APc 2 0.00 AHccc 1 0.00 rc 1 0.00 orccc 1 0.00 oeoHcc 1 0.00 oeccc 1 0.00 ocgcc 1 0.00 ocHc 1 0.00 oc 1 0.00 oPciecc 1 0.00 oHciecc 1 0.00 oHciec 1 0.00 oAPcc 1 0.00 eocc 1 0.00 ecccPc 1 0.00 ePcc 1 0.00 ePc 1 0.00 coeH 1 0.00 ciecc 1 0.00 ciPcc 1 0.00 ciHcc 1 0.00 cgoeccc 1 0.00 cgoecc 1 0.00 cgoePcc 1 0.00 cgoeH 1 0.00 cgoH 1 0.00 cgecc 1 0.00 cgcccHc 1 0.00 ccocPc 1 0.00 ccoHc 1 0.00 cccoec 1 0.00 ccccPc 1 0.00 cccPc 1 0.00 ccPccc 1 0.00 ccPcc 1 0.00 cPcc 1 0.00 Poeciec 1 0.00 Poecgcc 1 0.00 PciH 1 0.00 Horoecc 1 0.00 Hoec 1 0.00 HcccoHc 1 0.00 Aec 1 0.00 Acgc 1 0.00 AccHc 1 0.00 AcHc 1 0.00 AHoecc 1 0.00 AHcic 1 0.00 AHccgcc 1 0.00 AHccg ----- ---- ---- 1562 1.00 TOT Ditto with "cie": cat bio-j-huc-gut.wds \ | egrep 'cie$' \ | sed -e 's/cie$//g' \ | wfreq 116 0.35 AH 50 0.15 cg 40 0.12 oH 14 0.04 c 12 0.04 10 0.03 H 9 0.03 ccccg 7 0.02 ccc 7 0.02 cc 6 0.02 r 5 0.02 oeH 3 0.01 e 3 0.01 ciH 2 0.01 oe 2 0.01 oHccg 2 0.01 ccccHc 2 0.01 ccH 2 0.01 Ae 2 0.01 AHccg 2 0.01 AHcccg 1 0.00 rccc 1 0.00 or 1 0.00 ocg 1 0.00 oc 1 0.00 oPccccg 1 0.00 oP 1 0.00 oHcg 1 0.00 oHcccg 1 0.00 oHcc 1 0.00 eor 1 0.00 eccccg 1 0.00 eccccH 1 0.00 eHocg 1 0.00 coeccH 1 0.00 cicgcir 1 0.00 cgcircccccg 1 0.00 cgcie 1 0.00 cgcccccg 1 0.00 ccoH 1 0.00 ccir 1 0.00 cccg 1 0.00 cccc 1 0.00 cccHcc 1 0.00 cccHc 1 0.00 cccH 1 0.00 cPc 1 0.00 cHc 1 0.00 cH 1 0.00 Poecccg 1 0.00 Pccccg 1 0.00 Hoecg 1 0.00 Hcir 1 0.00 Hccg 1 0.00 Hcc 1 0.00 APcccg 1 0.00 AHc 1 0.00 A ----- ---- ---- 333 1.00 TOT Looking for suffixes that do not begin with "c": cat bio-j-huc-gut.wds \ | sed -e 's/^/_/g' -e 's/$/_/g' \ | compare-contexts '_AH.*_' '_cg.*_' '_cccH.*_' 72 0.20 cgcim 201 0.19 AHccgci 90 0.51 cccHcci 51 0.14 cgcir 191 0.18 AHcccgci 31 0.18 cccHci 50 0.14 cgcie 116 0.11 AHcie 15 0.08 cccHcccgci 36 0.10 cgci 94 0.09 AHcim 11 0.06 cccHccci 27 0.07 cgccccgci 87 0.08 AHccci 10 0.06 cccHccgci 17 0.05 cgoe 79 0.08 AHci 4 0.02 cccH 12 0.03 cgcin 54 0.05 AHcin 3 0.02 cccHcir 6 0.02 cgcif 50 0.05 AHcir 2 0.01 cccHcim 6 0.02 cgcieci 43 0.04 AHcci 2 0.01 cccHcif 5 0.01 cgcirci 21 0.02 AHoe 2 0.01 cccHc 5 0.01 cgcccgci 19 0.02 AHccccgci 1 0.01 cccHcor 4 0.01 cge 12 0.01 AHcccci 1 0.01 cccHcie 4 0.01 cgcccoe 7 0.01 AHcieci 1 0.01 cccHccie 3 0.01 cgor 5 0.00 AHcccg 1 0.01 cccHccg 3 0.01 cgciecgci 4 0.00 AHor 1 0.01 cccHcccie 3 0.01 cgccoe 4 0.00 AHcor 1 0.01 cccHcccci 3 0.01 cgcccci 4 0.00 AHcif 1 0.01 cccHccccg 3 0.01 cgccccci 4 0.00 AHccgcir ----- ---- ---- 2 0.01 cgcieo 3 0.00 AHciecgci 177 1.00 TOT 2 0.01 cgciecccgci 3 0.00 AHcicgci 2 0.01 cgcieccccgci 3 0.00 AHccg 2 0.01 cgciHccgci 2 0.00 AHe 2 0.01 cgciHcccgci 2 0.00 AHcic 2 0.01 cgccor 2 0.00 AHccgcie 2 0.01 cgccgci 2 0.00 AHcccgcie 2 0.01 cgccci 2 0.00 AHcccccgci 2 0.01 cgcccccgci 2 0.00 AH 2 0.01 cgHccgci 1 0.00 AHom 1 0.00 cgroe 1 0.00 AHoeoe 1 0.00 cgorcim 1 0.00 AHoecgci 1 0.00 cgoeccccgci 1 0.00 AHoeccccgci 1 0.00 cgoecccccgci 1 0.00 AHoPci 1 0.00 cgoePccccgci 1 0.00 AHcoeci 1 0.00 cgoeHccgci 1 0.00 AHcoe 1 0.00 cgoHccgci 1 0.00 AHcirci 1 0.00 cgoHccci 1 0.00 AHcircgci 1 0.00 cgeccccgci 1 0.00 AHciie 1 0.00 cgcirorci 1 0.00 AHcieoe 1 0.00 cgcirof 1 0.00 AHciecgcgci 1 0.00 cgcircccci 1 0.00 AHciecccci 1 0.00 cgcircccccgcie 1 0.00 AHciec 1 0.00 cgciie 1 0.00 AHcicccgci 1 0.00 cgcieoe 1 0.00 AHcgci 1 0.00 cgciecirci 1 0.00 AHccoe 1 0.00 cgciecircg 1 0.00 AHccir 1 0.00 cgciecir 1 0.00 AHccie 1 0.00 cgciecim 1 0.00 AHccgor 1 0.00 cgciecie 1 0.00 AHccgcim 1 0.00 cgcieHci 1 0.00 AHccgccgci 1 0.00 cgcicHcci 1 0.00 AHccgccccgci 1 0.00 cgciHcim 1 0.00 AHccccg 1 0.00 cgciHci 1 0.00 AHccccci 1 0.00 cgciHcci 1 0.00 AHccccHcci 1 0.00 cgciHccci 1 0.00 AHccc 1 0.00 cgccoecicg 1 0.00 AHcc 1 0.00 cgccccoe ----- ---- ---- 1 0.00 cgcccccgcie 1044 1.00 TOT 1 0.00 cgccccc 1 0.00 cgcccHcccgci 1 0.00 cgHccci 1 0.00 cgHcccci ----- ---- ---- 363 1.00 TOT So it seems that { oe or om e } are also valid suffixes. OK, let's try to parse what we can into prefix:suffix: split-prefix-suffix ------------------------------------------------ #! /n/gnu/bin/gawk -f # Attempts to split words into prefix/suffix inserting ":" in between. BEGIN { PREFS = "^(AH|AP|Ae|AeH|H|P|cH|ccH|cccH|ccccH|cg|cgciH|ci|ciH|e|eH|eccccH|oH|oP|oe|oeH|r)" SUFFS = "([co][^HP]*)$" SPLIT = ( PREFS SUFFS ) } ( $0 ~ SPLIT ) { match($0, PREFS) k = RLENGTH $0 = (substr($0, 1, k) ":" substr($0, k + 1)) print next } /./ { print; next } ------------------------------------------------ cat bio-j-huc-gut.wds \ | split-prefix-suffix \ | egrep -v ':' \ | wfreq 387 0.24 ccccgci 139 0.09 cccci 131 0.08 oe 81 0.05 Ae 61 0.04 ccccci 60 0.04 cccccgci 41 0.03 or 33 0.02 cccoe 30 0.02 ccim 25 0.02 coe 23 0.01 ccoe 21 0.01 cim 21 0.01 cccgci 17 0.01 cccor 14 0.01 ccie 14 0.01 ccci 12 0.01 cie 11 0.01 ccir 10 0.01 cor 10 0.01 cccir 10 0.01 cccc 9 0.01 ccccgcim 9 0.01 ccccgcie 8 0.01 ccccir 7 0.00 cir 7 0.00 cccie 7 0.00 ccccie 7 0.00 ccccg 6 0.00 cce 6 0.00 c 6 0.00 Ar 6 0.00 Acccgci 5 0.00 r 5 0.00 oroe 5 0.00 orci 5 0.00 com 5 0.00 cccPccci 4 0.00 oePccccgci 4 0.00 e 4 0.00 coeHcccgci 4 0.00 cin 4 0.00 cge 4 0.00 ccccoe 4 0.00 ccccgcir 4 0.00 ccccc 4 0.00 cccH 4 0.00 cPccci 4 0.00 AcHccci 4 0.00 A 3 0.00 orcim 3 0.00 ocHcci 3 0.00 oH 3 0.00 ecccHcci 3 0.00 coecccci 3 0.00 coeHccci 3 0.00 cif 3 0.00 cieci 3 0.00 ccoecgci 3 0.00 cccif 3 0.00 cccgcir 3 0.00 ccccgoe 3 0.00 Acgci 2 0.00 rcccHci 2 0.00 oroeci 2 0.00 om 2 0.00 oeoHci 2 0.00 oeeccci 2 0.00 oecccHcci 2 0.00 ocgci 2 0.00 occcci 2 0.00 occccgci 2 0.00 ocHccci 2 0.00 ecccHci 2 0.00 coeccccgci 2 0.00 coHcccgci 2 0.00 ciroe 2 0.00 circi 2 0.00 cimci 2 0.00 ciecccgci 2 0.00 cgHccgci 2 0.00 ccoHci 2 0.00 cccoHci 2 0.00 ccccieci 2 0.00 ccccPcci 2 0.00 cccPccccgci 2 0.00 HoHoe 2 0.00 Accci 2 0.00 Accccgci 2 0.00 AHe 2 0.00 AH 1 0.00 rcicHcci 1 0.00 orcir 1 0.00 orcin 1 0.00 orcie 1 0.00 orcccci 1 0.00 orcccccgci 1 0.00 on 1 0.00 oeoeHccci 1 0.00 oeoHcir 1 0.00 oeoHccccgci 1 0.00 oeeof 1 0.00 oeccccHcci 1 0.00 oeccHci 1 0.00 oePocHcci 1 0.00 oeA 1 0.00 ocicccci 1 0.00 ocgcir 1 0.00 ocgcim 1 0.00 ocgcie 1 0.00 ocgccccgci 1 0.00 occie 1 0.00 occcgci 1 0.00 occccin 1 0.00 occccci 1 0.00 occcPoec 1 0.00 ocHcor 1 0.00 ocHccoe 1 0.00 ocHcccgci 1 0.00 oPcieHcim 1 0.00 oHeor 1 0.00 oHeoe 1 0.00 oHcieHci 1 0.00 oHccoHcir 1 0.00 oAe 1 0.00 oAPccccgci 1 0.00 oAHci 1 0.00 oA 1 0.00 o 1 0.00 er 1 0.00 eoeHcim 1 0.00 eoHcif 1 0.00 eecgcir 1 0.00 ecccPcccgci 1 0.00 ePoe 1 0.00 ePcccgci 1 0.00 ePccccgci 1 0.00 eHecccci 1 0.00 eHe 1 0.00 coeor 1 0.00 coeci 1 0.00 coecgci 1 0.00 coecccoe 1 0.00 coeccHcie 1 0.00 coeHci 1 0.00 coeHcci 1 0.00 coeHccgci 1 0.00 cocgcir 1 0.00 cocHccci 1 0.00 coHoe 1 0.00 ciror 1 0.00 ciroHcci 1 0.00 circir 1 0.00 cioHci 1 0.00 cieorci 1 0.00 cieor 1 0.00 cieoe 1 0.00 cieciecgci 1 0.00 cieccccgci 1 0.00 cieccccc 1 0.00 ciccccHcci 1 0.00 ciccccHccci 1 0.00 cicPcim 1 0.00 cicHccci 1 0.00 ciPcccci 1 0.00 ciPccccgci 1 0.00 ci 1 0.00 cgroe 1 0.00 cgoePccccgci 1 0.00 cgoeHccgci 1 0.00 cgoHccgci 1 0.00 cgoHccci 1 0.00 cgeccccgci 1 0.00 cgcieHci 1 0.00 cgcicHcci 1 0.00 cgcccHcccgci 1 0.00 cgHccci 1 0.00 cgHcccci 1 0.00 cec 1 0.00 ccor 1 0.00 ccoeo 1 0.00 ccoeci 1 0.00 ccoecccci 1 0.00 ccoeHcccci 1 0.00 ccocgcim 1 0.00 ccocPcccgci 1 0.00 ccocHcci 1 0.00 ccoHcim 1 0.00 ccoHcie 1 0.00 ccoHcccgci 1 0.00 ccircie 1 0.00 ccino 1 0.00 ccieci 1 0.00 ccieccccg 1 0.00 ccieHccci 1 0.00 cciHcim 1 0.00 cci 1 0.00 ccer 1 0.00 ccecgcim 1 0.00 ccecgci 1 0.00 cceccPccccci 1 0.00 ccec 1 0.00 cccoeoe 1 0.00 cccoeo 1 0.00 cccoecgci 1 0.00 cccoecccgci 1 0.00 cccoecccci 1 0.00 ccciror 1 0.00 cccirci 1 0.00 cccieoeci 1 0.00 ccciHccci 1 0.00 cccgoe 1 0.00 cccgcin 1 0.00 cccgcif 1 0.00 cccgcie 1 0.00 ccccor 1 0.00 ccccieor 1 0.00 cccciHci 1 0.00 ccccgor 1 0.00 ccccgcif 1 0.00 ccccgciecgci 1 0.00 ccccgciHcir 1 0.00 ccccgccci 1 0.00 cccccoe 1 0.00 cccccim 1 0.00 cccccie 1 0.00 cccccgcir 1 0.00 cccccg 1 0.00 cccccci 1 0.00 cccccHcci 1 0.00 cccccHccci 1 0.00 ccccPcccgci 1 0.00 cccPoe 1 0.00 cccPci 1 0.00 cccPcccgci 1 0.00 cccP 1 0.00 ccPcim 1 0.00 ccPccccgci 1 0.00 ccPcccccgci 1 0.00 cc 1 0.00 cPcoe 1 0.00 cPccir 1 0.00 cPccie 1 0.00 cPccgor 1 0.00 cPcccci 1 0.00 cPccccgci 1 0.00 PoecgciHci 1 0.00 PoeHcccoe 1 0.00 PoeHccci 1 0.00 PoHcin 1 0.00 PciHccgci 1 0.00 HoePci 1 0.00 HoeHcci 1 0.00 HocHccci 1 0.00 HoHcieci 1 0.00 HciAHci 1 0.00 HcccoHcccgci 1 0.00 HcccgoeHcgci 1 0.00 H 1 0.00 Arcim 1 0.00 Aoe 1 0.00 An 1 0.00 Acir 1 0.00 Acim 1 0.00 Acie 1 0.00 AciHci 1 0.00 AcgciHcci 1 0.00 Acgccci 1 0.00 Acgcccgci 1 0.00 Acgccccg 1 0.00 Acccci 1 0.00 AccHcccgci 1 0.00 AcHcci 1 0.00 AcHcccgci 1 0.00 AcHccc 1 0.00 AP 1 0.00 AHoPci 1 0.00 AHccccHcci 1 0.00 AAHccci ----- ---- ---- 1585 1.00 TOT I must do something about the "c" prefix.... cat bio-j-huc-gut.wds \ | split-prefix-suffix \ | egrep ':' \ | sed -e 's/^.*://g' \ | wfreq All suffixes recognized by the code: 376 0.12 ccgci 355 0.11 cccgci 267 0.08 ci 265 0.08 ccccgci 255 0.08 cim 243 0.08 cie 240 0.08 cci 201 0.06 ccci 174 0.06 cir 111 0.04 cccci 102 0.03 oe 98 0.03 cin 41 0.01 or 25 0.01 ccccci 21 0.01 cgci 21 0.01 cccoe 20 0.01 cieci 18 0.01 cif 18 0.01 cccccgci 14 0.00 ccccg 13 0.00 coe 12 0.00 ccoe 11 0.00 cccg 9 0.00 circi 8 0.00 ciecgci 8 0.00 ccor 8 0.00 ccgcir 7 0.00 oeci 7 0.00 ccccgcir 6 0.00 cor 6 0.00 ccg 5 0.00 om 5 0.00 o 5 0.00 cieor 5 0.00 cicgci 5 0.00 ccie 5 0.00 ccgcie 4 0.00 oecgci 4 0.00 oeccccgci 4 0.00 cieoe 4 0.00 cieccccgci 4 0.00 cccir 4 0.00 cccgcie 4 0.00 ccccc 4 0.00 c 3 0.00 oecccgci 3 0.00 ciecccgci 3 0.00 cic 3 0.00 cccie 3 0.00 cccgcir 3 0.00 ccccgcie 3 0.00 ccc 3 0.00 cc 2 0.00 oroe 2 0.00 orcim 2 0.00 oeccci 2 0.00 oecccci 2 0.00 coecgci 2 0.00 ciie 2 0.00 cieo 2 0.00 cgoe 2 0.00 cgcim 2 0.00 ccgcim 2 0.00 cce 2 0.00 ccccif 1 0.00 oroeccccgci 1 0.00 orcin 1 0.00 orcie 1 0.00 orccci 1 0.00 oeor 1 0.00 oeof 1 0.00 oeoe 1 0.00 oecircir 1 0.00 oecim 1 0.00 oeciecccgci 1 0.00 oecgcie 1 0.00 oecgccccgci 1 0.00 oecccgcie 1 0.00 oecccg 1 0.00 oeccccg 1 0.00 oecccccgci 1 0.00 ocgcie 1 0.00 ocg 1 0.00 occci 1 0.00 occccgci 1 0.00 coeor 1 0.00 coeci 1 0.00 co 1 0.00 cirorci 1 0.00 cirof 1 0.00 ciroeof 1 0.00 ciroe 1 0.00 circif 1 0.00 circie 1 0.00 circgci 1 0.00 circccci 1 0.00 circccccgcie 1 0.00 cioc 1 0.00 cimci 1 0.00 cifo 1 0.00 ciecirci 1 0.00 ciecircg 1 0.00 ciecir 1 0.00 ciecim 1 0.00 ciecie 1 0.00 ciecgcgci 1 0.00 ciecce 1 0.00 cieccci 1 0.00 ciecccci 1 0.00 ciec 1 0.00 cicccgci 1 0.00 cgcircie 1 0.00 cgcircccci 1 0.00 cgcir 1 0.00 cgcieor 1 0.00 cgcieccor 1 0.00 cgcie 1 0.00 cg 1 0.00 ccoecicg 1 0.00 ccoeci 1 0.00 ccocgci 1 0.00 cco 1 0.00 ccir 1 0.00 ccin 1 0.00 ccieci 1 0.00 ccgor 1 0.00 ccgcioe 1 0.00 ccgcif 1 0.00 ccgcieor 1 0.00 ccgcci 1 0.00 ccgccgci 1 0.00 ccgccci 1 0.00 ccgccccgci 1 0.00 cccor 1 0.00 cccoecgci 1 0.00 ccciroe 1 0.00 cccieci 1 0.00 cccgcif 1 0.00 cccgciA 1 0.00 cccgccci 1 0.00 ccccor 1 0.00 ccccoe 1 0.00 ccccir 1 0.00 cccciecgci 1 0.00 cccciecg 1 0.00 ccccie 1 0.00 ccccic 1 0.00 ccccgcirci 1 0.00 cccccgcie 1 0.00 cccccg 1 0.00 cccccci 1 0.00 cccc ----- ---- ---- 3157 1.00 TOT In reverse-lex order: 1 0.00 cccgciA 4 0.00 c 3 0.00 cc 3 0.00 ccc 1 0.00 cccc 4 0.00 ccccc 1 0.00 ciec 3 0.00 cic 1 0.00 ccccic 1 0.00 cioc 2 0.00 cce 1 0.00 ciecce 243 0.08 cie 5 0.00 ccie 3 0.00 cccie 1 0.00 ccccie 1 0.00 ciecie 1 0.00 cgcie 5 0.00 ccgcie 4 0.00 cccgcie 3 0.00 ccccgcie 1 0.00 cccccgcie 1 0.00 circccccgcie 1 0.00 oecccgcie 1 0.00 oecgcie 1 0.00 ocgcie 1 0.00 circie 1 0.00 cgcircie 1 0.00 orcie 2 0.00 ciie 102 0.03 oe 13 0.00 coe 12 0.00 ccoe 21 0.01 cccoe 1 0.00 ccccoe 4 0.00 cieoe 1 0.00 oeoe 2 0.00 cgoe 1 0.00 ccgcioe 1 0.00 ciroe 1 0.00 ccciroe 2 0.00 oroe 18 0.01 cif 2 0.00 ccccif 1 0.00 ccgcif 1 0.00 cccgcif 1 0.00 circif 1 0.00 oeof 1 0.00 ciroeof 1 0.00 cirof 1 0.00 cg 6 0.00 ccg 11 0.00 cccg 14 0.00 ccccg 1 0.00 cccccg 1 0.00 oeccccg 1 0.00 oecccg 1 0.00 cccciecg 1 0.00 ccoecicg 1 0.00 ocg 1 0.00 ciecircg 267 0.08 ci 240 0.08 cci 201 0.06 ccci 111 0.04 cccci 25 0.01 ccccci 1 0.00 cccccci 1 0.00 ciecccci 2 0.00 oecccci 1 0.00 circccci 1 0.00 cgcircccci 1 0.00 cieccci 2 0.00 oeccci 1 0.00 ccgccci 1 0.00 cccgccci 1 0.00 occci 1 0.00 orccci 1 0.00 ccgcci 20 0.01 cieci 1 0.00 ccieci 1 0.00 cccieci 7 0.00 oeci 1 0.00 coeci 1 0.00 ccoeci 21 0.01 cgci 376 0.12 ccgci 355 0.11 cccgci 265 0.08 ccccgci 18 0.01 cccccgci 1 0.00 oecccccgci 4 0.00 cieccccgci 4 0.00 oeccccgci 1 0.00 oroeccccgci 1 0.00 ccgccccgci 1 0.00 oecgccccgci 1 0.00 occccgci 3 0.00 ciecccgci 1 0.00 oeciecccgci 3 0.00 oecccgci 1 0.00 cicccgci 1 0.00 ccgccgci 8 0.00 ciecgci 1 0.00 cccciecgci 4 0.00 oecgci 2 0.00 coecgci 1 0.00 cccoecgci 1 0.00 ciecgcgci 5 0.00 cicgci 1 0.00 ccocgci 1 0.00 circgci 1 0.00 cimci 9 0.00 circi 1 0.00 ciecirci 1 0.00 ccccgcirci 1 0.00 cirorci 255 0.08 cim 1 0.00 ciecim 1 0.00 oecim 2 0.00 cgcim 2 0.00 ccgcim 2 0.00 orcim 5 0.00 om 98 0.03 cin 1 0.00 ccin 1 0.00 orcin 5 0.00 o 1 0.00 co 1 0.00 cco 2 0.00 cieo 1 0.00 cifo 174 0.06 cir 1 0.00 ccir 4 0.00 cccir 1 0.00 ccccir 1 0.00 ciecir 1 0.00 cgcir 8 0.00 ccgcir 3 0.00 cccgcir 7 0.00 ccccgcir 1 0.00 oecircir 41 0.01 or 6 0.00 cor 8 0.00 ccor 1 0.00 cccor 1 0.00 ccccor 1 0.00 cgcieccor 5 0.00 cieor 1 0.00 cgcieor 1 0.00 ccgcieor 1 0.00 oeor 1 0.00 coeor 1 0.00 ccgor Note the isolated peaks at suffixes that beging with "ci" or "o". These sharp peaks are evidence that the prefixes recognized above do NOT have alternates with a "c" appended. That is, the suffixes 21 0.01 cgci 376 0.12 ccgci 355 0.11 cccgci 265 0.08 ccccgci 18 0.01 cccccgci appear to be different suffixes, and not the same "ccgci" suffix attached to different prefixes. Here are the most significant suffixes: 243 0.08 cie 102 0.03 oe 13 0.00 coe 12 0.00 ccoe 21 0.01 cccoe 18 0.01 cif 11 0.00 cccg 14 0.00 ccccg 267 0.08 ci 240 0.08 cci 201 0.06 ccci 111 0.04 cccci 25 0.01 ccccci 20 0.01 cieci 21 0.01 cgci 376 0.12 ccgci 355 0.11 cccgci 265 0.08 ccccgci 18 0.01 cccccgci 255 0.08 cim 98 0.03 cin 174 0.06 cir 41 0.01 or Removing the strings of "c": cat bio-j-huc-gut.wds \ | split-prefix-suffix \ | egrep ':c' \ | sed -e 's/^.*:c*//g' \ | wfreq 1035 0.35 gci 845 0.29 i 255 0.09 im 252 0.09 ie 180 0.06 ir 99 0.03 in 47 0.02 oe 33 0.01 g 22 0.01 ieci 20 0.01 if 19 0.01 gcir 16 0.01 or 15 0.01 14 0.00 gcie 9 0.00 irci 9 0.00 iecgci 5 0.00 ieor 5 0.00 icgci 4 0.00 ieoe 4 0.00 ieccccgci 4 0.00 ic 4 0.00 gcim 3 0.00 oecgci 3 0.00 iecccgci 2 0.00 oeci 2 0.00 o 2 0.00 iroe 2 0.00 iie 2 0.00 ieo 2 0.00 goe 2 0.00 gcif 2 0.00 gcieor 2 0.00 gccci 2 0.00 e 1 0.00 oeor 1 0.00 oecicg 1 0.00 ocgci 1 0.00 irorci 1 0.00 irof 1 0.00 iroeof 1 0.00 ircif 1 0.00 ircie 1 0.00 ircgci 1 0.00 ircccci 1 0.00 ircccccgcie 1 0.00 ioc 1 0.00 imci 1 0.00 ifo 1 0.00 iecirci 1 0.00 iecircg 1 0.00 iecir 1 0.00 iecim 1 0.00 iecie 1 0.00 iecgcgci 1 0.00 iecg 1 0.00 iecce 1 0.00 ieccci 1 0.00 iecccci 1 0.00 iec 1 0.00 icccgci 1 0.00 gor 1 0.00 gcircie 1 0.00 gcirci 1 0.00 gcircccci 1 0.00 gcioe 1 0.00 gcieccor 1 0.00 gciA 1 0.00 gcci 1 0.00 gccgci 1 0.00 gccccgci ----- ---- ---- 2958 1.00 TOT Redefined suffixes, and added a few prefixes: split-prefix-suffix ------------------------------------------------ #! /n/gnu/bin/gawk -f # Attempts to split words into prefix/suffix inserting ":" in between. BEGIN { PREFS = "^(AH|AP|Ae|AeH|H|P|cH|ccH|cccH|ccccH|cg|cgciH|ci|ciH|e|eH|ecccH|eccccH|oH|oP|oe|oeH|oecccH|coeH|r|c)" SUFFS = "([coe][cgiroe]*([mnrfe]|))$" SPLIT = ( PREFS SUFFS ) } ( $0 ~ SPLIT ) { match($0, PREFS) k = RLENGTH $0 = (substr($0, 1, k) ":" substr($0, k + 1)) print next } /./ { print; next } ------------------------------------------------ Here are the suffixes recognized by this code: 746 0.18 cccgci 398 0.09 ccgci 343 0.08 ccci 325 0.08 ccccgci 285 0.07 cim 271 0.06 ci 260 0.06 cci 257 0.06 cie 185 0.04 cir 172 0.04 cccci 127 0.03 oe 98 0.02 cin 51 0.01 or 45 0.01 ccoe 36 0.01 coe 26 0.01 ccccci 25 0.01 ccor 25 0.01 cccoe 21 0.00 cieci 21 0.00 cgci 19 0.00 e 18 0.00 cif 18 0.00 cccg 18 0.00 cccccgci 15 0.00 ccccg 13 0.00 cccgcie 13 0.00 ccc 12 0.00 ccie 12 0.00 cccir 11 0.00 ccir 11 0.00 ccgcir 10 0.00 om 10 0.00 cccie 9 0.00 circi 9 0.00 cccgcim 8 0.00 oeci 8 0.00 ciecgci 8 0.00 ccccgcir 7 0.00 cor 7 0.00 cccgcir 6 0.00 oeccccgci 6 0.00 ce 6 0.00 ccgcie 6 0.00 ccg 5 0.00 oecgci 5 0.00 oecccci 5 0.00 o 5 0.00 coecgci 5 0.00 cieor 5 0.00 cicgci 5 0.00 cccc 5 0.00 c 4 0.00 cieoe 4 0.00 cieccccgci 4 0.00 ccccc 3 0.00 oecccgci 3 0.00 eci 3 0.00 ciecccgci 3 0.00 cic 3 0.00 ccif 3 0.00 cccieci 3 0.00 cccgoe 3 0.00 ccccgcie 3 0.00 cc 2 0.00 oroe 2 0.00 orcim 2 0.00 oeor 2 0.00 oeccci 2 0.00 eor 2 0.00 eoe 2 0.00 eccci 2 0.00 ecccgci 2 0.00 eccccgci 2 0.00 coeci 2 0.00 circie 2 0.00 ciie 2 0.00 cieo 2 0.00 cgoe 2 0.00 cgcim 2 0.00 ccgcim 2 0.00 ccgcif 2 0.00 cce 2 0.00 cccor 2 0.00 cccgcif 2 0.00 cccgccci 2 0.00 ccccoe 2 0.00 ccccif 2 0.00 ccccie 1 0.00 oroeccccgci 1 0.00 orcin 1 0.00 orcie 1 0.00 orccci 1 0.00 oeof 1 0.00 oeoe 1 0.00 oecircir 1 0.00 oecim 1 0.00 oeciecccgci 1 0.00 oecgcie 1 0.00 oecgccccgci 1 0.00 oecccoe 1 0.00 oecccgcie 1 0.00 oecccg 1 0.00 oeccccg 1 0.00 oecccccgci 1 0.00 ocgcir 1 0.00 ocgcie 1 0.00 ocg 1 0.00 occci 1 0.00 occccgci 1 0.00 eorci 1 0.00 eof 1 0.00 eciecgci 1 0.00 ecgcir 1 0.00 ecccci 1 0.00 eccccc 1 0.00 ec 1 0.00 coeor 1 0.00 coeo 1 0.00 coecccci 1 0.00 cocgcim 1 0.00 co 1 0.00 cirorci 1 0.00 cirof 1 0.00 ciroeof 1 0.00 ciroe 1 0.00 circif 1 0.00 circgci 1 0.00 circccci 1 0.00 circccccgcie 1 0.00 cioc 1 0.00 ciecirci 1 0.00 ciecircg 1 0.00 ciecir 1 0.00 ciecim 1 0.00 ciecie 1 0.00 ciecgcgci 1 0.00 ciecce 1 0.00 cieccci 1 0.00 ciecccci 1 0.00 cieccccg 1 0.00 ciec 1 0.00 cicccgci 1 0.00 cgcircie 1 0.00 cgcircccci 1 0.00 cgcir 1 0.00 cgcieor 1 0.00 cgcieccor 1 0.00 cgcie 1 0.00 cg 1 0.00 cer 1 0.00 cecgcim 1 0.00 cecgci 1 0.00 cec 1 0.00 ccoeoe 1 0.00 ccoeo 1 0.00 ccoecicg 1 0.00 ccoeci 1 0.00 ccoecgci 1 0.00 ccoecccgci 1 0.00 ccoecccci 1 0.00 ccocgci 1 0.00 cco 1 0.00 cciror 1 0.00 ccirci 1 0.00 ccin 1 0.00 ccieoeci 1 0.00 ccieci 1 0.00 ccgor 1 0.00 ccgoe 1 0.00 ccgcioe 1 0.00 ccgcin 1 0.00 ccgcieor 1 0.00 ccgcci 1 0.00 ccgccgci 1 0.00 ccgccci 1 0.00 ccgccccgci 1 0.00 cccoecgci 1 0.00 ccciroe 1 0.00 cccieor 1 0.00 cccgor 1 0.00 cccgciecgci 1 0.00 ccccor 1 0.00 ccccir 1 0.00 ccccim 1 0.00 cccciecgci 1 0.00 cccciecg 1 0.00 ccccic 1 0.00 ccccgcirci 1 0.00 cccccgcie 1 0.00 cccccg 1 0.00 cccccci ----- ---- ---- 4207 1.00 TOT In reverse lex order: 5 0.00 c 3 0.00 cc 13 0.00 ccc 5 0.00 cccc 4 0.00 ccccc 1 0.00 eccccc 1 0.00 ec 1 0.00 cec 1 0.00 ciec 3 0.00 cic 1 0.00 ccccic 1 0.00 cioc 19 0.00 e 6 0.00 ce 2 0.00 cce 1 0.00 ciecce 257 0.06 cie 12 0.00 ccie 10 0.00 cccie 2 0.00 ccccie 1 0.00 ciecie 1 0.00 cgcie 6 0.00 ccgcie 13 0.00 cccgcie 3 0.00 ccccgcie 1 0.00 cccccgcie 1 0.00 circccccgcie 1 0.00 oecccgcie 1 0.00 oecgcie 1 0.00 ocgcie 2 0.00 circie 1 0.00 cgcircie 1 0.00 orcie 2 0.00 ciie 127 0.03 oe 36 0.01 coe 45 0.01 ccoe 25 0.01 cccoe 2 0.00 ccccoe 1 0.00 oecccoe 2 0.00 eoe 4 0.00 cieoe 1 0.00 oeoe 1 0.00 ccoeoe 2 0.00 cgoe 1 0.00 ccgoe 3 0.00 cccgoe 1 0.00 ccgcioe 1 0.00 ciroe 1 0.00 ccciroe 2 0.00 oroe 18 0.00 cif 3 0.00 ccif 2 0.00 ccccif 2 0.00 ccgcif 2 0.00 cccgcif 1 0.00 circif 1 0.00 eof 1 0.00 oeof 1 0.00 ciroeof 1 0.00 cirof 1 0.00 cg 6 0.00 ccg 18 0.00 cccg 15 0.00 ccccg 1 0.00 cccccg 1 0.00 cieccccg 1 0.00 oeccccg 1 0.00 oecccg 1 0.00 cccciecg 1 0.00 ccoecicg 1 0.00 ocg 1 0.00 ciecircg 271 0.06 ci 260 0.06 cci 343 0.08 ccci 172 0.04 cccci 26 0.01 ccccci 1 0.00 cccccci 1 0.00 ecccci 1 0.00 ciecccci 5 0.00 oecccci 1 0.00 coecccci 1 0.00 ccoecccci 1 0.00 circccci 1 0.00 cgcircccci 2 0.00 eccci 1 0.00 cieccci 2 0.00 oeccci 1 0.00 ccgccci 2 0.00 cccgccci 1 0.00 occci 1 0.00 orccci 1 0.00 ccgcci 3 0.00 eci 21 0.00 cieci 1 0.00 ccieci 3 0.00 cccieci 8 0.00 oeci 2 0.00 coeci 1 0.00 ccoeci 1 0.00 ccieoeci 21 0.00 cgci 398 0.09 ccgci 746 0.18 cccgci 325 0.08 ccccgci 18 0.00 cccccgci 1 0.00 oecccccgci 2 0.00 eccccgci 4 0.00 cieccccgci 6 0.00 oeccccgci 1 0.00 oroeccccgci 1 0.00 ccgccccgci 1 0.00 oecgccccgci 1 0.00 occccgci 2 0.00 ecccgci 3 0.00 ciecccgci 1 0.00 oeciecccgci 3 0.00 oecccgci 1 0.00 ccoecccgci 1 0.00 cicccgci 1 0.00 ccgccgci 1 0.00 cecgci 8 0.00 ciecgci 1 0.00 cccciecgci 1 0.00 eciecgci 1 0.00 cccgciecgci 5 0.00 oecgci 5 0.00 coecgci 1 0.00 ccoecgci 1 0.00 cccoecgci 1 0.00 ciecgcgci 5 0.00 cicgci 1 0.00 ccocgci 1 0.00 circgci 9 0.00 circi 1 0.00 ccirci 1 0.00 ciecirci 1 0.00 ccccgcirci 1 0.00 eorci 1 0.00 cirorci 285 0.07 cim 1 0.00 ccccim 1 0.00 ciecim 1 0.00 oecim 2 0.00 cgcim 2 0.00 ccgcim 9 0.00 cccgcim 1 0.00 cecgcim 1 0.00 cocgcim 2 0.00 orcim 10 0.00 om 98 0.02 cin 1 0.00 ccin 1 0.00 ccgcin 1 0.00 orcin 5 0.00 o 1 0.00 co 1 0.00 cco 2 0.00 cieo 1 0.00 coeo 1 0.00 ccoeo 1 0.00 cer 185 0.04 cir 11 0.00 ccir 12 0.00 cccir 1 0.00 ccccir 1 0.00 ciecir 1 0.00 cgcir 11 0.00 ccgcir 7 0.00 cccgcir 8 0.00 ccccgcir 1 0.00 ecgcir 1 0.00 ocgcir 1 0.00 oecircir 51 0.01 or 7 0.00 cor 25 0.01 ccor 2 0.00 cccor 1 0.00 ccccor 1 0.00 cgcieccor 2 0.00 eor 5 0.00 cieor 1 0.00 cccieor 1 0.00 cgcieor 1 0.00 ccgcieor 2 0.00 oeor 1 0.00 coeor 1 0.00 ccgor 1 0.00 cccgor 1 0.00 cciror These are the significant ones (15 or more occurrences): 19 0.00 e 257 0.06 cie 127 0.03 oe 36 0.01 coe 45 0.01 ccoe 25 0.01 cccoe 18 0.00 cif 18 0.00 cccg 15 0.00 ccccg 271 0.06 ci 260 0.06 cci 343 0.08 ccci 172 0.04 cccci 26 0.01 ccccci 21 0.00 cieci 21 0.00 cgci 398 0.09 ccgci 746 0.18 cccgci 325 0.08 ccccgci 18 0.00 cccccgci 285 0.07 cim 98 0.02 cin 185 0.04 cir 51 0.01 or 25 0.01 ccor Stripping the "[coe]c*" prefix of all suffixes: cat bio-j-huc-gut.wds \ | split-prefix-suffix \ | egrep ':' \ | sed -e 's/^.*:[coe]c*//g' \ | wfreq 1513 0.36 gci 1080 0.26 i 286 0.07 im 281 0.07 ie 209 0.05 ir 135 0.03 e 110 0.03 oe 99 0.02 in 56 0.01 51 0.01 r 42 0.01 g 37 0.01 or 29 0.01 gcir 25 0.01 ieci 25 0.01 gcie 23 0.01 if 13 0.00 gcim 10 0.00 m 10 0.00 irci 10 0.00 iecgci 8 0.00 eci 7 0.00 oecgci 6 0.00 ieor 6 0.00 goe 6 0.00 ecgci 6 0.00 eccccgci 5 0.00 icgci 5 0.00 ecccci 4 0.00 ieoe 4 0.00 ieccccgci 4 0.00 ic 4 0.00 gcif 3 0.00 oeci 3 0.00 iecccgci 3 0.00 gccci 3 0.00 ecccgci 2 0.00 roe 2 0.00 rcim 2 0.00 oeo 2 0.00 oecccci 2 0.00 o 2 0.00 iroe 2 0.00 ircie 2 0.00 iie 2 0.00 ieo 2 0.00 gor 2 0.00 gcieor 2 0.00 eor 2 0.00 eccci 1 0.00 roeccccgci 1 0.00 rcin 1 0.00 rcie 1 0.00 rccci 1 0.00 orci 1 0.00 of 1 0.00 oeor 1 0.00 oeoe 1 0.00 oecicg 1 0.00 oecccgci 1 0.00 ocgcim 1 0.00 ocgci 1 0.00 irorci 1 0.00 iror 1 0.00 irof 1 0.00 iroeof 1 0.00 ircif 1 0.00 ircgci 1 0.00 ircccci 1 0.00 ircccccgcie 1 0.00 ioc 1 0.00 ieoeci 1 0.00 iecirci 1 0.00 iecircg 1 0.00 iecir 1 0.00 iecim 1 0.00 iecie 1 0.00 iecgcgci 1 0.00 iecg 1 0.00 iecce 1 0.00 ieccci 1 0.00 iecccci 1 0.00 ieccccg 1 0.00 iec 1 0.00 icccgci 1 0.00 gcircie 1 0.00 gcirci 1 0.00 gcircccci 1 0.00 gcioe 1 0.00 gcin 1 0.00 gciecgci 1 0.00 gcieccor 1 0.00 gcci 1 0.00 gccgci 1 0.00 gccccgci 1 0.00 er 1 0.00 eof 1 0.00 eoe 1 0.00 ecircir 1 0.00 ecim 1 0.00 eciecccgci 1 0.00 ecgcim 1 0.00 ecgcie 1 0.00 ecgccccgci 1 0.00 ecccoe 1 0.00 ecccgcie 1 0.00 ecccg 1 0.00 eccccg 1 0.00 ecccccgci 1 0.00 ec ----- ---- ---- 4207 1.00 TOT The significant ones (mostly >40 cases) 56 0.01 _ 135 0.03 e 42 0.01 g 1513 0.36 gci 1080 0.26 i 281 0.07 ie 23 0.01 if 286 0.07 im 99 0.02 in 209 0.05 ir 110 0.03 oe 37 0.01 or 51 0.01 r Redefined split-prefix-suffix accordingly: SUFFS = "([coe]c*(|e|g|gci|i|ie|if|im|in|ir|oe|or|r))$" Let's see what prefixes we get now: cat bio-j-huc-gut.wds \ | split-prefix-suffix \ | egrep ':' \ | sed -e 's/:.*$//g' \ | wfreq 1000 0.25 AH 921 0.23 c 412 0.11 oH 307 0.08 cg 194 0.05 e 173 0.04 cccH 147 0.04 oe 134 0.03 H 107 0.03 ccccH 103 0.03 oeH 65 0.02 r 51 0.01 ciH 50 0.01 ci 40 0.01 eH 40 0.01 P 38 0.01 Ae 34 0.01 oP 21 0.01 cH 21 0.01 AP 19 0.00 ccH 11 0.00 AeH 10 0.00 coeH 9 0.00 eccccH 8 0.00 cgciH 5 0.00 ecccH 2 0.00 oecccH ----- ---- ---- 3922 1.00 TOT Keping only the significant ones ( >20 occurrences: 1000 0.25 AH 21 0.01 AP 38 0.01 Ae 134 0.03 H 40 0.01 P 921 0.23 c 21 0.01 cH 173 0.04 cccH 107 0.03 ccccH 307 0.08 cg 50 0.01 ci 51 0.01 ciH 194 0.05 e 40 0.01 eH 412 0.11 oH 34 0.01 oP 147 0.04 oe 103 0.03 oeH 65 0.02 r Fixing split-prefix-suffix: PREFS = "^(AH|AP|Ae|H|P|c|cH|cccH|ccccH|cg|ci|ciH|e|eH|oH|oP|oe|oeH|r)" Listing again the suffixes, without "[coe]c*" prefixes: cat bio-j-huc-gut.wds \ | split-prefix-suffix \ | egrep ':' \ | sed -e 's/^.*:[coe]c*//g' \ | wfreq 1497 0.39 gci 1042 0.27 i 284 0.07 im 278 0.07 ie 208 0.05 ir 132 0.03 e 109 0.03 oe 99 0.03 in 56 0.01 51 0.01 r 42 0.01 g 37 0.01 or 23 0.01 if ----- ---- ---- 3858 1.00 TOT The complete suffixes: 736 0.19 cccgci 392 0.10 ccgci 333 0.09 ccci 325 0.08 ccccgci 283 0.07 cim 257 0.07 ci 254 0.07 cie 247 0.06 cci 184 0.05 cir 171 0.04 cccci 124 0.03 oe 98 0.03 cin 51 0.01 or 45 0.01 ccoe 35 0.01 coe 26 0.01 ccccci 25 0.01 ccor 25 0.01 cccoe 21 0.01 cgci 19 0.00 e 18 0.00 cif 18 0.00 cccg 18 0.00 cccccgci 15 0.00 ccccg 13 0.00 ccc 12 0.00 ccie 12 0.00 cccir 11 0.00 ccir 10 0.00 cccie 7 0.00 cor 6 0.00 ce 6 0.00 ccg 5 0.00 o 5 0.00 cccc 5 0.00 c 4 0.00 ccccc 3 0.00 eci 3 0.00 ccif 3 0.00 cc 2 0.00 eor 2 0.00 eoe 2 0.00 eccci 2 0.00 ecccgci 2 0.00 eccccgci 2 0.00 cce 2 0.00 cccor 2 0.00 ccccoe 2 0.00 ccccif 2 0.00 ccccie 1 0.00 ocg 1 0.00 occci 1 0.00 occccgci 1 0.00 ecccci 1 0.00 eccccc 1 0.00 ec 1 0.00 cg 1 0.00 ccin 1 0.00 ccccor 1 0.00 ccccir 1 0.00 ccccim 1 0.00 cccccg 1 0.00 cccccci ----- ---- ---- 3858 1.00 TOT Removed suffixes beginning with "e": SUFFS = "([co]c*(|e|g|gci|i|ie|if|im|in|ir|oe|or|r))$" Tabulating prefixes: 998 0.26 AH 920 0.24 c 410 0.11 oH 302 0.08 cg 194 0.05 e 173 0.05 cccH 145 0.04 oe 134 0.04 H 107 0.03 ccccH 103 0.03 oeH 65 0.02 r 51 0.01 ciH 40 0.01 P 38 0.01 eH 38 0.01 Ae 34 0.01 oP 29 0.01 ci 21 0.01 cH 21 0.01 AP ----- ---- ---- 3823 1.00 TOT It looks like "P" is equivalent to "H"... Recomputing prefix/suffix table: /bin/rm -f .title /bin/rm -f .table /bin/touch .table set noglob set ofmt = "0" set npat = 1 foreach pat ( \ 'AH' 'c' 'oH' 'cg' \ 'e' 'cccH' 'oe' 'H' \ 'ccccH' 'oeH' 'r' 'ciH' \ 'P' 'eH' 'Ae' 'oP' \ 'ci' 'cH' 'AP' \ ) /n/gnu/bin/printf " %7s" "${pat}" >> .title /bin/cat bio-j-huc-gut.wds \ | /n/gnu/bin/sed -e 's/^/_/g' -e 's/$/_/g' \ | /n/gnu/bin/egrep "_${pat}[co][^HPA]*_" \ | /n/gnu/bin/sed -e "s/_${pat}//g" -e '/../s/_$//g' \ | /n/gnu/bin/sort | uniq -c \ | /n/gnu/bin/expand \ > .suff.frq /n/gnu/bin/join -a 1 -a 2 -1 1 -2 2 -o"${ofmt},2.1" -e 0 .table .suff.frq > .tmp /bin/mv .tmp .table @ npat = ${npat} + 1 set ofmt = "${ofmt},1.${npat}" end unset noglob /n/gnu/bin/printf "\n" >> .title cat .table \ | /n/gnu/bin/gawk '/./ { s=0; for(i=2;i<=NF;i++) s+=$(i); print s, $0 }' \ | sort -nr \ > .tbsort cat .title .tbsort \ | format-suffix-table TOTAL AH oH c H oeH ciH cH eH cccH ccccH cg P e oe Ae r oP ci AP ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- SUFFIX \ TOT 4104 1038 446 998 148 105 53 21 43 173 109 338 63 212 150 38 73 40 34 22 ------------ ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ci 257 79 28 1 2 10 1 1 2 31 26 36 0 6 22 7 2 2 0 1 cci 247 43 22 14 2 4 2 1 1 90 68 0 0 0 0 0 0 0 0 0 ccci 333 87 35 139 5 19 10 6 3 11 4 2 0 5 4 2 1 0 0 0 cccci 171 12 13 61 10 3 0 0 1 1 0 3 3 20 19 9 5 4 5 2 ccccci 26 1 1 1 1 1 1 0 0 0 0 3 0 6 4 0 1 2 4 0 cccccci 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 cgci 21 1 0 0 0 0 0 0 0 0 0 0 0 8 9 1 0 0 1 1 ccgci 392 201 85 21 28 20 12 3 8 10 2 2 0 0 0 0 0 0 0 0 cccgci 736 191 57 387 16 15 12 4 7 15 5 5 3 7 6 1 1 2 0 2 ccccgci 325 19 10 60 18 2 1 0 2 0 0 27 17 73 38 7 10 18 12 11 cccccgci 18 2 0 0 0 0 0 0 0 0 0 2 0 6 1 3 2 0 2 0 cim 283 94 41 30 7 7 4 0 6 2 0 72 0 2 3 2 13 0 0 0 ccccim 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 cie 254 116 40 14 10 5 3 1 0 1 0 50 0 3 2 2 6 1 0 0 ccie 12 1 0 7 0 0 0 1 0 1 2 0 0 0 0 0 0 0 0 0 cccie 10 0 1 7 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 ccccie 2 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 cir 184 50 36 11 8 7 2 1 4 3 0 51 1 2 1 1 3 2 1 0 ccir 11 1 0 10 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 cccir 12 0 1 8 1 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 ccccir 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 cin 98 54 16 0 2 5 2 0 2 0 0 12 0 0 3 1 1 0 0 0 ccin 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 cif 18 4 0 0 1 1 0 0 0 2 0 6 0 1 0 0 3 0 0 0 ccif 3 0 0 3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 oe 124 21 10 25 6 1 0 0 1 0 0 17 8 16 7 1 9 1 0 1 coe 35 1 6 23 3 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 ccoe 45 1 1 33 4 0 0 0 0 0 0 3 2 0 0 0 0 0 0 1 cccoe 25 0 0 4 3 0 1 0 0 0 0 4 4 3 3 1 0 0 1 1 or 51 4 2 10 4 0 0 0 0 0 0 3 0 10 13 0 3 1 0 1 cor 7 4 0 1 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 ccor 25 0 1 17 0 0 0 0 0 0 0 2 2 3 0 0 0 0 0 0 cccor 2 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 ccccor 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 om 10 1 0 5 0 1 0 0 0 0 0 0 1 0 0 0 2 0 0 0 cccg 18 5 3 7 2 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 ccccg 15 1 0 1 0 0 0 0 0 1 0 0 0 5 3 0 2 1 1 0 cieci 21 7 4 1 1 0 0 0 0 0 0 6 0 0 0 0 1 1 0 0 cccgcie 13 2 1 9 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 ccc 13 1 0 10 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 ccgcir 11 4 4 3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 cccgcim 9 0 0 9 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 oeci 8 0 0 1 0 0 0 0 0 0 0 0 0 4 0 0 1 2 0 0 circi 8 1 0 0 1 0 0 0 0 0 0 5 0 0 0 0 1 0 0 0 ciecgci 8 3 2 0 0 0 0 0 0 0 0 3 0 0 0 0 0 0 0 0 ccccgcir 8 0 0 1 1 0 0 0 0 0 0 0 4 1 0 0 0 0 1 0 cccgcir 7 0 0 4 0 0 2 0 0 0 0 0 0 0 1 0 0 0 0 0 oeccccgci 6 1 0 2 0 0 0 0 0 0 0 1 2 0 0 0 0 0 0 0 ccg 6 3 1 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 ce 6 0 0 6 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ccgcie 6 2 2 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 oecgci 5 1 1 1 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 oecccci 5 0 0 3 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 o 5 0 0 0 0 0 0 0 0 0 0 0 0 4 1 0 0 0 0 0 coecgci 5 0 1 3 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 cieor 5 0 4 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 cicgci 5 3 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 cccc 5 0 0 4 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 c 5 0 0 1 0 0 0 0 0 2 0 0 0 0 2 0 0 0 0 0 cieoe 4 1 2 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 cieccccgci 4 0 1 0 0 0 0 0 0 0 0 2 0 0 0 0 0 1 0 0 ccccc 4 0 0 0 0 0 0 0 0 0 0 1 0 1 1 0 0 0 1 0 oecccgci 3 0 0 0 1 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 ciecccgci 3 0 1 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 cic 3 2 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 cccieci 3 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 cccgoe 3 0 0 3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ccccgcie 3 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 1 0 0 cc 3 1 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 oroe 2 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 orcim 2 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 oeor 2 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 oeccci 2 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 coeci 2 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 circie 2 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ciie 2 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 cieo 2 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 cgoe 2 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 cgcim 2 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 ccgcim 2 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ccgcif 2 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 cce 2 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 cccgcif 2 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 cccgccci 2 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 ccccoe 2 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 ccccif 2 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 oroeccccgci 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 orcin 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 orcie 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 orccci 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 oeof 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 oeoe 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 oecircir 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 oecim 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 oeciecccgci 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 oecgcie 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 oecgccccgci 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 oecccoe 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 oecccgcie 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 oecccg 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 oeccccg 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 oecccccgci 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 ocgcir 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ocgcie 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 ocg 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 occci 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 occccgci 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 coeor 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 coeo 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 coecccci 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 cocgcim 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 co 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 cirorci 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 cirof 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 ciroeof 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 circif 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 circgci 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 circccci 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 circccccgcie 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 cioc 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 cino 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 cimci 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 cifo 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 ciecirci 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 ciecircg 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 ciecir 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 ciecim 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 ciecie 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 ciecgcgci 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ciecce 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 cieccci 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ciecccci 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 cieccccg 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ciec 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 cicccgci 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 cgcircie 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 cgcircccci 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 cgcir 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 cgcieor 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 cgcieccor 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 cgcie 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 cg 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 cer 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 cecgcim 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 cecgci 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 cec 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ccoeoe 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ccoeo 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ccoecicg 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 ccoecgci 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ccoecccgci 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ccoecccci 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ccocgci 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 cco 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 cciror 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ccirci 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ccieoeci 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ccieci 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 ccgor 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ccgoe 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ccgcioe 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ccgcin 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ccgcieor 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ccgcci 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 ccgccgci 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ccgccci 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ccgccccgci 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 cccoecgci 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 ccciroe 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 cccieor 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 cccgor 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 cccgciecgci 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 cccciecgci 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 cccciecg 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 ccccic 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ccccgcirci 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 cccccgcie 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 cccccg 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 Redefined split-prefix-suffix with only the significant suffixes: SUFFS = "(ccc3ci|cc3ci|ccci|cccc3ci|cim|ci|cix|cci|ci2|cccci|ox|cin|o2|ccox|cox|ccccci|cccox|cco2)$" 97-07-18 stolfi =============== I got the idea that the \s/ plume may be a stress mark (that shifts when the word is inflected). So I decided to redo the analysis for Portuguese, comparing o-accent with o-plain, etc.: cat port.wds \ | sed -e 's/[óô]/ó/g' \ | compare-contexts -lctx 0 -rctx 0 \ '[aeiouáéíóúàâêôü]*[o][aeiouáéíóúàâêôü]*' \ '[aeiouáéíóúàâêôü]*[ó][aeiouáéíóúàâêôü]*' 6143 0.93 o 125 0.99 ó 135 0.02 io 1 0.01 eó 112 0.02 ou ----- ---- ---- 75 0.01 oi 126 1.00 TOT 55 0.01 ao 36 0.01 eo 32 0.00 oo 12 0.00 aio 5 0.00 eio 4 0.00 oa 2 0.00 oá 2 0.00 oe ----- ---- ---- 6613 1.00 TOT cat port.wds \ | sed -e 's/[áâ]/á/g' \ | compare-contexts -lctx 0 -rctx 0 \ '[aeiouáéíóúàâêôü]*[a][aeiouáéíóúàâêôü]*' \ '[aeiouáéíóúàâêôü]*[á][aeiouáéíóúàâêôü]*' 6878 0.87 a 258 0.75 á 448 0.06 ia 84 0.24 iá 224 0.03 ua 2 0.01 uá 134 0.02 ai 2 0.01 oá 69 0.01 ea ----- ---- ---- 55 0.01 ao 346 1.00 TOT 23 0.00 uai 23 0.00 au 12 0.00 aio 8 0.00 iai 4 0.00 éia 4 0.00 oa 3 0.00 ía 3 0.00 eia 3 0.00 eai 2 0.00 aí 2 0.00 aue 2 0.00 ae ----- ---- ---- 7897 1.00 TOT cat port.wds \ | sed -e 's/[êé]/é/g' \ | compare-contexts -lctx 0 -rctx 0 \ '[aeiouáéíóúàâêôü]*[e][aeiouáéíóúàâêôü]*' \ '[aeiouáéíóúàâêôü]*[é][aeiouáéíóúàâêôü]*' 7385 0.89 e 651 0.97 é 420 0.05 ue 8 0.01 üé 196 0.02 ie 7 0.01 ié 117 0.01 ei 4 0.01 éia 69 0.01 ea 1 0.00 éi 61 0.01 eu ----- ---- ---- 36 0.00 eo 671 1.00 TOT 5 0.00 eio 3 0.00 eia 3 0.00 eai 2 0.00 üe 2 0.00 oe 2 0.00 aue 2 0.00 ae 1 0.00 eó 1 0.00 eí 1 0.00 eiú 1 0.00 ee ----- ---- ---- 8307 1.00 TOT These tables are skewed by the short words "a", "é", "e", "o", etc. So let's require at least one more letter after the accent: cat port.wds \ | sed \ -e 's/[áâ]/á/g' \ -e 's/^/_/g' \ -e 's/$/_/g' \ | compare-contexts -lctx 1 -rctx 0 \ '[a][aeiouáéíóúàâêôü]*[a-záéíóúàâêôü]' \ '[á][aeiouáéíóúàâêôü]*[a-záéíóúàâêôü]' 284 0.06 par 73 0.25 ián 211 0.04 _ar 36 0.12 _án 151 0.03 _as 29 0.10 _ár 148 0.03 cad 15 0.05 tán 144 0.03 lad 12 0.04 rár 128 0.03 tas 12 0.04 rám 116 0.02 dad 11 0.04 sár 111 0.02 tan 10 0.03 ráf 110 0.02 _ap 10 0.03 cál 101 0.02 rad 8 0.03 iáv 95 0.02 das 7 0.02 mát 77 0.02 lar 7 0.02 lás 76 0.02 fac 6 0.02 vár 74 0.02 as 5 0.02 jáv 72 0.02 _al 5 0.02 fác 70 0.01 tam 4 0.01 táv 68 0.01 nal 4 0.01 ráp 66 0.01 mais 4 0.01 pán 61 0.01 ual 4 0.01 nál 58 0.01 tad 3 0.01 máx 56 0.01 cas 3 0.01 lát 55 0.01 ram 3 0.01 iám 48 0.01 car 3 0.01 cáv 47 0.01 tar 2 0.01 uár 46 0.01 nas 2 0.01 tár 46 0.01 mas 2 0.01 rát 46 0.01 ias 1 0.00 tág 46 0.01 _ao 1 0.00 ráv 45 0.01 cal 1 0.00 oáv 44 0.01 tal 1 0.00 oác 43 0.01 ian 1 0.00 nár 41 0.01 am 1 0.00 láv 40 0.01 nad 1 0.00 háv 39 0.01 ran 1 0.00 dáv 37 0.01 sam 1 0.00 cán 36 0.01 lag 1 0.00 bás 36 0.01 ial 1 0.00 _át 35 0.01 _ad ----- ---- ---- 34 0.01 val 291 1.00 TOT 34 0.01 ras ... .... ..... 1 0.00 ab ----- ---- ---- 4785 1.00 TOT Too many cases, let's reduce them: cat port.wds \ | sed \ -e 's/[áâ]/á/g' \ -e 's/^/_/g' \ -e 's/$/_/g' \ | compare-contexts -lctx 0 -rctx 0 \ '_[a][aeiouáéíóúàâêôü]*[bcdfghjklmnpqrstvwxyz]' \ '_[á][aeiouáéíóúàâêôü]*[bcdfghjklmnpqrstvwxyz]' 211 0.27 _ar 36 0.55 _án 151 0.19 _as 29 0.44 _ár 110 0.14 _ap 1 0.02 _át 72 0.09 _al ----- ---- ---- 35 0.05 _ad 66 1.00 TOT 32 0.04 _at 31 0.04 _an 28 0.04 _ac 26 0.03 _ab 15 0.02 _am 13 0.02 _aj 9 0.01 _aos 8 0.01 _aut 7 0.01 _ain 6 0.01 _av 6 0.01 _aq 6 0.01 _af 5 0.01 _aum 4 0.01 _ag 2 0.00 _aux ----- ---- ---- 777 1.00 TOT cat port.wds \ | sed \ -e 's/[óô]/ó/g' \ -e 's/^/_/g' \ -e 's/$/_/g' \ | compare-contexts -lctx 0 -rctx 0 \ '_[o][aeiouáéíóúàâêôü]*[bcdfghjklmnpqrstvwxyz]' \ '_[ó][aeiouáéíóúàâêôü]*[bcdfghjklmnpqrstvwxyz]' 154 0.39 _os 7 0.70 _ót 68 0.17 _or 2 0.20 _ób 54 0.14 _ob 1 0.10 _ór 39 0.10 _out ----- ---- ---- 26 0.07 _ot 10 1.00 TOT 26 0.07 _on 16 0.04 _op 9 0.02 _oc 2 0.01 _oit ----- ---- ---- 394 1.00 TOT Not very impressive. However, that is not surprising, considering that most accents (especially those that shift) are already omitted by the default stress rules.