Hacking at the Voynich manuscript - Side notes 030 Combining the p-m-s and OKOKOKO paradigms Last edited on 1999-02-02 14:48:15 by stolfi This note attempts to rederive the p-m-s paradigm in light of the OKOKOKO decompostion. First, let's get the word lists for each section: set sections = ( `cat ../023/text-sections/all.names` ) set seccom = `echo ${sections} | tr ' ' ','` mkdir data foreach sec ( $sections ) echo "${sec}" mkdir data/${sec} cat ../023/text-sections/${sec}.evt \ | words-from-evt \ | tr '*' '?' \ > data/${sec}/words.wds end mkdir data/all cat data/{${seccom}}/words.wds > data/all/words.wds mkdir data/ren cat Rene-words.frq \ | gawk '/./{n=$1;for(i=0;i data/ren/words.wds dicio-wc data/{${seccom},ren,all}/words.wds lines words bytes file ------- ------- --------- ------------ 2210 2210 13427 data/unk/words.wds 2211 2210 13467 data/pha/words.wds 10358 10356 66804 data/str/words.wds 7584 7584 44680 data/hea/words.wds 3338 3336 20045 data/heb/words.wds 6539 6539 37738 data/bio/words.wds 324 324 2089 data/ast/words.wds 389 387 2264 data/cos/words.wds 169 169 1018 data/zod/words.wds 28939 28939 172850 data/ren/words.wds 33122 33115 201532 data/all/words.wds Factor them into elements and separate the unreadable words and parsing bugs: foreach sec ( $sections ren all ) echo ${sec} cat data/${sec}/words.wds \ | factor-OK \ > data/${sec}/words.fac cat data/${sec}/words.fac \ | egrep -e '^{[{}a-z?]*}$' \ > data/${sec}/words-gut.fac cat data/${sec}/words.fac \ | egrep -v -e '^{[{}a-z?]*}$' \ > data/${sec}/words-bad.fac egrep -e '^[^{]' data/${sec}/words-gut.fac | head -5 egrep -e '[^}]$' data/${sec}/words-gut.fac | head -5 egrep -e '[}][^{]' data/${sec}/words-gut.fac | head -5 egrep -e '[^}][{]' data/${sec}/words-gut.fac | head -5 end dicio-wc data/{${seccom},ren,all}/words-{gut,bad}.fac lines words bytes file ------- ------- --------- ------------ 2190 2190 30362 data/unk/words-gut.fac 20 20 279 data/unk/words-bad.fac 2112 2112 28599 data/pha/words-gut.fac 99 98 1439 data/pha/words-bad.fac 10311 10311 149736 data/str/words-gut.fac 47 45 678 data/str/words-bad.fac 7532 7532 98697 data/hea/words-gut.fac 52 52 721 data/hea/words-bad.fac 3318 3318 45009 data/heb/words-gut.fac 20 18 270 data/heb/words-bad.fac 6530 6530 85290 data/bio/words-gut.fac 9 9 114 data/bio/words-bad.fac 312 312 4495 data/ast/words-gut.fac 12 12 201 data/ast/words-bad.fac 376 376 4954 data/cos/words-gut.fac 13 11 174 data/cos/words-bad.fac 162 162 2240 data/zod/words-gut.fac 7 7 104 data/zod/words-bad.fac 28939 28939 389191 data/ren/words-gut.fac 0 0 0 data/ren/words-bad.fac 32843 32843 449382 data/all/words-gut.fac 279 272 3980 data/all/words-bad.fac Count the elements: foreach sec ( $sections ren all ) echo "${sec}" cat data/${sec}/words-gut.fac \ | tr '{}' '\012\012' \ | egrep '.' \ | sort | uniq -c | expand \ | sort -b +0 -1nr \ > data/${sec}/elem.cts end multicol \ -v titles="all ren $sections" \ data/{all,ren,$seccom}/elem.cts \ > elem.multi compare-counts \ -titles "all ren $sections element" \ -remFreqs \ -sort 1 \ data/{all,ren,$seccom}/elem.cts \ > elem.cmp all ren unk pha str hea heb bio ast cos zod element --- --- --- --- --- --- --- --- --- --- --- ------- 834 826 846 773 848 800 864 847 828 852 798 o 712 699 728 676 736 689 719 700 700 725 711 y 615 602 600 600 622 611 609 622 608 603 572 a 524 510 512 524 536 532 488 516 523 527 516 d 452 440 437 435 467 470 437 424 482 435 431 l 394 380 382 389 402 425 373 362 436 388 394 k 346 336 336 349 363 333 327 342 379 343 356 ch 300 291 276 298 319 280 278 308 326 297 296 r 259 248 246 262 274 253 251 242 311 271 294 q 225 215 209 228 235 215 217 226 286 234 269 iin 192 181 168 215 201 171 186 200 253 183 193 t 161 151 145 184 163 154 155 165 223 159 177 che 131 119 128 149 119 142 134 131 171 135 123 ee 113 104 105 136 107 112 119 117 159 106 110 sh 97 89 92 113 96 89 104 102 123 89 95 s 82 74 77 101 82 83 90 74 110 74 80 she 70 61 69 77 71 78 73 58 90 65 72 ke 60 54 56 71 58 70 62 51 81 57 63 p 51 41 53 67 47 63 56 33 78 54 55 in 44 35 43 63 40 54 47 30 72 42 49 m 37 28 36 57 33 51 37 22 60 34 20 te 31 23 31 53 31 35 34 18 56 29 19 cth 26 18 28 44 27 29 27 12 51 26 19 ckh 22 14 21 41 20 27 23 11 43 21 19 ir 19 12 19 39 16 25 21 10 39 18 12 eee 17 11 15 36 14 23 15 9 38 18 12 f 15 10 13 34 12 20 14 9 31 17 9 oa 13 9 11 30 10 18 11 7 21 11 6 e? 11 7 10 23 9 17 10 5 20 10 6 ckhe 10 6 9 20 8 14 8 5 16 8 4 cthe 8 5 7 19 7 12 7 4 14 8 4 cph 7 5 6 16 7 10 6 4 11 7 4 oy 6 4 6 16 6 7 6 4 11 7 4 n 6 3 5 15 5 6 5 4 8 6 3 iir 5 2 5 14 5 5 4 3 8 6 3 iiin 4 2 5 13 4 5 4 2 6 6 3 i? 4 2 3 12 3 4 3 2 6 6 3 cphe 3 2 3 10 3 4 3 2 6 5 3 oo 3 1 1 10 3 3 2 2 6 5 3 cfh 3 1 1 10 2 3 2 2 5 5 3 im 2 1 1 9 2 2 1 2 5 3 3 yo 2 1 1 8 1 2 1 2 5 3 3 de 2 1 1 5 1 2 1 2 4 3 3 iiir 1 1 1 5 1 2 1 1 3 3 3 il 1 1 1 2 1 2 1 1 0 3 3 j 1 1 0 2 1 2 0 1 0 3 3 x 1 1 0 2 0 2 0 1 0 3 3 is 1 1 0 2 0 2 0 1 0 2 3 ya 1 1 0 1 0 1 0 1 0 2 3 cfhe 1 1 0 1 0 1 0 1 0 1 . ay 0 1 0 0 0 1 0 1 0 1 . ao 0 1 0 0 0 1 0 1 0 1 . iim 0 0 0 0 0 1 0 1 0 1 . g 0 0 0 0 0 1 0 1 0 1 . ck 0 0 0 0 0 0 0 1 0 1 . iil 0 0 0 0 0 0 0 1 0 1 . ct 0 0 0 0 0 0 . 1 0 1 . id 0 0 0 0 0 0 . 1 0 1 . iid 0 0 0 0 0 0 . 0 0 1 . cthh 0 0 0 0 0 0 . 0 0 1 . b 0 0 0 0 0 0 . 0 0 1 . cphh 0 0 0 0 0 0 . 0 0 1 . ikh 0 0 0 0 0 0 . 0 0 1 . aa 0 0 0 0 0 0 . 0 0 1 . c? 0 0 0 0 0 0 . 0 0 1 . iis 0 0 0 0 0 0 . 0 0 1 . yoa 0 0 0 0 0 0 . 0 0 1 . iiil 0 0 0 0 0 0 . 0 0 1 . ikhe 0 0 0 0 0 0 . 0 0 1 . aoy 0 0 0 0 0 0 . 0 0 1 . cf 0 0 0 0 0 0 . 0 0 1 . chh 0 0 0 0 0 0 . 0 0 1 . cp 0 0 0 0 0 0 . 0 0 1 . h? 0 0 0 0 . 0 . 0 0 1 . iiid 0 0 0 0 . 0 . 0 0 0 . ij 0 0 0 0 . 0 . 0 0 0 . iph 0 0 0 0 . 0 . 0 0 0 . ith 0 0 0 0 . 0 . 0 0 0 . ithe 0 0 0 0 . 0 . . 0 0 . ithh 0 0 0 0 . 0 . . 0 0 . oao 0 0 0 0 . 0 . . 0 0 . ooa 0 0 0 0 . 0 . . . 0 . ooooooooo 0 0 0 0 . 0 . . . 0 . oya 0 0 0 . . 0 . . . 0 . pe 0 0 0 . . . . . . 0 . u 0 0 . . . . . . . 0 . yay . 0 . . . . . . . . . yy . 0 . . . . . . . . . cfhh . 0 . . . . . . . . . ckhh . 0 . . . . . . . . . kh . . . . . . . . . . . v Check whether our list of elements is complete: cat data/{all,ren}/elem.cts | gawk '/./{print $2}' | sort | uniq > .bar cat elem.dic | sort > .foo bool 1-2 .bar .foo cat elem-to-class.tbl | gawk '/./{print $1}' | sort > .baz bool 1-2 .baz .foo Now let's enumerate all pairs of non-empty elements, consecutive and non-consecutive, in each word: foreach ptpn ( sep.0 con.1 ) set ptp = "${ptpn:r}"; set pfl = "${ptpn:e}" foreach sec ( $sections ren all ) echo "Enumerating ${ptp} element pairs for ${sec}..." cat data/${sec}/words-gut.fac \ | nice enum-elem-pairs -v consecutive=${pfl} \ | tr -d '{}' \ | sort | uniq -c | expand \ | gawk '/./{printf "%7d %s:%s\n", $1,$2,$3;}' \ | sort +0 -1nr +1 -2 \ > data/${sec}/elem-${ptp}-pair.cts end multicol \ -v titles="all ren ${sections}" \ data/{all,ren,$seccom}/elem-${ptp}-pair.cts \ > elem-${ptp}-pair.multi compare-counts \ -titles "all ren $sections pair" \ -freqs \ -sort 1 \ data/{all,ren,$seccom}/elem-${ptp}-pair.cts \ > elem-${ptp}-pair.cmp end Tabulate element pairs, collapsing elements into classes foreach ptp ( sep con ) foreach sec ( ${sections} ren all ) echo "=== ${ptp} pairs for ${sec} ========================" cat data/${sec}/elem-${ptp}-pair.cts \ | tr ':' ' ' \ | map-fields \ -v table=elem-to-class.tbl \ -v fields="2,3" \ | gawk '/./{printf "%7d %s:%s\n", $1,$2,$3;}' \ | combine-counts | sort -b +0 -1nr +1 -2 \ > data/${sec}/class-${ptp}-pair.cts foreach ttpn ( freqs.3 counts.5 ) set ttp = "${ttpn:r}"; set dig = "${ttpn:e}" cat data/${sec}/class-${ptp}-pair.cts \ | tr ':' ' ' | gawk '/./{print $1,"*",$2,$3;}' \ | tabulate-triple-counts \ -v rows=elem-classes.dic \ -v cols=elem-classes.dic \ -v ${ttp}=1 -v digits=${dig} \ > data/${sec}/class-${ptp}-pair.${ttp} end end end Here is a typical "sep" table, for the "bio" section: Pairs with key = * Pair probabilities (×999): --- --- --- --- --- --- --- --- --- --- --- ----- E T T O Q O S D X H N I W C T --- --- --- --- --- --- --- --- --- --- --- ----- Q . 79 13 14 11 36 . 7 . . 164 O . 76 84 27 25 61 2 36 1 . 317 S . 35 11 11 14 8 . 3 . . 86 D . 68 9 2 2 . . 4 . . 88 X . 83 12 42 8 12 . . . . 161 H . 86 20 29 24 . . 14 . . 177 N . . . . . . . . . . 0 I . . . . . . . . . . 0 W . 1 . . . . . . . . 3 ETC . . . . . . . . . . 0 --- --- --- --- --- --- --- --- --- --- --- ----- TOT . 432 151 128 88 121 5 67 4 . 999 Note that the classes H and X are rarely preceded by D but often followed by it. I suppose most of these cases are final "dy"s. Now let's extract all subsequences of three non-empty elements from each word: foreach ptpn ( sep.0 ) set ptp = "${ptpn:r}"; set pfl = "${ptpn:e}" foreach sec ( $sections ren all ) echo "Enumerating ${ptp} element triples for ${sec}..." cat data/${sec}/words-gut.fac \ | nice enum-elem-triples -v consecutive=${pfl} \ | tr -d '{}' \ | sort | uniq -c | expand \ | gawk '/./{printf "%7d %s:%s:%s\n", $1,$2,$3,$4;}' \ | sort +0 -1nr +1 -2 \ > data/${sec}/elem-${ptp}-triple.cts end multicol -v titles="all ren ${sections}" data/{all,ren,$seccom}/elem-${ptp}-triple.cts \ > elem-${ptp}-triple.multi compare-counts \ -titles "all ren $sections triple" \ -freqs \ -sort 1 \ data/{all,ren,$seccom}/elem-${ptp}-triple.cts \ > elem-${ptp}-triple.cmp end Tabulate triples sliced by middle element, first collapsing similar letters: foreach ptp ( sep ) foreach sec ( ${sections} ren all ) echo "=== ${ptp} triples for ${sec} ========================" cat data/${sec}/elem-${ptp}-triple.cts \ | tr ':' ' ' \ | map-fields \ -v table=elem-to-class.tbl \ -v fields="2,3,4" \ | gawk '/./{printf "%7d %s:%s:%s\n", $1, $2,$3,$4;}' \ | combine-counts | sort -b +0 -1nr +1 -2 \ > data/${sec}/class-${ptp}-triple.cts foreach ttpn ( freqs.3 counts.5 ) set ttp = "${ttpn:r}"; set dig = "${ttpn:e}" cat data/${sec}/class-${ptp}-triple.cts \ | tr ':' ' ' | gawk '/./{print $1,$3,$2,$4;}' \ | sort -b +1 -2 +0 -1nr \ | tabulate-triple-counts \ -v rows=elem-classes.dic \ -v cols=elem-classes.dic \ -v ${ttp}=1 -v digits=${dig} \ > data/${sec}/class-${ptp}-triple.${ttp} end end end It seems that, if we ignore the O's and Q's, most words have a "midfix" consisting of D, X, and H elements, with a prefix of S letters, and a suffix of S and D elements. Let's add to the factored word tables a second column with the element classes: foreach sec ( ${sections} ren all ) echo "$sec" cat data/${sec}/words-gut.fac \ | gawk \ ' /./{ \ s = $1; \ gsub(/[{}]/, " ", s); gsub(/ */, " ", s); \ gsub(/^ */, "", s); gsub(/ *$/, "", s); \ printf "%s %s\n", $0, s; \ } ' \ | map-fields \ -v table=elem-to-class.tbl \ -v forgiving=1 \ | gawk '/./{ \ e=$1; $1=""; c=$0; \ gsub(/^ */,":",c); gsub(/ *$/,":",c); \ gsub(/ */, ":", c); \ printf "%s %s\n", c,e; \ } ' \ | sort | uniq -c | expand \ | sort -b +0 -1nr +1 -2 \ > data/${sec}/class-wds.cts end Looking at Rene's words, it looks like most words have at most one H, and all X's are consecutive and adjacent to it (except for the intrusion of "O"s in some languages). Let's tabulate the patterns of H and X elements, after removing the O elements, and any prefix or suffix consisting of elements oter than H and X:: foreach sec ( ${sections} ren all ) echo "$sec" cat data/${sec}/class-wds.cts \ | gawk '/./{print $1,$2}' \ | sed \ -e 's/[O]://g' \ -e 's/ :[A-GI-WY-Z:]*/ :/' \ -e 's/:[A-GI-WY-Z:]*$/:/' \ | combine-counts | sort -b +0 -1nr +1 -2 \ > data/${sec}/class-midf.cts end multicol \ -v titles="all ren ${sections}" \ data/{all,ren,${seccom}}/class-midf.cts \ > class-midf.multi compare-counts \ -titles "all ren $sections midfix" \ -remFreqs \ -sort 1 \ data/{all,ren,$seccom}/class-midf.cts \ > class-midf.cmp Now let's tabulate the prefix, suffix, and unifix patterns, omitting the Q's and O's: foreach sec ( ${sections} ren all ) echo "$sec" cat data/${sec}/class-wds.cts \ | gawk '/./{print $1,$2}' \ | egrep -v -e '[HX]:' \ | sed \ -e 's/[O]://g' \ -e 's/ :[Q:]*/ :/' \ | combine-counts | sort -b +0 -1nr +1 -2 \ > data/${sec}/class-unif.cts end foreach sec ( ${sections} ren all ) echo "$sec" cat data/${sec}/class-wds.cts \ | gawk '/./{print $1,$2}' \ | egrep -e '[HX]:' \ | sed \ -e 's/[O]://g' \ -e 's/ :[Q:]*/ :/' \ -e 's/[HX]:.*//' \ | combine-counts | sort -b +0 -1nr +1 -2 \ > data/${sec}/class-pref.cts end foreach sec ( ${sections} ren all ) echo "$sec" cat data/${sec}/class-wds.cts \ | gawk '/./{print $1,$2}' \ | egrep -e '[HX]:' \ | sed \ -e 's/[O]://g' \ -e 's/ :[Q:]*/ :/' \ -e 's/ :.*[HX]:/ :/' \ | combine-counts | sort -b +0 -1nr +1 -2 \ > data/${sec}/class-suff.cts end foreach sec ( ${sections} ren all ) echo "$sec" cat data/${sec}/class-wds.cts \ | gawk '/./{print $1,$2}' \ | egrep -e '[HX]:' \ | sed \ -e 's/[O]://g' \ -e 's/:[A-PR-Z:]*$/:/' \ | combine-counts | sort -b +0 -1nr +1 -2 \ > data/${sec}/class-qhaf.cts end foreach sec ( ${sections} ren all ) echo "$sec" cat data/${sec}/class-wds.cts \ | gawk '/./{print $1,$2}' \ | egrep -v -e '[HX]:' \ | sed \ -e 's/[O]://g' \ -e 's/:[A-PR-Z:]*$/:/' \ | combine-counts | sort -b +0 -1nr +1 -2 \ > data/${sec}/class-qsof.cts end Let's make comparative tables: foreach fix ( midf unif pref suff qhaf qsof ) multicol \ -v titles="all ren ${sections}" \ data/{all,ren,${seccom}}/class-${fix}.cts \ > class-${fix}.multi compare-counts \ -titles "all ren ${sections} ${fix}ix-pattern" \ -freqs \ -sort 1 \ data/{all,ren,${seccom}}/class-${fix}.cts \ > class-${fix}.cmp end THE OKOKOKO AND PMS PARADIGMS COMBINED To a first approximation, the Voynichese words can be decomposed into the following "elements": Q = { q } O = { a o y } ("circles") H = { t te cth cthe k ke ckh ckhe p pe cph cphe f fe cfh cfhe } ("gallows") X = { ch che sh she ee eee } ("tables") R = { d l r s } ("dealers") F = { n m g j in iin ir iir } ("finals") W = { e i cthh ith kh ct iiim iir is ETC. } ("weirdos") The "p" and "f" elements are almost certainly calligraphic variants of the corresponding "t" and "k" elements. There are two classes of words: the "hard" ones, that contain Hs and/or Xs, and the 'soft" ones that don't. Let's ignore the O's for the moment. The "hard" words have the form Q^a R^b X^c H^d X^e R^f F^g where a = 0 (86%) or 1 (14%) b = 0 (90%) or 1 ( 9%) d = 0 (49%) or 1 (49%) c+e = 0 (52%) or 1 (43%) or 2 (4%) f = 0 (42%) or 1 (53%) or 2 (4%) g = 0 (85%) or 1 (14%) The "soft" words have the form Q^w R^x F^y where w = 0 (95%) or 1 ( 5%) x = 0 (12%) or 1 (58%) or 2 (22%) or 3( 2%) y = 0 (55%) or 1 (40%) The "soft" schema above can be interpreted as a special case of the "hard" schema with no X or Hs (i.e. c+d+e = 0), although the probabilities are somewhat different. Said another way, the typical Voynichese word has a "midfix" (kernel, root), possibly empty, consisting of at most one gallows surrounded by at most two tables. To the midfix is attached a prefix having at most one "q" and at most one dealer; and a suffix with at most two dealers and at most one final. Let's now check how many words fit this paradigm: foreach sec ( ${sections} ren all ) echo "$sec" cat data/${sec}/class-wds.cts \ | gawk '/./{print $1,$2}' \ | sed -e 's/[O]://g' \ > /tmp/.foo cat /tmp/.foo \ | egrep -e ' :(Q:|)([SD]:|)(X:X:H:|H:X:X:|(X:|)(H:|)(X:|))([SD]:|)([SD]:|)([NI]:|)$' \ | combine-counts | sort -b +0 -1nr +1 -2 \ > /tmp/.foogut cat /tmp/.foo \ | egrep -v -e ' :(Q:|)([SD]:|)(X:X:H:|H:X:X:|(X:|)(H:|)(X:|))([SD]:|)([SD]:|)([NI]:|)$' \ | combine-counts | sort -b +0 -1nr +1 -2 \ > /tmp/.foobad /bin/rm -f data/${sec}/class-fit.cts cat /tmp/.foogut | sed -e 's/ :/ +:/' >> data/${sec}/class-fit.cts cat /tmp/.foobad | sed -e 's/ :/ -:/' >> data/${sec}/class-fit.cts end This paradigm fits 97% of all words in Rene's list (with multiplicities) and 94% of all words in the interlinear file. The remaining 6% includes the words containing "wild" (W) elements (3% of all words) and long words that look like two words joined together. The most common patterns in the interlinear that do not fit the paradigm and do not contain a wild element are 46 :H:H: 44 :H:X:H: 38 :H:S:X: 32 :H:S:X:D: 26 :H:S:X:S: 20 :D:I:S: 19 :D:S:X:D: 19 :H:H:S: 19 :X:D:X: 18 :S:I:S: 18 :X:X:H:X: 15 :I:S: 15 :S:S:X: 15 :X:S:H: 14 :H:I:S: 13 :D:S:X: 12 :D:I:D: 12 :S:S:X:D: where S = { s l r }, D = { d }, I = { in iin ir iir }, N = { n m g j }. TO MERGE WITH THE ABOVE: [ 1999-02-02 ] Word pattern frequencies ------------------------ It is instructive to analyze the frequency of each word pattern, the result of collapsing the letters into the classes { Q O X I R E } or { Q O K } defined above. For this study we will use the majority-vote transcription, that includes Takeshi's new full transcription. For simplicity, let's discard all data containing weirdos, extra plumes, unreadable characters, or the rare letters [abuvxz]. Let's also map the upper case EVA letters [SCIKTPF] to their lower case varians, since the capitalization carries no information in those cases. cat ../045/only-m.evt \ | egrep -e '^<[^<>]*;A>' \ | tr 'SCIKTPF' 'sciktpf' \ | tr -d '\!' \ | sed \ -e 's/^<[^<>]*> *//g' \ -e 's/[{][^{}]*[}]//g' \ -e 's/[&][0-9*?][0-9*?]*[;]\?/*/g' \ -e 's/[buxvz]/*/g' \ -e 's/[.,]*-[-.,]*/-/g' \ -e 's/[,]*[.][,.]*/./g' \ -e 's/[,][,]*/,/g' \ -e 's/.['"'"'"]/?/g' \ -e 's/[^-,./= ]*[%?*][^-,./= ]*/?/g' \ > base.txt Let's reduce the alphabet to letter classes as follows: O = [aoy] I = [i]+ Q = [q] E = unattached [eh] R = [djmg] and [rlsn] X = , , , , [ci][ktpf][h], [c][ktpf], [ktpf] The following hack should do it: cat base.txt \ | sed \ -e 's/ee/X/g' \ -e 's/[csi][h]/X/g' \ -e 's/[ci][ktpf][h]/X/g' \ -e 's/[c][ktpf]/X/g' \ -e 's/[ktpf]/X/g' \ -e 's/[rlsn]/R/g' \ -e 's/[mdgj]/R/g' \ -e 's/[aoy]/O/g' \ -e 's/[q]/Q/g' \ -e 's/[i][i]*/I/g' \ -e 's/[ceh]/E/g' \ > base.clt egrep '[^-.,=/?XEQROI]' base.clt > .bugs head -10 .bugs First, the { Q O X I R E } patterns: cat base.clt \ | tr '., =/-' '\012\012\012\012\012\012' \ | egrep '.' \ | egrep -v '[?%*]' \ | sort | uniq -c | expand \ | sort +0 -1nr \ > QOIXER.frq The result is a long-tailed distribution that begins freq pattern ----- ---------- 1832 XOR 1725 OR 1649 ROIR 1413 ROR 1209 OXOR 1084 XERO 940 XXO 903 XEO 817 OXOIR 786 QOXOR 745 XEOR 718 OIR 716 XO 703 OXXO 660 QOXOIR 560 QOXXO 487 R 480 QOXXRO 404 RO 382 OXXRO 379 XXOR 376 QOXERO 375 OXO 372 XOIR 370 OXERO 325 OROR 316 OXEOR 312 OXXOR 309 XXRO 307 OROIR ... ... Let's now collapse the elements { X XE R IR } to a single class K, and absorb the Q into the following O: cat base.clt \ | tr '., =/-' '\012\012\012\012\012\012' \ | egrep '.' \ | egrep -v '[?%*]' \ | sed \ -e 's/XE/K/g' \ -e 's/X/K/g' \ -e 's/IR/K/g' \ -e 's/R/K/g' \ -e 's/QO/O/g' \ | sort | uniq -c | expand \ | sort +0 -1nr \ > QOK.frq The result is still a relatively long-tailed distribution: freq pattern ----- ---------- 6061 KOK 4690 OKOK 3075 KKO 2704 OK 2646 OKKO 2023 KO 1531 OKKKO 1365 OKO 1346 KKOK 1236 KKKO 1052 KOKO 951 OKKOK 861 KOKOK 578 K 561 OKOKO 374 KOKKO 324 KK 309 OKOKOK 265 O 233 KKKOK 219 KKOKO 202 KKKKO 189 KKK 177 OKKOKO 175 OKKK 169 OKK 160 OOK 152 OKOKKO 142 KOKKOK 139 OKKKKO ... ... Conversely, we can analyze the patterns of X and R ignoring the { Q E I O } complements: cat base.clt \ | tr '., =/-' '\012\012\012\012\012\012' \ | egrep '.' \ | egrep -v '[?%*]' \ | tr -d 'QEIO' \ | sort | uniq -c | expand \ | sort +0 -1nr \ > XR.frq The result is still a fairly broad distribution: freq pattern ----- ---------- 10441 XR 4319 RR 4006 R 3768 XXR 3682 XX 2999 X 1461 XRR 1279 RXR 538 RX 480 XXX 463 RRR 409 XXRR 366 RXXR 346 RXX 302 230 XXXR 151 XRXR 132 XRX 116 XRRR 90 RXRR 89 RRXR 59 RRRR 56 XXXX 50 RRX 31 XXRX 30 XXRRR 24 XRXX 23 RXXRR 23 RXXX 22 XRXXR ... ...