# Note 017

```Hacking at the Voynich manuscript - Side notes
017 OKOKOKO: The fine structure of Voynichese words

Last edited on 1999-02-02 14:01:55 by stolfi

[ A first version of this note was posted around 1998-03-11, to the
voynich mailing list.  This version was extensively revised between
1998-03-21 and 1998-03-29. The section about word and line breaks

[ If you decide to print this note, be warned that some lines have
almost 120 characters.]

-----------------------------

Let "X" be any set of letters.  We can always break any string
whatsoever into zero or more "X"s, each surrounded by letters which
are not "X"s:

N X N X N X N ... X N

where "X" represents exactly one letter from the set, and
"N" is any string (possibly empty) of non-"X" letters.

Now let's apply this decomposition to the Voynichese words,
using as "X" the set of letters

{ sh ch ee
k ckh ck ikh
t cth ct ith
f cfh cf ifh
p cph cp iph
d
r l s g m n
}

(I am using the basic EVA alphabet, without capitals).

It turns out that, for this choice of "X", the intervening "N"
strings are highly constrained. In fact, most words can be
decomposed as

Q  O  I X E  O  I X E  O  I X E  O...  I X E  O

where

Q is empty or "q";
O is zero or more elements from the set A = { a o y };
I is empty, or one of { i ii iii };
E is empty, or "e".

The QOKOKOKO schema
-------------------

In fact, we can constrain these pieces even more.
With very few exceptions,

"E" may be non-empty only after { sh ch ee k ckh t cth p cph f cfh d }

"I" may be non-empty only before { r l g m n s d }

Note that "d" is exceptional in that it may be accompanied by
either "e" or "i" strings; but the two are mutually exclusive.
(In fact the letter pairs "id" and "de" are both extremely rare.)

That is, we can write the generic word as

Q O K O K O K ... K O K O

where O is as above, and K is one of the "main elements"

{ k    t    p    f
ke   te   pe   fe
ckh  cth  cph  cfh
ckhe cthe cphe cfhe
ikh  ith  iph  ifh
ck   ct   cf   cp

sh  ch  ee
she che eee

de

d    r    l    g    m    n    s
id   ir   il   ig   im   in   is
iid  iir  iil  iig  iim  iin  iis
iiid iiir iiil iiig iiim iiin iiis
}

Note that

* The letters "p" and "f" are probably ornate versions of other
letters: most likely "k" and "t", but perhaps others.

* Various statistics suggest that "k" and "t" may be the same letter.

* Ditto for "p" and "f".

* Ditto for "y" and "o".

* Ditto for "g" and "m".

* The letter "q" does not seem to be part of the word;
it may be an abbreviation for "and".

* The groups { ikh ith iph ifh } may be equivalent to
{ ckh cth cph cfh }, respectively.

* Instances of { ee eee } may be instances { ch che }
with missing ligature.

Finally, many of the "K" elements are so rare that they are
probably errors.   If we consider only elements with
frequency 0.1% or higher, and exclude the elements
with "i*h", "p", and "f", we are left with only 25
"significant" elements:

K* = { k    ke   ckh  ckhe
t    te   cth  cthe
ch   che
sh   she
ee   eee
l    m    s    d
n    r
in   ir
iin  iir
iiin
}

Parsing ambiguities
-------------------

Note that the inclusion in "X" of the groups { ikh ith iph ifh }
does not create any ambiguity with the "I" modifiers, since the
presence of "h" after a tall letter forces one to parse the
preceding letter (which must be "i" or "c") as part of the same
element.  Indeed, the elements { ikh ith iph ifh } may be merely
calligraphic variants of { ckh cth cph cfh }, and are the only
instances where the letters { k t p f } may be preceded by "i".

On the other hand, including the string "ee" in the set "X" leads to
an ambiguity in the parsing of words with three or more consecutive
"e"s.  For example, "okeeedy" could be parsed either as

Q   O   I  X  E   O   I  X  E   O   I  X  E   O
-   o   -  k  -   -   -  ee e   -   -  d  -   y

or as

Q   O   I  X  E   O   I  X  E   O   I  X  E   O
-   o   -  k  e   -   -  ee -   -   -  d  -   y

Several Voynichologists (Rene and Dennis, among others) are unhappy
and perhaps allowing "ee" and "eee" as possible "E" modifiers.

But there are reasons for including "ee" in "X".  For one thing,
while an isolated "e" is pretty common within words, it practically
never occurs right after { d r l } or before the first "X"; but "ee"
and "eee" often occurs in those positions.  That is, while a single
"e" must always be attached to a preceding "X", the groups "ee" and
"eee" can stand on their own, like the other "X" groups.

(One could argue that the "c" in the elements { ck ct cf cp }, which
may occur before any other "X" group in some words, is in fact an
instance of "e". However, in the few cases I have checked, the "c"
has a noticeable ligature, even though the matching "h" is
missing. So it seems indeed valid to write those combinations with
"c" and not with "e".)

One must keep in mind also that an "ee" group may well be a "ch"
element whose ligature was omitted (by the scribe or the
transcriber).  Similarly, the very rare occurrences of "se" may
well be instances of "sh" with missing ligature.

Conversely, it may be that the `natural' form of the letters
{ ch che sh she } is { ee eee se see }, respectively; and the
ligatures are optional calligraphic devices added
to clarify the parsing, almost as an afterthought.

Parsing the text
----------------

The words that fail this "QOKOKOKO" pattern are quite rare.
Let's count them in the following files:

hea-u.wds  a few herbal-A pages, which I carefully
transcribed from Jacques Guy's images;

hea-f.wds  herbal-A pages in Friedman's transcription;

heb-f.wds  herbal-B pages in Friedman's transcription;

bio-f.wds  biological (language B) pages in
Friedman's transcription;

vdp-z.wds  a list of all words that occur at least twice,
transcribed by the EVMT team.

(The "-f" files were created between 97-11-11 and 98-11-12,
as {hea,heb,bio}-f-gut.wds, from Landini's interlinear
converted to EVA.  The last one was created by expanding
a word frequency list posted by Rene Zandbergen on march/98;
an entry "N W" in that list generated "N" copies of word "W"
in file "vdp-z.wds".)

foreach file ( hea-u hea-f heb-f bio-f vdp-z )
cat \${file}.wds \
| egrep -v '[*]' \
| sed -f factor-OK.sed \
> \${file}.fac
cat \${file}.fac \
| egrep -e '[#@%=]' \
> \${file}-weird.fac
dicio-wc \${file}.fac \${file}-weird.fac
end

--- factor-OK.sed ------------------------
# Map "sh", "ch", and "ee" to single letters to simplify the parsing.
# Note that "eee" groups are paired off from left end.
s/ch/C/g
s/sh/S/g
s/ee/E/g
# Map platformed and half-platformed letters to capitals to simplify the parsing:
s/ckh/K/g
s/cth/T/g
s/cfh/F/g
s/cph/P/g
#
s/ikh/G/g
s/ith/H/g
s/ifh/M/g
s/iph/N/g
#
s/ck/U/g
s/ct/V/g
s/cf/X/g
s/cp/Y/g
# Put down scanning head in "@" state
s/\$/@/
:x
# If in "@" state, copy "[aoy]" group, and switch to "#" state:
s/\([aoy][aoy]*\)@/#\1/
s/@/#_/
# If in "#" state, copy next main letter and "e" complements,
# insert "}" delimiter, and switch to "%" or "=" state depending on
# whether "i"s are allowed or not:
s/\([CSEktfpKTFPd]e\)#/=\1}/g
s/\([CSEktfpKTFPGHMNUVXY]\)#/=\1}/g
s/\([rlgmnsd]\)#/%\1}/g
# If in "%" state, attach "i" string to group, go to "=" state:
s/\(iii\)%/=\1/
s/\(ii\)%/=\1/
s/\(i\)%/=\1/
s/%/=/
# If in "=" state, insert "{" delimiter, and go back to "@" state:
s/=/@{/
tx
# We should exit the loop only in the "#" state.
s/^[q]#/{q}/
s/^#/{_}/
# Unfold letter folding:
s/U/ck/g
s/V/ct/g
s/X/cf/g
s/Y/cp/g
#
s/G/ikh/g
s/H/ith/g
s/M/ifh/g
s/N/iph/g
#
s/K/ckh/g
s/T/cth/g
s/P/cph/g
s/F/cfh/g
#
s/C/ch/g
s/S/sh/g
s/E/ee/g
------------------------------------------

lines   words     bytes file
------ ------- --------- ------------
803     803     11751 hea-u.fac
0       0         0 hea-u-weird.fac

lines   words     bytes file
------ ------- --------- ------------
7812    7812    113448 hea-f.fac
93      93      1144 hea-f-weird.fac

lines   words     bytes file
------ ------- --------- ------------
3223    3223     47932 heb-f.fac
46      46       564 heb-f-weird.fac

lines   words     bytes file
------ ------- --------- ------------
6182    6182     90650 bio-f.fac
39      39       474 bio-f-weird.fac

lines   words     bytes file
------ ------- --------- ------------
28939   28939    420444 vdp-z.fac
142     142      1339 vdp-z-weird.fac

So, the exceptions to the QOKOKOKO pattern are less than 1.5% in
Friedman's transcription, less than 0.5% in Rene's list, and none
in my own transcription.

(The last result is not that impressive, of course.  Even though I did my

The exceptions in Rene's word list
----------------------------------

Here is a breakdown of the 142 words (counting multiple occurrences)
in Rene's file that did not fit the QOKOKOKO pattern.  (Let's keep in mind
that Rene's file only includes words that occur at least twice.)

It seems that some of these exceptions can be explained as "mutations"
from other letters: scribal errors, calligraphic variations, pen
running out of ink, vellum defects, spots, fading, and of course
poor copy quality.  Some are harder to explain, however, and may
require extending the basic schema.

* Words with groups { ckhh cthh cphh cfhh } (42 cases):

chckhhy(9) cthhy(4) chcthhy(4) shcthhy(3) qcthhy(3) ckhhy(3)
chcphhy(3) chcfhhy(3) shocthhy(2) shcphhy(2) qcphhedy(2)
ockhhy(2) ocfhhy(2)

These exceptions account for 0.15% of all words.  I propose that
these are calligraphic accidents; that is, "ckhh" is a "ckhe"
whose ligature was overextended, and similarly for the other
groups.

* Words with "oe" (41 cases):

qoedy(5) qoedaiin(3) oedy(2)

qoeol(5) qoear(2) qoeor(2)

qoekeey(3) oekaiin(3) qoekol(2) oekeey(2)
qoekedy(2) oekey(2) oekeody(2)

choety(2) choeky(2) sheoeky(2)

These exceptions account for approximately 0.15% of all words.
The cases with "eke" could be explained as instances of "ckh"
with missing ligature.  The others may be true exceptions to the
schema.

Note that the "oe" occurs only at the beginning of the word, or
after the initial "q", or after an initial "ch" or "she" (which,
in language A, seem to behave like "q" to some extent).

* Words beginning with "e" or "qe" (20 cases):

ety(6) qekeey(3) qekchdy(3) qety(2) qekor(2) qekaiin(2)
etaiin(2)

These word-initial "e"s could be explained as partly erased
instances of { a o y }.  Note that if we replace the initial "e"
by "o" or "y" we get fairly common words in all these cases.

* Words with the special letters "x" and "v" (20 cases):

x(10) v(8) xar(2)

Note that these letters (picnic table and caret) occur mostly as
isolated letters. Therefore, they may be non-phonetic symbols, or
abbreviations.

* Words with "e" after "s" (5 cases):

chsey(3) shese(2)

These exceptions could be instances of "sh" without the ligature.

* Isolated "e"s (4 cases):

e(4)

These exceptions could be instances of "s" with missing plume.

* Words with "eeb" (3 cases):

cheeb(3)

I propose that "eeb" is merely a calligraphic variation of
"an" or "iin".

* Words with "ykh" (3 cases):

ykhey(3)

I can't think of a good explanation for these cases.

* Letter "o" before "q" (2 cases):

oqokain(2)

Perhaps the extra "o" is a separate word, or part of the
previous one?

* Letter "i" in word-final position (2 cases):

okai(2)

These exceptions could be truncated "in" or "ir" groups.

Frequencies for "K" elements
----------------------------

Here are the statistics for the "K" groups.

foreach file ( hea-u hea-f heb-f bio-f vdp-z )
cat \${file}.fac \
| egrep -v '[@%#=]' \
| sed \
-e 's/^[^{}]*{//g' \
-e 's/}[^{}]*\$//g' \
-e 's/}[^{}]*{/./g' \
| tr '.' '\012' \
| egrep -e '.' \
| sort | uniq -c | expand | sort -b +0 -1nr \
| compute-freqs | sed -e 's/^  //g' \
> \${file}-k.frq
dicio-wc \${file}-k.frq
end

lines file
------ ------------
39 hea-u-k.frq
41 hea-f-k.frq
36 heb-f-k.frq
35 bio-f-k.frq
44 vdp-z-k.frq

multicol {hea-u,hea-f,heb-f,bio-f,vdp-z}-k.frq

hea-u             hea-f             heb-f             bio-f             vdp-z
----------------  ----------------  ----------------  ----------------  ----------------
752 0.304 _      7024 0.297 _      2856 0.284 _      4627 0.242 _     24167 0.273 _
292 0.118 ch     2524 0.107 ch     1459 0.145 d      2512 0.131 d      9928 0.112 d
216 0.087 d      2194 0.093 d       765 0.076 k      2140 0.112 l      7523 0.085 l
183 0.074 l      1702 0.072 l       608 0.060 l      1516 0.079 q      6470 0.073 k
178 0.072 r      1466 0.062 r       600 0.060 r      1422 0.074 k      4855 0.055 r
119 0.048 t      1257 0.053 k       557 0.055 ch      828 0.043 che    4698 0.053 ch
102 0.041 k      1177 0.050 t       424 0.042 iin     804 0.042 r      4630 0.052 q
101 0.041 iin    1090 0.046 iin     366 0.036 che     775 0.041 ee     3641 0.041 t
93 0.038 sh      832 0.035 sh      353 0.035 t       723 0.038 iin    3545 0.040 iin
76 0.031 s       695 0.029 q       321 0.032 q       670 0.035 she    3384 0.038 ee
58 0.023 cth     632 0.027 s       250 0.025 ee      615 0.032 t      3328 0.038 che
51 0.021 che     464 0.020 che     207 0.021 ke      476 0.025 ch     1663 0.019 she
51 0.021 q       453 0.019 cth     176 0.017 s       377 0.020 ke     1644 0.019 sh
36 0.015 m       353 0.015 ee      176 0.017 she     357 0.019 s      1608 0.018 s
23 0.009 ee      253 0.011 m       175 0.017 sh      316 0.017 sh     1428 0.016 in
22 0.009 in      216 0.009 p       123 0.012 p       194 0.010 te     1370 0.015 ke
20 0.008 p       187 0.008 she     113 0.011 m       168 0.009 p       789 0.009 te
19 0.008 she     186 0.008 in      110 0.011 te      142 0.007 ckh     734 0.008 p
17 0.007 ckh     176 0.007 ckh      89 0.009 ckh     113 0.006 in      632 0.007 m
11 0.004 te      130 0.005 ke       74 0.007 f        81 0.004 cth     573 0.006 cth
8 0.003 ke       78 0.003 cph      67 0.007 in       72 0.004 m       511 0.006 ckh
7 0.003 cph      75 0.003 f        51 0.005 ir       42 0.002 ckhe    379 0.004 ir
5 0.002 ir       75 0.003 n        38 0.004 cth      31 0.002 ir      223 0.003 eee
4 0.002 ct       70 0.003 te       24 0.002 cthe     29 0.002 cthe    177 0.002 ckhe
4 0.002 iiin     65 0.003 cthe     19 0.002 ckhe     21 0.001 eee     134 0.002 cthe
4 0.002 n        59 0.002 ir       13 0.001 eee      21 0.001 f       125 0.001 f
3 0.001 cthe     57 0.002 eee      13 0.001 iir      12 0.001 cphe    116 0.001 iiin
3 0.001 de       47 0.002 ckhe      9 0.001 iiin     10 0.001 n        95 0.001 iir
3 0.001 eee      27 0.001 cfh       6 0.001 cphe      7 0.000 cph      82 0.001 cph
3 0.001 f        24 0.001 iir       5 0.000 cfh       7 0.000 iiin     67 0.001 n
3 0.001 iir      21 0.001 cphe      5 0.000 cph       5 0.000 de       43 0.000 cphe
2 0.001 cfh      20 0.001 iiin      5 0.000 de        4 0.000 cfh      26 0.000 g
2 0.001 ck        8 0.000 de        5 0.000 n         3 0.000 il       21 0.000 im
1 0.000 cf        6 0.000 iim       4 0.000 cfhe      2 0.000 iir      18 0.000 cfh
1 0.000 ckhe      3 0.000 cfhe      2 0.000 id        1 0.000 pe       12 0.000 ikh
1 0.000 cphe      3 0.000 iid       1 0.000 iil                        10 0.000 ct
1 0.000 iid       3 0.000 iil                                           8 0.000 ith
1 0.000 iim       2 0.000 id                                            7 0.000 ck
1 0.000 im        2 0.000 iis                                           7 0.000 iid
1 0.000 il                                            7 0.000 il
1 0.000 is                                            2 0.000 cfhe
2 0.000 de
2 0.000 iim
2 0.000 iis

In these tables, the "_" entry represents the empty "Q" slot.

Let's extract from those tables the elements that are not in the
reduced set "K*" and are not simple uses of the `jokers' "p" and "f":

foreach file ( hea-u hea-f heb-f bio-f vdp-z )
cat \${file}-k.frq \
| egrep -v ' (([ktpf]|c[ktpf]h|[cs]h|ee)(e|)|[_qlmsdnr]|i[nr]|ii[nr]|iiin)\$' \
> \${file}-knr.frq
end

multicol {hea-u,hea-f,heb-f,bio-f,vdp-z}-knr.frq

hea-u            hea-f            heb-f            bio-f           vdp-z
---------------  ---------------  ---------------  --------------  ---------------
4 0.002 ct       8 0.000 de       5 0.000 de       5 0.000 de     26 0.000 g
3 0.001 de       6 0.000 iim      2 0.000 id       3 0.000 il     21 0.000 im
2 0.001 ck       3 0.000 iid      1 0.000 iil                     12 0.000 ikh
1 0.000 cf       3 0.000 iil                                      10 0.000 ct
1 0.000 iid      2 0.000 id                                        8 0.000 ith
1 0.000 iim      2 0.000 iis                                       7 0.000 ck
1 0.000 im       1 0.000 il                                        7 0.000 iid
1 0.000 is                                        7 0.000 il
2 0.000 de
2 0.000 iim
2 0.000 iis

Recall that strings with three or more "e"s have ambiguous parsing,
which affects the statistics of "ee" and all elements with the "e"
modifier.  The factor-Ok script arbitrarily pairs the "e"s from the
left, so that such strings are parsed as as zero or more "ee"s
followed by one "ee" or "eee".

To assess the implications of this ambiguity, let's check how
many ambiguous strings we have in each file:

foreach file ( hea-u hea-f heb-f bio-f vdp-z )
cat \${file}.wds \
| egrep -v '[*]' \
| sed -e 's/[^e]/./g' \
| tr '.' '\012' \
| egrep '.' \
| sort | uniq -c | expand | sort +0 -1nr \
| compute-freqs | sed -e 's/^  //g' \
> \${file}-eee.frq
dicio-wc \${file}-eee.frq
end

multicol {hea-u,hea-f,heb-f,bio-f,vdp-z}-eee.frq

hea-u            hea-f             heb-f            bio-f            vdp-z
---------------  ----------------  ---------------  ---------------  ---------------
97 0.789 e     1069 0.721 e       952 0.782 e     2187 0.732 e     7593 0.677 e
23 0.187 ee     355 0.239 ee      253 0.208 ee     779 0.261 ee    3395 0.303 ee
3 0.024 eee     57 0.038 eee      13 0.011 eee     21 0.007 eee    223 0.020 eee
2 0.001 eeee

Note that, surprisingly, there are practically no words with four ot more
"e"s in a row.

My factoring script will parse the "eee" strings as one "eee"
element.  In all files, the frequency of the "eee" element is less
than 0.003 ( i.e. 0.3% of the total "K" elements) Therefore, if I had used
the other parsing ("e" + "ee"), the frequencies of "ee" and
all other "e"-modified elements would increase by less than 0.003
in total.

By the way, the low frequency of "eee" probably means that
its ambiguity would be no big problem for the intended readers.

In fact, the absence of "eeee"s could be explained by the following
theory: the letters "ch" and "sh" are officially written "ee" and
"se"; since that would lead to ambiguities, the scribe
routinely (but not invariably) adds ligatures to indicate
the intended grouping.

Frequencies of "K" elements in languages A and B
------------------------------------------------

In the "K" frequency tables above we can already see a marked difference
between languages A and B.  Looking only at the reduced element subset K*,
plus "q" and "_" (meaning no "q"):

foreach file ( hea-f heb-f )
cat \${file}-k.frq \
| egrep ' (([kt]|c[kt]h|[cs]h|ee)(e|)|[_qlmsdnr]|i[nr]|ii[nr]|iiin)\$' \
> \${file}-kr.frq
end

multicol {hea-f,heb-f}-kr.frq

hea-f             heb-f
----------------  ----------------
7024 0.297 _      2856 0.284 _
2524 0.107 ch     1459 0.145 d
2194 0.093 d       765 0.076 k
1702 0.072 l       608 0.060 l
1466 0.062 r       600 0.060 r
1257 0.053 k       557 0.055 ch
1177 0.050 t       424 0.042 iin
1090 0.046 iin     366 0.036 che
832 0.035 sh      353 0.035 t
695 0.029 q       321 0.032 q
632 0.027 s       250 0.025 ee
464 0.020 che     207 0.021 ke
453 0.019 cth     176 0.017 s
353 0.015 ee      176 0.017 she
253 0.011 m       175 0.017 sh
187 0.008 she     113 0.011 m
186 0.008 in      110 0.011 te
176 0.007 ckh      89 0.009 ckh
130 0.005 ke       67 0.007 in
75 0.003 n        51 0.005 ir
70 0.003 te       38 0.004 cth
65 0.003 cthe     24 0.002 cthe
59 0.002 ir       19 0.002 ckhe
57 0.002 eee      13 0.001 eee
47 0.002 ckhe     13 0.001 iir
24 0.001 iir       9 0.001 iiin
20 0.001 iiin      5 0.000 n

There is also a less marked but still significant difference between
herbal-B and bio-B:

foreach file ( heb-f bio-f )
cat \${file}-k.frq \
| egrep ' (([kt]|c[kt]h|[cs]h|ee)(e|)|[_qlmsdnr]|i[nr]|ii[nr]|iiin)\$' \
> \${file}-kr.frq
end

multicol {heb-f,bio-f}-kr.frq

heb-f             bio-f
----------------  ----------------
2856 0.284 _      4627 0.242 _
1459 0.145 d      2512 0.131 d
765 0.076 k      2140 0.112 l
608 0.060 l      1516 0.079 q
600 0.060 r      1422 0.074 k
557 0.055 ch      828 0.043 che
424 0.042 iin     804 0.042 r
366 0.036 che     775 0.041 ee
353 0.035 t       723 0.038 iin
321 0.032 q       670 0.035 she
250 0.025 ee      615 0.032 t
207 0.021 ke      476 0.025 ch
176 0.017 s       377 0.020 ke
176 0.017 she     357 0.019 s
175 0.017 sh      316 0.017 sh
113 0.011 m       194 0.010 te
110 0.011 te      142 0.007 ckh
89 0.009 ckh     113 0.006 in
67 0.007 in       81 0.004 cth
51 0.005 ir       72 0.004 m
38 0.004 cth      42 0.002 ckhe
24 0.002 cthe     31 0.002 ir
19 0.002 ckhe     29 0.002 cthe
13 0.001 eee      21 0.001 eee
13 0.001 iir      10 0.001 n
9 0.001 iiin      7 0.000 iiin
5 0.000 n         2 0.000 iir

However, most of that difference disappears if we:

(1) identify the letters { k t p f}, which we have
good reasons to believe are the same letter;

(2) omit the letter "q", which is believed to be
a symbol for "and", and hence might be correlated
with subject matter;

(3) identify "ee" with "ch".

foreach file ( hea-u hea-f heb-f bio-f vdp-z )
cat \${file}.fac \
| egrep -v '[@%#=]' \
| sed \
-e 's/^[^{}]*{//g' \
-e 's/}[^{}]*\$//g' \
-e 's/}[^{}]*{/./g' \
-e 's/[ktpf]/k/g' \
-e 's/ee/ch/g' \
-e 's/q//g' \
| tr '.' '\012' \
| egrep -e '.' \
| sort | uniq -c | expand | sort -b +0 -1nr \
| compute-freqs | sed -e 's/^  //g' \
| egrep ' (([kt]|c[kt]h|[cs]h|ee)(e|)|[_qlmsdnr]|i[nr]|ii[nr]|iiin)\$' \
> \${file}-krr.frq
dicio-wc \${file}-krr.frq
end

multicol {hea-u,hea-f,heb-f,bio-f,vdp-z}-krr.frq

hea-u             hea-f             heb-f             bio-f             vdp-z
----------------  ----------------  ----------------  ----------------  ----------------
752 0.310 _      7024 0.306 _      2856 0.293 _      4627 0.263 _     24167 0.288 _
315 0.130 ch     2877 0.125 ch     1459 0.150 d      2512 0.143 d     10970 0.131 k
244 0.101 k      2725 0.119 k      1315 0.135 k      2226 0.126 k      9928 0.118 d
216 0.089 d      2194 0.096 d       807 0.083 ch     2140 0.122 l      8082 0.096 ch
183 0.075 l      1702 0.074 l       608 0.062 l      1251 0.071 ch     7523 0.089 l
178 0.073 r      1466 0.064 r       600 0.062 r       849 0.048 che    4855 0.058 r
101 0.042 iin    1090 0.047 iin     424 0.043 iin     804 0.046 r      3551 0.042 che
93 0.038 sh      832 0.036 sh      379 0.039 che     723 0.041 iin    3545 0.042 iin
84 0.035 ckh     734 0.032 ckh     317 0.033 ke      670 0.038 she    2159 0.026 ke
76 0.031 s       632 0.028 s       176 0.018 s       572 0.032 ke     1663 0.020 she
54 0.022 che     521 0.023 che     176 0.018 she     357 0.020 s      1644 0.020 sh
36 0.015 m       253 0.011 m       175 0.018 sh      316 0.018 sh     1608 0.019 s
22 0.009 in      200 0.009 ke      137 0.014 ckh     234 0.013 ckh    1428 0.017 in
19 0.008 ke      187 0.008 she     113 0.012 m       113 0.006 in     1184 0.014 ckh
19 0.008 she     186 0.008 in       67 0.007 in       83 0.005 ckhe    632 0.008 m
5 0.002 ckhe    136 0.006 ckhe     53 0.005 ckhe     72 0.004 m       379 0.005 ir
5 0.002 ir       75 0.003 n        51 0.005 ir       31 0.002 ir      356 0.004 ckhe
4 0.002 iiin     59 0.003 ir       13 0.001 iir      10 0.001 n       116 0.001 iiin
4 0.002 n        24 0.001 iir       9 0.001 iiin      7 0.000 iiin     95 0.001 iir
3 0.001 iir      20 0.001 iiin      5 0.001 n         2 0.000 iir      67 0.001 n

Statistics of "O" strings
-------------------------

Now, what do we do with the "O" strings?  Let's look at their
statistics:

foreach file ( hea-u hea-f heb-f bio-f vdp-z )
cat \${file}.fac \
| egrep -v '[@%#=]' \
| sed -e 's/{[^{}]*}/./g' \
| tr '.' '\012' \
| egrep -e '.' \
| sort | uniq -c | expand | sort -b +0 -1nr \
| compute-freqs | sed -e 's/^  //g' \
> \${file}-ooo.frq
dicio-wc \${file}-ooo.frq
end

lines file
------ ------------
9 hea-u-ooo.frq
15 hea-f-ooo.frq
11 heb-f-ooo.frq
12 bio-f-ooo.frq
11 vdp-z-ooo.frq

multicol {hea-u,hea-f,heb-f,bio-f,vdp-z}-ooo.frq

hea-u           hea-f            heb-f           bio-f            vdp-z
--------------  ---------------  --------------  ---------------  --------------
1364 0.551 _   12782 0.540 _     5371 0.533 _   10295 0.538 _    45585 0.514 _
595 0.240 o    5444 0.230 o     1712 0.170 y    3558 0.186 o    18671 0.211 o
262 0.106 y    3069 0.130 y     1616 0.160 o    3413 0.178 y    13615 0.154 y
234 0.094 a    2188 0.092 a     1325 0.132 a    1835 0.096 a    10544 0.119 a
11 0.004 oa     70 0.003 oa      16 0.002 oa      6 0.000 yo     171 0.002 oa
6 0.002 oy     59 0.002 oy      11 0.001 oy      4 0.000 oy      51 0.001 oy
2 0.001 ao     16 0.001 oo       8 0.001 oo      3 0.000 ay      23 0.000 oo
2 0.001 oo     14 0.001 yo       6 0.001 yo      3 0.000 oa      12 0.000 yo
1 0.000 yo      5 0.000 ay       2 0.000 ay      2 0.000 ao       6 0.000 ay
4 0.000 ya       1 0.000 ao      2 0.000 ya       6 0.000 ya
2 0.000 ao       1 0.000 ya      1 0.000 aoy      2 0.000 yy
2 0.000 yoa                      1 0.000 oaa
1 0.000 aa
1 0.000 oao
1 0.000 yay

Thus, the only common alternatives are empty, "y", "a", and "o".  In
fact, as we know, the alternative "y" is common only in initial and
final positions; and in those positions it seems to be equivalent to
"o".

Note that about half of the "O" slots are filled (i.e. the ratio K:O
is about 2:1).  Therefore, if the "K" elements were randomly mixed
with "O" letters, the "O" slots should be about

67% empty,
22% single-letter,
7% double-letter,  and
2% triple-letter.

50% empty,
50% single-letter,
<1% double-letter, and
<0.1% triple-letter.

In fact, triple-letter "O"s are so rare that they can be assumed to
be errors. In Rene's good-quality word list (vdp-z.wds) there are no
triple-letter "O"s at all.

Statistics of "K" strings
-------------------------

Let's now look at the clusters of "K" elements between consecutive
non-empty "O"s.  To reduce the size of the output, let's map the
letters { k t p f } to "k", and "ch" to "ee":

foreach file ( hea-f heb-f bio-f vdp-z )
cat \${file}.fac \
| egrep -v '[@%#=]' \
| sed \
-e 's/^{[q_]}//g' \
-e 's/^_//g' \
-e 's/_\$//g' \
-e 's/[oay]/./g' \
-e 's/[{}]//g' \
-e 's/[ktpf]/k/g' \
-e 's/ch/ee/g' \
| tr '.' '\012' \
| egrep -e '.' \
| sort | uniq -c | expand | sort -b +0 -1nr \
| compute-freqs | sed -e 's/^  //g' \
> \${file}-kkk.frq
dicio-wc \${file}-kkk.frq
end

lines file
------ ------------
257 hea-f-kkk.frq
213 heb-f-kkk.frq
265 bio-f-kkk.frq
232 vdp-z-kkk.frq

multicol {hea-f,heb-f,bio-f,vdp-z}-kkk.frq > multi-kkk.frq

hea-f                      heb-f                       bio-f                        vdp-z
-------------------------  --------------------------  ---------------------------  -----------------------
1733 0.131 d                663 0.136 k                1424 0.165 l                 5345 0.119 k
1441 0.109 l                568 0.116 r                1148 0.133 k                 5296 0.118 l
1380 0.104 r                550 0.113 d                 729 0.084 r                 4714 0.105 r
1237 0.093 k                419 0.086 iin               720 0.083 iin               4146 0.093 d
1164 0.088 ee               408 0.084 l                 464 0.054 d                 3534 0.079 iin
1078 0.081 iin              172 0.035 ke_d              379 0.044 ke_d              1994 0.045 k_ee
931 0.070 k_ee             161 0.033 k_ee_d            331 0.038 k_ee_d            1861 0.042 ee
592 0.045 ckh              114 0.023 s                 260 0.030 s                 1424 0.032 in
553 0.042 sh               107 0.022 m                 258 0.030 she_d             1232 0.028 s
426 0.032 s                105 0.022 k_ee              230 0.027 eee_d             1068 0.024 k_ee_d
235 0.018 m                 94 0.019 ee                175 0.020 k_ee              1052 0.023 eee
229 0.017 eee               92 0.019 ke                148 0.017 eee               1036 0.023 ke
179 0.014 ke                87 0.018 ee_d              147 0.017 she                868 0.019 ke_d
174 0.013 in                77 0.016 eee_d             112 0.013 in                 813 0.018 sh
149 0.011 k_eee             70 0.014 eee               111 0.013 l_k                643 0.014 she
133 0.010 she               63 0.013 in                104 0.012 ke                 632 0.014 m
114 0.009 d_ee              60 0.012 sh                 99 0.011 l_eee_d            631 0.014 eee_d
112 0.008 ckhe              55 0.011 eee_k              87 0.010 k_eee_d            622 0.014 ckh
110 0.008 k_sh              52 0.011 ee_ckh             67 0.008 m                  459 0.010 she_d
106 0.008 l_d               51 0.010 l_d                66 0.008 l_d                428 0.010 k_eee
64 0.005 n                 50 0.010 ir                 65 0.008 sh                 406 0.009 l_k
58 0.004 ee_ckh            49 0.010 she                63 0.007 ee_ckh             398 0.009 k_eee_d
.... ..... ..........      .... ..... ...............  .... ..... ................  .... ..... .............
1 0.000 ckh_s_ee_s         1 0.000 ee_sh_d             2 0.000 l_l                  6 0.000 she_ke
1 0.000 ckh_sh             1 0.000 eee_ckh_d           2 0.000 l_sh_ee_s            5 0.000 d_sh_ee_d
1 0.000 ckhe_iin           1 0.000 eee_ckhe            2 0.000 l_she_ckh            5 0.000 il
1 0.000 ckhe_k_k_k_l       1 0.000 eee_ckhe_d          2 0.000 l_she_k              5 0.000 sh_ee_k_ee
1 0.000 d_ee_ee_ckhe       1 0.000 eee_ee              2 0.000 r_ee_r               4 0.000 d_ee_ee_d
1 0.000 d_ee_ee_s          1 0.000 eee_eee             2 0.000 r_eee_k              4 0.000 d_sh_d
1 0.000 d_ee_eee           1 0.000 eee_k_ee_ee         2 0.000 r_k                  4 0.000 ee_ee_k_ee
.... ..... ..........      .... ..... ...............  .... ..... ................  .... ..... .............

Obviously, groups of two or more consecutive "K" elements are quite
common.  Here is the frequency for each repeat count:

foreach file ( hea-f heb-f bio-f vdp-z )
cat \${file}.fac \
| egrep -v '[@%#=]' \
| sed \
-e 's/^{[q_]}//g' \
-e 's/^_//g' \
-e 's/_\$//g' \
-e 's/[oay]/./g' \
-e 's/[{}]//g' \
-e 's/[a-z][a-z]*/x/g' \
| tr '.' '\012' \
| egrep -e '.' \
| sort | uniq -c | expand | sort -b +0 -1nr \
| compute-freqs | sed -e 's/^  //g' \
> \${file}-kn.frq
dicio-wc \${file}-kn.frq
end

lines file
------ ------------
5 hea-f-kn.frq
6 heb-f-kn.frq
6 bio-f-kn.frq
4 vdp-z-kn.frq

multicol {hea-f,heb-f,bio-f,vdp-z}-kn.frq

hea-f                  heb-f                    bio-f                    vdp-z
---------------------  -----------------------  -----------------------  -------------------
10849 0.819 x           3387 0.694 x             5527 0.640 x            33290 0.744 x
2149 0.162 x_x         1034 0.212 x_x           1966 0.228 x_x           8124 0.181 x_x
229 0.017 x_x_x        416 0.085 x_x_x         1038 0.120 x_x_x         3077 0.069 x_x_x
20 0.002 x_x_x_x       38 0.008 x_x_x_x         99 0.011 x_x_x_x        280 0.006 x_x_x_x
5 0.000 x_x_x_x_x      5 0.001 x_x_x_x_x        1 0.000 x_x_x_x_x
2 0.000 x_x_x_x_x_x      1 0.000 x_x_x_x_x_x

So strings of 3 consecutive "K" elements are relatively common,
strings of 4 are rare, and no word that occurs twice has
5 or more "K"s in a row.

Recall that about 50% of the "O" slots are empty, and about 50%
consist of one letter only.  If the "O" slots were filled
or empty at random, then we would expect the following statistics

0.500 x
0.250 x_x
0.125 x_x_x
0.063 x_x_x_x
0.031 x_x_x_x_x
0.015 x_x_x_x_x_x
0.007 x_x_x_x_x_x_x

So the statistics above suggest that in language A the distribution
of "O"s is more uniform than would be expected from chance.
(The case is not clear because the presence of short words
would bias the statistics towards entries with few consecutive "K"s.)

Note the significant difference in K-repeat frequencies for language
A and language B.  The frequencies for language B are closer to the
"random" model.

Analysis of "K" and "O" statistics
----------------------------------

What can we conclude from these numbers?  Let's consider the
alternatives:

(1) The EVA letters { a o y } are different Voynichese letters.

This theory does not look very promising: if they were
different letters, they should belong to the same class
(vowel, consonant, whaterver); but then we would expect to see
a fair number of diphtongs (double-letter "O" strings),
which we don't see.

(2) The EVA letters { a o y } are the same Voynichese letter.

This theory could explain why there are so few double-letter
"O" slots: namely, because the Voynichese letter "o/a/y"
cannot occur twice in a row (a common restriction in natural
languages).

(3) Each "O" string is a modifiers for (i.e. a part of) the
next "K" element; except for the final "O" string,
which stands on its own.

(4) Each "O" string is a modifiers for the preceding "K" element;
except for the initial "O" string, which stands on its own.

(5) Some "K" element may admit "O" letters as post-modifiers,

After a quick look, I would guess that

{ sh ch ee she che eee }            admit "a/o/y" only as post-modifiers
{ r l m n ir iir in iin iiin }      admit "a/o/y" only as pre-modifiers
{ s d k t cth ckh ke te cthe ckhe } admit "a/o/y" in both positions.

But this hunch needs to be confirmed...

(6) None of the above.

Appendix: A more flexible factoring script
------------------------------------------

The logic of factor-OK.sed was rewritten in AWK as
a "factor-field-OK" script that allows one to factor a
selected field of a multifield file.

Checking consistency of the two scripts:

foreach file ( hea-u hea-f heb-f bio-f vdp-z )
echo \${file}-old
cat \${file}.wds \
| sed -f factor-OK.sed \
> .\${file}-old.fac
echo \${file}-new
cat \${file}.wds \
| factor-field-OK \
| gawk '/./{ print \$1; }' \
> .\${file}-new.fac
dicio-wc .\${file}-{old,new}.fac
diff .\${file}-{old,new}.fac
/bin/rm .\${file}-{old,new}.fac
end

The differences are confined to words that factor-OK can't parse.
The new script will forcibly factor those words into elements
{i+X}, {X[eh]}, or {X} where {X} is a character other than [aoy].

Word and line breaks in the OKOKOKO model
-----------------------------------------

[1999-02-01]

It is instructive to analyze the immediate contexts of definite word
spaces (std), breaks due to figures in the text (fig),
intra-paragraph line breaks (lin), and inter-word pairs (non), in
terms of the K/E/O classification.

For this study we will use the majority-vote transcription, that
includes Takeshi's new full transcription. For simplicity, let's
characters, or the rare letters [abuvxz]. Let's also map the upper
case EVA letters [SCIKTPF] to their lower case varians, since the
capitalization carries no information in those cases.

cat ../045/only-m.evt \
| egrep -e '^<[^<>]*;A>' \
| tr 'SCIKTPF' 'sciktpf' \
| tr -d '\!' \
| sed \
-e 's/^<[^<>]*> *//g' \
-e 's/[{][^{}]*[}]//g' \
-e 's/[&][0-9*?][0-9*?]*[;]\?/*/g' \
-e 's/[buxvz]/*/g' \
-e 's/[.,]*-[-.,]*/-/g' \
-e 's/[,]*[.][,.]*/./g' \
-e 's/[,][,]*/,/g' \
-e 's/.['"'"'"]/?/g' \
-e 's/[^-,./= ]*[%?*][^-,./= ]*/?/g' \
> base.txt

Let's reduce the alphabet to letter classes as follows:

O = [aoy]
I = [i]+
Q = [q]
E = unattached [eh]
R = [djmg] and [rlsn]
X = <ee>, <ch>, <sh>, <ih>, [ci][ktpf][h], [c][ktpf], [ktpf]

The following hack should do it:

cat base.txt \
| sed \
-e 's/ee/X/g' \
-e 's/[csi][h]/X/g' \
-e 's/[ci][ktpf][h]/X/g' \
-e 's/[c][ktpf]/X/g' \
-e 's/[ktpf]/X/g' \
-e 's/[rlsn]/R/g' \
-e 's/[mdgj]/R/g' \
-e 's/[aoy]/O/g' \
-e 's/[q]/Q/g' \
-e 's/[i][i]*/I/g' \
-e 's/[ceh]/E/g' \
> base.clt

egrep '[^-.,=/?XEQROI]' base.clt > .bugs

Now let's count the pairs:

cat base.clt \
| sed \
-e 's/-\(.\)-/-\1\1-/g' \
-e 's/\(.\)-\(.\)/@\1-\2@/g' \
| tr '@' '\012' \
| egrep -e '^.-.\$' \
| egrep -v -e '[?%*]' \
| sort | uniq -c | expand \
| sort +0 -1nr \
| compute-freqs \
> fig.breaks

cat base.clt \
| sed \
-e 's/[.]\(.\)[.]/.\1\1./g' \
-e 's/\(.\)[.]\(.\)/@\1.\2@/g' \
| tr '@' '\012' \
| egrep -e '^.[.].\$' \
| egrep -v -e '[?%*]' \
| sort | uniq -c | expand \
| sort +0 -1nr \
| compute-freqs \
> std.breaks

cat base.clt \
| sed \
-e 's/^[-=., ]*\([^-=., ]\)[-=., ]*\$/\1\1/g' \
-e 's/^[-=., ]*\([^-=., ]\)/\1@/g' \
-e 's/\([^-=., ]\)[-., ]*\$/@\1\//g' \
| tr -d '\012' \
| tr '@' '\012' \
| egrep -e '^.[/].\$' \
| egrep -v -e '[?%*]' \
| sort | uniq -c | expand \
| sort +0 -1nr \
| compute-freqs \
> lin.breaks

cat base.clt \
| sed \
-e 's/\([^-=., ]\)/\1@\1\!/g' \
-e 's/[\!:]*[-=:., ][-\!=:., ]*/@/g' \
| tr '@' '\012' \
| egrep -e '^.\!.\$' \
| egrep -v -e '[?%*]' \
| sort | uniq -c | expand \
| sort +0 -1nr \
| compute-freqs \
> non.breaks

multicol -v titles="non std fig lin" {non,std,fig,lin}.breaks

Here is the result:

non                 std                 fig                 lin
------------------  ------------------  ------------------  ------------------
18340 0.1577 O!R     6363 0.2281 R.X      168 0.2205 R-O      636 0.2048 R/O
16712 0.1437 X!O     5797 0.2078 R.O      134 0.1759 O-O      594 0.1913 R/R
14897 0.1281 R!O     3703 0.1328 O.X      107 0.1404 O-R      415 0.1337 O/R
12804 0.1101 O!X     3287 0.1178 O.Q      107 0.1404 R-X      377 0.1214 O/O
9634 0.0829 X!E     2786 0.0999 O.O      100 0.1312 O-X      334 0.1076 R/X
8495 0.0731 X!X     2748 0.0985 O.R       98 0.1286 R-R      262 0.0844 R/Q
6322 0.0544 O!I     1882 0.0675 R.R       16 0.0210 O-Q      248 0.0799 O/X
6298 0.0542 I!R     1068 0.0383 R.Q        9 0.0118 X-X      199 0.0641 O/Q
5160 0.0444 E!O       83 0.0030 X.X        9 0.0118 R-Q       18 0.0058 X/O
5048 0.0434 Q!O       57 0.0020 X.O        6 0.0079 X-R        5 0.0016 X/R
3792 0.0326 E!R       22 0.0008 R.E        4 0.0052 X-O        5 0.0016 O/E
3071 0.0264 R!X       18 0.0006 E.O        3 0.0039 O-E        2 0.0006 X/X
2866 0.0247 X!R       18 0.0006 X.R        1 0.0013 E-R        2 0.0006 X/Q
939 0.0081 E!X       17 0.0006 E.X                            2 0.0006 R/E
905 0.0078 R!R       11 0.0004 O.E                            1 0.0003 E/O
531 0.0046 O!O       10 0.0004 X.Q                            1 0.0003 E/R
152 0.0013 O!E        9 0.0003 E.R                            1 0.0003 I/O
93 0.0008 R!E        5 0.0002 E.Q                            1 0.0003 I/Q
52 0.0004 Q!X        2 0.0001 I.O                            1 0.0003 X/E
37 0.0003 I!X        2 0.0001 I.R                            1 0.0003 Q/R
33 0.0003 Q!E        2 0.0001 R.I
28 0.0002 O!Q        1 0.0000 I.X
15 0.0001 R!I        1 0.0000 X.E
13 0.0001 X!I
11 0.0001 I!O
4 0.0000 E!E
4 0.0000 Q!R
3 0.0000 R!Q
1 0.0000 E!I

So we can say that

(1) Line breaks occur almost only between { R O } and { R O X Q }
(with frequencies ranging from 6% to 20% of all line breaks);
rarely between X and { R O X Q }
(less than 0.9% of all line breaks);
and essentially NEVER after { Q I E } or before { I E }
(less than 0.4% of all line breaks).

(2) Ordinary word breaks follow the same pattern:
the pairs between { R O } and { R O X Q }
have frequencies between 3.8% and 22%;
pairs between X and { R O X Q } have total
frequency of 0.6%; and all the remaining pairs
account for only 0.3% of the line breaks.

(3) Figure breaks too follow almost the same pattern:
the pairs { R O } and { X R O }
have frequencies ranging from 22% to 12%,
but the pairs { R-Q and O-Q } are much rarer
than around line breaks and ordinary spaces,
about 1--2% each.  Breaks between X and { R O X Q }
are slightly more common (2.5% total) and
all other pairs are almost absent (about 0.5%).

(4) The relative frequencies of { R O X Q }
are approximately 2:2:1:1 after a line break,
and 6:5:5:1 after a figure break, roughly
independently of the character before the break.

(5) The relative frequencies of { R O X Q }
after ordinary breaks seem to depend on the
preceding letter, but they are still
of the same order of magnitude.

(6) Inside words, the valid pairs are
{ QO, OX, OI, IX, XX, XE, XO, EX, EO }
with frequencies ranging from 4.1% to 27%.
The remaining pairs have much lower frequencies
(OO accounts for 0.46% of all pairs, and OE
for only 0.13%).

These observations seem to imply that the "word spaces", line
breaks, and figure breaks are fairly similar and very distinct from
random inter-character pairs.

Their similarity, and the relative
independence of the next letter strongly suggests that they are
indeed word boundaries.  In that case we conclude that
Voynichese words may end in O or R (40-45% and 50-60%, respectively)
or rarely X; and may begin only with X, O, R, or Q.

(A more detailed analysis would show that the O at end of words is
almost always <y>. Also the last R in a line is most often EVA <m>,
which is only rarely seen at the other kinds of word breaks.)

Point (3) shows that figure breaks are more like word breaks
than random inter-character breaks.  The depressed frequencies
of X-Q and O-Q call for an explanation, though.

Point (6) is a partial restatement of the QOKOKOKO paradigm.
Note that the pairs { QO OI IX XE EX EO }, which are fairly
common inside words, are not legal places for word spaces,
line, or figure breaks.

Thes data do not shed much light on whether each O and E
is attached to the preceding X, the following X, or sometimes
both, or neither.  Unfortunately there are (practically)
no figure breaks adjacent to an E.

Word pattern frequencies
------------------------

It is also instructive to analyze the frequency of each word pattern,
the result of collapsing the letters into the classes { Q O X I R E }
or { Q O K } defined above.

First, the { Q O X I R E } patterns:

cat base.clt \
| tr '., =/-' '\012\012\012\012\012\012' \
| egrep '.' \
| egrep -v '[?%*]' \
| sort | uniq -c | expand \
| sort +0 -1nr \
> QOIXER.frq

The result is a long-tailed distribution that begins

freq pattern
----- ----------
1832 XOR
1725 OR
1649 ROIR
1413 ROR
1209 OXOR
1084 XERO
940 XXO
903 XEO
817 OXOIR
786 QOXOR
745 XEOR
718 OIR
716 XO
703 OXXO
660 QOXOIR
560 QOXXO
487 R
480 QOXXRO
404 RO
382 OXXRO
379 XXOR
376 QOXERO
375 OXO
372 XOIR
370 OXERO
325 OROR
316 OXEOR
312 OXXOR
309 XXRO
307 OROIR
... ...

Let's now collapse the elements { X XE R IR } to a single class K,
and absorb the Q into the following O:

cat base.clt \
| tr '., =/-' '\012\012\012\012\012\012' \
| egrep '.' \
| egrep -v '[?%*]' \
| sed \
-e 's/XE/K/g' \
-e 's/X/K/g' \
-e 's/IR/K/g' \
-e 's/R/K/g' \
-e 's/QO/O/g' \
| sort | uniq -c | expand \
| sort +0 -1nr \
> QOK.frq

The result is still a relatively long-tailed distribution:

freq pattern
----- ----------
6061 KOK
4690 OKOK
3075 KKO
2704 OK
2646 OKKO
2023 KO
1531 OKKKO
1365 OKO
1346 KKOK
1236 KKKO
1052 KOKO
951 OKKOK
861 KOKOK
578 K
561 OKOKO
374 KOKKO
324 KK
309 OKOKOK
265 O
233 KKKOK
219 KKOKO
202 KKKKO
189 KKK
177 OKKOKO
175 OKKK
169 OKK
160 OOK
152 OKOKKO
142 KOKKOK
139 OKKKKO
... ...

Conversely, we can analyze the patterns of X and R ignoring the
{ Q E I O } complements:

cat base.clt \
| tr '., =/-' '\012\012\012\012\012\012' \
| egrep '.' \
| egrep -v '[?%*]' \
| tr -d 'QEIO' \
| sort | uniq -c | expand \
| sort +0 -1nr \
> XR.frq

The result is still a fairly broad distribution:

freq pattern
----- ----------
10441 XR
4319 RR
4006 R
3768 XXR
3682 XX
2999 X
1461 XRR
1279 RXR
538 RX
480 XXX
463 RRR
409 XXRR
366 RXXR
346 RXX
302
230 XXXR
151 XRXR
132 XRX
116 XRRR
90 RXRR
89 RRXR
59 RRRR
56 XXXX
50 RRX
31 XXRX
30 XXRRR
24 XRXX
23 RXXRR
23 RXXX
22 XRXXR
... ...
```