Hacking at the Voynich manuscript - Side notes
017 OKOKOKO: The fine structure of Voynichese words
Last edited on 1999-02-02 14:01:55 by stolfi
[ A first version of this note was posted around 1998-03-11, to the
voynich mailing list. This version was extensively revised between
1998-03-21 and 1998-03-29. The section about word and line breaks
was added on 1999-02-01. ]
[ If you decide to print this note, be warned that some lines have
almost 120 characters.]
The basic QOIXEOIXEO paradigm
-----------------------------
Let "X" be any set of letters. We can always break any string
whatsoever into zero or more "X"s, each surrounded by letters which
are not "X"s:
N X N X N X N ... X N
where "X" represents exactly one letter from the set, and
"N" is any string (possibly empty) of non-"X" letters.
Now let's apply this decomposition to the Voynichese words,
using as "X" the set of letters
{ sh ch ee
k ckh ck ikh
t cth ct ith
f cfh cf ifh
p cph cp iph
d
r l s g m n
}
(I am using the basic EVA alphabet, without capitals).
It turns out that, for this choice of "X", the intervening "N"
strings are highly constrained. In fact, most words can be
decomposed as
Q O I X E O I X E O I X E O... I X E O
where
Q is empty or "q";
O is zero or more elements from the set A = { a o y };
I is empty, or one of { i ii iii };
E is empty, or "e".
The QOKOKOKO schema
-------------------
In fact, we can constrain these pieces even more.
With very few exceptions,
"E" may be non-empty only after { sh ch ee k ckh t cth p cph f cfh d }
"I" may be non-empty only before { r l g m n s d }
Note that "d" is exceptional in that it may be accompanied by
either "e" or "i" strings; but the two are mutually exclusive.
(In fact the letter pairs "id" and "de" are both extremely rare.)
That is, we can write the generic word as
Q O K O K O K ... K O K O
where O is as above, and K is one of the "main elements"
{ k t p f
ke te pe fe
ckh cth cph cfh
ckhe cthe cphe cfhe
ikh ith iph ifh
ck ct cf cp
sh ch ee
she che eee
de
d r l g m n s
id ir il ig im in is
iid iir iil iig iim iin iis
iiid iiir iiil iiig iiim iiin iiis
}
Note that
* The letters "p" and "f" are probably ornate versions of other
letters: most likely "k" and "t", but perhaps others.
* Various statistics suggest that "k" and "t" may be the same letter.
* Ditto for "p" and "f".
* Ditto for "y" and "o".
* Ditto for "g" and "m".
* The letter "q" does not seem to be part of the word;
it may be an abbreviation for "and".
* The groups { ikh ith iph ifh } may be equivalent to
{ ckh cth cph cfh }, respectively.
* Instances of { ee eee } may be instances { ch che }
with missing ligature.
Finally, many of the "K" elements are so rare that they are
probably errors. If we consider only elements with
frequency 0.1% or higher, and exclude the elements
with "i*h", "p", and "f", we are left with only 25
"significant" elements:
K* = { k ke ckh ckhe
t te cth cthe
ch che
sh she
ee eee
l m s d
n r
in ir
iin iir
iiin
}
Parsing ambiguities
-------------------
Note that the inclusion in "X" of the groups { ikh ith iph ifh }
does not create any ambiguity with the "I" modifiers, since the
presence of "h" after a tall letter forces one to parse the
preceding letter (which must be "i" or "c") as part of the same
element. Indeed, the elements { ikh ith iph ifh } may be merely
calligraphic variants of { ckh cth cph cfh }, and are the only
instances where the letters { k t p f } may be preceded by "i".
On the other hand, including the string "ee" in the set "X" leads to
an ambiguity in the parsing of words with three or more consecutive
"e"s. For example, "okeeedy" could be parsed either as
Q O I X E O I X E O I X E O
- o - k - - - ee e - - d - y
or as
Q O I X E O I X E O I X E O
- o - k e - - ee - - - d - y
Several Voynichologists (Rene and Dennis, among others) are unhappy
about this ambiguity; they favor excluding "ee" from the set "X",
and perhaps allowing "ee" and "eee" as possible "E" modifiers.
But there are reasons for including "ee" in "X". For one thing,
while an isolated "e" is pretty common within words, it practically
never occurs right after { d r l } or before the first "X"; but "ee"
and "eee" often occurs in those positions. That is, while a single
"e" must always be attached to a preceding "X", the groups "ee" and
"eee" can stand on their own, like the other "X" groups.
(One could argue that the "c" in the elements { ck ct cf cp }, which
may occur before any other "X" group in some words, is in fact an
instance of "e". However, in the few cases I have checked, the "c"
has a noticeable ligature, even though the matching "h" is
missing. So it seems indeed valid to write those combinations with
"c" and not with "e".)
One must keep in mind also that an "ee" group may well be a "ch"
element whose ligature was omitted (by the scribe or the
transcriber). Similarly, the very rare occurrences of "se" may
well be instances of "sh" with missing ligature.
Conversely, it may be that the `natural' form of the letters
{ ch che sh she } is { ee eee se see }, respectively; and the
ligatures are optional calligraphic devices added
to clarify the parsing, almost as an afterthought.
Parsing the text
----------------
The words that fail this "QOKOKOKO" pattern are quite rare.
Let's count them in the following files:
hea-u.wds a few herbal-A pages, which I carefully
transcribed from Jacques Guy's images;
hea-f.wds herbal-A pages in Friedman's transcription;
heb-f.wds herbal-B pages in Friedman's transcription;
bio-f.wds biological (language B) pages in
Friedman's transcription;
vdp-z.wds a list of all words that occur at least twice,
transcribed by the EVMT team.
(The "-f" files were created between 97-11-11 and 98-11-12,
as {hea,heb,bio}-f-gut.wds, from Landini's interlinear
converted to EVA. The last one was created by expanding
a word frequency list posted by Rene Zandbergen on march/98;
an entry "N W" in that list generated "N" copies of word "W"
in file "vdp-z.wds".)
foreach file ( hea-u hea-f heb-f bio-f vdp-z )
cat ${file}.wds \
| egrep -v '[*]' \
| sed -f factor-OK.sed \
> ${file}.fac
cat ${file}.fac \
| egrep -e '[#@%=]' \
> ${file}-weird.fac
dicio-wc ${file}.fac ${file}-weird.fac
end
--- factor-OK.sed ------------------------
# Map "sh", "ch", and "ee" to single letters to simplify the parsing.
# Note that "eee" groups are paired off from left end.
s/ch/C/g
s/sh/S/g
s/ee/E/g
# Map platformed and half-platformed letters to capitals to simplify the parsing:
s/ckh/K/g
s/cth/T/g
s/cfh/F/g
s/cph/P/g
#
s/ikh/G/g
s/ith/H/g
s/ifh/M/g
s/iph/N/g
#
s/ck/U/g
s/ct/V/g
s/cf/X/g
s/cp/Y/g
# Put down scanning head in "@" state
s/$/@/
:x
# If in "@" state, copy "[aoy]" group, and switch to "#" state:
s/\([aoy][aoy]*\)@/#\1/
s/@/#_/
# If in "#" state, copy next main letter and "e" complements,
# insert "}" delimiter, and switch to "%" or "=" state depending on
# whether "i"s are allowed or not:
s/\([CSEktfpKTFPd]e\)#/=\1}/g
s/\([CSEktfpKTFPGHMNUVXY]\)#/=\1}/g
s/\([rlgmnsd]\)#/%\1}/g
# If in "%" state, attach "i" string to group, go to "=" state:
s/\(iii\)%/=\1/
s/\(ii\)%/=\1/
s/\(i\)%/=\1/
s/%/=/
# If in "=" state, insert "{" delimiter, and go back to "@" state:
s/=/@{/
tx
# We should exit the loop only in the "#" state.
# Split "q" prefix and discard scanning head if done:
s/^[q]#/{q}/
s/^#/{_}/
# Unfold letter folding:
s/U/ck/g
s/V/ct/g
s/X/cf/g
s/Y/cp/g
#
s/G/ikh/g
s/H/ith/g
s/M/ifh/g
s/N/iph/g
#
s/K/ckh/g
s/T/cth/g
s/P/cph/g
s/F/cfh/g
#
s/C/ch/g
s/S/sh/g
s/E/ee/g
------------------------------------------
lines words bytes file
------ ------- --------- ------------
803 803 11751 hea-u.fac
0 0 0 hea-u-weird.fac
lines words bytes file
------ ------- --------- ------------
7812 7812 113448 hea-f.fac
93 93 1144 hea-f-weird.fac
lines words bytes file
------ ------- --------- ------------
3223 3223 47932 heb-f.fac
46 46 564 heb-f-weird.fac
lines words bytes file
------ ------- --------- ------------
6182 6182 90650 bio-f.fac
39 39 474 bio-f-weird.fac
lines words bytes file
------ ------- --------- ------------
28939 28939 420444 vdp-z.fac
142 142 1339 vdp-z-weird.fac
So, the exceptions to the QOKOKOKO pattern are less than 1.5% in
Friedman's transcription, less than 0.5% in Rene's list, and none
in my own transcription.
(The last result is not that impressive, of course. Even though I did my
transcription before I had worked out the structure above, I already
had some intuition about it, so my reading was not impartial.)
The exceptions in Rene's word list
----------------------------------
Here is a breakdown of the 142 words (counting multiple occurrences)
in Rene's file that did not fit the QOKOKOKO pattern. (Let's keep in mind
that Rene's file only includes words that occur at least twice.)
It seems that some of these exceptions can be explained as "mutations"
from other letters: scribal errors, calligraphic variations, pen
running out of ink, vellum defects, spots, fading, and of course
poor copy quality. Some are harder to explain, however, and may
require extending the basic schema.
* Words with groups { ckhh cthh cphh cfhh } (42 cases):
chckhhy(9) cthhy(4) chcthhy(4) shcthhy(3) qcthhy(3) ckhhy(3)
chcphhy(3) chcfhhy(3) shocthhy(2) shcphhy(2) qcphhedy(2)
ockhhy(2) ocfhhy(2)
These exceptions account for 0.15% of all words. I propose that
these are calligraphic accidents; that is, "ckhh" is a "ckhe"
whose ligature was overextended, and similarly for the other
groups.
* Words with "oe" (41 cases):
qoedy(5) qoedaiin(3) oedy(2)
qoeol(5) qoear(2) qoeor(2)
qoekeey(3) oekaiin(3) qoekol(2) oekeey(2)
qoekedy(2) oekey(2) oekeody(2)
choety(2) choeky(2) sheoeky(2)
These exceptions account for approximately 0.15% of all words.
The cases with "eke" could be explained as instances of "ckh"
with missing ligature. The others may be true exceptions to the
schema.
Note that the "oe" occurs only at the beginning of the word, or
after the initial "q", or after an initial "ch" or "she" (which,
in language A, seem to behave like "q" to some extent).
* Words beginning with "e" or "qe" (20 cases):
ety(6) qekeey(3) qekchdy(3) qety(2) qekor(2) qekaiin(2)
etaiin(2)
These word-initial "e"s could be explained as partly erased
instances of { a o y }. Note that if we replace the initial "e"
by "o" or "y" we get fairly common words in all these cases.
* Words with the special letters "x" and "v" (20 cases):
x(10) v(8) xar(2)
Note that these letters (picnic table and caret) occur mostly as
isolated letters. Therefore, they may be non-phonetic symbols, or
abbreviations.
* Words with "e" after "s" (5 cases):
chsey(3) shese(2)
These exceptions could be instances of "sh" without the ligature.
* Isolated "e"s (4 cases):
e(4)
These exceptions could be instances of "s" with missing plume.
* Words with "eeb" (3 cases):
cheeb(3)
I propose that "eeb" is merely a calligraphic variation of
"an" or "iin".
* Words with "ykh" (3 cases):
ykhey(3)
I can't think of a good explanation for these cases.
* Letter "o" before "q" (2 cases):
oqokain(2)
Perhaps the extra "o" is a separate word, or part of the
previous one?
* Letter "i" in word-final position (2 cases):
okai(2)
These exceptions could be truncated "in" or "ir" groups.
Frequencies for "K" elements
----------------------------
Here are the statistics for the "K" groups.
foreach file ( hea-u hea-f heb-f bio-f vdp-z )
cat ${file}.fac \
| egrep -v '[@%#=]' \
| sed \
-e 's/^[^{}]*{//g' \
-e 's/}[^{}]*$//g' \
-e 's/}[^{}]*{/./g' \
| tr '.' '\012' \
| egrep -e '.' \
| sort | uniq -c | expand | sort -b +0 -1nr \
| compute_freqs.gawk | sed -e 's/^ //g' \
> ${file}-k.frq
dicio-wc ${file}-k.frq
end
lines file
------ ------------
39 hea-u-k.frq
41 hea-f-k.frq
36 heb-f-k.frq
35 bio-f-k.frq
44 vdp-z-k.frq
multicol {hea-u,hea-f,heb-f,bio-f,vdp-z}-k.frq
hea-u hea-f heb-f bio-f vdp-z
---------------- ---------------- ---------------- ---------------- ----------------
752 0.304 _ 7024 0.297 _ 2856 0.284 _ 4627 0.242 _ 24167 0.273 _
292 0.118 ch 2524 0.107 ch 1459 0.145 d 2512 0.131 d 9928 0.112 d
216 0.087 d 2194 0.093 d 765 0.076 k 2140 0.112 l 7523 0.085 l
183 0.074 l 1702 0.072 l 608 0.060 l 1516 0.079 q 6470 0.073 k
178 0.072 r 1466 0.062 r 600 0.060 r 1422 0.074 k 4855 0.055 r
119 0.048 t 1257 0.053 k 557 0.055 ch 828 0.043 che 4698 0.053 ch
102 0.041 k 1177 0.050 t 424 0.042 iin 804 0.042 r 4630 0.052 q
101 0.041 iin 1090 0.046 iin 366 0.036 che 775 0.041 ee 3641 0.041 t
93 0.038 sh 832 0.035 sh 353 0.035 t 723 0.038 iin 3545 0.040 iin
76 0.031 s 695 0.029 q 321 0.032 q 670 0.035 she 3384 0.038 ee
58 0.023 cth 632 0.027 s 250 0.025 ee 615 0.032 t 3328 0.038 che
51 0.021 che 464 0.020 che 207 0.021 ke 476 0.025 ch 1663 0.019 she
51 0.021 q 453 0.019 cth 176 0.017 s 377 0.020 ke 1644 0.019 sh
36 0.015 m 353 0.015 ee 176 0.017 she 357 0.019 s 1608 0.018 s
23 0.009 ee 253 0.011 m 175 0.017 sh 316 0.017 sh 1428 0.016 in
22 0.009 in 216 0.009 p 123 0.012 p 194 0.010 te 1370 0.015 ke
20 0.008 p 187 0.008 she 113 0.011 m 168 0.009 p 789 0.009 te
19 0.008 she 186 0.008 in 110 0.011 te 142 0.007 ckh 734 0.008 p
17 0.007 ckh 176 0.007 ckh 89 0.009 ckh 113 0.006 in 632 0.007 m
11 0.004 te 130 0.005 ke 74 0.007 f 81 0.004 cth 573 0.006 cth
8 0.003 ke 78 0.003 cph 67 0.007 in 72 0.004 m 511 0.006 ckh
7 0.003 cph 75 0.003 f 51 0.005 ir 42 0.002 ckhe 379 0.004 ir
5 0.002 ir 75 0.003 n 38 0.004 cth 31 0.002 ir 223 0.003 eee
4 0.002 ct 70 0.003 te 24 0.002 cthe 29 0.002 cthe 177 0.002 ckhe
4 0.002 iiin 65 0.003 cthe 19 0.002 ckhe 21 0.001 eee 134 0.002 cthe
4 0.002 n 59 0.002 ir 13 0.001 eee 21 0.001 f 125 0.001 f
3 0.001 cthe 57 0.002 eee 13 0.001 iir 12 0.001 cphe 116 0.001 iiin
3 0.001 de 47 0.002 ckhe 9 0.001 iiin 10 0.001 n 95 0.001 iir
3 0.001 eee 27 0.001 cfh 6 0.001 cphe 7 0.000 cph 82 0.001 cph
3 0.001 f 24 0.001 iir 5 0.000 cfh 7 0.000 iiin 67 0.001 n
3 0.001 iir 21 0.001 cphe 5 0.000 cph 5 0.000 de 43 0.000 cphe
2 0.001 cfh 20 0.001 iiin 5 0.000 de 4 0.000 cfh 26 0.000 g
2 0.001 ck 8 0.000 de 5 0.000 n 3 0.000 il 21 0.000 im
1 0.000 cf 6 0.000 iim 4 0.000 cfhe 2 0.000 iir 18 0.000 cfh
1 0.000 ckhe 3 0.000 cfhe 2 0.000 id 1 0.000 pe 12 0.000 ikh
1 0.000 cphe 3 0.000 iid 1 0.000 iil 10 0.000 ct
1 0.000 iid 3 0.000 iil 8 0.000 ith
1 0.000 iim 2 0.000 id 7 0.000 ck
1 0.000 im 2 0.000 iis 7 0.000 iid
1 0.000 il 7 0.000 il
1 0.000 is 2 0.000 cfhe
2 0.000 de
2 0.000 iim
2 0.000 iis
In these tables, the "_" entry represents the empty "Q" slot.
Let's extract from those tables the elements that are not in the
reduced set "K*" and are not simple uses of the `jokers' "p" and "f":
foreach file ( hea-u hea-f heb-f bio-f vdp-z )
cat ${file}-k.frq \
| egrep -v ' (([ktpf]|c[ktpf]h|[cs]h|ee)(e|)|[_qlmsdnr]|i[nr]|ii[nr]|iiin)$' \
> ${file}-knr.frq
end
multicol {hea-u,hea-f,heb-f,bio-f,vdp-z}-knr.frq
hea-u hea-f heb-f bio-f vdp-z
--------------- --------------- --------------- -------------- ---------------
4 0.002 ct 8 0.000 de 5 0.000 de 5 0.000 de 26 0.000 g
3 0.001 de 6 0.000 iim 2 0.000 id 3 0.000 il 21 0.000 im
2 0.001 ck 3 0.000 iid 1 0.000 iil 12 0.000 ikh
1 0.000 cf 3 0.000 iil 10 0.000 ct
1 0.000 iid 2 0.000 id 8 0.000 ith
1 0.000 iim 2 0.000 iis 7 0.000 ck
1 0.000 im 1 0.000 il 7 0.000 iid
1 0.000 is 7 0.000 il
2 0.000 de
2 0.000 iim
2 0.000 iis
Recall that strings with three or more "e"s have ambiguous parsing,
which affects the statistics of "ee" and all elements with the "e"
modifier. The factor-Ok script arbitrarily pairs the "e"s from the
left, so that such strings are parsed as as zero or more "ee"s
followed by one "ee" or "eee".
To assess the implications of this ambiguity, let's check how
many ambiguous strings we have in each file:
foreach file ( hea-u hea-f heb-f bio-f vdp-z )
cat ${file}.wds \
| egrep -v '[*]' \
| sed -e 's/[^e]/./g' \
| tr '.' '\012' \
| egrep '.' \
| sort | uniq -c | expand | sort +0 -1nr \
| compute_freqs.gawk | sed -e 's/^ //g' \
> ${file}-eee.frq
dicio-wc ${file}-eee.frq
end
multicol {hea-u,hea-f,heb-f,bio-f,vdp-z}-eee.frq
hea-u hea-f heb-f bio-f vdp-z
--------------- ---------------- --------------- --------------- ---------------
97 0.789 e 1069 0.721 e 952 0.782 e 2187 0.732 e 7593 0.677 e
23 0.187 ee 355 0.239 ee 253 0.208 ee 779 0.261 ee 3395 0.303 ee
3 0.024 eee 57 0.038 eee 13 0.011 eee 21 0.007 eee 223 0.020 eee
2 0.001 eeee
Note that, surprisingly, there are practically no words with four ot more
"e"s in a row.
My factoring script will parse the "eee" strings as one "eee"
element. In all files, the frequency of the "eee" element is less
than 0.003 ( i.e. 0.3% of the total "K" elements) Therefore, if I had used
the other parsing ("e" + "ee"), the frequencies of "ee" and
all other "e"-modified elements would increase by less than 0.003
in total.
By the way, the low frequency of "eee" probably means that
its ambiguity would be no big problem for the intended readers.
In fact, the absence of "eeee"s could be explained by the following
theory: the letters "ch" and "sh" are officially written "ee" and
"se"; since that would lead to ambiguities, the scribe
routinely (but not invariably) adds ligatures to indicate
the intended grouping.
Frequencies of "K" elements in languages A and B
------------------------------------------------
In the "K" frequency tables above we can already see a marked difference
between languages A and B. Looking only at the reduced element subset K*,
plus "q" and "_" (meaning no "q"):
foreach file ( hea-f heb-f )
cat ${file}-k.frq \
| egrep ' (([kt]|c[kt]h|[cs]h|ee)(e|)|[_qlmsdnr]|i[nr]|ii[nr]|iiin)$' \
> ${file}-kr.frq
end
multicol {hea-f,heb-f}-kr.frq
hea-f heb-f
---------------- ----------------
7024 0.297 _ 2856 0.284 _
2524 0.107 ch 1459 0.145 d
2194 0.093 d 765 0.076 k
1702 0.072 l 608 0.060 l
1466 0.062 r 600 0.060 r
1257 0.053 k 557 0.055 ch
1177 0.050 t 424 0.042 iin
1090 0.046 iin 366 0.036 che
832 0.035 sh 353 0.035 t
695 0.029 q 321 0.032 q
632 0.027 s 250 0.025 ee
464 0.020 che 207 0.021 ke
453 0.019 cth 176 0.017 s
353 0.015 ee 176 0.017 she
253 0.011 m 175 0.017 sh
187 0.008 she 113 0.011 m
186 0.008 in 110 0.011 te
176 0.007 ckh 89 0.009 ckh
130 0.005 ke 67 0.007 in
75 0.003 n 51 0.005 ir
70 0.003 te 38 0.004 cth
65 0.003 cthe 24 0.002 cthe
59 0.002 ir 19 0.002 ckhe
57 0.002 eee 13 0.001 eee
47 0.002 ckhe 13 0.001 iir
24 0.001 iir 9 0.001 iiin
20 0.001 iiin 5 0.000 n
There is also a less marked but still significant difference between
herbal-B and bio-B:
foreach file ( heb-f bio-f )
cat ${file}-k.frq \
| egrep ' (([kt]|c[kt]h|[cs]h|ee)(e|)|[_qlmsdnr]|i[nr]|ii[nr]|iiin)$' \
> ${file}-kr.frq
end
multicol {heb-f,bio-f}-kr.frq
heb-f bio-f
---------------- ----------------
2856 0.284 _ 4627 0.242 _
1459 0.145 d 2512 0.131 d
765 0.076 k 2140 0.112 l
608 0.060 l 1516 0.079 q
600 0.060 r 1422 0.074 k
557 0.055 ch 828 0.043 che
424 0.042 iin 804 0.042 r
366 0.036 che 775 0.041 ee
353 0.035 t 723 0.038 iin
321 0.032 q 670 0.035 she
250 0.025 ee 615 0.032 t
207 0.021 ke 476 0.025 ch
176 0.017 s 377 0.020 ke
176 0.017 she 357 0.019 s
175 0.017 sh 316 0.017 sh
113 0.011 m 194 0.010 te
110 0.011 te 142 0.007 ckh
89 0.009 ckh 113 0.006 in
67 0.007 in 81 0.004 cth
51 0.005 ir 72 0.004 m
38 0.004 cth 42 0.002 ckhe
24 0.002 cthe 31 0.002 ir
19 0.002 ckhe 29 0.002 cthe
13 0.001 eee 21 0.001 eee
13 0.001 iir 10 0.001 n
9 0.001 iiin 7 0.000 iiin
5 0.000 n 2 0.000 iir
However, most of that difference disappears if we:
(1) identify the letters { k t p f}, which we have
good reasons to believe are the same letter;
(2) omit the letter "q", which is believed to be
a symbol for "and", and hence might be correlated
with subject matter;
(3) identify "ee" with "ch".
foreach file ( hea-u hea-f heb-f bio-f vdp-z )
cat ${file}.fac \
| egrep -v '[@%#=]' \
| sed \
-e 's/^[^{}]*{//g' \
-e 's/}[^{}]*$//g' \
-e 's/}[^{}]*{/./g' \
-e 's/[ktpf]/k/g' \
-e 's/ee/ch/g' \
-e 's/q//g' \
| tr '.' '\012' \
| egrep -e '.' \
| sort | uniq -c | expand | sort -b +0 -1nr \
| compute_freqs.gawk | sed -e 's/^ //g' \
| egrep ' (([kt]|c[kt]h|[cs]h|ee)(e|)|[_qlmsdnr]|i[nr]|ii[nr]|iiin)$' \
> ${file}-krr.frq
dicio-wc ${file}-krr.frq
end
multicol {hea-u,hea-f,heb-f,bio-f,vdp-z}-krr.frq
hea-u hea-f heb-f bio-f vdp-z
---------------- ---------------- ---------------- ---------------- ----------------
752 0.310 _ 7024 0.306 _ 2856 0.293 _ 4627 0.263 _ 24167 0.288 _
315 0.130 ch 2877 0.125 ch 1459 0.150 d 2512 0.143 d 10970 0.131 k
244 0.101 k 2725 0.119 k 1315 0.135 k 2226 0.126 k 9928 0.118 d
216 0.089 d 2194 0.096 d 807 0.083 ch 2140 0.122 l 8082 0.096 ch
183 0.075 l 1702 0.074 l 608 0.062 l 1251 0.071 ch 7523 0.089 l
178 0.073 r 1466 0.064 r 600 0.062 r 849 0.048 che 4855 0.058 r
101 0.042 iin 1090 0.047 iin 424 0.043 iin 804 0.046 r 3551 0.042 che
93 0.038 sh 832 0.036 sh 379 0.039 che 723 0.041 iin 3545 0.042 iin
84 0.035 ckh 734 0.032 ckh 317 0.033 ke 670 0.038 she 2159 0.026 ke
76 0.031 s 632 0.028 s 176 0.018 s 572 0.032 ke 1663 0.020 she
54 0.022 che 521 0.023 che 176 0.018 she 357 0.020 s 1644 0.020 sh
36 0.015 m 253 0.011 m 175 0.018 sh 316 0.018 sh 1608 0.019 s
22 0.009 in 200 0.009 ke 137 0.014 ckh 234 0.013 ckh 1428 0.017 in
19 0.008 ke 187 0.008 she 113 0.012 m 113 0.006 in 1184 0.014 ckh
19 0.008 she 186 0.008 in 67 0.007 in 83 0.005 ckhe 632 0.008 m
5 0.002 ckhe 136 0.006 ckhe 53 0.005 ckhe 72 0.004 m 379 0.005 ir
5 0.002 ir 75 0.003 n 51 0.005 ir 31 0.002 ir 356 0.004 ckhe
4 0.002 iiin 59 0.003 ir 13 0.001 iir 10 0.001 n 116 0.001 iiin
4 0.002 n 24 0.001 iir 9 0.001 iiin 7 0.000 iiin 95 0.001 iir
3 0.001 iir 20 0.001 iiin 5 0.001 n 2 0.000 iir 67 0.001 n
Statistics of "O" strings
-------------------------
Now, what do we do with the "O" strings? Let's look at their
statistics:
foreach file ( hea-u hea-f heb-f bio-f vdp-z )
cat ${file}.fac \
| egrep -v '[@%#=]' \
| sed -e 's/{[^{}]*}/./g' \
| tr '.' '\012' \
| egrep -e '.' \
| sort | uniq -c | expand | sort -b +0 -1nr \
| compute_freqs.gawk | sed -e 's/^ //g' \
> ${file}-ooo.frq
dicio-wc ${file}-ooo.frq
end
lines file
------ ------------
9 hea-u-ooo.frq
15 hea-f-ooo.frq
11 heb-f-ooo.frq
12 bio-f-ooo.frq
11 vdp-z-ooo.frq
multicol {hea-u,hea-f,heb-f,bio-f,vdp-z}-ooo.frq
hea-u hea-f heb-f bio-f vdp-z
-------------- --------------- -------------- --------------- --------------
1364 0.551 _ 12782 0.540 _ 5371 0.533 _ 10295 0.538 _ 45585 0.514 _
595 0.240 o 5444 0.230 o 1712 0.170 y 3558 0.186 o 18671 0.211 o
262 0.106 y 3069 0.130 y 1616 0.160 o 3413 0.178 y 13615 0.154 y
234 0.094 a 2188 0.092 a 1325 0.132 a 1835 0.096 a 10544 0.119 a
11 0.004 oa 70 0.003 oa 16 0.002 oa 6 0.000 yo 171 0.002 oa
6 0.002 oy 59 0.002 oy 11 0.001 oy 4 0.000 oy 51 0.001 oy
2 0.001 ao 16 0.001 oo 8 0.001 oo 3 0.000 ay 23 0.000 oo
2 0.001 oo 14 0.001 yo 6 0.001 yo 3 0.000 oa 12 0.000 yo
1 0.000 yo 5 0.000 ay 2 0.000 ay 2 0.000 ao 6 0.000 ay
4 0.000 ya 1 0.000 ao 2 0.000 ya 6 0.000 ya
2 0.000 ao 1 0.000 ya 1 0.000 aoy 2 0.000 yy
2 0.000 yoa 1 0.000 oaa
1 0.000 aa
1 0.000 oao
1 0.000 yay
Thus, the only common alternatives are empty, "y", "a", and "o". In
fact, as we know, the alternative "y" is common only in initial and
final positions; and in those positions it seems to be equivalent to
"o".
Note that about half of the "O" slots are filled (i.e. the ratio K:O
is about 2:1). Therefore, if the "K" elements were randomly mixed
with "O" letters, the "O" slots should be about
67% empty,
22% single-letter,
7% double-letter, and
2% triple-letter.
Instead we see about
50% empty,
50% single-letter,
<1% double-letter, and
<0.1% triple-letter.
In fact, triple-letter "O"s are so rare that they can be assumed to
be errors. In Rene's good-quality word list (vdp-z.wds) there are no
triple-letter "O"s at all.
Statistics of "K" strings
-------------------------
Let's now look at the clusters of "K" elements between consecutive
non-empty "O"s. To reduce the size of the output, let's map the
letters { k t p f } to "k", and "ch" to "ee":
foreach file ( hea-f heb-f bio-f vdp-z )
cat ${file}.fac \
| egrep -v '[@%#=]' \
| sed \
-e 's/^{[q_]}//g' \
-e 's/^_//g' \
-e 's/_$//g' \
-e 's/[oay]/./g' \
-e 's/[{}]//g' \
-e 's/[ktpf]/k/g' \
-e 's/ch/ee/g' \
| tr '.' '\012' \
| egrep -e '.' \
| sort | uniq -c | expand | sort -b +0 -1nr \
| compute_freqs.gawk | sed -e 's/^ //g' \
> ${file}-kkk.frq
dicio-wc ${file}-kkk.frq
end
lines file
------ ------------
257 hea-f-kkk.frq
213 heb-f-kkk.frq
265 bio-f-kkk.frq
232 vdp-z-kkk.frq
multicol {hea-f,heb-f,bio-f,vdp-z}-kkk.frq > multi-kkk.frq
hea-f heb-f bio-f vdp-z
------------------------- -------------------------- --------------------------- -----------------------
1733 0.131 d 663 0.136 k 1424 0.165 l 5345 0.119 k
1441 0.109 l 568 0.116 r 1148 0.133 k 5296 0.118 l
1380 0.104 r 550 0.113 d 729 0.084 r 4714 0.105 r
1237 0.093 k 419 0.086 iin 720 0.083 iin 4146 0.093 d
1164 0.088 ee 408 0.084 l 464 0.054 d 3534 0.079 iin
1078 0.081 iin 172 0.035 ke_d 379 0.044 ke_d 1994 0.045 k_ee
931 0.070 k_ee 161 0.033 k_ee_d 331 0.038 k_ee_d 1861 0.042 ee
592 0.045 ckh 114 0.023 s 260 0.030 s 1424 0.032 in
553 0.042 sh 107 0.022 m 258 0.030 she_d 1232 0.028 s
426 0.032 s 105 0.022 k_ee 230 0.027 eee_d 1068 0.024 k_ee_d
235 0.018 m 94 0.019 ee 175 0.020 k_ee 1052 0.023 eee
229 0.017 eee 92 0.019 ke 148 0.017 eee 1036 0.023 ke
179 0.014 ke 87 0.018 ee_d 147 0.017 she 868 0.019 ke_d
174 0.013 in 77 0.016 eee_d 112 0.013 in 813 0.018 sh
149 0.011 k_eee 70 0.014 eee 111 0.013 l_k 643 0.014 she
133 0.010 she 63 0.013 in 104 0.012 ke 632 0.014 m
114 0.009 d_ee 60 0.012 sh 99 0.011 l_eee_d 631 0.014 eee_d
112 0.008 ckhe 55 0.011 eee_k 87 0.010 k_eee_d 622 0.014 ckh
110 0.008 k_sh 52 0.011 ee_ckh 67 0.008 m 459 0.010 she_d
106 0.008 l_d 51 0.010 l_d 66 0.008 l_d 428 0.010 k_eee
64 0.005 n 50 0.010 ir 65 0.008 sh 406 0.009 l_k
58 0.004 ee_ckh 49 0.010 she 63 0.007 ee_ckh 398 0.009 k_eee_d
.... ..... .......... .... ..... ............... .... ..... ................ .... ..... .............
1 0.000 ckh_s_ee_s 1 0.000 ee_sh_d 2 0.000 l_l 6 0.000 she_ke
1 0.000 ckh_sh 1 0.000 eee_ckh_d 2 0.000 l_sh_ee_s 5 0.000 d_sh_ee_d
1 0.000 ckhe_iin 1 0.000 eee_ckhe 2 0.000 l_she_ckh 5 0.000 il
1 0.000 ckhe_k_k_k_l 1 0.000 eee_ckhe_d 2 0.000 l_she_k 5 0.000 sh_ee_k_ee
1 0.000 d_ee_ee_ckhe 1 0.000 eee_ee 2 0.000 r_ee_r 4 0.000 d_ee_ee_d
1 0.000 d_ee_ee_s 1 0.000 eee_eee 2 0.000 r_eee_k 4 0.000 d_sh_d
1 0.000 d_ee_eee 1 0.000 eee_k_ee_ee 2 0.000 r_k 4 0.000 ee_ee_k_ee
.... ..... .......... .... ..... ............... .... ..... ................ .... ..... .............
Obviously, groups of two or more consecutive "K" elements are quite
common. Here is the frequency for each repeat count:
foreach file ( hea-f heb-f bio-f vdp-z )
cat ${file}.fac \
| egrep -v '[@%#=]' \
| sed \
-e 's/^{[q_]}//g' \
-e 's/^_//g' \
-e 's/_$//g' \
-e 's/[oay]/./g' \
-e 's/[{}]//g' \
-e 's/[a-z][a-z]*/x/g' \
| tr '.' '\012' \
| egrep -e '.' \
| sort | uniq -c | expand | sort -b +0 -1nr \
| compute_freqs.gawk | sed -e 's/^ //g' \
> ${file}-kn.frq
dicio-wc ${file}-kn.frq
end
lines file
------ ------------
5 hea-f-kn.frq
6 heb-f-kn.frq
6 bio-f-kn.frq
4 vdp-z-kn.frq
multicol {hea-f,heb-f,bio-f,vdp-z}-kn.frq
hea-f heb-f bio-f vdp-z
--------------------- ----------------------- ----------------------- -------------------
10849 0.819 x 3387 0.694 x 5527 0.640 x 33290 0.744 x
2149 0.162 x_x 1034 0.212 x_x 1966 0.228 x_x 8124 0.181 x_x
229 0.017 x_x_x 416 0.085 x_x_x 1038 0.120 x_x_x 3077 0.069 x_x_x
20 0.002 x_x_x_x 38 0.008 x_x_x_x 99 0.011 x_x_x_x 280 0.006 x_x_x_x
5 0.000 x_x_x_x_x 5 0.001 x_x_x_x_x 1 0.000 x_x_x_x_x
2 0.000 x_x_x_x_x_x 1 0.000 x_x_x_x_x_x
So strings of 3 consecutive "K" elements are relatively common,
strings of 4 are rare, and no word that occurs twice has
5 or more "K"s in a row.
Recall that about 50% of the "O" slots are empty, and about 50%
consist of one letter only. If the "O" slots were filled
or empty at random, then we would expect the following statistics
0.500 x
0.250 x_x
0.125 x_x_x
0.063 x_x_x_x
0.031 x_x_x_x_x
0.015 x_x_x_x_x_x
0.007 x_x_x_x_x_x_x
So the statistics above suggest that in language A the distribution
of "O"s is more uniform than would be expected from chance.
(The case is not clear because the presence of short words
would bias the statistics towards entries with few consecutive "K"s.)
Note the significant difference in K-repeat frequencies for language
A and language B. The frequencies for language B are closer to the
"random" model.
Analysis of "K" and "O" statistics
----------------------------------
What can we conclude from these numbers? Let's consider the
alternatives:
(1) The EVA letters { a o y } are different Voynichese letters.
This theory does not look very promising: if they were
different letters, they should belong to the same class
(vowel, consonant, whaterver); but then we would expect to see
a fair number of diphtongs (double-letter "O" strings),
which we don't see.
(2) The EVA letters { a o y } are the same Voynichese letter.
This theory could explain why there are so few double-letter
"O" slots: namely, because the Voynichese letter "o/a/y"
cannot occur twice in a row (a common restriction in natural
languages).
(3) Each "O" string is a modifiers for (i.e. a part of) the
next "K" element; except for the final "O" string,
which stands on its own.
(4) Each "O" string is a modifiers for the preceding "K" element;
except for the initial "O" string, which stands on its own.
(5) Some "K" element may admit "O" letters as post-modifiers,
some may admit them as pre-modifiers, some may admit both.
After a quick look, I would guess that
{ sh ch ee she che eee } admit "a/o/y" only as post-modifiers
{ r l m n ir iir in iin iiin } admit "a/o/y" only as pre-modifiers
{ s d k t cth ckh ke te cthe ckhe } admit "a/o/y" in both positions.
But this hunch needs to be confirmed...
(6) None of the above.
Appendix: A more flexible factoring script
------------------------------------------
The logic of factor-OK.sed was rewritten in AWK as
a "factor-field-OK" script that allows one to factor a
selected field of a multifield file.
Checking consistency of the two scripts:
foreach file ( hea-u hea-f heb-f bio-f vdp-z )
echo ${file}-old
cat ${file}.wds \
| sed -f factor-OK.sed \
> .${file}-old.fac
echo ${file}-new
cat ${file}.wds \
| factor-field-OK \
| gawk '/./{ print $1; }' \
> .${file}-new.fac
dicio-wc .${file}-{old,new}.fac
diff .${file}-{old,new}.fac
/bin/rm .${file}-{old,new}.fac
end
The differences are confined to words that factor-OK can't parse.
The new script will forcibly factor those words into elements
{i+X}, {X[eh]}, or {X} where {X} is a character other than [aoy].
Word and line breaks in the OKOKOKO model
-----------------------------------------
[1999-02-01]
It is instructive to analyze the immediate contexts of definite word
spaces (std), breaks due to figures in the text (fig),
intra-paragraph line breaks (lin), and inter-word pairs (non), in
terms of the K/E/O classification.
For this study we will use the majority-vote transcription, that
includes Takeshi's new full transcription. For simplicity, let's
discard all data containing weirdos, extra plumes, unreadable
characters, or the rare letters [abuvxz]. Let's also map the upper
case EVA letters [SCIKTPF] to their lower case varians, since the
capitalization carries no information in those cases.
cat ../045/only-m.evt \
| egrep -e '^<[^<>]*;A>' \
| tr 'SCIKTPF' 'sciktpf' \
| tr -d '\!' \
| sed \
-e 's/^<[^<>]*> *//g' \
-e 's/[{][^{}]*[}]//g' \
-e 's/[&][0-9*?][0-9*?]*[;]\?/*/g' \
-e 's/[buxvz]/*/g' \
-e 's/[.,]*-[-.,]*/-/g' \
-e 's/[,]*[.][,.]*/./g' \
-e 's/[,][,]*/,/g' \
-e 's/.['"'"'"]/?/g' \
-e 's/[^-,./= ]*[%?*][^-,./= ]*/?/g' \
> base.txt
Let's reduce the alphabet to letter classes as follows:
O = [aoy]
I = [i]+
Q = [q]
E = unattached [eh]
R = [djmg] and [rlsn]
X = <ee>, <ch>, <sh>, <ih>, [ci][ktpf][h], [c][ktpf], [ktpf]
The following hack should do it:
cat base.txt \
| sed \
-e 's/ee/X/g' \
-e 's/[csi][h]/X/g' \
-e 's/[ci][ktpf][h]/X/g' \
-e 's/[c][ktpf]/X/g' \
-e 's/[ktpf]/X/g' \
-e 's/[rlsn]/R/g' \
-e 's/[mdgj]/R/g' \
-e 's/[aoy]/O/g' \
-e 's/[q]/Q/g' \
-e 's/[i][i]*/I/g' \
-e 's/[ceh]/E/g' \
> base.clt
egrep '[^-.,=/?XEQROI]' base.clt > .bugs
head -10 .bugs
Now let's count the pairs:
cat base.clt \
| sed \
-e 's/-\(.\)-/-\1\1-/g' \
-e 's/\(.\)-\(.\)/@\1-\2@/g' \
| tr '@' '\012' \
| egrep -e '^.-.$' \
| egrep -v -e '[?%*]' \
| sort | uniq -c | expand \
| sort +0 -1nr \
| compute_freqs.gawk \
> fig.breaks
cat base.clt \
| sed \
-e 's/[.]\(.\)[.]/.\1\1./g' \
-e 's/\(.\)[.]\(.\)/@\1.\2@/g' \
| tr '@' '\012' \
| egrep -e '^.[.].$' \
| egrep -v -e '[?%*]' \
| sort | uniq -c | expand \
| sort +0 -1nr \
| compute_freqs.gawk \
> std.breaks
cat base.clt \
| sed \
-e 's/^[-=., ]*\([^-=., ]\)[-=., ]*$/\1\1/g' \
-e 's/^[-=., ]*\([^-=., ]\)/\1@/g' \
-e 's/\([^-=., ]\)[-., ]*$/@\1\//g' \
| tr -d '\012' \
| tr '@' '\012' \
| egrep -e '^.[/].$' \
| egrep -v -e '[?%*]' \
| sort | uniq -c | expand \
| sort +0 -1nr \
| compute_freqs.gawk \
> lin.breaks
cat base.clt \
| sed \
-e 's/\([^-=., ]\)/\1@\1\!/g' \
-e 's/[\!:]*[-=:., ][-\!=:., ]*/@/g' \
| tr '@' '\012' \
| egrep -e '^.\!.$' \
| egrep -v -e '[?%*]' \
| sort | uniq -c | expand \
| sort +0 -1nr \
| compute_freqs.gawk \
> non.breaks
multicol -v titles="non std fig lin" {non,std,fig,lin}.breaks
Here is the result:
non std fig lin
------------------ ------------------ ------------------ ------------------
18340 0.1577 O!R 6363 0.2281 R.X 168 0.2205 R-O 636 0.2048 R/O
16712 0.1437 X!O 5797 0.2078 R.O 134 0.1759 O-O 594 0.1913 R/R
14897 0.1281 R!O 3703 0.1328 O.X 107 0.1404 O-R 415 0.1337 O/R
12804 0.1101 O!X 3287 0.1178 O.Q 107 0.1404 R-X 377 0.1214 O/O
9634 0.0829 X!E 2786 0.0999 O.O 100 0.1312 O-X 334 0.1076 R/X
8495 0.0731 X!X 2748 0.0985 O.R 98 0.1286 R-R 262 0.0844 R/Q
6322 0.0544 O!I 1882 0.0675 R.R 16 0.0210 O-Q 248 0.0799 O/X
6298 0.0542 I!R 1068 0.0383 R.Q 9 0.0118 X-X 199 0.0641 O/Q
5160 0.0444 E!O 83 0.0030 X.X 9 0.0118 R-Q 18 0.0058 X/O
5048 0.0434 Q!O 57 0.0020 X.O 6 0.0079 X-R 5 0.0016 X/R
3792 0.0326 E!R 22 0.0008 R.E 4 0.0052 X-O 5 0.0016 O/E
3071 0.0264 R!X 18 0.0006 E.O 3 0.0039 O-E 2 0.0006 X/X
2866 0.0247 X!R 18 0.0006 X.R 1 0.0013 E-R 2 0.0006 X/Q
939 0.0081 E!X 17 0.0006 E.X 2 0.0006 R/E
905 0.0078 R!R 11 0.0004 O.E 1 0.0003 E/O
531 0.0046 O!O 10 0.0004 X.Q 1 0.0003 E/R
152 0.0013 O!E 9 0.0003 E.R 1 0.0003 I/O
93 0.0008 R!E 5 0.0002 E.Q 1 0.0003 I/Q
52 0.0004 Q!X 2 0.0001 I.O 1 0.0003 X/E
37 0.0003 I!X 2 0.0001 I.R 1 0.0003 Q/R
33 0.0003 Q!E 2 0.0001 R.I
28 0.0002 O!Q 1 0.0000 I.X
15 0.0001 R!I 1 0.0000 X.E
13 0.0001 X!I
11 0.0001 I!O
4 0.0000 E!E
4 0.0000 Q!R
3 0.0000 R!Q
1 0.0000 E!I
So we can say that
(1) Line breaks occur almost only between { R O } and { R O X Q }
(with frequencies ranging from 6% to 20% of all line breaks);
rarely between X and { R O X Q }
(less than 0.9% of all line breaks);
and essentially NEVER after { Q I E } or before { I E }
(less than 0.4% of all line breaks).
(2) Ordinary word breaks follow the same pattern:
the pairs between { R O } and { R O X Q }
have frequencies between 3.8% and 22%;
pairs between X and { R O X Q } have total
frequency of 0.6%; and all the remaining pairs
account for only 0.3% of the line breaks.
(3) Figure breaks too follow almost the same pattern:
the pairs { R O } and { X R O }
have frequencies ranging from 22% to 12%,
but the pairs { R-Q and O-Q } are much rarer
than around line breaks and ordinary spaces,
about 1--2% each. Breaks between X and { R O X Q }
are slightly more common (2.5% total) and
all other pairs are almost absent (about 0.5%).
(4) The relative frequencies of { R O X Q }
are approximately 2:2:1:1 after a line break,
and 6:5:5:1 after a figure break, roughly
independently of the character before the break.
(5) The relative frequencies of { R O X Q }
after ordinary breaks seem to depend on the
preceding letter, but they are still
of the same order of magnitude.
(6) Inside words, the valid pairs are
{ QO, OX, OI, IX, XX, XE, XO, EX, EO }
with frequencies ranging from 4.1% to 27%.
The remaining pairs have much lower frequencies
(OO accounts for 0.46% of all pairs, and OE
for only 0.13%).
These observations seem to imply that the "word spaces", line
breaks, and figure breaks are fairly similar and very distinct from
random inter-character pairs.
Their similarity, and the relative
independence of the next letter strongly suggests that they are
indeed word boundaries. In that case we conclude that
Voynichese words may end in O or R (40-45% and 50-60%, respectively)
or rarely X; and may begin only with X, O, R, or Q.
(A more detailed analysis would show that the O at end of words is
almost always <y>. Also the last R in a line is most often EVA <m>,
which is only rarely seen at the other kinds of word breaks.)
Point (3) shows that figure breaks are more like word breaks
than random inter-character breaks. The depressed frequencies
of X-Q and O-Q call for an explanation, though.
Point (6) is a partial restatement of the QOKOKOKO paradigm.
Note that the pairs { QO OI IX XE EX EO }, which are fairly
common inside words, are not legal places for word spaces,
line, or figure breaks.
Thes data do not shed much light on whether each O and E
is attached to the preceding X, the following X, or sometimes
both, or neither. Unfortunately there are (practically)
no figure breaks adjacent to an E.
Word pattern frequencies
------------------------
It is also instructive to analyze the frequency of each word pattern,
the result of collapsing the letters into the classes { Q O X I R E }
or { Q O K } defined above.
First, the { Q O X I R E } patterns:
cat base.clt \
| tr '., =/-' '\012\012\012\012\012\012' \
| egrep '.' \
| egrep -v '[?%*]' \
| sort | uniq -c | expand \
| sort +0 -1nr \
> QOIXER.frq
The result is a long-tailed distribution that begins
freq pattern
----- ----------
1832 XOR
1725 OR
1649 ROIR
1413 ROR
1209 OXOR
1084 XERO
940 XXO
903 XEO
817 OXOIR
786 QOXOR
745 XEOR
718 OIR
716 XO
703 OXXO
660 QOXOIR
560 QOXXO
487 R
480 QOXXRO
404 RO
382 OXXRO
379 XXOR
376 QOXERO
375 OXO
372 XOIR
370 OXERO
325 OROR
316 OXEOR
312 OXXOR
309 XXRO
307 OROIR
... ...
Let's now collapse the elements { X XE R IR } to a single class K,
and absorb the Q into the following O:
cat base.clt \
| tr '., =/-' '\012\012\012\012\012\012' \
| egrep '.' \
| egrep -v '[?%*]' \
| sed \
-e 's/XE/K/g' \
-e 's/X/K/g' \
-e 's/IR/K/g' \
-e 's/R/K/g' \
-e 's/QO/O/g' \
| sort | uniq -c | expand \
| sort +0 -1nr \
> QOK.frq
The result is still a relatively long-tailed distribution:
freq pattern
----- ----------
6061 KOK
4690 OKOK
3075 KKO
2704 OK
2646 OKKO
2023 KO
1531 OKKKO
1365 OKO
1346 KKOK
1236 KKKO
1052 KOKO
951 OKKOK
861 KOKOK
578 K
561 OKOKO
374 KOKKO
324 KK
309 OKOKOK
265 O
233 KKKOK
219 KKOKO
202 KKKKO
189 KKK
177 OKKOKO
175 OKKK
169 OKK
160 OOK
152 OKOKKO
142 KOKKOK
139 OKKKKO
... ...
Conversely, we can analyze the patterns of X and R ignoring the
{ Q E I O } complements:
cat base.clt \
| tr '., =/-' '\012\012\012\012\012\012' \
| egrep '.' \
| egrep -v '[?%*]' \
| tr -d 'QEIO' \
| sort | uniq -c | expand \
| sort +0 -1nr \
> XR.frq
The result is still a fairly broad distribution:
freq pattern
----- ----------
10441 XR
4319 RR
4006 R
3768 XXR
3682 XX
2999 X
1461 XRR
1279 RXR
538 RX
480 XXX
463 RRR
409 XXRR
366 RXXR
346 RXX
302
230 XXXR
151 XRXR
132 XRX
116 XRRR
90 RXRR
89 RRXR
59 RRRR
56 XXXX
50 RRX
31 XXRX
30 XXRRR
24 XRXX
23 RXXRR
23 RXXX
22 XRXXR
... ...