Hacking at the Voynich manuscript - Side notes
017 OKOKOKO: The fine structure of Voynichese words

Last edited on 1999-09-23 23:55:55 by stolfi

  [ A first version of this note was posted around 1998-03-11, to the
  voynich mailing list.  This version was extensively revised between
  1998-03-21 and 1998-03-29. The section about word and line breaks 
  was added on 1999-02-01. ]
  
  [ If you decide to print this note, be warned that some lines have
  almost 120 characters.]

The basic QOIXEOIXEO paradigm
-----------------------------

  Let "X" be any set of letters.  We can always break any string
  whatsoever into zero or more "X"s, each surrounded by letters which
  are not "X"s:
  
     N X N X N X N ... X N
  
  where "X" represents exactly one letter from the set, and
  "N" is any string (possibly empty) of non-"X" letters.
  
  Now let's apply this decomposition to the Voynichese words,
  using as "X" the set of letters
    
      { sh ch ee  
        k ckh ck ikh 
        t cth ct ith 
        f cfh cf ifh 
        p cph cp iph
        d 
        r l s g m n
      }
    
  (I am using the basic EVA alphabet, without capitals). 
  
  It turns out that, for this choice of "X", the intervening "N"
  strings are highly constrained. In fact, most words can be 
  decomposed as 
  
     Q  O  I X E  O  I X E  O  I X E  O...  I X E  O 
     
  where 
  
     Q is empty or "q";
     O is zero or more elements from the set A = { a o y };
     I is empty, or one of { i ii iii };
     E is empty, or "e".
     
The QOKOKOKO schema
-------------------

  In fact, we can constrain these pieces even more.  
  With very few exceptions,
  
    "E" may be non-empty only after { sh ch ee k ckh t cth p cph f cfh d }
    
    "I" may be non-empty only before { r l g m n s d }
    
  Note that "d" is exceptional in that it may be accompanied by
  either "e" or "i" strings; but the two are mutually exclusive.
  (In fact the letter pairs "id" and "de" are both extremely rare.)
    
  That is, we can write the generic word as
  
    Q O K O K O K ... K O K O
    
  where O is as above, and K is one of the "main elements"
  
    { k    t    p    f    
      ke   te   pe   fe   
      ckh  cth  cph  cfh   
      ckhe cthe cphe cfhe  
      ikh  ith  iph  ifh   
      ck   ct   cf   cp 

      sh  ch  ee 
      she che eee
      
      de

      d    r    l    g    m    n    s 
      id   ir   il   ig   im   in   is
      iid  iir  iil  iig  iim  iin  iis
      iiid iiir iiil iiig iiim iiin iiis 
    }
    
  Note that 
  
    * The letters "p" and "f" are probably ornate versions of other
      letters: most likely "k" and "t", but perhaps others.
    
    * Various statistics suggest that "k" and "t" may be the same letter.
    
    * Ditto for "p" and "f". 
    
    * Ditto for "y" and "o".
    
    * Ditto for "g" and "m". 
    
    * The letter "q" does not seem to be part of the word;
      it may be an abbreviation for "and". 
      
    * The groups { ikh ith iph ifh } may be equivalent to 
      { ckh cth cph cfh }, respectively.
      
    * Instances of { ee eee } may be instances { ch che }
      with missing ligature. 
  
  Finally, many of the "K" elements are so rare that they are
  probably errors.   If we consider only elements with 
  frequency 0.1% or higher, and exclude the elements
  with "i*h", "p", and "f", we are left with only 25 
  "significant" elements:
  
    K* = { k    ke   ckh  ckhe 
           t    te   cth  cthe
           ch   che 
           sh   she
           ee   eee
           l    m    s    d
           n    r    
           in   ir 
           iin  iir 
           iiin 
         }
    
Parsing ambiguities
-------------------

  Note that the inclusion in "X" of the groups { ikh ith iph ifh }
  does not create any ambiguity with the "I" modifiers, since the
  presence of "h" after a tall letter forces one to parse the
  preceding letter (which must be "i" or "c") as part of the same
  element.  Indeed, the elements { ikh ith iph ifh } may be merely
  calligraphic variants of { ckh cth cph cfh }, and are the only
  instances where the letters { k t p f } may be preceded by "i".

  On the other hand, including the string "ee" in the set "X" leads to
  an ambiguity in the parsing of words with three or more consecutive
  "e"s.  For example, "okeeedy" could be parsed either as
  
      Q   O   I  X  E   O   I  X  E   O   I  X  E   O
      -   o   -  k  -   -   -  ee e   -   -  d  -   y
      
  or as
      
      Q   O   I  X  E   O   I  X  E   O   I  X  E   O
      -   o   -  k  e   -   -  ee -   -   -  d  -   y
      
  Several Voynichologists (Rene and Dennis, among others) are unhappy
  about this ambiguity; they favor excluding "ee" from the set "X",
  and perhaps allowing "ee" and "eee" as possible "E" modifiers.
  
  But there are reasons for including "ee" in "X".  For one thing,
  while an isolated "e" is pretty common within words, it practically
  never occurs right after { d r l } or before the first "X"; but "ee"
  and "eee" often occurs in those positions.  That is, while a single
  "e" must always be attached to a preceding "X", the groups "ee" and
  "eee" can stand on their own, like the other "X" groups.
     
  (One could argue that the "c" in the elements { ck ct cf cp }, which
  may occur before any other "X" group in some words, is in fact an
  instance of "e". However, in the few cases I have checked, the "c"
  has a noticeable ligature, even though the matching "h" is
  missing. So it seems indeed valid to write those combinations with
  "c" and not with "e".)
  
  One must keep in mind also that an "ee" group may well be a "ch"
  element whose ligature was omitted (by the scribe or the 
  transcriber).  Similarly, the very rare occurrences of "se" may
  well be instances of "sh" with missing ligature.
  
  Conversely, it may be that the `natural' form of the letters 
  { ch che sh she } is { ee eee se see }, respectively; and the 
  ligatures are optional calligraphic devices added 
  to clarify the parsing, almost as an afterthought.

Parsing the text
----------------

  The words that fail this "QOKOKOKO" pattern are quite rare.
  Let's count them in the following files:
  
     hea-u.wds  a few herbal-A pages, which I carefully
                transcribed from Jacques Guy's images;
               
     hea-f.wds  herbal-A pages in Friedman's transcription;

     heb-f.wds  herbal-B pages in Friedman's transcription;
     
     bio-f.wds  biological (language B) pages in 
                Friedman's transcription;

     vdp-z.wds  a list of all words that occur at least twice,
                transcribed by the EVMT team.
               
  (The "-f" files were created between 97-11-11 and 98-11-12,
  as {hea,heb,bio}-f-gut.wds, from Landini's interlinear 
  converted to EVA.  The last one was created by expanding 
  a word frequency list posted by Rene Zandbergen on march/98; 
  an entry "N W" in that list generated "N" copies of word "W"
  in file "vdp-z.wds".)
  
    foreach file ( hea-u hea-f heb-f bio-f vdp-z )
      cat ${file}.wds \
        | egrep -v '[*]' \
        | sed -f factor-OK.sed \
        > ${file}.fac
      cat ${file}.fac \
        | egrep -e '[#@%=]' \
        > ${file}-weird.fac
      dicio-wc ${file}.fac ${file}-weird.fac
    end
  
    --- factor-OK.sed ------------------------
    # Map "sh", "ch", and "ee" to single letters to simplify the parsing.
    # Note that "eee" groups are paired off from left end. 
    s/ch/C/g
    s/sh/S/g
    s/ee/E/g
    # Map platformed and half-platformed letters to capitals to simplify the parsing:
    s/ckh/K/g
    s/cth/T/g
    s/cfh/F/g
    s/cph/P/g
    #
    s/ikh/G/g
    s/ith/H/g
    s/ifh/M/g
    s/iph/N/g
    #
    s/ck/U/g
    s/ct/V/g
    s/cf/X/g
    s/cp/Y/g
    # Put down scanning head in "@" state
    s/$/@/
    :x
    # If in "@" state, copy "[aoy]" group, and switch to "#" state:
    s/\([aoy][aoy]*\)@/#\1/
    s/@/#_/
    # If in "#" state, copy next main letter and "e" complements,
    # insert "}" delimiter, and switch to "%" or "=" state depending on
    # whether "i"s are allowed or not:
    s/\([CSEktfpKTFPd]e\)#/=\1}/g
    s/\([CSEktfpKTFPGHMNUVXY]\)#/=\1}/g
    s/\([rlgmnsd]\)#/%\1}/g
    # If in "%" state, attach "i" string to group, go to "=" state: 
    s/\(iii\)%/=\1/
    s/\(ii\)%/=\1/
    s/\(i\)%/=\1/
    s/%/=/
    # If in "=" state, insert "{" delimiter, and go back to "@" state:
    s/=/@{/
    tx
    # We should exit the loop only in the "#" state.
    # Split "q" prefix and discard scanning head if done:
    s/^[q]#/{q}/
    s/^#/{_}/
    # Unfold letter folding:
    s/U/ck/g
    s/V/ct/g
    s/X/cf/g
    s/Y/cp/g
    #
    s/G/ikh/g
    s/H/ith/g
    s/M/ifh/g
    s/N/iph/g
    #
    s/K/ckh/g
    s/T/cth/g
    s/P/cph/g
    s/F/cfh/g
    #
    s/C/ch/g
    s/S/sh/g
    s/E/ee/g
    ------------------------------------------

     lines   words     bytes file        
    ------ ------- --------- ------------
       803     803     11751 hea-u.fac
         0       0         0 hea-u-weird.fac

     lines   words     bytes file        
    ------ ------- --------- ------------
      7812    7812    113448 hea-f.fac
        93      93      1144 hea-f-weird.fac

     lines   words     bytes file        
    ------ ------- --------- ------------
      3223    3223     47932 heb-f.fac
        46      46       564 heb-f-weird.fac

     lines   words     bytes file        
    ------ ------- --------- ------------
      6182    6182     90650 bio-f.fac
        39      39       474 bio-f-weird.fac

     lines   words     bytes file        
    ------ ------- --------- ------------
     28939   28939    420444 vdp-z.fac
       142     142      1339 vdp-z-weird.fac


  So, the exceptions to the QOKOKOKO pattern are less than 1.5% in
  Friedman's transcription, less than 0.5% in Rene's list, and none
  in my own transcription.
  
  (The last result is not that impressive, of course.  Even though I did my
  transcription before I had worked out the structure above, I already
  had some intuition about it, so my reading was not impartial.)
  
The exceptions in Rene's word list
----------------------------------

  Here is a breakdown of the 142 words (counting multiple occurrences)
  in Rene's file that did not fit the QOKOKOKO pattern.  (Let's keep in mind
  that Rene's file only includes words that occur at least twice.)
  
  It seems that some of these exceptions can be explained as "mutations"
  from other letters: scribal errors, calligraphic variations, pen
  running out of ink, vellum defects, spots, fading, and of course
  poor copy quality.  Some are harder to explain, however, and may
  require extending the basic schema.
  
    * Words with groups { ckhh cthh cphh cfhh } (42 cases):
        
        chckhhy(9) cthhy(4) chcthhy(4) shcthhy(3) qcthhy(3) ckhhy(3)
        chcphhy(3) chcfhhy(3) shocthhy(2) shcphhy(2) qcphhedy(2)
        ockhhy(2) ocfhhy(2)

      These exceptions account for 0.15% of all words.  I propose that
      these are calligraphic accidents; that is, "ckhh" is a "ckhe"
      whose ligature was overextended, and similarly for the other
      groups.
      
    * Words with "oe" (41 cases):
    
        qoedy(5) qoedaiin(3) oedy(2)

        qoeol(5) qoear(2) qoeor(2)

        qoekeey(3) oekaiin(3) qoekol(2) oekeey(2)
        qoekedy(2) oekey(2) oekeody(2)

        choety(2) choeky(2) sheoeky(2)

      These exceptions account for approximately 0.15% of all words.
      The cases with "eke" could be explained as instances of "ckh"
      with missing ligature.  The others may be true exceptions to the
      schema.

      Note that the "oe" occurs only at the beginning of the word, or
      after the initial "q", or after an initial "ch" or "she" (which,
      in language A, seem to behave like "q" to some extent).

    * Words beginning with "e" or "qe" (20 cases):

        ety(6) qekeey(3) qekchdy(3) qety(2) qekor(2) qekaiin(2)
        etaiin(2)
        
      These word-initial "e"s could be explained as partly erased
      instances of { a o y }.  Note that if we replace the initial "e"
      by "o" or "y" we get fairly common words in all these cases.

    * Words with the special letters "x" and "v" (20 cases):
    
        x(10) v(8) xar(2)

      Note that these letters (picnic table and caret) occur mostly as
      isolated letters. Therefore, they may be non-phonetic symbols, or
      abbreviations.

    * Words with "e" after "s" (5 cases):

        chsey(3) shese(2)

      These exceptions could be instances of "sh" without the ligature.
      
    * Isolated "e"s (4 cases):
    
        e(4)
        
      These exceptions could be instances of "s" with missing plume.
      
    * Words with "eeb" (3 cases): 
    
        cheeb(3)

      I propose that "eeb" is merely a calligraphic variation of
      "an" or "iin".
       
    * Words with "ykh" (3 cases):
      
        ykhey(3)
        
      I can't think of a good explanation for these cases.
              
    * Letter "o" before "q" (2 cases):
    
        oqokain(2)
        
      Perhaps the extra "o" is a separate word, or part of the
      previous one?
        
    * Letter "i" in word-final position (2 cases):

        okai(2)

      These exceptions could be truncated "in" or "ir" groups.
        
Frequencies for "K" elements
----------------------------

  Here are the statistics for the "K" groups.
  
    foreach file ( hea-u hea-f heb-f bio-f vdp-z )
      cat ${file}.fac \
        | egrep -v '[@%#=]' \
        | sed \
            -e 's/^[^{}]*{//g' \
            -e 's/}[^{}]*$//g' \
            -e 's/}[^{}]*{/./g' \
        | tr '.' '\012' \
        | egrep -e '.' \
        | sort | uniq -c | expand | sort -b +0 -1nr \
        | compute-freqs | sed -e 's/^  //g' \
        > ${file}-k.frq
      dicio-wc ${file}-k.frq
    end
  
     lines file        
    ------ ------------
        39 hea-u-k.frq
        41 hea-f-k.frq
        36 heb-f-k.frq
        35 bio-f-k.frq
        44 vdp-z-k.frq

    multicol {hea-u,hea-f,heb-f,bio-f,vdp-z}-k.frq

          hea-u             hea-f             heb-f             bio-f             vdp-z   
    ----------------  ----------------  ----------------  ----------------  ----------------
      752 0.304 _      7024 0.297 _      2856 0.284 _      4627 0.242 _     24167 0.273 _   
      292 0.118 ch     2524 0.107 ch     1459 0.145 d      2512 0.131 d      9928 0.112 d   
      216 0.087 d      2194 0.093 d       765 0.076 k      2140 0.112 l      7523 0.085 l   
      183 0.074 l      1702 0.072 l       608 0.060 l      1516 0.079 q      6470 0.073 k   
      178 0.072 r      1466 0.062 r       600 0.060 r      1422 0.074 k      4855 0.055 r   
      119 0.048 t      1257 0.053 k       557 0.055 ch      828 0.043 che    4698 0.053 ch  
      102 0.041 k      1177 0.050 t       424 0.042 iin     804 0.042 r      4630 0.052 q   
      101 0.041 iin    1090 0.046 iin     366 0.036 che     775 0.041 ee     3641 0.041 t   
       93 0.038 sh      832 0.035 sh      353 0.035 t       723 0.038 iin    3545 0.040 iin 
       76 0.031 s       695 0.029 q       321 0.032 q       670 0.035 she    3384 0.038 ee  
       58 0.023 cth     632 0.027 s       250 0.025 ee      615 0.032 t      3328 0.038 che 
       51 0.021 che     464 0.020 che     207 0.021 ke      476 0.025 ch     1663 0.019 she 
       51 0.021 q       453 0.019 cth     176 0.017 s       377 0.020 ke     1644 0.019 sh  
       36 0.015 m       353 0.015 ee      176 0.017 she     357 0.019 s      1608 0.018 s   
       23 0.009 ee      253 0.011 m       175 0.017 sh      316 0.017 sh     1428 0.016 in  
       22 0.009 in      216 0.009 p       123 0.012 p       194 0.010 te     1370 0.015 ke  
       20 0.008 p       187 0.008 she     113 0.011 m       168 0.009 p       789 0.009 te  
       19 0.008 she     186 0.008 in      110 0.011 te      142 0.007 ckh     734 0.008 p   
       17 0.007 ckh     176 0.007 ckh      89 0.009 ckh     113 0.006 in      632 0.007 m   
       11 0.004 te      130 0.005 ke       74 0.007 f        81 0.004 cth     573 0.006 cth 
        8 0.003 ke       78 0.003 cph      67 0.007 in       72 0.004 m       511 0.006 ckh 
        7 0.003 cph      75 0.003 f        51 0.005 ir       42 0.002 ckhe    379 0.004 ir  
        5 0.002 ir       75 0.003 n        38 0.004 cth      31 0.002 ir      223 0.003 eee 
        4 0.002 ct       70 0.003 te       24 0.002 cthe     29 0.002 cthe    177 0.002 ckhe
        4 0.002 iiin     65 0.003 cthe     19 0.002 ckhe     21 0.001 eee     134 0.002 cthe
        4 0.002 n        59 0.002 ir       13 0.001 eee      21 0.001 f       125 0.001 f   
        3 0.001 cthe     57 0.002 eee      13 0.001 iir      12 0.001 cphe    116 0.001 iiin
        3 0.001 de       47 0.002 ckhe      9 0.001 iiin     10 0.001 n        95 0.001 iir 
        3 0.001 eee      27 0.001 cfh       6 0.001 cphe      7 0.000 cph      82 0.001 cph 
        3 0.001 f        24 0.001 iir       5 0.000 cfh       7 0.000 iiin     67 0.001 n   
        3 0.001 iir      21 0.001 cphe      5 0.000 cph       5 0.000 de       43 0.000 cphe
        2 0.001 cfh      20 0.001 iiin      5 0.000 de        4 0.000 cfh      26 0.000 g   
        2 0.001 ck        8 0.000 de        5 0.000 n         3 0.000 il       21 0.000 im  
        1 0.000 cf        6 0.000 iim       4 0.000 cfhe      2 0.000 iir      18 0.000 cfh 
        1 0.000 ckhe      3 0.000 cfhe      2 0.000 id        1 0.000 pe       12 0.000 ikh 
        1 0.000 cphe      3 0.000 iid       1 0.000 iil                        10 0.000 ct  
        1 0.000 iid       3 0.000 iil                                           8 0.000 ith 
        1 0.000 iim       2 0.000 id                                            7 0.000 ck  
        1 0.000 im        2 0.000 iis                                           7 0.000 iid 
                          1 0.000 il                                            7 0.000 il  
                          1 0.000 is                                            2 0.000 cfhe
                                                                                2 0.000 de  
                                                                                2 0.000 iim 
                                                                                2 0.000 iis 

  In these tables, the "_" entry represents the empty "Q" slot.
  
  Let's extract from those tables the elements that are not in the 
  reduced set "K*" and are not simple uses of the `jokers' "p" and "f":
  
    foreach file ( hea-u hea-f heb-f bio-f vdp-z )
      cat ${file}-k.frq \
        | egrep -v ' (([ktpf]|c[ktpf]h|[cs]h|ee)(e|)|[_qlmsdnr]|i[nr]|ii[nr]|iiin)$' \
        > ${file}-knr.frq
    end
  
    multicol {hea-u,hea-f,heb-f,bio-f,vdp-z}-knr.frq

          hea-u            hea-f            heb-f            bio-f           vdp-z   
    ---------------  ---------------  ---------------  --------------  ---------------
        4 0.002 ct       8 0.000 de       5 0.000 de       5 0.000 de     26 0.000 g  
        3 0.001 de       6 0.000 iim      2 0.000 id       3 0.000 il     21 0.000 im 
        2 0.001 ck       3 0.000 iid      1 0.000 iil                     12 0.000 ikh
        1 0.000 cf       3 0.000 iil                                      10 0.000 ct 
        1 0.000 iid      2 0.000 id                                        8 0.000 ith
        1 0.000 iim      2 0.000 iis                                       7 0.000 ck 
        1 0.000 im       1 0.000 il                                        7 0.000 iid
                         1 0.000 is                                        7 0.000 il 
                                                                           2 0.000 de 
                                                                           2 0.000 iim
                                                                           2 0.000 iis
                                                                           
  Recall that strings with three or more "e"s have ambiguous parsing,
  which affects the statistics of "ee" and all elements with the "e"
  modifier.  The factor-Ok script arbitrarily pairs the "e"s from the
  left, so that such strings are parsed as as zero or more "ee"s
  followed by one "ee" or "eee".
  
  To assess the implications of this ambiguity, let's check how
  many ambiguous strings we have in each file:
  
    foreach file ( hea-u hea-f heb-f bio-f vdp-z )
      cat ${file}.wds \
        | egrep -v '[*]' \
        | sed -e 's/[^e]/./g' \
        | tr '.' '\012' \
        | egrep '.' \
        | sort | uniq -c | expand | sort +0 -1nr \
        | compute-freqs | sed -e 's/^  //g' \
        > ${file}-eee.frq
      dicio-wc ${file}-eee.frq
    end

    multicol {hea-u,hea-f,heb-f,bio-f,vdp-z}-eee.frq

          hea-u            hea-f             heb-f            bio-f            vdp-z   
    ---------------  ----------------  ---------------  ---------------  ---------------
       97 0.789 e     1069 0.721 e       952 0.782 e     2187 0.732 e     7593 0.677 e  
       23 0.187 ee     355 0.239 ee      253 0.208 ee     779 0.261 ee    3395 0.303 ee 
        3 0.024 eee     57 0.038 eee      13 0.011 eee     21 0.007 eee    223 0.020 eee
                         2 0.001 eeee                                                   

  Note that, surprisingly, there are practically no words with four ot more
  "e"s in a row.  
  
  My factoring script will parse the "eee" strings as one "eee"
  element.  In all files, the frequency of the "eee" element is less
  than 0.003 ( i.e. 0.3% of the total "K" elements) Therefore, if I had used
  the other parsing ("e" + "ee"), the frequencies of "ee" and 
  all other "e"-modified elements would increase by less than 0.003
  in total. 
  
  By the way, the low frequency of "eee" probably means that 
  its ambiguity would be no big problem for the intended readers.
  
  In fact, the absence of "eeee"s could be explained by the following
  theory: the letters "ch" and "sh" are officially written "ee" and
  "se"; since that would lead to ambiguities, the scribe
  routinely (but not invariably) adds ligatures to indicate 
  the intended grouping.

Wordlength statistics in the OKOKOKO model
------------------------------------------

  Let's compute statistics on the number of O and K elements in words:
  
    foreach file ( hea-f heb-f bio-f vdp-z )
      cat ${file}.fac \
        | egrep -v '[@%#=]' \
        | sed \
            -e 's/[{}_]/ /g' \
            -e 's/^  *//g' \
            -e 's/  *$//g' \
            -e 's/   */ /g' \
        | egrep -e '.' \
        | count-okokoko-lengths \
        > ${file}-ok.lfr
      dicio-wc ${file}-ok.lfr
    end

    foreach file ( hea-f heb-f bio-f vdp-z )
      cat ${file}.fac \
        | egrep -v '[@%#=]' \
        | sed \
            -e 's/[{}_aoy]/ /g' \
            -e 's/^  *//g' \
            -e 's/  *$//g' \
            -e 's/   */ /g' \
        | egrep -e '.' \
        | count-okokoko-lengths \
        > ${file}-k.lfr
      dicio-wc ${file}-k.lfr
    end

    foreach file ( hea-f heb-f bio-f vdp-z )
      cat ${file}.fac \
        | egrep -v '[@%#=]' \
        | sed \
            -e 's/[{}_]/ /g' \
            -e 's/[^ aoy]/ /g' \
            -e 's/^  *//g' \
            -e 's/  *$//g' \
            -e 's/   */ /g' \
        | egrep -e '.' \
        | count-okokoko-lengths \
        > ${file}-o.lfr
      dicio-wc ${file}-o.lfr
    end

    foreach file ( hea-f heb-f bio-f vdp-z )
      set ff = ( ${file}-{ok,k,o}.lfr )
      multicol -v titles="$ff" $ff
    end

    hea-f-ok.lfr                         hea-f-k.lfr                    hea-f-o.lfr                  
    -----------------------------------  ---------------------------  ---------------------------
    avg length =  3.6                    avg length =  2.2            avg length =  1.5            
                                                                          
    len nwds example                     len nwds example             len nwds example           
    --- ---- ------------------          --- ---- ------------------  --- ---- ------------------
      1  205 sh                            1 1411 r                     1 4340 yay               
      2 1096 a r                           2 4078 f s                   2 2789 y a               
      3 2941 f yay s                       3 1801 sh l d                3  302 o o y             
      4 1737 y k a l                       4  376 sh k ch ee            4   13 o o o y           
      5 1223 sh o l d y                    5   28 t sh d ee s                                      
      6  412 r o l o t y                   6    2 k l s ch ee s                                    
      7   92 t sh o d ee s y               7    0                                                  
      8   10 q o s ch o d a m              8    1 p ch d l ch p ch l                               
      9    1 o p ch o l o t o l                                                                      
     10    1 d a l ch o d o l d y                                                                    
     11    0                                                                                         
     12    1 p ch o d o l ch o p ch a l                                                              

    heb-f-ok.lfr                     heb-f-k.lfr                    heb-f-o.lfr                  
    -------------------------------  -----------------------------  -----------------------------
    avg length =  3.7                avg length =  2.3              avg length =  1.5            

    len nwords example               len nwords example             len nwords example           
    --- ------ ------------------    --- ------ ------------------  --- ------ ------------------
      1     53 l                       1    476 q                     1   1575 oy                
      2    377 q oy                    2   1605 d iir                 2   1342 o y               
      3    999 d che y                 3    857 p she k               3    145 a o y             
      4    932 o d a iir               4    198 q k ee d              4      1 a a o a           
      5    578 p she o k y             5     28 q t ee d r                                       
      6    189 a d ee o d y            6      4 k ee s ch ee s                                   
      7     40 q o t ee d a r                                                                    
      8      6 y k ee d l che d y                                                                
      9      3 o k ee o s ch ee o s                                                              

    bio-f-ok.lfr                     bio-f-k.lfr                    bio-f-o.lfr                  
    -------------------------------  -----------------------------  -----------------------------
    avg length =  3.8                avg length =  2.4              avg length =  1.5            

    len nwords example               len nwords example             len nwords example           
    --- ------ ------------------    --- ------ ------------------  --- ------ ------------------
      1     80 r                       1    828 sh                    1   3252 y                 
      2    692 sh y                    2   2794 k r                   2   2601 a y               
      3   2074 she d y                 3   1996 k che d               3    122 o a y             
      4   1417 k che d y               4    464 q l che d             4      2 o a o o           
      5   1394 q o k a r               5     40 q l k ee l                                       
      6    418 q o l che d y           6      6 q k ee l she d                                   
      7     54 q o k a r d y                                                                     
      8     12 q o l k ee o l y                                                                  
      9      2 q o k ee y l she d y                                                              

    vdp-z-ok.lfr                   vdp-z-k.lfr                    vdp-z-o.lfr                  
    -----------------------------  -----------------------------  -----------------------------
    avg length =  3.7              avg length =  2.3              avg length =  1.5            

    len nwords example             len nwords example             len nwords example           
    --- ------ ------------------  --- ------ ------------------  --- ------ ------------------
      1    614 s                     1   4504 l                     1  14033 a                 
      2   3551 o l                   2  14200 d iin                 2  12928 o y               
      3   8812 d a iin               3   8165 q k ee                3   1052 o o y             
      4   7868 o k a iin             4   1690 q k ee d              4     14 o a o y           
      5   5952 q o k ee y            5     72 q l k ee d                                       
      6   1786 q o k ee d y                                                                    
      7    192 q o k ee o d y                                                                  
      8     22 q o k o l che d y                                                               


Frequencies of "K" elements in languages A and B
------------------------------------------------

  In the "K" frequency tables above we can already see a marked difference
  between languages A and B.  Looking only at the reduced element subset K*,
  plus "q" and "_" (meaning no "q"):
  
    foreach file ( hea-f heb-f )
      cat ${file}-k.frq \
        | egrep ' (([kt]|c[kt]h|[cs]h|ee)(e|)|[_qlmsdnr]|i[nr]|ii[nr]|iiin)$' \
        > ${file}-kr.frq
    end
  
    multicol {hea-f,heb-f}-kr.frq
  
          hea-f             heb-f
    ----------------  ----------------
     7024 0.297 _      2856 0.284 _   
     2524 0.107 ch     1459 0.145 d   
     2194 0.093 d       765 0.076 k   
     1702 0.072 l       608 0.060 l   
     1466 0.062 r       600 0.060 r   
     1257 0.053 k       557 0.055 ch  
     1177 0.050 t       424 0.042 iin 
     1090 0.046 iin     366 0.036 che 
      832 0.035 sh      353 0.035 t   
      695 0.029 q       321 0.032 q   
      632 0.027 s       250 0.025 ee  
      464 0.020 che     207 0.021 ke  
      453 0.019 cth     176 0.017 s   
      353 0.015 ee      176 0.017 she 
      253 0.011 m       175 0.017 sh  
      187 0.008 she     113 0.011 m   
      186 0.008 in      110 0.011 te  
      176 0.007 ckh      89 0.009 ckh 
      130 0.005 ke       67 0.007 in  
       75 0.003 n        51 0.005 ir  
       70 0.003 te       38 0.004 cth 
       65 0.003 cthe     24 0.002 cthe
       59 0.002 ir       19 0.002 ckhe
       57 0.002 eee      13 0.001 eee 
       47 0.002 ckhe     13 0.001 iir 
       24 0.001 iir       9 0.001 iiin
       20 0.001 iiin      5 0.000 n   
  
  There is also a less marked but still significant difference between
  herbal-B and bio-B:
  
    foreach file ( heb-f bio-f )
      cat ${file}-k.frq \
        | egrep ' (([kt]|c[kt]h|[cs]h|ee)(e|)|[_qlmsdnr]|i[nr]|ii[nr]|iiin)$' \
        > ${file}-kr.frq
    end
  
    multicol {heb-f,bio-f}-kr.frq
  
          heb-f             bio-f  
    ----------------  ----------------
     2856 0.284 _      4627 0.242 _   
     1459 0.145 d      2512 0.131 d   
      765 0.076 k      2140 0.112 l   
      608 0.060 l      1516 0.079 q   
      600 0.060 r      1422 0.074 k   
      557 0.055 ch      828 0.043 che 
      424 0.042 iin     804 0.042 r   
      366 0.036 che     775 0.041 ee  
      353 0.035 t       723 0.038 iin 
      321 0.032 q       670 0.035 she 
      250 0.025 ee      615 0.032 t   
      207 0.021 ke      476 0.025 ch  
      176 0.017 s       377 0.020 ke  
      176 0.017 she     357 0.019 s   
      175 0.017 sh      316 0.017 sh  
      113 0.011 m       194 0.010 te  
      110 0.011 te      142 0.007 ckh 
       89 0.009 ckh     113 0.006 in  
       67 0.007 in       81 0.004 cth 
       51 0.005 ir       72 0.004 m   
       38 0.004 cth      42 0.002 ckhe
       24 0.002 cthe     31 0.002 ir  
       19 0.002 ckhe     29 0.002 cthe
       13 0.001 eee      21 0.001 eee 
       13 0.001 iir      10 0.001 n   
        9 0.001 iiin      7 0.000 iiin
        5 0.000 n         2 0.000 iir 


  However, most of that difference disappears if we:

    (1) identify the letters { k t p f}, which we have 
    good reasons to believe are the same letter;
    
    (2) omit the letter "q", which is believed to be
    a symbol for "and", and hence might be correlated
    with subject matter;
    
    (3) identify "ee" with "ch".
  
    foreach file ( hea-u hea-f heb-f bio-f vdp-z )
      cat ${file}.fac \
        | egrep -v '[@%#=]' \
        | sed \
            -e 's/^[^{}]*{//g' \
            -e 's/}[^{}]*$//g' \
            -e 's/}[^{}]*{/./g' \
            -e 's/[ktpf]/k/g' \
            -e 's/ee/ch/g' \
            -e 's/q//g' \
        | tr '.' '\012' \
        | egrep -e '.' \
        | sort | uniq -c | expand | sort -b +0 -1nr \
        | compute-freqs | sed -e 's/^  //g' \
        | egrep ' (([kt]|c[kt]h|[cs]h|ee)(e|)|[_qlmsdnr]|i[nr]|ii[nr]|iiin)$' \
        > ${file}-krr.frq
      dicio-wc ${file}-krr.frq
    end
  
    multicol {hea-u,hea-f,heb-f,bio-f,vdp-z}-krr.frq

          hea-u             hea-f             heb-f             bio-f             vdp-z   
    ----------------  ----------------  ----------------  ----------------  ----------------
      752 0.310 _      7024 0.306 _      2856 0.293 _      4627 0.263 _     24167 0.288 _   
      315 0.130 ch     2877 0.125 ch     1459 0.150 d      2512 0.143 d     10970 0.131 k   
      244 0.101 k      2725 0.119 k      1315 0.135 k      2226 0.126 k      9928 0.118 d   
      216 0.089 d      2194 0.096 d       807 0.083 ch     2140 0.122 l      8082 0.096 ch  
      183 0.075 l      1702 0.074 l       608 0.062 l      1251 0.071 ch     7523 0.089 l   
      178 0.073 r      1466 0.064 r       600 0.062 r       849 0.048 che    4855 0.058 r   
      101 0.042 iin    1090 0.047 iin     424 0.043 iin     804 0.046 r      3551 0.042 che 
       93 0.038 sh      832 0.036 sh      379 0.039 che     723 0.041 iin    3545 0.042 iin 
       84 0.035 ckh     734 0.032 ckh     317 0.033 ke      670 0.038 she    2159 0.026 ke  
       76 0.031 s       632 0.028 s       176 0.018 s       572 0.032 ke     1663 0.020 she 
       54 0.022 che     521 0.023 che     176 0.018 she     357 0.020 s      1644 0.020 sh  
       36 0.015 m       253 0.011 m       175 0.018 sh      316 0.018 sh     1608 0.019 s   
       22 0.009 in      200 0.009 ke      137 0.014 ckh     234 0.013 ckh    1428 0.017 in  
       19 0.008 ke      187 0.008 she     113 0.012 m       113 0.006 in     1184 0.014 ckh 
       19 0.008 she     186 0.008 in       67 0.007 in       83 0.005 ckhe    632 0.008 m   
        5 0.002 ckhe    136 0.006 ckhe     53 0.005 ckhe     72 0.004 m       379 0.005 ir  
        5 0.002 ir       75 0.003 n        51 0.005 ir       31 0.002 ir      356 0.004 ckhe
        4 0.002 iiin     59 0.003 ir       13 0.001 iir      10 0.001 n       116 0.001 iiin
        4 0.002 n        24 0.001 iir       9 0.001 iiin      7 0.000 iiin     95 0.001 iir 
        3 0.001 iir      20 0.001 iiin      5 0.001 n         2 0.000 iir      67 0.001 n   


Statistics of "O" strings
-------------------------

  Now, what do we do with the "O" strings?  Let's look at their
  statistics:
  
    foreach file ( hea-u hea-f heb-f bio-f vdp-z )
      cat ${file}.fac \
        | egrep -v '[@%#=]' \
        | sed -e 's/{[^{}]*}/./g' \
        | tr '.' '\012' \
        | egrep -e '.' \
        | sort | uniq -c | expand | sort -b +0 -1nr \
        | compute-freqs | sed -e 's/^  //g' \
        > ${file}-ooo.frq
      dicio-wc ${file}-ooo.frq
    end
    
     lines file        
    ------ ------------
         9 hea-u-ooo.frq
        15 hea-f-ooo.frq
        11 heb-f-ooo.frq
        12 bio-f-ooo.frq
        11 vdp-z-ooo.frq

    multicol {hea-u,hea-f,heb-f,bio-f,vdp-z}-ooo.frq

          hea-u           hea-f            heb-f           bio-f            vdp-z   
    --------------  ---------------  --------------  ---------------  --------------
     1364 0.551 _   12782 0.540 _     5371 0.533 _   10295 0.538 _    45585 0.514 _ 
      595 0.240 o    5444 0.230 o     1712 0.170 y    3558 0.186 o    18671 0.211 o 
      262 0.106 y    3069 0.130 y     1616 0.160 o    3413 0.178 y    13615 0.154 y 
      234 0.094 a    2188 0.092 a     1325 0.132 a    1835 0.096 a    10544 0.119 a 
       11 0.004 oa     70 0.003 oa      16 0.002 oa      6 0.000 yo     171 0.002 oa
        6 0.002 oy     59 0.002 oy      11 0.001 oy      4 0.000 oy      51 0.001 oy
        2 0.001 ao     16 0.001 oo       8 0.001 oo      3 0.000 ay      23 0.000 oo
        2 0.001 oo     14 0.001 yo       6 0.001 yo      3 0.000 oa      12 0.000 yo
        1 0.000 yo      5 0.000 ay       2 0.000 ay      2 0.000 ao       6 0.000 ay
                        4 0.000 ya       1 0.000 ao      2 0.000 ya       6 0.000 ya
                        2 0.000 ao       1 0.000 ya      1 0.000 aoy      2 0.000 yy
                        2 0.000 yoa                      1 0.000 oaa                
                        1 0.000 aa                                                  
                        1 0.000 oao                                                 
                        1 0.000 yay                                                 

  Thus, the only common alternatives are empty, "y", "a", and "o".  In
  fact, as we know, the alternative "y" is common only in initial and
  final positions; and in those positions it seems to be equivalent to
  "o".
  
  Note that about half of the "O" slots are filled (i.e. the ratio K:O
  is about 2:1).  Therefore, if the "K" elements were randomly mixed
  with "O" letters, the "O" slots should be about 
  
    67% empty, 
    22% single-letter, 
    7% double-letter,  and 
    2% triple-letter. 
    
  Instead we see about 
  
    50% empty, 
    50% single-letter, 
    <1% double-letter, and 
    <0.1% triple-letter.
  
  In fact, triple-letter "O"s are so rare that they can be assumed to
  be errors. In Rene's good-quality word list (vdp-z.wds) there are no
  triple-letter "O"s at all.
  
Statistics of "K" strings
-------------------------

  Let's now look at the clusters of "K" elements between consecutive
  non-empty "O"s.  To reduce the size of the output, let's map the
  letters { k t p f } to "k", and "ch" to "ee":
  
    foreach file ( hea-f heb-f bio-f vdp-z )
      cat ${file}.fac \
        | egrep -v '[@%#=]' \
        | sed \
            -e 's/^{[q_]}//g' \
            -e 's/^_//g' \
            -e 's/_$//g' \
            -e 's/[oay]/./g' \
            -e 's/[{}]//g' \
            -e 's/[ktpf]/k/g' \
            -e 's/ch/ee/g' \
        | tr '.' '\012' \
        | egrep -e '.' \
        | sort | uniq -c | expand | sort -b +0 -1nr \
        | compute-freqs | sed -e 's/^  //g' \
        > ${file}-kkk.frq
      dicio-wc ${file}-kkk.frq
    end
  
     lines file        
    ------ ------------
       257 hea-f-kkk.frq
       213 heb-f-kkk.frq
       265 bio-f-kkk.frq
       232 vdp-z-kkk.frq

    multicol {hea-f,heb-f,bio-f,vdp-z}-kkk.frq > multi-kkk.frq

          hea-f                      heb-f                       bio-f                        vdp-z   
    -------------------------  --------------------------  ---------------------------  -----------------------
     1733 0.131 d                663 0.136 k                1424 0.165 l                 5345 0.119 k          
     1441 0.109 l                568 0.116 r                1148 0.133 k                 5296 0.118 l          
     1380 0.104 r                550 0.113 d                 729 0.084 r                 4714 0.105 r          
     1237 0.093 k                419 0.086 iin               720 0.083 iin               4146 0.093 d          
     1164 0.088 ee               408 0.084 l                 464 0.054 d                 3534 0.079 iin        
     1078 0.081 iin              172 0.035 ke_d              379 0.044 ke_d              1994 0.045 k_ee       
      931 0.070 k_ee             161 0.033 k_ee_d            331 0.038 k_ee_d            1861 0.042 ee         
      592 0.045 ckh              114 0.023 s                 260 0.030 s                 1424 0.032 in         
      553 0.042 sh               107 0.022 m                 258 0.030 she_d             1232 0.028 s          
      426 0.032 s                105 0.022 k_ee              230 0.027 eee_d             1068 0.024 k_ee_d     
      235 0.018 m                 94 0.019 ee                175 0.020 k_ee              1052 0.023 eee        
      229 0.017 eee               92 0.019 ke                148 0.017 eee               1036 0.023 ke         
      179 0.014 ke                87 0.018 ee_d              147 0.017 she                868 0.019 ke_d       
      174 0.013 in                77 0.016 eee_d             112 0.013 in                 813 0.018 sh         
      149 0.011 k_eee             70 0.014 eee               111 0.013 l_k                643 0.014 she        
      133 0.010 she               63 0.013 in                104 0.012 ke                 632 0.014 m          
      114 0.009 d_ee              60 0.012 sh                 99 0.011 l_eee_d            631 0.014 eee_d      
      112 0.008 ckhe              55 0.011 eee_k              87 0.010 k_eee_d            622 0.014 ckh        
      110 0.008 k_sh              52 0.011 ee_ckh             67 0.008 m                  459 0.010 she_d      
      106 0.008 l_d               51 0.010 l_d                66 0.008 l_d                428 0.010 k_eee      
       64 0.005 n                 50 0.010 ir                 65 0.008 sh                 406 0.009 l_k        
       58 0.004 ee_ckh            49 0.010 she                63 0.007 ee_ckh             398 0.009 k_eee_d    
     .... ..... ..........      .... ..... ...............  .... ..... ................  .... ..... .............  
        1 0.000 ckh_s_ee_s         1 0.000 ee_sh_d             2 0.000 l_l                  6 0.000 she_ke     
        1 0.000 ckh_sh             1 0.000 eee_ckh_d           2 0.000 l_sh_ee_s            5 0.000 d_sh_ee_d  
        1 0.000 ckhe_iin           1 0.000 eee_ckhe            2 0.000 l_she_ckh            5 0.000 il         
        1 0.000 ckhe_k_k_k_l       1 0.000 eee_ckhe_d          2 0.000 l_she_k              5 0.000 sh_ee_k_ee 
        1 0.000 d_ee_ee_ckhe       1 0.000 eee_ee              2 0.000 r_ee_r               4 0.000 d_ee_ee_d  
        1 0.000 d_ee_ee_s          1 0.000 eee_eee             2 0.000 r_eee_k              4 0.000 d_sh_d     
        1 0.000 d_ee_eee           1 0.000 eee_k_ee_ee         2 0.000 r_k                  4 0.000 ee_ee_k_ee 
     .... ..... ..........      .... ..... ...............  .... ..... ................  .... ..... .............  

  Obviously, groups of two or more consecutive "K" elements are quite
  common.  Here is the frequency for each repeat count:
  
    foreach file ( hea-f heb-f bio-f vdp-z )
      cat ${file}.fac \
        | egrep -v '[@%#=]' \
        | sed \
            -e 's/^{[q_]}//g' \
            -e 's/^_//g' \
            -e 's/_$//g' \
            -e 's/[oay]/./g' \
            -e 's/[{}]//g' \
            -e 's/[a-z][a-z]*/x/g' \
        | tr '.' '\012' \
        | egrep -e '.' \
        | sort | uniq -c | expand | sort -b +0 -1nr \
        | compute-freqs | sed -e 's/^  //g' \
        > ${file}-kn.frq
      dicio-wc ${file}-kn.frq
    end
  
     lines file        
    ------ ------------
         5 hea-f-kn.frq
         6 heb-f-kn.frq
         6 bio-f-kn.frq
         4 vdp-z-kn.frq

    multicol {hea-f,heb-f,bio-f,vdp-z}-kn.frq
  
          hea-f                  heb-f                    bio-f                    vdp-z   
    ---------------------  -----------------------  -----------------------  -------------------
    10849 0.819 x           3387 0.694 x             5527 0.640 x            33290 0.744 x      
     2149 0.162 x_x         1034 0.212 x_x           1966 0.228 x_x           8124 0.181 x_x    
      229 0.017 x_x_x        416 0.085 x_x_x         1038 0.120 x_x_x         3077 0.069 x_x_x  
       20 0.002 x_x_x_x       38 0.008 x_x_x_x         99 0.011 x_x_x_x        280 0.006 x_x_x_x
        5 0.000 x_x_x_x_x      5 0.001 x_x_x_x_x        1 0.000 x_x_x_x_x                       
                               2 0.000 x_x_x_x_x_x      1 0.000 x_x_x_x_x_x                     

  So strings of 3 consecutive "K" elements are relatively common, 
  strings of 4 are rare, and no word that occurs twice has 
  5 or more "K"s in a row.  

  Recall that about 50% of the "O" slots are empty, and about 50%
  consist of one letter only.  If the "O" slots were filled
  or empty at random, then we would expect the following statistics
  
    0.500 x
    0.250 x_x
    0.125 x_x_x
    0.063 x_x_x_x
    0.031 x_x_x_x_x
    0.015 x_x_x_x_x_x
    0.007 x_x_x_x_x_x_x
    
  So the statistics above suggest that in language A the distribution
  of "O"s is more uniform than would be expected from chance.
  (The case is not clear because the presence of short words 
  would bias the statistics towards entries with few consecutive "K"s.)
  
  Note the significant difference in K-repeat frequencies for language
  A and language B.  The frequencies for language B are closer to the 
  "random" model.
  
Analysis of "K" and "O" statistics
----------------------------------

  What can we conclude from these numbers?  Let's consider the 
  alternatives:
  
    (1) The EVA letters { a o y } are different Voynichese letters.
    
        This theory does not look very promising: if they were
        different letters, they should belong to the same class
        (vowel, consonant, whaterver); but then we would expect to see
        a fair number of diphtongs (double-letter "O" strings),
        which we don't see.
        
    (2) The EVA letters { a o y } are the same Voynichese letter.
    
        This theory could explain why there are so few double-letter
        "O" slots: namely, because the Voynichese letter "o/a/y"
        cannot occur twice in a row (a common restriction in natural
        languages).  
        
    (3) Each "O" string is a modifiers for (i.e. a part of) the 
        next "K" element; except for the final "O" string,
        which stands on its own.
    
    (4) Each "O" string is a modifiers for the preceding "K" element;
        except for the initial "O" string, which stands on its own.
    
    (5) Some "K" element may admit "O" letters as post-modifiers, 
        some may admit them as pre-modifiers, some may admit both.
        
        After a quick look, I would guess that

          { sh ch ee she che eee }            admit "a/o/y" only as post-modifiers
          { r l m n ir iir in iin iiin }      admit "a/o/y" only as pre-modifiers
          { s d k t cth ckh ke te cthe ckhe } admit "a/o/y" in both positions.

        But this hunch needs to be confirmed...
    
    (6) None of the above.
  
Appendix: A more flexible factoring script
------------------------------------------

  The logic of factor-OK.sed was rewritten in AWK as
  a "factor-field-OK" script that allows one to factor a
  selected field of a multifield file.
  
  Checking consistency of the two scripts:
  
    foreach file ( hea-u hea-f heb-f bio-f vdp-z )
      echo ${file}-old
      cat ${file}.wds \
        | sed -f factor-OK.sed \
        > .${file}-old.fac
      echo ${file}-new
      cat ${file}.wds \
        | factor-field-OK \
        | gawk '/./{ print $1; }' \
        > .${file}-new.fac
      dicio-wc .${file}-{old,new}.fac 
      diff .${file}-{old,new}.fac 
      /bin/rm .${file}-{old,new}.fac 
    end
    
  The differences are confined to words that factor-OK can't parse.
  The new script will forcibly factor those words into elements
  {i+X}, {X[eh]}, or {X} where {X} is a character other than [aoy].