Hacking at the Voynich manuscript - Side notes
030 Combining the p-m-s and OKOKOKO paradigms

Last edited on 1999-02-02 14:48:15 by stolfi

This note attempts to rederive the p-m-s paradigm
in light of the OKOKOKO decompostion.

First, let's get the word lists for each section:

  set sections = ( `cat ../023/text-sections/all.names` )
  set seccom = `echo ${sections} | tr ' ' ','`
  mkdir data
  
  foreach sec ( $sections )
    echo "${sec}"
    mkdir data/${sec}
    cat ../023/text-sections/${sec}.evt \
      | words-from-evt \
      | tr '*' '?' \
      > data/${sec}/words.wds
  end

  mkdir data/all
  cat data/{${seccom}}/words.wds > data/all/words.wds

  mkdir data/ren
  cat Rene-words.frq \
    | gawk '/./{n=$1;for(i=0;i<n;i++){print $2;}}' \
    > data/ren/words.wds

  dicio-wc data/{${seccom},ren,all}/words.wds

    lines   words     bytes file        
  ------- ------- --------- ------------
     2210    2210     13427 data/unk/words.wds
     2211    2210     13467 data/pha/words.wds
    10358   10356     66804 data/str/words.wds
     7584    7584     44680 data/hea/words.wds
     3338    3336     20045 data/heb/words.wds
     6539    6539     37738 data/bio/words.wds
      324     324      2089 data/ast/words.wds
      389     387      2264 data/cos/words.wds
      169     169      1018 data/zod/words.wds
    28939   28939    172850 data/ren/words.wds
    33122   33115    201532 data/all/words.wds

Factor them into elements and separate the unreadable words 
and parsing bugs:

  foreach sec ( $sections ren all )
    echo ${sec}
    cat data/${sec}/words.wds \
      | factor-OK \
      > data/${sec}/words.fac
    cat data/${sec}/words.fac \
      | egrep -e '^{[{}a-z?]*}$' \
      > data/${sec}/words-gut.fac
    cat data/${sec}/words.fac \
      | egrep -v -e '^{[{}a-z?]*}$' \
      > data/${sec}/words-bad.fac
    egrep -e '^[^{]' data/${sec}/words-gut.fac | head -5
    egrep -e '[^}]$' data/${sec}/words-gut.fac | head -5
    egrep -e '[}][^{]' data/${sec}/words-gut.fac | head -5
    egrep -e '[^}][{]' data/${sec}/words-gut.fac | head -5
  end
  dicio-wc data/{${seccom},ren,all}/words-{gut,bad}.fac 
  
    lines   words     bytes file        
  ------- ------- --------- ------------
     2190    2190     30362 data/unk/words-gut.fac
       20      20       279 data/unk/words-bad.fac

     2112    2112     28599 data/pha/words-gut.fac
       99      98      1439 data/pha/words-bad.fac

    10311   10311    149736 data/str/words-gut.fac
       47      45       678 data/str/words-bad.fac

     7532    7532     98697 data/hea/words-gut.fac
       52      52       721 data/hea/words-bad.fac

     3318    3318     45009 data/heb/words-gut.fac
       20      18       270 data/heb/words-bad.fac

     6530    6530     85290 data/bio/words-gut.fac
        9       9       114 data/bio/words-bad.fac

      312     312      4495 data/ast/words-gut.fac
       12      12       201 data/ast/words-bad.fac

      376     376      4954 data/cos/words-gut.fac
       13      11       174 data/cos/words-bad.fac

      162     162      2240 data/zod/words-gut.fac
        7       7       104 data/zod/words-bad.fac

    28939   28939    389191 data/ren/words-gut.fac
        0       0         0 data/ren/words-bad.fac

    32843   32843    449382 data/all/words-gut.fac
      279     272      3980 data/all/words-bad.fac

Count the elements:

  foreach sec ( $sections ren all )
    echo "${sec}"
    cat data/${sec}/words-gut.fac \
      | tr '{}' '\012\012' \
      | egrep '.' \
      | sort | uniq -c | expand \
      | sort -b +0 -1nr \
      > data/${sec}/elem.cts
  end

  multicol \
    -v titles="all ren $sections" \
    data/{all,ren,$seccom}/elem.cts \
    > elem.multi
    
  compare-counts \
    -titles "all ren $sections element" \
    -remFreqs \
    -sort 1 \
    data/{all,ren,$seccom}/elem.cts \
    > elem.cmp

  all  ren  unk  pha  str  hea  heb  bio  ast  cos  zod  element
  ---  ---  ---  ---  ---  ---  ---  ---  ---  ---  ---  -------
  834  826  846  773  848  800  864  847  828  852  798  o
  712  699  728  676  736  689  719  700  700  725  711  y
  615  602  600  600  622  611  609  622  608  603  572  a
  524  510  512  524  536  532  488  516  523  527  516  d
  452  440  437  435  467  470  437  424  482  435  431  l
  394  380  382  389  402  425  373  362  436  388  394  k
  346  336  336  349  363  333  327  342  379  343  356  ch
  300  291  276  298  319  280  278  308  326  297  296  r
  259  248  246  262  274  253  251  242  311  271  294  q
  225  215  209  228  235  215  217  226  286  234  269  iin
  192  181  168  215  201  171  186  200  253  183  193  t
  161  151  145  184  163  154  155  165  223  159  177  che
  131  119  128  149  119  142  134  131  171  135  123  ee
  113  104  105  136  107  112  119  117  159  106  110  sh
   97   89   92  113   96   89  104  102  123   89   95  s
   82   74   77  101   82   83   90   74  110   74   80  she
   70   61   69   77   71   78   73   58   90   65   72  ke
   60   54   56   71   58   70   62   51   81   57   63  p
   51   41   53   67   47   63   56   33   78   54   55  in
   44   35   43   63   40   54   47   30   72   42   49  m
   37   28   36   57   33   51   37   22   60   34   20  te
   31   23   31   53   31   35   34   18   56   29   19  cth
   26   18   28   44   27   29   27   12   51   26   19  ckh
   22   14   21   41   20   27   23   11   43   21   19  ir
   19   12   19   39   16   25   21   10   39   18   12  eee
   17   11   15   36   14   23   15    9   38   18   12  f
   15   10   13   34   12   20   14    9   31   17    9  oa
   13    9   11   30   10   18   11    7   21   11    6  e?
   11    7   10   23    9   17   10    5   20   10    6  ckhe
   10    6    9   20    8   14    8    5   16    8    4  cthe
    8    5    7   19    7   12    7    4   14    8    4  cph
    7    5    6   16    7   10    6    4   11    7    4  oy
    6    4    6   16    6    7    6    4   11    7    4  n
    6    3    5   15    5    6    5    4    8    6    3  iir
    5    2    5   14    5    5    4    3    8    6    3  iiin
    4    2    5   13    4    5    4    2    6    6    3  i?
    4    2    3   12    3    4    3    2    6    6    3  cphe
    3    2    3   10    3    4    3    2    6    5    3  oo
    3    1    1   10    3    3    2    2    6    5    3  cfh
    3    1    1   10    2    3    2    2    5    5    3  im
    2    1    1    9    2    2    1    2    5    3    3  yo
    2    1    1    8    1    2    1    2    5    3    3  de
    2    1    1    5    1    2    1    2    4    3    3  iiir
    1    1    1    5    1    2    1    1    3    3    3  il
    1    1    1    2    1    2    1    1    0    3    3  j
    1    1    0    2    1    2    0    1    0    3    3  x
    1    1    0    2    0    2    0    1    0    3    3  is
    1    1    0    2    0    2    0    1    0    2    3  ya
    1    1    0    1    0    1    0    1    0    2    3  cfhe
    1    1    0    1    0    1    0    1    0    1    .  ay
    0    1    0    0    0    1    0    1    0    1    .  ao
    0    1    0    0    0    1    0    1    0    1    .  iim
    0    0    0    0    0    1    0    1    0    1    .  g
    0    0    0    0    0    1    0    1    0    1    .  ck
    0    0    0    0    0    0    0    1    0    1    .  iil
    0    0    0    0    0    0    0    1    0    1    .  ct
    0    0    0    0    0    0    .    1    0    1    .  id
    0    0    0    0    0    0    .    1    0    1    .  iid
    0    0    0    0    0    0    .    0    0    1    .  cthh
    0    0    0    0    0    0    .    0    0    1    .  b
    0    0    0    0    0    0    .    0    0    1    .  cphh
    0    0    0    0    0    0    .    0    0    1    .  ikh
    0    0    0    0    0    0    .    0    0    1    .  aa
    0    0    0    0    0    0    .    0    0    1    .  c?
    0    0    0    0    0    0    .    0    0    1    .  iis
    0    0    0    0    0    0    .    0    0    1    .  yoa
    0    0    0    0    0    0    .    0    0    1    .  iiil
    0    0    0    0    0    0    .    0    0    1    .  ikhe
    0    0    0    0    0    0    .    0    0    1    .  aoy
    0    0    0    0    0    0    .    0    0    1    .  cf
    0    0    0    0    0    0    .    0    0    1    .  chh
    0    0    0    0    0    0    .    0    0    1    .  cp
    0    0    0    0    0    0    .    0    0    1    .  h?
    0    0    0    0    .    0    .    0    0    1    .  iiid
    0    0    0    0    .    0    .    0    0    0    .  ij
    0    0    0    0    .    0    .    0    0    0    .  iph
    0    0    0    0    .    0    .    0    0    0    .  ith
    0    0    0    0    .    0    .    0    0    0    .  ithe
    0    0    0    0    .    0    .    .    0    0    .  ithh
    0    0    0    0    .    0    .    .    0    0    .  oao
    0    0    0    0    .    0    .    .    0    0    .  ooa
    0    0    0    0    .    0    .    .    .    0    .  ooooooooo
    0    0    0    0    .    0    .    .    .    0    .  oya
    0    0    0    .    .    0    .    .    .    0    .  pe
    0    0    0    .    .    .    .    .    .    0    .  u
    0    0    .    .    .    .    .    .    .    0    .  yay
    .    0    .    .    .    .    .    .    .    .    .  yy
    .    0    .    .    .    .    .    .    .    .    .  cfhh
    .    0    .    .    .    .    .    .    .    .    .  ckhh
    .    0    .    .    .    .    .    .    .    .    .  kh
    .    .    .    .    .    .    .    .    .    .    .  v

Check whether our list of elements is complete:

  cat data/{all,ren}/elem.cts | gawk '/./{print $2}' | sort | uniq > .bar

  cat elem.dic | sort > .foo
  bool 1-2 .bar .foo

  cat elem-to-class.tbl | gawk '/./{print $1}' | sort > .baz
  bool 1-2 .baz .foo

Now let's enumerate all pairs of non-empty elements,
consecutive and non-consecutive, in each word:

  foreach ptpn ( sep.0 con.1 )
    set ptp = "${ptpn:r}"; set pfl = "${ptpn:e}"
    foreach sec ( $sections ren all )
      echo "Enumerating ${ptp} element pairs for ${sec}..."
      cat data/${sec}/words-gut.fac \
        | nice enum-elem-pairs -v consecutive=${pfl} \
        | tr -d '{}' \
        | sort | uniq -c | expand \
        | gawk '/./{printf "%7d %s:%s\n", $1,$2,$3;}' \
        | sort +0 -1nr +1 -2 \
      > data/${sec}/elem-${ptp}-pair.cts
    end
    multicol \
      -v titles="all ren ${sections}" \
      data/{all,ren,$seccom}/elem-${ptp}-pair.cts \
      > elem-${ptp}-pair.multi
    compare-counts \
      -titles "all ren $sections pair" \
      -freqs \
      -sort 1 \
      data/{all,ren,$seccom}/elem-${ptp}-pair.cts \
      > elem-${ptp}-pair.cmp
  end
  
Tabulate element pairs, collapsing elements into classes

  foreach ptp ( sep con )
    foreach sec ( ${sections} ren all )
      echo "=== ${ptp} pairs for ${sec} ========================"
      cat data/${sec}/elem-${ptp}-pair.cts \
        | tr ':' ' ' \
        | map-fields \
            -v table=elem-to-class.tbl \
            -v fields="2,3" \
        | gawk '/./{printf "%7d %s:%s\n", $1,$2,$3;}' \
        | combine-counts | sort -b +0 -1nr +1 -2 \
        > data/${sec}/class-${ptp}-pair.cts
      foreach ttpn ( freqs.3 counts.5 )
        set ttp = "${ttpn:r}"; set dig = "${ttpn:e}"
        cat data/${sec}/class-${ptp}-pair.cts \
          | tr ':' ' ' | gawk '/./{print $1,"*",$2,$3;}' \
          | tabulate-triple-counts \
              -v rows=elem-classes.dic \
              -v cols=elem-classes.dic \
              -v ${ttp}=1 -v digits=${dig} \
          > data/${sec}/class-${ptp}-pair.${ttp}
      end
    end
  end
  
Here is a typical "sep" table, for the "bio" section:

    Pairs with key = *

    Pair probabilities (×999):
    --- --- --- --- --- --- --- --- --- --- --- -----
                                              E     T
                                              T     O
          Q   O   S   D   X   H   N   I   W   C     T
    --- --- --- --- --- --- --- --- --- --- --- -----
    Q     .  79  13  14  11  36   .   7   .   .   164
    O     .  76  84  27  25  61   2  36   1   .   317
    S     .  35  11  11  14   8   .   3   .   .    86
    D     .  68   9   2   2   .   .   4   .   .    88
    X     .  83  12  42   8  12   .   .   .   .   161
    H     .  86  20  29  24   .   .  14   .   .   177
    N     .   .   .   .   .   .   .   .   .   .     0
    I     .   .   .   .   .   .   .   .   .   .     0
    W     .   1   .   .   .   .   .   .   .   .     3
    ETC   .   .   .   .   .   .   .   .   .   .     0
    --- --- --- --- --- --- --- --- --- --- --- -----
    TOT   . 432 151 128  88 121   5  67   4   .   999


Note that the classes H and X are rarely preceded by D
but often followed by it.  I suppose most of these 
cases are final "dy"s.

Now let's extract all subsequences of three non-empty
elements from each word:

  foreach ptpn ( sep.0 )
    set ptp = "${ptpn:r}"; set pfl = "${ptpn:e}"
    foreach sec ( $sections ren all )
      echo "Enumerating ${ptp} element triples for ${sec}..."
      cat data/${sec}/words-gut.fac \
        | nice enum-elem-triples -v consecutive=${pfl} \
        | tr -d '{}' \
        | sort | uniq -c | expand \
        | gawk '/./{printf "%7d %s:%s:%s\n", $1,$2,$3,$4;}' \
        | sort +0 -1nr +1 -2 \
      > data/${sec}/elem-${ptp}-triple.cts
    end
    multicol -v titles="all ren ${sections}" data/{all,ren,$seccom}/elem-${ptp}-triple.cts \
      > elem-${ptp}-triple.multi
    compare-counts \
      -titles "all ren $sections triple" \
      -freqs \
      -sort 1 \
      data/{all,ren,$seccom}/elem-${ptp}-triple.cts \
      > elem-${ptp}-triple.cmp
  end

Tabulate triples sliced by middle element, first collapsing similar letters:

  foreach ptp ( sep )
    foreach sec ( ${sections} ren all )
      echo "=== ${ptp} triples for ${sec} ========================"
      cat data/${sec}/elem-${ptp}-triple.cts \
        | tr ':' ' ' \
        | map-fields \
            -v table=elem-to-class.tbl \
            -v fields="2,3,4" \
        | gawk '/./{printf "%7d %s:%s:%s\n", $1, $2,$3,$4;}' \
        | combine-counts | sort -b +0 -1nr +1 -2 \
        > data/${sec}/class-${ptp}-triple.cts
      foreach ttpn ( freqs.3 counts.5 )
        set ttp = "${ttpn:r}"; set dig = "${ttpn:e}"
        cat data/${sec}/class-${ptp}-triple.cts \
          | tr ':' ' ' | gawk '/./{print $1,$3,$2,$4;}' \
          | sort -b +1 -2 +0 -1nr \
          | tabulate-triple-counts \
              -v rows=elem-classes.dic \
              -v cols=elem-classes.dic \
              -v ${ttp}=1 -v digits=${dig} \
          > data/${sec}/class-${ptp}-triple.${ttp}
      end
    end
  end

It seems that, if we ignore the O's and Q's, most words have a "midfix"
consisting of D, X, and H elements, with a prefix 
of S letters, and a suffix of S and D elements.

Let's add to the factored word tables a second column with the 
element classes:

  foreach sec ( ${sections} ren all )
    echo "$sec"
    cat data/${sec}/words-gut.fac \
      | gawk \
          ' /./{ \
              s = $1; \
              gsub(/[{}]/, " ", s); gsub(/   */, " ", s); \
              gsub(/^  */, "", s); gsub(/  *$/, "", s); \
              printf "%s %s\n", $0, s; \
            } '  \
      | map-fields \
          -v table=elem-to-class.tbl \
          -v forgiving=1 \
      | gawk '/./{ \
            e=$1; $1=""; c=$0; \
            gsub(/^ */,":",c); gsub(/ *$/,":",c); \
            gsub(/  */, ":", c); \
            printf "%s %s\n", c,e; \
          } ' \
      | sort | uniq -c | expand \
      | sort -b +0 -1nr +1 -2 \
      > data/${sec}/class-wds.cts
  end

Looking at Rene's words, it looks like most words have at most one H,
and all X's are consecutive and adjacent to it (except for the intrusion
of "O"s in some languages). 

Let's tabulate the patterns of H and X elements,
after removing the O elements, and any prefix or suffix
consisting of elements oter than H and X::

  foreach sec ( ${sections} ren all )
    echo "$sec"
    cat data/${sec}/class-wds.cts \
      | gawk '/./{print $1,$2}' \
      | sed \
          -e 's/[O]://g' \
          -e 's/ :[A-GI-WY-Z:]*/ :/' \
          -e 's/:[A-GI-WY-Z:]*$/:/' \
      | combine-counts | sort -b +0 -1nr +1 -2 \
      > data/${sec}/class-midf.cts 
  end
  
  multicol \
    -v titles="all ren ${sections}" \
    data/{all,ren,${seccom}}/class-midf.cts \
    > class-midf.multi

  compare-counts \
    -titles "all ren $sections midfix" \
    -remFreqs \
    -sort 1 \
    data/{all,ren,$seccom}/class-midf.cts \
    > class-midf.cmp

Now let's tabulate the prefix, suffix, and unifix patterns,
omitting the Q's and O's:

  foreach sec ( ${sections} ren all )
    echo "$sec"
    cat data/${sec}/class-wds.cts \
      | gawk '/./{print $1,$2}' \
      | egrep -v -e '[HX]:' \
      | sed \
          -e 's/[O]://g' \
          -e 's/ :[Q:]*/ :/' \
      | combine-counts | sort -b +0 -1nr +1 -2 \
      > data/${sec}/class-unif.cts 
  end
  
  foreach sec ( ${sections} ren all )
    echo "$sec"
    cat data/${sec}/class-wds.cts \
      | gawk '/./{print $1,$2}' \
      | egrep -e '[HX]:' \
      | sed \
          -e 's/[O]://g' \
          -e 's/ :[Q:]*/ :/' \
          -e 's/[HX]:.*//' \
      | combine-counts | sort -b +0 -1nr +1 -2 \
      > data/${sec}/class-pref.cts 
  end
  
  foreach sec ( ${sections} ren all )
    echo "$sec"
    cat data/${sec}/class-wds.cts \
      | gawk '/./{print $1,$2}' \
      | egrep -e '[HX]:' \
      | sed \
          -e 's/[O]://g' \
          -e 's/ :[Q:]*/ :/' \
          -e 's/ :.*[HX]:/ :/' \
      | combine-counts | sort -b +0 -1nr +1 -2 \
      > data/${sec}/class-suff.cts 
  end
  
  foreach sec ( ${sections} ren all )
    echo "$sec"
    cat data/${sec}/class-wds.cts \
      | gawk '/./{print $1,$2}' \
      | egrep -e '[HX]:' \
      | sed \
          -e 's/[O]://g' \
          -e 's/:[A-PR-Z:]*$/:/' \
      | combine-counts | sort -b +0 -1nr +1 -2 \
      > data/${sec}/class-qhaf.cts 
  end
  
  foreach sec ( ${sections} ren all )
    echo "$sec"
    cat data/${sec}/class-wds.cts \
      | gawk '/./{print $1,$2}' \
      | egrep -v -e '[HX]:' \
      | sed \
          -e 's/[O]://g' \
          -e 's/:[A-PR-Z:]*$/:/' \
      | combine-counts | sort -b +0 -1nr +1 -2 \
      > data/${sec}/class-qsof.cts 
  end
  
Let's make comparative tables:

  foreach fix ( midf unif pref suff qhaf qsof )
    multicol \
      -v titles="all ren ${sections}" \
      data/{all,ren,${seccom}}/class-${fix}.cts \
      > class-${fix}.multi
    compare-counts \
      -titles "all ren ${sections} ${fix}ix-pattern" \
      -freqs \
      -sort 1 \
      data/{all,ren,${seccom}}/class-${fix}.cts \
      > class-${fix}.cmp
  end

THE OKOKOKO AND PMS PARADIGMS COMBINED

To a first approximation, the Voynichese words can be decomposed into
the following "elements":

  Q = { q }
  
  O = { a o y }  ("circles")
  
  H = { t te cth cthe   k ke ckh ckhe   
        p pe cph cphe   f fe cfh cfhe } ("gallows")
  
  X = { ch che  sh she  ee eee }  ("tables") 
  
  R = { d l r s }   ("dealers")
   
  F = { n m g j  in iin ir iir }   ("finals") 
  
  W = { e i cthh ith kh ct iiim iir is ETC. }   ("weirdos")
  
The "p" and "f" elements are almost certainly calligraphic variants
of the corresponding "t" and "k" elements.

There are two classes of words: the "hard" ones, that contain Hs
and/or Xs, and the 'soft" ones that don't.

Let's ignore the O's for the moment.  The "hard" words have the form

  Q^a R^b X^c H^d X^e R^f F^g
  
where 

  a    = 0 (86%) or 1 (14%)
  b    = 0 (90%) or 1 ( 9%)
  d    = 0 (49%) or 1 (49%)
  c+e  = 0 (52%) or 1 (43%) or 2 (4%)
  f    = 0 (42%) or 1 (53%) or 2 (4%)
  g    = 0 (85%) or 1 (14%)
  
The "soft" words have the form

  Q^w R^x F^y
  
where 

  w = 0 (95%) or 1 ( 5%)
  x = 0 (12%) or 1 (58%) or 2 (22%) or 3( 2%) 
  y = 0 (55%) or 1 (40%)
  
The "soft" schema above can be interpreted as a special case of the
"hard" schema with no X or Hs (i.e. c+d+e = 0), although the probabilities
are somewhat different.

Said another way, the typical Voynichese word has a "midfix"
(kernel, root), possibly empty, consisting of at most one gallows
surrounded by at most two tables.  To the midfix is attached
a prefix having at most one "q" and at most one dealer;
and a suffix with at most two dealers and at most one final.

Let's now check how many words fit this paradigm:

  foreach sec ( ${sections} ren all )
    echo "$sec"
    cat data/${sec}/class-wds.cts \
      | gawk '/./{print $1,$2}' \
      | sed -e 's/[O]://g' \
      > /tmp/.foo
    cat /tmp/.foo \
      | egrep -e ' :(Q:|)([SD]:|)(X:X:H:|H:X:X:|(X:|)(H:|)(X:|))([SD]:|)([SD]:|)([NI]:|)$' \
      | combine-counts | sort -b +0 -1nr +1 -2 \
      > /tmp/.foogut 
    cat /tmp/.foo \
      | egrep -v -e ' :(Q:|)([SD]:|)(X:X:H:|H:X:X:|(X:|)(H:|)(X:|))([SD]:|)([SD]:|)([NI]:|)$' \
      | combine-counts | sort -b +0 -1nr +1 -2 \
      > /tmp/.foobad
    /bin/rm -f  data/${sec}/class-fit.cts 
    cat /tmp/.foogut | sed -e 's/ :/ +:/' >> data/${sec}/class-fit.cts 
    cat /tmp/.foobad | sed -e 's/ :/ -:/' >> data/${sec}/class-fit.cts 
  end
  
This paradigm fits 97% of all words in Rene's list (with multiplicities)
and 94% of all words in the interlinear file.  The remaining 6%
includes the words containing "wild" (W) elements (3% of all words)
and long words that look like two words joined together.

The most common patterns in the interlinear that do
not fit the paradigm and do not contain a wild element 
are 

     46 :H:H:
     44 :H:X:H:
     38 :H:S:X:
     32 :H:S:X:D:
     26 :H:S:X:S:
     20 :D:I:S:
     19 :D:S:X:D:
     19 :H:H:S:
     19 :X:D:X:
     18 :S:I:S:
     18 :X:X:H:X:
     15 :I:S:
     15 :S:S:X:
     15 :X:S:H:
     14 :H:I:S:
     13 :D:S:X:
     12 :D:I:D:
     12 :S:S:X:D:

where S = { s l r }, D = { d }, I = { in iin ir iir }, N = { n m g j }.

TO MERGE WITH THE ABOVE:

[ 1999-02-02 ]

Word pattern frequencies
------------------------

  It is instructive to analyze the frequency of each word pattern,
  the result of collapsing the letters into the classes { Q O X I R E }
  or { Q O K } defined above.
  
  For this study we will use the majority-vote transcription, that
  includes Takeshi's new full transcription. For simplicity, let's
  discard all data containing weirdos, extra plumes, unreadable
  characters, or the rare letters [abuvxz]. Let's also map the upper
  case EVA letters [SCIKTPF] to their lower case varians, since the
  capitalization carries no information in those cases.

    cat ../045/only-m.evt \
      | egrep -e '^<[^<>]*;A>' \
      | tr 'SCIKTPF' 'sciktpf' \
      | tr -d '\!' \
      | sed \
          -e 's/^<[^<>]*> *//g' \
          -e 's/[{][^{}]*[}]//g' \
          -e 's/[&][0-9*?][0-9*?]*[;]\?/*/g' \
          -e 's/[buxvz]/*/g' \
          -e 's/[.,]*-[-.,]*/-/g' \
          -e 's/[,]*[.][,.]*/./g' \
          -e 's/[,][,]*/,/g' \
          -e 's/.['"'"'"]/?/g' \
          -e 's/[^-,./= ]*[%?*][^-,./= ]*/?/g' \
      > base.txt

  Let's reduce the alphabet to letter classes as follows:
  
    O = [aoy]
    I = [i]+
    Q = [q]
    E = unattached [eh]
    R = [djmg] and [rlsn]
    X = <ee>, <ch>, <sh>, <ih>, [ci][ktpf][h], [c][ktpf], [ktpf]

  The following hack should do it:

    cat base.txt \
      | sed \
          -e 's/ee/X/g' \
          -e 's/[csi][h]/X/g' \
          -e 's/[ci][ktpf][h]/X/g' \
          -e 's/[c][ktpf]/X/g' \
          -e 's/[ktpf]/X/g' \
          -e 's/[rlsn]/R/g' \
          -e 's/[mdgj]/R/g' \
          -e 's/[aoy]/O/g' \
          -e 's/[q]/Q/g' \
          -e 's/[i][i]*/I/g' \
          -e 's/[ceh]/E/g' \
      > base.clt

    egrep '[^-.,=/?XEQROI]' base.clt > .bugs
    head -10 .bugs

  First, the { Q O X I R E } patterns:

    cat base.clt \
      | tr '., =/-' '\012\012\012\012\012\012' \
      | egrep '.' \
      | egrep -v '[?%*]' \
      | sort | uniq -c | expand \
      | sort +0 -1nr \
      > QOIXER.frq
      
  The result is a long-tailed distribution that begins
  
       freq pattern
      ----- ----------
       1832 XOR
       1725 OR
       1649 ROIR
       1413 ROR
       1209 OXOR
       1084 XERO
        940 XXO
        903 XEO
        817 OXOIR
        786 QOXOR
        745 XEOR
        718 OIR
        716 XO
        703 OXXO
        660 QOXOIR
        560 QOXXO
        487 R
        480 QOXXRO
        404 RO
        382 OXXRO
        379 XXOR
        376 QOXERO
        375 OXO
        372 XOIR
        370 OXERO
        325 OROR
        316 OXEOR
        312 OXXOR
        309 XXRO
        307 OROIR
        ... ...
    
  Let's now collapse the elements { X XE R IR } to a single class K,
  and absorb the Q into the following O:
  
    cat base.clt \
      | tr '., =/-' '\012\012\012\012\012\012' \
      | egrep '.' \
      | egrep -v '[?%*]' \
      | sed \
          -e 's/XE/K/g' \
          -e 's/X/K/g' \
          -e 's/IR/K/g' \
          -e 's/R/K/g' \
          -e 's/QO/O/g' \
      | sort | uniq -c | expand \
      | sort +0 -1nr \
      > QOK.frq
  
  The result is still a relatively long-tailed distribution:
  
       freq pattern
      ----- ----------
       6061 KOK
       4690 OKOK
       3075 KKO
       2704 OK
       2646 OKKO
       2023 KO
       1531 OKKKO
       1365 OKO
       1346 KKOK
       1236 KKKO
       1052 KOKO
        951 OKKOK
        861 KOKOK
        578 K
        561 OKOKO
        374 KOKKO
        324 KK
        309 OKOKOK
        265 O
        233 KKKOK
        219 KKOKO
        202 KKKKO
        189 KKK
        177 OKKOKO
        175 OKKK
        169 OKK
        160 OOK
        152 OKOKKO
        142 KOKKOK
        139 OKKKKO
        ... ...
  
  Conversely, we can analyze the patterns of X and R ignoring the 
  { Q E I O } complements:

    cat base.clt \
      | tr '., =/-' '\012\012\012\012\012\012' \
      | egrep '.' \
      | egrep -v '[?%*]' \
      | tr -d 'QEIO' \
      | sort | uniq -c | expand \
      | sort +0 -1nr \
      > XR.frq

  The result is still a fairly broad distribution:

       freq pattern
      ----- ----------
      10441 XR
       4319 RR
       4006 R
       3768 XXR
       3682 XX
       2999 X
       1461 XRR
       1279 RXR
        538 RX
        480 XXX
        463 RRR
        409 XXRR
        366 RXXR
        346 RXX
        302 
        230 XXXR
        151 XRXR
        132 XRX
        116 XRRR
         90 RXRR
         89 RRXR
         59 RRRR
         56 XXXX
         50 RRX
         31 XXRX
         30 XXRRR
         24 XRXX
         23 RXXRR
         23 RXXX
         22 XRXXR
        ... ...