Hacking at the Voynich manuscript - Side notes
025 Classifying OKOKOKO elements as word-initial, -medial, and -final

Last edited on 1999-02-05 19:33:19 by stolfi

Word and line breaks in the OKOKOKO model
-----------------------------------------

  In Notes/017 we saw that (practically) every Voynichese word 
  can be parsed into the paradigm QOKOKO...KO where 
  Q, O, K are certain sets of letters and letter clusters.

  It is instructive to analyze the immediate contexts of definite word
  spaces (std), breaks due to figures in the text (fig),
  intra-paragraph line breaks (lin), and inter-word pairs (non), in
  terms of this paradigm.

The source text
---------------

  For this study we will use the majority-vote and consensus
  transcriptions, that includes Takeshi's new full transcription. For
  simplicity, let's discard all data containing weirdos, extra plumes,
  unreadable characters, or the rare letters [abuvxz]. Let's also map
  the upper case EVA letters [SCIKTPF] to their lower case varians,
  since the capitalization carries no information in those cases.

    foreach vt ( m.A c.Y ) 
      set v = ${vt:r}; set t = ${vt:e}
      cat ../045/only-${v}.evt \
        | egrep -e '^<[^<>]*;'"$t"'>' \
        | tr 'SCIKTPF' 'sciktpf' \
        | tr -d '\!' \
        | sed \
            -e 's/^<[^<>]*> *//g' \
            -e 's/[{][^{}]*[}]//g' \
            -e 's/[&][0-9*?][0-9*?]*[;]\?/*/g' \
            -e 's/[buxvz]/*/g' \
            -e 's/[.,]*-[-.,]*/-/g' \
            -e 's/[,]*[.][,.]*/./g' \
            -e 's/[,][,]*/,/g' \
            -e 's/.['"'"'"]/?/g' \
            -e 's/[^-,./= ]*[%?*][^-,./= ]*/?/g' \
        > base-${v}.txt
    end

  Let's now separate the text into the OKOKOKO elements.
  We delete empty elements and put {} around the O strings too:

    foreach v ( m c ) 
      cat base-${v}.txt \
        | factor-field-OK \
            -v inField=1 \
            -v erase=1 \
            -v outField=1 \
        | sed \
            -e 's/{_}//g' \
            -e 's/_//g' \
            -e 's/\([aoy][aoy]*\)/{\1}/g' \
        > base-${v}.elt
    end
    
  Besides these elements, we will use the following reduced alphabets:
  the coarse set "clt"
  
    O = <[aoy]+>
    Q = [q]
    I = [i]+
    R = [djmg] and [rlsn]
    X = <ee>, <ch>, <sh>, <ih>, <se>, [ci][ktpf][h], [ci][ktpf], [ktpf]
    E = [ehc] not included in the X letters

  We consider also the finer set "flt" where the R and X sets get split
  further as follows
  
    S = <ee>, <ch>, <sh>, <ih>, <se>
    H = [ktpf]
    G = [ci][ktpf][h], [c][ktpf]
    L = [rlsn]
    D = [djmg]
  
  Finally we shoudl consider the "simplified" elements
  where possible errors and calligraphic variants are mapped to 
  likely "correct" versions:

    {p}   -> {t} (also in composites)
    {f}   -> {k} (also in composites)
    {g} -> {m}
    {j} -> {d}
    {iXh} -> {cXh}
    {cXhh} -> {cXhe}
    {iXhh} -> {cXhe}
    {iid} -> {ii} {d}
    etc.
    
  Let's call these the "simple" elements (slt).

  The conversion is done by the sed scripts elt2clt, elt2flt, elt2slt.
  The script elt2elt, a no-op, is also provided for uniformity.

  Converting:

    foreach v ( m c ) 
      foreach map ( clt flt slt )
        echo "map = ${map}"
        cat base-${v}.elt \
          | elt2${map} \
          > base-${v}.${map}
      end
    end
  
  Checking the completeness of the conversion:

    foreach v ( m c ) 
      foreach ma ( elt.a-z clt.A-Z flt.A-Z slt.a-z )
        set map = "${ma:r}"; set alf = "${ma:e}"
        echo "map = ${map} alf = ${alf}"
        cat base-${v}.${map} \
          | egrep '[{][^{}]*[^{}'"${alf}"'?*][^{}]*[}]' \
          > .bugs-${v}.${map}
        cat base-${v}.${map} \
          | egrep '(^|[}])[^{}]*[^-,.=/ {}][^{}]*([{]|$)' \
          >> .bugs-${v}.${map}
        head -10 .bugs-${v}.${map}
      end
    end

Element frequencies
-------------------

  Computing the element frequencies:
  
    foreach v ( m c ) 
      foreach map ( elt clt flt slt )
        cat base-${v}.${map} \
          | tr '{}' '\012\012' \
          | egrep '[_A-Za-z?*%]' \
          | sort | uniq -c | expand \
          | sort +0 -1nr \
          > ${map}-${v}.frq
        cat ${map}-${v}.frq \
          | gawk '/./{print $2;}' \
          | sort \
          > ${map}-${v}.dic
      end
    end
  

  Element frequencies
  
     Majority version:                                    Consensus version:
                                                          
     count clt  count flt  count slt  count elt     count clt   count flt  count slt  count elt  
     ----- ---  ----- ---  ----- ---  ----- -----   ----- ---   ----- ---  ----- ---  ----- -----
     53662 O    53662 O    23747 o    23193 o       45290 O     45290 O    19966 o    19593 o    
     38745 R    25157 L    16878 y    16681 y       32418 R     20716 L    14831 y    14702 y    
     38037 X    19283 S    13568 a    13260 a       32223 X     16632 S    10891 d    10862 d    
      9927 E    16647 H    12490 d    12440 d        8506 E     13888 H    10856 a    10635 a    
      6351 I    13585 D    10463 ch   10019 l        8052 ?     11702 D     9094 ch    8745 l    
      5138 Q     9927 E    10075 l     7683 k        4766 I      8506 E     8771 l     8052 ?    
      2487 ?     6367 I     9919 e     6392 r        4451 Q      8052 ?     8503 e     6314 k    
                 5138 Q     9756 k     6383 ch                   4774 I     8052 ?     5491 ch   
                 2487 ?     7114 r     5138 q                    4451 Q     8042 k     5400 r    
                 2110 G     6891 t     4570 t                    1703 G     5951 r     4451 q    
                            5580 n     4103 ee                              5846 t     3870 t    
                            5138 q     4069 che                             4451 q     3660 iin  
                            4443 ee    4017 iin                             4215 n     3599 che  
                            4385 sh    2487 ?                               3782 ee    3535 ee   
                            4204 ii    2369 s                               3759 sh    1964 sh   
                            2487 ?     2308 sh                              3758 ii    1774 she  
                            2380 s     2029 she                             1776 s     1772 s    
                            2058 i     1690 ke                               981 i     1459 ke   
                            1138 cth   1326 in                               936 cth   1103 p    
                            1095 m     1316 p                                811 m      864 te   
                             972 ckh    996 m                                767 ckh    756 m    
                             106 iii    993 te                                35 iii    630 cth  
                                        730 cth                                         541 ckh  
                                        654 ckh                                         471 ir   
                                        582 ir                                          433 in   
                                        371 f                                           266 f    
                                        340 eee                                         247 eee  
                                        260 oa                                          194 oa   
                                        231 e                                           181 ckhe 
                                        229 ckhe                                        145 cthe 
                                        185 cthe                                        135 e    
                                        145 cph                                         112 cph  
                                        133 n                                            88 oy   
                                        ... ...                                          87 n    
                                                                                        ... ...  
                                       
  Observe that the frequency of {?} increased inthe consensus version,
  while all other counts decreased. Note that the counts of {r} and
  {s} decreased even relative to the other letters. Otherwise the
  differences are minimal.


  Plotting the histograms

    foreach v ( m c ) 
      foreach map ( clt flt slt elt )
        gnuplot <<EOF
        set term x11
        plot "${map}-${v}.frq" using :1 with steps
        pause 120
        quit
EOF
      end
    end

Counting reduced character pairs
--------------------------------

  Now let's count the reduced letter pairs inside words and around 
  all three kinds of word breaks:

    foreach v ( m c )
      foreach map ( clt flt elt slt )

        echo "map = ${map} v = ${v}"

        echo "non-breaks ..."
        cat base-${v}.${map} \
          | sed \
              -e 's/\({[^{}]*}\)/\1@\1\!/g' \
              -e 's/[\!:]*[-=:., ][-\!=:., ]*/@/g' \
          | tr '@' '\012' \
          | egrep -e '^{[^{}]*}\!{[^{}]*}$' \
          | compute-pair-freqs \
          > non-${map}-${v}.frq

        echo "simple word breaks ..."
        cat base-${v}.${map} \
          | sed \
              -e 's/[.]\({[^{}]*}\)[.]/.\1\1./g' \
              -e 's/\({[^{}]*}\)[.]\({[^{}]*}\)/@\1.\2@/g' \
          | tr '@' '\012' \
          | egrep -e '^{[^{}]*}[.]{[^{}]*}$' \
          | compute-pair-freqs \
          > std-${map}-${v}.frq

        echo "figure breaks ..."
        cat base-${v}.${map} \
          | sed \
              -e 's/-\({[^{}]*}\)-/-\1\1-/g' \
              -e 's/\({[^{}]*}\)-\({[^{}]*}\)/@\1-\2@/g' \
          | tr '@' '\012' \
          | egrep -e '^{[^{}]*}-{[^{}]*}$' \
          | compute-pair-freqs \
          > fig-${map}-${v}.frq

        echo "line breaks ..."
        cat base-${v}.${map} \
          | sed \
              -e 's/^[-=., ]*\({[^{}]*}\)[-=., ]*$/\1\1/g' \
              -e 's/^[-=., ]*\({[^{}]*}\)/\1@/g' \
              -e 's/\({[^{}]*}\)[-., ]*$/@\1\//g' \
          | tr -d '\012' \
          | tr '@' '\012' \
          | egrep -e '^{[^{}]*}[/]{[^{}]*}$' \
          | compute-pair-freqs \
          > lin-${map}-${v}.frq

      end
    end
    
  Created {clt,flt,slt,elt}.dic from the "-m" versions, reordering 
  with hindsight into important classes

  Format element pair frequencies in matrix form:

    foreach v ( m c )
      foreach map ( clt flt slt )
        foreach brk ( lin fig std non )
          echo "map = ${map}  brk = ${brk}"
          cat ${brk}-${map}-${v}.frq \
            | gawk '/./{s=$3; gsub(/[\!:./]/, " ", s); print $1,s;}' \
            | count-diword-freqs \
                -v rows=${map}.dic -v cols=${map}.dic \
                -v counted=1 -v digits=5 \
            > ${map}-${brk}-${v}.dwtbl
        end
      end
    end
    
Analysis in terms of the Q O I R X E classes
--------------------------------------------

  Here are the counts for the mapping to the coarse { Q O I R X E } classes:

    Pairs of adjacent characters inside words:

      absolute counts                                   per-row percentages
      --- ------- ----- ----- ----- ----- ----- -----   --- -- -- -- -- -- --
                T                                                            
                O                                                            
                T     Q     O     R     X     E     I        Q  O  R  X  E  I
      --- ------- ----- ----- ----- ----- ----- -----   --- -- -- -- -- -- --
      Q      5137     .  5048     4    52    33     .   Q    . 98  .  1  .  .
      O     37646    28     . 18339 12821   152  6306   O    .  . 48 34  . 16
      R     18953     3 14900   904  3072    59    15   R    . 78  4 16  .  .
      X     37748     . 16730  2859  8517  9629    13   X    . 44  7 22 25  .
      E      9860     .  5139  3770   948     2     1   E    . 52 38  9  .  .
      I      6341     .    11  6301    18     .    11   I    .  . 99  .  .  .
      --- ------- ----- ----- ----- ----- ----- -----   --- -- -- -- -- -- --
      TOT  115685    31 41828 32177 25428  9875  6346   TOT  . 36 27 21  8  5


    Pairs around ordinary word breaks (std):
    
      absolute counts                                   per-row percentages
      --- ------- ----- ----- ----- ----- ----- -----   --- -- -- -- -- -- --
                T                                                            
                O                                                            
                T     Q     O     R     X     E     I        Q  O  R  X  E  I
      --- ------- ----- ----- ----- ----- ----- -----   --- -- -- -- -- -- --
      Q         0     .     .     .     .     .     .   Q    .  .  .  .  .  .
      O     12535  3287  2786  2748  3703    11     .   O   26 22 21 29  .  .
      R     15134  1068  5797  1879  6366    22     2   R    7 38 12 42  .  .
      X       174    10    59    19    85     1     .   X    5 33 10 48  .  .
      E        44     5    16     8    15     .     .   E    .  .  .  .  .  .
      I         5     .     2     2     1     .     .   I    .  .  .  .  .  .
      --- ------- ----- ----- ----- ----- ----- -----   --- -- -- -- -- -- --
      TOT   27892  4370  8660  4656 10170    34     2   TOT 15 31 16 36  .  .

    Pairs around figure breaks (fig):
    
      absolute counts                                   per-row percentages
      --- ------- ----- ----- ----- ----- ----- -----   --- -- -- -- -- -- --
                T                                                            
                O                                                            
                T     Q     O     R     X     E     I        Q  O  R  X  E  I
      --- ------- ----- ----- ----- ----- ----- -----   --- -- -- -- -- -- --
      Q         0     .     .     .     .     .     .   Q    .  .  .  .  .  .
      O       360    16   134   107   100     3     .   O    4 37 29 27  .  .
      R       382     9   168    98   107     .     .   R    2 43 25 28  .  .
      X        19     .     4     6     9     .     .   X    . 21 31 47  .  .
      E         1     .     .     1     .     .     .   E    .  .  .  .  .  .
      I         0     .     .     .     .     .     .   I    .  .  .  .  .  .
      --- ------- ----- ----- ----- ----- ----- -----   --- -- -- -- -- -- --
      TOT     762    25   306   212   216     3     .   TOT  3 40 27 28  .  .

    Pairs around line breaks (lin):

      absolute counts                                   per-row percentages
      --- ------- ----- ----- ----- ----- ----- -----   --- -- -- -- -- -- --
                T                                                            
                O                                                            
                T     Q     O     R     X     E     I        Q  O  R  X  E  I
      --- ------- ----- ----- ----- ----- ----- -----   --- -- -- -- -- -- --
      Q         1     .     .     1     .     .     .   Q    .  .  .  .  .  .
      O      1244   199   377   415   248     5     .   O   15 30 33 19  .  .
      R      1828   262   636   594   334     2     .   R   14 34 32 18  .  .
      X        28     2    18     5     2     1     .   X    7 64 17  7  3  .
      E         2     .     1     1     .     .     .   E    .  .  .  .  .  .
      I         2     1     1     .     .     .     .   I    .  .  .  .  .  .
      --- ------- ----- ----- ----- ----- ----- -----   --- -- -- -- -- -- --
      TOT    3105   464  1033  1016   584     8     .   TOT 14 33 32 18  .  .

  So we can say that 
  
    (1) Line breaks occur almost only between { O R } and { Q O R X }
        (with frequencies ranging from 6% to 20% of all line breaks);
        rarely between X and { Q O R X }
        (less than 0.9% of all line breaks);
        and essentially never after { Q I E } or before { I E }
        (less than 0.4% of all line breaks).
      
    (2) Ordinary word breaks follow the same pattern:
        the pairs between { O R } and { Q O R X }
        have frequencies between 3.8% and 22%;
        pairs between X and { Q O R X } have total 
        frequency of 0.6%; and all the remaining pairs 
        account for only 0.3% of the line breaks.
    
    (3) Figure breaks too follow almost the same pattern:
        the pairs { O R } and { O R X }
        have frequencies ranging from 22% to 12%,
        but the pairs { R-Q and O-Q } are much rarer
        than around line breaks and ordinary spaces,
        about 1--2% each.  Breaks between X and { Q O R X }
        are slightly more common (2.5% total) and 
        all other pairs are almost absent (about 0.5%).
        
    (4) The relative frequencies of { Q O R X } 
        are approximately 1:2:2:1 after a line break,
        and 0:3:2:2 after a figure break, roughly 
        independently of the character before the break.
        
    (5) The relative frequencies of { Q O R X } 
        after ordinary word breaks seem to depend on the 
        preceding letter: 1:1:1:1 after O, 1:4:1:4 after R.
        However they are still of the same order of magnitude.
    
    (6) Inside words, the valid pairs are 
        { QO, OX, OI, IX, XX, XE, XO, EX, EO }
        with frequencies ranging from 4.1% to 27%.
        The remaining pairs have much lower frequencies
        (OO accounts for 0.46% of all pairs, and OE 
        for only 0.13%).
      
  These observations seem to imply that the "word spaces", line
  breaks, and figure breaks are fairly similar when compared to
  all inter-character pairs. 
  
  Their similarity, and the relative independence of the second letter
  on the first strongly suggests that those breaks are indeed word
  boundaries. In that case we conclude that Voynichese words may end
  in O or R (40-45% and 50-60%, respectively) or rarely X; and may
  begin only with Q, O, R, or X.
  
  (A more detailed analysis would show that the O at end of words is
  almost always <y>. Also the last R in a line is most often EVA <m>,
  which is only rarely seen at the other kinds of word breaks.)
  
  Point (3) shows that figure breaks are more like line and word
  breaks than like random inter-character breaks.
  
  The main difference between line breaks and figure breaks is that
  the probability of finding a Q is much higher after a line break (14%)
  than after a figure break (3%).  
  
  The main difference between line and figure breaks on one side, and
  the ordinary word breaks on the other, is that the probability of
  the letter after an ordinary word break is visibly dependent on the
  letter before the break. The difference can be described as an
  enhancement of Q.O pairs at the expense of O.O pairs; and an
  enhancement of R.O and R.X pairs at the expense fo R.R pairs.
  
    Distribution of letters after a break, depending on the 
    previous one, ignoring pairs that end in Q:
    
       after R:             after O:
       
            O  R  X            O  R  X 
           -- -- --           -- -- --  
       lin 40 37 21       lin 36 39 23
       fig 45 26 28       fig 39 31 29
       std 41 13 45       std 30 29 40
       non 78  4 16       non  . 48 34

  Point (6) is a partial restatement of the QOKOKOKO paradigm.
  Note that the pairs { QO OI IX XE EX EO }, which are fairly
  common inside words, are not legal places for word spaces,
  line, or figure breaks.
     
  Unfortunately this data does not shed much light on whether each O
  (or E) is attached to the preceding X, the following X, or sometimes
  both, or neither. There are (practically) no figure breaks adjacent
  to an E.


Analysis for the Q O I G H S L D E classes
------------------------------------------

  Here are the counts for the mapping to finer classes { Q O I G H S L D E }


    Pairs around ordinary word breaks (std):
    
      absolute counts                                                     per-row percentages
      --- ------- ----- ----- ----- ----- ----- ----- ----- ----- -----   --- -- -- -- -- -- -- --
                T                                                                                 
                O                                                                                 
                T     Q     O     L     D     S     H     G     E     I        Q  O  L  D  S  H  G
      --- ------- ----- ----- ----- ----- ----- ----- ----- ----- -----   --- -- -- -- -- -- -- --
      Q         0     .     .     .     .     .     .     .     .     .   Q    .  .  .  .  .  .  .
      O     12535  3287  2786  1493  1255  2359  1050   294    11     .   O   26 22 11 10 18  8  2
      L     14499   957  5570   604  1186  5127   568   465    21     1   L    6 38  4  8 35  3  3
      D       635   111   227    51    38   183    13    10     1     1   D   17 35  8  5 28  2  1
      S        57     5    23     4     5     7    12     1     .     .   S    8 40  7  8 12 21  1
      H       112     5    34     4     4    63     1     .     1     .   H    4 30  3  3 56  .  .
      G         5     .     2     1     1     1     .     .     .     .   G    .  .  .  .  .  .  .
      E        44     5    16     2     6     5     9     1     .     .   E   11 36  4 13 11 20  2
      I         5     .     2     1     1     1     .     .     .     .   I    .  .  .  .  .  .  .
      --- ------- ----- ----- ----- ----- ----- ----- ----- ----- -----   --- -- -- -- -- -- -- --
      TOT   27892  4370  8660  2160  2496  7746  1653   771    34     2   TOT 15 31  7  8 27  5  2
    

    Pairs around figure breaks (fig):
    
      absolute counts                                                     per-row percentages
      --- ------- ----- ----- ----- ----- ----- ----- ----- ----- -----   --- -- -- -- -- -- -- --
                T                                                                                 
                O                                                                                 
                T     Q     O     L     D     S     H     G     E     I        Q  O  L  D  S  H  G
      --- ------- ----- ----- ----- ----- ----- ----- ----- ----- -----   --- -- -- -- -- -- -- --
      Q         0     .     .     .     .     .     .     .     .     .   Q    .  .  .  .  .  .  .
      O       360    16   134    38    69    85     1    14     3     .   O    4 37 10 19 23  .  3
      L       315     9   133    28    53    86     .     6     .     .   L    2 42  8 16 27  .  1
      D        67     .    35     8     9    12     3     .     .     .   D    . 52 11 13 17  4  .
      S         2     .     .     .     .     2     .     .     .     .   S    .  .  .  .  .  .  .
      H         9     .     2     2     2     3     .     .     .     .   H    .  .  .  .  .  .  .
      G         8     .     2     1     1     4     .     .     .     .   G    .  .  .  .  .  .  .
      E         1     .     .     .     1     .     .     .     .     .   E    .  .  .  .  .  .  .
      I         0     .     .     .     .     .     .     .     .     .   I    .  .  .  .  .  .  .
      --- ------- ----- ----- ----- ----- ----- ----- ----- ----- -----   --- -- -- -- -- -- -- --
      TOT     762    25   306    77   135   192     4    20     3     .   TOT  3 40 10 17 25  .  2


    Pairs around line breaks (lin):

      absolute counts                                                     per-row percentages
      --- ------- ----- ----- ----- ----- ----- ----- ----- ----- -----   --- -- -- -- -- -- -- --
                T                                                                                 
                O                                                                                 
                T     Q     O     L     D     S     H     G     E     I        Q  O  L  D  S  H  G
      --- ------- ----- ----- ----- ----- ----- ----- ----- ----- -----   --- -- -- -- -- -- -- --
      Q         1     .     .     1     .     .     .     .     .     .   Q    .  .  .  .  .  .  .
      O      1244   199   377   172   243   124   123     1     5     .   O   15 30 13 19  9  9  .
      L      1160   187   385   165   206   121    89     5     2     .   L   16 33 14 17 10  7  .
      D       668    75   251    99   124    58    59     2     .     .   D   11 37 14 18  8  8  .
      S         5     .     5     .     .     .     .     .     .     .   S    .  .  .  .  .  .  .
      H        20     1    11     3     2     .     2     .     1     .   H    4 54 14  9  .  9  .
      G         3     1     2     .     .     .     .     .     .     .   G    .  .  .  .  .  .  .
      E         2     .     1     1     .     .     .     .     .     .   E    .  .  .  .  .  .  .
      I         2     1     1     .     .     .     .     .     .     .   I    .  .  .  .  .  .  .
      --- ------- ----- ----- ----- ----- ----- ----- ----- ----- -----   --- -- -- -- -- -- -- --
      TOT    3105   464  1033   441   575   303   273     8     8     .   TOT 14 33 14 18  9  8  .


  These numbers can be summarized as follows: line breaks occur only
  between { O L D } and { Q O L D S H }, and practically never after 
  { Q S H G E I } or before { G E I }.  The distribution of the 
  first letter of the line does not depend much on the last
  letter of the previous line.
  
  Ordinary word breaks have almost the same distribution, except that
  they may also occur before G. Figure breaks are even more extreme in
  that they occur before G but hardly ever before H.
  
  Also the distribution of the letter after a word break depends
  significantly on the letter before the break. In particular the
  pairs L.S and D.S are far more common around word breaks than they
  are around line breaks.
  
Analysis for the "corrected" elements
-------------------------------------

  In tabular form:

    Pairs around ordinary word breaks (std):
    
      --- ------- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- -----
                T                                                                                               c     c
                O                                                                 c     s     e                 k     t
                T     q     a     o     y     r     l     s     n     d     m     h     h     e     k     t     h     h
      --- ------- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- -----
      a        29     3     2     5     1     5     3     .     .     4     .     3     .     .     2     .     1     .
      o       836    75    32    80    22    59   108    31     .   120     1    98    42     1    78    37    17    33
      y     11670  3209   103  2250   291   228   726   332     1  1128     2  1519   689     7   501   432    61   182
      r      4520   232   757  1205   188    15    48    58     .   218     3   965   594     8    54    33    39    94
      l      4688   371   150   908   113    56   168   124     1   602     1  1049   600    15   294   118    26    85
      s       763    28   199   204    44     2    13     9     .    16     1   136    69     3     6     6    12    14
      n      4528   326   209  1383   210    11    30    69     .   342     3  1116   571     1    31    26    48   147
      d       371    88    29    81    20     5    27     5     1    19     .    51    32     .     6     2     2     2
      m       264    23    10    76    11     .     2    11     .    19     .    69    31     .     5     .     .     6
      --- ------- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- -----
      TOT   27892  4370  1509  6241   910   383  1131   643     3  2485    11  5054  2656    36   996   657   206   565
                                                                                                                       
                                                                                                                       
    Pairs around figure breaks (fig):                      
                                                                                                                       
      --- ------- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- -----
                T                                                                                               c     c
                O                                                                 c     s     e                 k     t
                T     q     a     o     y     r     l     s     n     d     m     h     h     e     k     t     h     h
      --- ------- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- -----
      a         0     .     .     .     .     .     .     .     .     .     .     .     .     .     .     .     .     .
      o        19     2     .     2     4     1     3     .     .     1     .     2     2     .     .     1     .     1
      y       341    14     4    77    47     .     2    32     .    68     .    58    22     1     .     .     2    11
      r        56     3     2    11    10     .     .     4     .    10     .    12     4     .     .     .     .     .
      l       103     4     3    25    16     2     1     9     .    19     .    12     9     .     .     .     .     3
      s        55     .     5    16     5     .     .     3     .    10     .    11     5     .     .     .     .     .
      n       101     2     .    25    15     .     .     9     .    14     .    21    12     .     .     .     .     3
      d        45     .     2    10     9     2     .     4     .     9     .     4     5     .     .     .     .     .
      m        22     .     .    11     3     .     .     2     .     .     .     2     1     .     2     1     .     .
      --- ------- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- -----
      TOT     762    25    17   179   110     7     6    64     .   133     2   130    61     1     2     2     2    18
                                                                                                                       
                                                                                                                       
    Pairs around line breaks (lin):                        
                                                                                                                       
      --- ------- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- -----
                T                                                                                               c     c
                O                                                                 c     s     e                 k     t
                T     q     a     o     y     r     l     s     n     d     m     h     h     e     k     t     h     h
      --- ------- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- -----
      a         7     4     .     1     1     .     .     .     .     1     .     .     .     .     .     .     .     .
      o        47     8     1     9     6     2     3     6     .     6     .     2     1     .     1     2     .     .
      y      1190   187     2   168   189     2    25   134     .   236     .    59    62     .    12   108     .     1
      r       276    41     1    55    48     3     6    19     .    49     .    16    17     .     2    18     .     1
      l       364    57     .    54    59     2    15    53     .    70     .    10    16     .     6    21     .     1
      s       109    21     .    19    18     1     2    10     .    14     .     3     8     .     .    12     .     .
      n       411    68     1    66    64     1     4    49     .    73     .    16    35     .     5    25     .     3
      d       108    19     1    21    22     1     4     8     .    17     .     6     3     .     1     4     .     1
      m       560    56     3    89   115     1     8    77     .   107     .    17    31     1     2    52     .     1
      --- ------- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- -----
      TOT    3105   464     9   489   535    14    67   360     .   575     .   129   173     1    30   243     .     8


  Distribution (percentual) of letters after each type of break:

      --- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- --
                                                        c  c
                                         c  s  e        k  t
           q  a  o  y  r  l  s  n  d  m  h  h  e  k  t  h  h
      --- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- --
      non  . 10 13 12  5  7  5  4  7  .  4  1  3  7  4  .  .
      
      std 15  5 22  3  1  4  2  .  8  . 18  9  .  3  2  .  2
      fig  3  2 23 14  .  .  8  . 17  . 17  8  .  .  .  .  2
      lin 14  . 15 17  .  2 11  . 18  .  4  5  .  .  7  .  .
      

  Distribution (percentual) of letters before each type of break:

      --- ---- ---- ---- ----     
             n    s    f    l     
             o    t    i    i     
             n    d    g    n     
      --- ---- ---- ---- ----     
      q      4    0    0    0     
      a     11    0    0    0     
      o     19    2    2    1     
      y      1   41   44   38     
      r      1   16    7    8     
      l      3   16   13   11     
      s      1    2    7    3     
      n      0   16   13   13     
      d     10    1    5    3     
      m      0    0    2   18     
      ch     8    0    0    0     
      sh     3    0    0    0     
      ee     3    0    0    0     
      k      8    0    0    0     
      t      5    0    0    0     
      ckh    0    0    0    0     
      cth    0    0    0    0     
      e      8    0    0    0     
      i      1    0    0    0     
      ii     3    0    0    0     
      iii    0    0    0    0     
      --- ---- ---- ---- ----
   
  From the analysis with "flt" classes, we would expect the following pairs to occur:
  
     { a o y  r l s n  d m } and { q  a o y  r l s n  d m  ch sh ee  k t  ckh cth }
     
  Notable absences in all breaks:
  
     */m - the ratio d:m is 10:1 but the ratio */d : */m is over 200:1
     
     */n - compared to */r, */l, */s and considering the element frequencies.
           of course that is because <n> only occurs after <i> whereas
           <r> <l> <s> also occur in other contexts.
     
     */ee - the ratio ch:sh:ee is 2:1:1 but */ch : */sh : */ee is 2:1:0
     
     a/*, o/* - while a:o:y is 3:6:4, a/*:o/*:y/* is 0:10:150 in ordinary breaks,
           and similarly absent in other breaks.
  
  notable absences in line and figure breaks:
  
     */a - ratio */a:*/o:*/y is 2:8:1 in std, 1:50:50 in lin, while a:o:y is 3:6:4
     
     */r - 1.3% of std, 1% of fig, 0.4% of lin. Also, the
           ratio r:s is 3:1 but */r:*/s is 1:2 in std, 1:25 in lin.
     
     */k - 3.5% of std, 0.3% of fig, 1% of lin
     
     */l - 4% in std, 1% of fi,  2% of lin;
  
  notable absences in line breaks: 
  
     */ckh, */cth - 2.7% of std, 2.6% of fig, 0.2% of line
    
  Notable absences in figure breaks:
 
     */q - 15% of standard breaks, 3% of figure breaks, 15% of line breaks

     */k,*/t - 6% of std, 0.5% of fig, 9% of lin (seen in in flt)
     
     s/r, s/s, d/r, d/s

  Notable anomalies:
  
    */s - 2% of std, 8% of fig, 12% of lin.

  This said, the significant pairs found around line breaks 
  are mostly between
  
    { o y  r l s n  d m } and { q  o y  l s  d  ch sh  t }
    
  where o/* pairs are actually quite rare.
  
  Pairs with second element <t> or <s> are in fact more common around
  line breaks than around word breaks.  Thus it may be that such
  word breaks are preferentially omitted in transcription.
  
  Pairs with first element <m> have the same discrepancy, but there
  the likely explanation is that <m> is an abbreviation or a 
  calligraphic variant that is specifically used at end of line.
  (Since <m> also occurs before figure breaks, the abbreviation
  theory seems more likely.)