Hacking at the Voynich manuscript - Side notes
022 Analyzing QOKOKOKO element frequencies per section

Last edited on 1999-02-01 16:28:09 by stolfi

1998-06-20 stolfi
=================

  [ Originally part of Notes/023 ]
  [ First version done on 1998-05-04, now redone with fresher data. ]
  [ Split off from Notes/023 as Notes/022 on 1999-01-31. ]

  The purpose of this note is to compare the frequencies 
  of the QOKOKOKO elements (see Notes/017 and Notes/018)
  among the various sections of the VMS.
  
  Since I am not still clear on how to group the O's with K's (with the
  following K, with the preceding K, with both, with neither), I will
  leave them as separate elements. Also, for simplicity (without any
  conviction at all) I will split every double-letter O into two
  elements.

  Also, given Grove's observations on anomalous "p" and "t" 
  distributions at beginning-of-line, and the well-known attraction
  of certain elements for end-of-line, it seems advisable to discard
  the first few and last few elements of every line.

I. EXTRACTING AND COUNTING ELEMENTS

  We will prepare two sets of statistics, one using raw words ("-r")
  and one using word equivalence classes ("-c").

    elem-to-class -describe

      element equivalence:
        map_ee_to_ch
        ignore_gallows_eyes
        join_ei
        equate_aoy
        collapse_ii
        equate_eights
        equate_pt
        erase_word_spaces
        crush_invalid_words
        append_tilde

  This mapping will hopefully reduce transcription and sampling noise.

  The source data will be majority edition derived from interlinear
  release 1.6e6, already chopped into pages and sections:

    ln -s ../045/pages-m text-pages
    ln -s ../045/subsecs-m text-subsecs
    
  Creating a combined file of the source text for archiving:
  
     ( cd text-pages && cat `cat all.names | sed -e 's/$/.evt/'` ) \
       > all.evt
       
     cat all.evt \
       | validate-new-evt-format \
          -v chars='abcdefghijklmnopqrstuvxyz?*' \
          -v requireUnitHeaders=0 \
          -v requirePageHeaders=0

  Selecting the plain text and factoring it into elements:

    foreach utype ( pages subsecs )
      foreach f ( `cat text-${utype}/all.names` )
        set ofile = "/tmp/${utype}-${f}.etx"
        echo ${ofile}
        cat text-${utype}/${f}.evt \
          | select-units \
              -v types='parags,starred-parags,circular-lines,circular-text,radial-lines,titles' \
              -v table=unit-to-type.tbl \
          | lines-from-evt | egrep '.' \
          | factor-line-OK | egrep '.' \
          > ${ofile}
      end
    end

  Separating the elements and mapping them to classes:

    foreach ep ( cat.RAW elem-to-class.EQV )
      set etag = ${ep:e}; set ecmd = ${ep:r}
      foreach utype ( pages subsecs )
        foreach f ( `cat text-${utype}/all.names` )
          set ofile = "/tmp/${utype}-${f}-${etag}.els"
          echo ${ofile}
          cat /tmp/${utype}-${f}.etx \
            | trim-three-from-ends \
            | tr '{}._' '\012\012\012\012' \
            | ${ecmd} | egrep '.' \
            > ${ofile}
        end
      end
    end

  Counting elements and computing relative frequencies:

    mkdir -p RAW EQV
    foreach ep ( elem-to-clean.RAW elem-to-class.EQV )
      set etag = ${ep:e}; set ecmd = ${ep:r}
      /bin/rm -rf ${etag}/efreqs
      mkdir -p ${etag}/efreqs
      foreach utype ( pages subsecs )
        set frdir = "${etag}/efreqs/${utype}"
        mkdir -p ${frdir}
        cp -p text-${utype}/all.names ${frdir}/
        foreach f ( `cat text-${utype}/all.names` )
          set ofile = "${frdir}/$f.frq"
          echo ${ofile}
          cat /tmp/${utype}-${f}-${etag}.els \
            | sort | uniq -c | expand \
            | sort -b +0 -1nr \
            | compute-freqs \
            > ${ofile}
        end
      end
    end
    
    /bin/rm /tmp/{pages,subsecs}-*{,-RAW,-EQV}.{etx,els}

  Computing total frequencies:

    foreach ep ( cat.RAW elem-to-class.EQV )
      set etag = ${ep:e}; set ecmd = ${ep:r}
      foreach utype ( pages subsecs )
        set fmt = "${etag}/efreqs/${utype}/%s.frq"
        set frfiles = ( \
          `cat text-${utype}/all.names | gawk '/./{printf "'"${fmt}"'\n",$0;}'` \
        )
        echo ${frfiles}
        cat ${frfiles} \
          | gawk '/./{print $1, $3;}' \
          | combine-counts \
          | sort -b +0 -1nr \
          | compute-freqs \
          > ${etag}/efreqs/${utype}/tot.frq
      end
    end  
  
II. TABULATING ELEMENT FREQUENCIES PER SUBSECTION

    set sectags = ( `cat text-subsecs/all.names` )
    echo $sectags

    foreach etag ( RAW EQV )
      tabulate-frequencies \
        -dir ${etag}/efreqs/subsecs \
        -title "elem" \
        tot ${sectags}
    end

  Elements sorted by frequency (× 99), per subsection:

      tot     unk     pha     str     hea     heb     bio     ast     cos     zod    
      ------- ------- ------- ------- ------- ------- ------- ------- ------- -------
      17 o    16 o    23 o    15 o    20 o    14 y    15 o    18 o    16 o    20 o   
      12 y    12 a     9 y    11 y    11 y    14 o    15 y    13 y    13 y    13 a   
       9 a    11 y     8 l    11 a     9 ch   11 d    10 d     9 a    11 a     8 y   
       8 d     8 d     6 a     8 d     7 a    10 a     8 l     7 d     9 l     8 l   
       7 l     7 l     6 d     6 l     7 d     6 k     7 a     5 ch    6 d     7 t   
       5 k     5 r     4 r     6 k     6 l     5 l     6 q     5 ee    4 ch    5 r   
       4 ch    5 k     4 k     4 q     5 r     4 ch    6 k     5 r     4 k     5 d   
       4 r     4 ch    4 ch    4 ee    4 k     4 r     3 ee    4 k     4 r     5 ee  
       4 q     3 t     3 q     4 r     3 t     3 iin   3 che   3 s     4 t     3 ch  
       3 ee    3 iin   3 ee    3 iin   3 iin   3 che   3 r     3 l     3 iin   3 k   
       3 iin   3 q     3 che   3 ch    2 sh    2 q     2 she   3 t     2 sh    2 te  
       3 t     2 che   3 iin   3 che   2 q     2 t     2 t     2 che   2 q     2 iin 
       3 che   2 sh    2 ke    3 t     2 s     2 ee    2 ch    2 iin   2 ee    1 s   
       1 sh    1 ee    1 s     1 she   1 che   1 ke    1 in    2 ke    2 che   1 che 
       1 she   1 she   1 ?     1 sh    1 cth   1 sh    1 ke    1 q     1 s     1 sh  
       1 ke    1 p     1 sh    1 ke    1 ee    1 she   1 iin   1 ?     1 she   1 she 
       1 s     1 s     1 she   1 in    0 she   1 s     1 sh    1 e?    0 ke    0 p   
       0 in    0 ke    1 t     0 p     0 p     0 te    1 s     1 she   0 e?    0 ?   
       0 p     0 te    0 ckh   0 s     0 ckh   0 p     0 te    1 te    0 p     0 in  
       0 te    0 ir    0 ckhe  0 te    0 in    0 ckh   0 ckh   1 sh    0 te    0 ke  
       0 cth   0 cth   0 te    0 ir    0 m     0 f     0 p     0 p     0 ?     0 eee 
       0 ckh   0 f     0 p     0 eee   0 ke    0 in    0 cth   0 ir    0 cth   0 e?  
       0 ir    0 in    0 e?    0 ckh   0 cph   0 ir    0 ckhe  0 cth   0 in    0 m   
       0 ?     0 m     0 iiir  0 cth   0 te    0 m     0 f     0 ckh   0 m     0 cth 
       0 eee   0 ckh   0 in    0 f     0 cthe  0 cth   0 eee   0 eee   0 ckh   0 cthe
       0 f     0 cph   0 cth   0 e?    0 f     0 e?    0 e?    0 cthe  0 eee   0 iir 
       0 m     0 eee   0 cthe  0 ckhe  0 ?     0 cthe  0 ir    0 in    0 cthe  0 q   
       0 e?    0 e?    0 ir    0 ?     0 ckhe  0 ?     0 cthe  0 iir   0 ir          
       0 ckhe  0 ?     0 f     0 m     0 e?    0 ckhe  0 iiin  0 i?    0 iir         
       0 cthe  0 cfh   0 m     0 iir   0 eee   0 eee   0 h?    0 il    0 cfh         
       0 cph   0 cthe  0 eee   0 i?    0 ir    0 iir   0 cph   0 j     0 ckhe        
       0 iir   0 cphe  0 j     0 cthe  0 n     0 iiin  0 cphe  0 ckhe  0 ij          
       0 iiin  0 ckhe  0 cphe  0 iiin  0 cfh   0 cphe  0 ?     0 cph                 
       0 n     0 iir   0 cph   0 cph   0 cphe  0 cph   0 ck    0 f                   
       0 i?    0 im    0 iiin  0 il    0 i?    0 cfhe  0 ikh   0 im                  
       0 cphe  0 n     0 iir   0 n     0 iiin  0 i?    0 il    0 m                   
       0 cfh   0 i?    0 i?    0 cphe  0 iir   0 im    0 m                           
       0 iiir  0 x     0 cfhe  0 de    0 ct    0 cfh   0 n                           
       0 de    0 iil   0 de    0 x     0 de    0 n     0 cfh                         
       0 il    0 il    0 is    0 is    0 im    0 x     0 de                          
       0 im            0 n     0 im    0 cfhe  0 de    0 i?                          
       0 cfhe          0 id    0 cfhe  0 b     0 id    0 is                          
       0 x             0 pe    0 cfh   0 ck            0 ith                         
       0 is                    0 iil   0 id            0 c?                          
       0 j                     0 iid   0 iil           0 iir                         
       0 h?                    0 id    0 cf            0 b                           
       0 ck                    0 iiid  0 g             0 cp                          
       0 ct                    0 iiil  0 h?            0 ct                          
       0 iil                   0 iim   0 iiil          0 g                           
       0 id                    0 iis   0 iiir          0 iil                         

  I have compared these counts with those obtained by removing two, one, or zero
  elements from each line end.  The conclusion is that the ordering of the first
  six entries in each column is quite stable; it is probably not an artifact.

  Some quick observations: there seem to be three "extremal" samples:
  hea ("ch" abundant), bio ("q" important), and zod ("t" important).

  There are too many "e?" elements; I must check where they come from
  and perhaps modify the set of elements to account for them.

    [ It seems that many came from groups of the form "e[ktpf]e",
      "e[ktpf]ee", which could be "c[ktpf]h" and "c[ktpf]he" without
      ligatures.  Most of the remaining come from Friedman's
      transcription; there are practically none in the more careful
      transcriptions. ]

  All valid elements that occur at least 10 times in the text:

    o y a 
    q 
    n in iin iiin
    r ir iir iiir
    d
    s is
    l il
    m im
    j
    de
    k t ke te
    p f
    cth ckh cthe ckhe
    cph cfh cphe cfhe
    ch che
    sh she
    ee eee 
    x

  Valid elements that occur less than 10 times in the whole text:

    iil
    ij
    pe
    ct ck
    id

  Created a file "RAW/plots/vald/keys.dic" with all the valid elements.


  Equiv-reduced elements sorted by frequency (× 99), per subsection:

    tot      unk      pha      str      hea      heb      bio      ast      cos      zod     
    -------- -------- -------- -------- -------- -------- -------- -------- -------- --------
    38 o~    40 o~    40 o~    38 o~    39 o~    40 o~    37 o~    40 o~    41 o~    42 o~   
    10 t~    10 t~     8 l~    11 t~    11 ch~   12 d~    10 d~    10 ch~    9 t~    11 t~   
     8 d~     8 d~     7 ch~    8 d~     9 t~    10 t~     9 t~     8 t~     9 l~     8 ch~  
     8 ch~    7 l~     7 d~     8 ch~    7 d~     6 ch~    8 l~     7 d~     7 ch~    8 l~   
     7 l~     6 ch~    6 t~     6 l~     6 l~     5 l~     6 q~     5 r~     6 d~     5 d~   
     4 r~     5 r~     4 r~     4 q~     5 r~     4 r~     5 ch~    3 s~     4 r~     5 r~   
     4 q~     3 in~    3 q~     4 in~    4 in~    3 in~    3 che~   3 l~     3 in~    3 te~  
     4 in~    3 q~     3 in~    4 r~     2 sh~    3 che~   3 in~    3 te~    2 sh~    3 in~  
     3 che~   2 che~   3 che~   4 che~   2 cth~   2 q~     3 r~     3 che~   2 q~     2 che~ 
     1 te~    2 sh~    2 te~    1 te~    2 q~     2 te~    2 she~   2 in~    2 che~   1 s~   
     1 sh~    1 te~    1 s~     1 she~   2 s~     1 sh~    2 te~    1 q~     1 te~    1 she~ 
     1 she~   1 she~   1 ?~     1 sh~    1 che~   1 she~   1 sh~    1 ?~     1 s~     1 sh~  
     1 cth~   1 cth~   1 cth~   0 s~     0 te~    1 s~     1 cth~   1 e?~    1 she~   0 ?~   
     1 s~     1 s~     1 sh~    0 cth~   0 she~   1 cth~   1 s~     1 she~   0 cth~   0 e?~  
     0 ir~    0 ir~    1 she~   0 ir~    0 cthe~  0 ir~    0 cthe~  1 cth~   0 e?~    0 cthe~
     0 cthe~  0 cthe~  1 cthe~  0 cthe~  0 ir~    0 cthe~  0 e?~    1 sh~    0 ?~     0 cth~ 
     0 ?~     0 e?~    0 ir~    0 e?~    0 ?~     0 e?~    0 ir~    0 ir~    0 ir~    0 ir~  
     0 e?~    0 ?~     0 e?~    0 ?~     0 e?~    0 ?~     0 h?~    0 cthe~  0 cthe~  0 q~   
     0 n~     0 id~    0 i?~    0 i?~    0 n~     0 id~    0 ith~   0 i?~    0 id~           
     0 i?~    0 n~     0 de~    0 il~    0 i?~    0 i?~    0 ct~    0 il~                    
     0 il~    0 i?~    0 is~    0 n~     0 ct~    0 n~     0 ?~     0 id~                    
     0 id~    0 il~    0 n~     0 de~    0 id~    0 x~     0 il~                             
     0 de~    0 x~     0 id~    0 id~    0 de~    0 de~    0 n~                              
     0 ct~                      0 x~     0 il~             0 de~                             
     0 is~                      0 is~    0 b~              0 i?~                             
     0 x~                                0 is~             0 is~                             
     0 h?~                               0 h?~             0 c?~                             
     0 ith~                                                0 b~                              
     0 b~                                                                                    
     0 c?~                                                                                   

  There are 23 valid elements with frequency > 20 under the equivalence:

    o 
    t   te 
    cth cthe 
    ch  che 
    sh  she 
    d   de
    id
    l r q s m n
    in ir im il

  Valid elements with frequency below 20:

    ct is g b x

  Created a file "EQV/plots/vald/keys.dic" with all the valid elements, collapsed by the
  above equivalence.

IV. "ED"'S STORY

  Rene observed that the EVA <ed> digraph is a marker for the A/B 
  language split.  He produced some plots where the horizontal
  axis is page number, with subsections distinguished by colors.
  
  Let's count the word frequencies per page:

    zcat ../037/vms-17-ok.soc.gz \
      | tr '/' '-' \
      | gawk \
          ' \
            (($2 ~ /[A]/) && ($6 \!~ /[-=., ]/)){ \
              gsub(/[.].*$/,"",$1); print $9, substr($10,2), $1, $6; \
            } \
          ' \
      | sort | uniq -c | expand \
      | sort -b +1 -2 +2 -3 +0 -1nr \
      > .all.fpw
      
    cat .all.fpw \
      | list-page-champs -v maxChamps=4 \
      > .all.chpw

  Let's count the total word occurrences per page: 

    cat .all.fpw \
      | gawk \
          ' /./{ k = ($2 " " $3 " " $4); ct[k] += $1; } \
            END { for(w in ct) { print ct[w], w; } } \
          ' \
      | sort -b +2 -3 +0 -1nr \
      > .all.tpw

  Let's now count the <ed>-containing words per page:

    cat .all.fpw \
      | gawk \
          ' ($5 ~ /ed/){ print; } ' \
      > .ed.fpw

    cat .ed.fpw \
      | list-page-champs -v maxChamps=6 \
      > .ed.chpw

    cat .ed.fpw \
      | gawk '//{ print $1,$2,$3,$4; }' \
      | combine-counts \
      | sort -b +2 -3 \
      > .ed.tpw

  Let's plot the ratio of <ed>-words to total words per page:
  
    plot-freqs .ed.tpw .all.tpw
    
  The plots of the <ed>-ratio R show that 
  
    "hea" and "pha" are virtually <ed>-free (R < 0.03, below the erro level);
    
    "cos-1" (the part before the "zod") and "zod" begin with slightly higher
    <ed> ratios than "hea"/"pha" (R ~ 0.04); R then increases sharply
    along "zod", from R ~ 0.03 to R ~ 0.11, and jumps to R ~ 0.20 in
    "cos-2" (the part after "zod").
    
    "heb-2" (after "zod") has R ~ 0.17 just below that of "cos-2".
    "heb-1" (before "zod") has widely variable R, with mean R ~ 0.20.
    
    "str" has R ~ 0.20 like "heb", but more uniform
    (except for the two pages before "zod", which have R ~ 0.02).
    
    "bio" has R ~ 0.20 in the middle, R ~ 0.32 at both ends
    
  So, based only on these plots, the writing sequence
  would be
  
    hea + pha (no obvious order)

    cos-1 + zod 

    heb-2 

    str + heb-1 + cos-2

    bio 

V. ABOUT "ED" AND LADY "DY"

  It seems that most of the <ed> words in language B are 
  actually words that end with <edy>.  In fact there seems
  to be a very small number of words involved.
  
  Let's plot the per-page frequencies of the <dy> ending:
  
    cat .all.fpw \
      | gawk ' ($5 ~ /dy$/){ print $1,$2,$3,$4; } ' \
      | combine-counts \
      | sort -b +2 -3 \
      > .dy.tpw
  
    plot-freqs .dy.tpw .all.tpw

  This plot shows the same trends as the <ed> frequency, except that
  the data for language-A is noisier and the distinction between
  languages A and B is less marked (because the counts for 
  language A are no longer zero).
  
  Here "cos-1" and "zod" are practically equal.
  
  Curiously, pharma has a slightly higher R than herbal-A; and R
  actually decreases as we go down herbal-A. This decrease is strange
  since the trend in the zodiac pages establishes that <ed> increases
  frm older to newer, hence language A should be earlier than language B.

  Let's try again with the <edy> ending proper:
  
    cat .all.fpw \
      | gawk ' ($5 ~ /edy$/){ print $1,$2,$3,$4; } ' \
      | combine-counts \
      | sort -b +2 -3 \
      > .edy.tpw
  
    plot-freqs .edy.tpw .all.tpw

  These plots are like the <ed> plots but cleaner. 
  The "bio" subsection has R ~ 0.25 with a dip in the middle.
  Subsections "str-2" and "heb-1" have almost the same R ~ 0.15.
  Subsections "cos-2" and "heb-2" have R ~ 0.10.
  
  Subsections "cos-1" and "zod" have R ~ 0.03 (barely significant)
  and the trend in "zod" is not so clear.
  
  Finally subsections "hea-1", "hea-2", "pha", and "str-1"
  have hardly any "edy".
  
VI. THE "EDY" WORDS  
  
  Let's compute the overall frequency of each word per subsection,
  removing the <q> prefix and mapping [ktpf] to <k>:
  
    cat .all.fpw \
      | gawk \
          ' /./{ \
              gsub(/^q/,"",$5); gsub(/[ktpf]/,"k",$5); \
              print $1,$2,"000","000",$5 \
            } \
          ' \
      | combine-counts \
      | sort -b +1 -2 +0 -1nr \
      > .all.ftw
      
    cat .all.ftw \
      | gawk '/./{print $1,$2,$3,$4 } ' \
      | combine-counts \
      | sort -b +2 -3 \
      > .all.ttw

  Now let's look at the <edy> words specifically:

    cat .all.ftw \
      | gawk '($5 ~ /edy$/){ print; }' \
      > .edy.ftw

    cat .edy.ftw \
      | list-page-champs -v maxChamps=6 \
      > .edy.chtw

  Here are the six most common words in each subsection, manually sorted:

    sec   totwd  champions
    ---   -----  ----------------------------------------------------------------------
    str   10783  okedy(180) okeedy(271) chedy(193) shedy(119) lchedy(56) okchedy(131)
    bio    6716  okedy(310) okeedy(252) chedy(218) shedy(252) lchedy(59) okchedy(44)
    heb    3337  okedy(101) okeedy(31)  chedy(67)  shedy(36)             kedy(26) ykedy(25)
               
    cos    2590  okeedy(11) chedy(12) okedy(23) shedy(11) okchedy(10) kchedy(5)
    zod     997  okeedy(5)  chedy(4)  okedy(3)  shedy(5)  okshedy(2)  eeedy(2)
               
    hea    7553  chedy(1) ykchedy(1) okeedy(2) esedy(1)
    pha    2401  chedy(1) ockhedy(1) cheedy(2) cholkeedy(1) ckhedy(1) okedy(1)
               
    unk    1847  okedy(21) chedy(19) okchedy(14) shedy(14) okeedy(7) olkeedy(7)
    ---   -----  ----------------------------------------------------------------------
   
  As it can be seen, the <edy> words (there are many of them!) 
  are characteristic of language B ("bio", "heb", "str"), and
  also a bit of "cos" and "zod".   
  
  The frequency of "okedy" (and its k/t/q variants) is 
    
    1:  22 "bio" 
    1:  33 "heb" 
    1:  55 "str"
    1: 220 "cos"
    1: 200 "zod"
  
  and practically nil in "hea", "pha".

  Let's look at the words that DON'T end in <edy>:
  
    cat .all.ftw \
      | gawk ' ($5 \!~ /edy$/){ print; } ' \
      > .not-edy.ftw

    cat .not-edy.ftw \
      | list-page-champs -v maxChamps=6 \
      > .not-edy.chtw
  
  These are the six most common non-edy words in each subsection, also manually sorted:

    sec   totwd  champions
    ---   -----  ----------------------------------------------------------------------
    str   10783  okaiin(350) okal(198) okeey(341) aiin(199) okain(173) okar(184)
    bio    6716  okaiin(145) okal(185) okeey(128)           okain(240)          ol(363) oky(124)
    heb    3337  okaiin(67)  okal(56)             aiin(68)             okar(92) daiin(79) or(68)
                 
    cos    2590  aiin(44) ar(57) okeey(45)  or(43) dar(40) daiin(39)
    zod     997  aiin(29) ar(28) okeey(24)  al(30) okaiin(21) okal(17)
                 
    hea    7553  daiin(393) chol(215) chor(144) okchy(142) ckhy(138) oky(131)
    pha    2401  daiin(105) chol(47)  okeol(62) okol(52) okeey(51) ol(41)
                 
    unk    1847  okar(58) daiin(42) okaiin(40) okal(32) aiin(31) or(31)
    ---   -----  ----------------------------------------------------------------------

  Note that <daiin> is the most common word in herbal-A and pharma,
  but it shows up also in the other subsections, at 1/2 to 1/4 the 
  frequency:
  
    "hea" 1: 18 
    "pha" 1: 24

    "heb" 1: 40
    "str" 1: 75
    "bio" 1: 80

    "cos" 1: 60
    "zod" 1: 80
    
  So perhaps <daiin> is a function word that got less and less 
  used as the author's vocabulary expanded.
  
  The most popular non-<edy> words in language B are
  
   okaiin okal okeey aiin okain okar
   
  They are fairly uniform across subsections, except perhaps 
  okar which is more concentrated in herbal-B.
  
  It is hard to get any conclusion from these lists (other
  than `it strongly suggests Chinese' 8-). 
  

  Let's try with the words <chedy>/<shedy>:

    cat .all.fpw \
      | gawk \
          ' ($5 ~ /^[cs]hedy$/){ \
              print $1,$2,$3,$4 \
            } \
          ' \
      | combine-counts \
      | sort -b +2 -3 \
      > .chedy.tpw
  
    plot-freqs .chedy.tpw .all.tpw
  
  Predictably the R values are smaller overall, and 
  only the "str-2", "heb-1", and "bio" are significantly greater
  than 0.
  
  The "bio" pages still show the dip in the middle.
  
  Let's try <dain>/<daiin>, which should show the reverse trend:
  
    cat .all.fpw \
      | gawk \
          ' ($5 ~ /^d[ao]i+n$/){ \
              print $1,$2,$3,$4 \
            } \
          ' \
      | combine-counts \
      | sort -b +2 -3 \
      > .dain.tpw
  
    plot-freqs .dain.tpw .all.tpw
  
  Predictably again these pages show the opposite trends. In "hea" R is
  large and decreasing ("hea-1" has R ~ 0.07, "hea-2" has R ~ 0.04). Next
  is "pha" at R ~ 0.04, then "heb-2" and "heb-1" a R ~ 0.03, then "str", "cos",
  "zod" and "bio" all at R ~ 0.02.
  
  The "unk" pages f1r and f49v have R ~ 0.08, which is right
  in the middle of the herbal-A range.  The others have lower
  R, which is consistent with language B material.

  Let's compare the frequencies of "Ke" elements relative
  to total non-[aoy] (mostly "K" and "Ke") elements.
  
    cat .all.fpw \
      | ../017/factor-field-OK \
          -v inField=5 \
          -v outField=6 \
      | gawk \
          ' /./{ \
              f = $6; \
              gsub(/^[^{}]*/,"",f);  gsub(/[^{}]*$/,"",f);  \
              gsub(/}[^{}]*{/,"} {",f);  n = split(f, ff);  \
              for(i=1;i<=n;i++){ print $1,$2,$3,$4,ff[i]; } \
            } \
          ' \
      | grep -v '{_}' \
      | combine-counts \
      | sort -b +1 -2 +2 -3 +0 -1nr \
      > .all.fpe

    dicio-wc .all.fpw .all.fpe

      lines   words     bytes file        
    ------- ------- --------- ------------
      24921  124605    688935 .all.fpw
       5632   28160    147899 .all.fpe

  And let's compute the total non-[aoy] elements per page:

    cat .all.fpe \
      | gawk \
          ' /./{ k = ($2 " " $3 " " $4); ct[k] += $1; } \
            END { for(w in ct) { print ct[w], w; } } \
          ' \
      | sort -b +2 -3 +0 -1nr \
      > .all.tpe

    dicio-wc .all.tpw .all.tpe

      lines   words     bytes file        
    ------- ------- --------- ------------
        227     908      3798 .all.tpw
        227     908      3924 .all.tpe

  Let's now count the "Ke" elements only:

    cat .all.fpe \
      | gawk \
          ' ($5 ~ /{([ice][ktpf]?[he]|[ktpf])e}/){ print; } ' \
      > .Ke.fpe

    cat .Ke.fpe \
      | gawk '//{print $1,$2,$3,$4; }' \
      | combine-counts \
      | sort -b +2 -3 \
      > .Ke.tpe
      
    plot-freqs .Ke.tpe .all.tpe

  Strangely the plots show little change from language-A to 
  language-B, less than the variation within the same subsection.
  
  The ratio for "hea-1" is lowest (R ~ 0.03) and is minimum
  around page p025.  Curiously it seems to oscillate 
  with a period of 1-2 pages.
  
  The ratios for all other subsections are about the same,
  around 0.10. 
  
  The "zod" pages show again a sharp increasing trend,
  except for the first couple of pages.

  Observations:

    If languages A and B are indeed different languages, 
    it is hard to explain why some letter group statistics
    are so uniform, and why some are so variable.  

    If languages A and B are different spellings of the same
    language, then the spelling change must not have affected
    the use of "Ke" elements relative to the "K" elements.

    If the difference between languages A and B is merely due
    to vocabulary (including tense/person/etc.), then
    the difference again must not favor "Ke" words over "K" words.

  Let's try the gallows elements only:

    cat .all.fpe \
      | gawk \
          ' ($5 ~ /{.*[ktpf].*}/){ print; } ' \
      > .ktpf.fpe

    cat .ktpf.fpe \
      | gawk '//{print $1,$2,$3,$4; }' \
      | combine-counts \
      | sort -b +2 -3 \
      > .ktpf.tpe
      
    plot-freqs .ktpf.tpe .all.tpe

  These plots are even more uniform than the previous ones.
  The ratio of gallows elements to non-gallows is 
  amazingly constant (R ~ 0.22) for all subsections and 
  languages.  One cannot even see the "zod" trend.

  Let's look at the "skeletons" of the words, obtained by deleting the
  [aoy] inserts and the [i] and [e] modifiers.

    cat .all.fpw \
      | ../017/factor-field-OK \
          -v inField=5 \
          -v outField=6 \
      | gawk \
          ' /./{ \
              f = $6; \
              gsub(/^[^{}]*/,"",f);  gsub(/[^{}]*$/,"",f);  \
              if (match(f,/{([ice][ktpf]?[eh]|[ktpf])e}/)) \
                { gsub(/e}/,"}",f); } \
              gsub(/{i+/,"{",f); gsub(/{_}/,"",f); \
              gsub(/}[^{}]*{/,"",f);  \
              print $1,$2,$3,$4,f; \
            } \
          ' \
      | combine-counts \
      | sort -b +1 -2 +2 -3 +0 -1nr \
      > .all.fps

  And let's compute the total non-[aoy] elements per page:

    cat .all.fps \
      | gawk \
          ' /./{ k = ($2 " " $3 " " $4); ct[k] += $1; } \
            END { for(w in ct) { print ct[w], w; } } \
          ' \
      | sort -b +2 -3 +0 -1nr \
      > .all.tps

    dicio-wc .all.tpw .all.tps