# Hacking at the Voynich manuscript - Side notes
093 Parsing words into elements and counting them

Last edited on 2026-01-17 23:35:56 by stolfi

  This note is a remake of note 622. Its purpose is to parse the
  transcription into "elements" according to my new word paradigm, and
  computing statistics thereof.
  
  The main elements comprise the single glyphs @[aoyqdlrs]; the benches
  @Ch, @Sh, @Ih, the "topless bench" @'ee', and the platform gallows
  (@CTh, @CPHh, @IKh, etc.), both optionally modified by a single @e;
  and the codas @n, @m, @in, @im, @ir, ... @iiin, @iiim, @iiir. Other
  rare combinations will be flagged and handled appropriately later.

SETUP

  ln -s ../.. work
  
  ln -s work/compute_freqs.gawk
  ln -s work/combine_counts.gawk
  ln -s work/error_funcs.gawk
  ln -s work/validate_25e1_ivt_format.gawk
  
  ln -s work/error_funcs.py
  ln -s work/process_funcs.py

  ln -s work/factor_field_general.gawk
  ln -s work/factor_text_25e1_eva_to_elems.gawk

SOURCE TRANSCRIPTION

  The source transcription will be my own 2025 transcription (code ";U")
  completed with Rene's IVT (code ";Z") as prepared in Note 074, split 
  by section and type as per Note 092:
  
    ln -s ../092/st_words
    
  The "*.eva" files in this folder should have all weirdos &NNN 
  mapped to "?". All comments "<!...>" and alignment markers "<%>"
  "<$>" [«=»] should have been removed.  The files "*.wff" should
  have fractional counts that take into account the dubious spaces ','.
  
  Also, all ligature braces '{...}' should have been 
  removed, and all EVA characters should have been lowercased.
  This loses information about weird ligatures like {Cto} or {Qy}.
  We will fix this detail later.
  
  We consider only "parags" type text.

SAVING

    now="$( yyyy-mm-dd-hhmmss )"
    echo now=${now} 1>&2
    # now=2026-01-15-110038
    mkdir -p SAVE/${now}
    cp -av \
        Note-093.txt \
        do_093*.sh \
        elem_parse_funcs.gawk \
        parse_ivt_file_into_elements.gawk \
        parse_wff_file_into_elements.gawk \
        parse_ivt_files_into_elements.sh \
        parse_weff_file_into_okoko_pats.gawk \
        parse_oko_file_into_cmc_pats.gawk \
      SAVE/${now}

LEVEL 0 - EVA CHARS

  Let's first count how many words are composed of the standard EVA chars, 
  as opposed to rare chars (@b @g @j @u @v @x), weirdos @&NNN,
  and unreadable glyphs @?.
   
    do_093_char_stats.sh
    
      all          gud          bad         % gud sec-type
    ------------ ------------ ------------  ----- ----------
     6210.000000  6172.000000    38.000000  99.39 bio-parags
     1008.500000   984.500000    24.000000  97.62 cos-parags
     7749.000000  7569.000000   180.000000  97.68 hea-parags
     3364.500000  3310.500000    54.000000  98.40 heb-parags
     2291.500000  2194.500000    97.000000  95.77 pha-parags
    10714.000000 10581.750000   132.250000  98.77 str-parags
     3001.500000  2944.500000    57.000000  98.10 unk-parags

    34339.000000 33756.750000   582.250000  98.30 tot-parags
    34339.000000 33616.750000   722.250000  97.90 tot-parags

  So at least 97% of the words in the main sections ("str", "bio",
  "hea", "heb") use only the "main" chars. The vast majority of the
  "bad" words have "?". The most words with no "?" or weirdos that are
  rejected because of rare chars is 51 in "hea-parags", 40.25 in
  "str-parags".
  
  From the words with valid chars only, per section and type,
  we extracted the lexicons, consisting of words with a certain
  min number of occurrences that depends on the size of the subset:
  
    LEXCON SIZES AMONG WORDS WITH VALID CHARS
    
    bio-parags lexicon size =   300 last =     3.000000 yteey
    cos-parags lexicon size =    57 last =     3.000000 shodaiin
    hea-parags lexicon size =   415 last =     3.000000 ytoldy
    heb-parags lexicon size =   208 last =     3.000000 ypchdy
    pha-parags lexicon size =   152 last =     3.000000 ykeor
    str-parags lexicon size =   554 last =     3.000000 ykedy
    unk-parags lexicon size =   206 last =     3.000000 ytody

    tot-parags lexicon size =  1376 last =     3.000000 ytodaiin
    
  Note that the "tot-parags" lexicon size is more than the sum of 
  the other lexicon sizes, because in includes some words that occur
  sufficiently often in two or more sections but not enough in any of them.

LEVEL 1 - ELEMENTS

  We now parse the words into the elements of the word paradigm.
  Valid elements are surrounded by "{}".  Glyphs that cannot be 
  parsed as valid elements (including "?" and rare chars) are
  surrounded by "[]".
  
  We exclude words that have the rare chars above or the
  invalid glyph '?'.
  
  The element set is 
  
    {q} {o} {a} {y} {d} {r} {l} {s}
    
    {ch} {che} {sh} {she} {ee} {eee}
    
    {k} {ke} {t} {te} {p} {pe} {f} {fe}
    
    {ckh} {ckhe}
    {cth} {cthe}
    {cph} {cphe}
    {cfh} {cfhe}

    {n} {in} {iin} {iiin} 
    {m} {im} {iim}
        {ir} {iir}
  
  For a justification of those elements, see the 
  file "Note-093-extra.txt"
  
    do_093_elem_stats.sh

      all          gud          bad         % gud sec-type
    ------------ ------------ ------------  ----- ----------
     6172.000000  6123.500000    48.500000  99.21 bio-parags
      984.500000   949.500000    35.000000  96.44 cos-parags
     7569.000000  7434.500000   134.500000  98.22 hea-parags
     3310.500000  3245.500000    65.000000  98.04 heb-parags
     2194.500000  2137.500000    57.000000  97.40 pha-parags
    10581.750000 10416.750000   165.000000  98.44 str-parags
     2944.500000  2907.500000    37.000000  98.74 unk-parags

    33756.750000 33214.750000   542.000000  98.39 tot-parags

  So at least 98% of all words in the main sections that have only the
  "valid" chars can be parsed into valid elements of the model.

  Here are the element frequencies:
  
      12564.750000 0.09922 {a}
         19.000000 0.00015 {cfhe}
         49.000000 0.00039 {cfh}
       3919.250000 0.03095 {che}
       5980.500000 0.04722 {ch}
        212.500000 0.00168 {ckhe}
        634.000000 0.00501 {ckh}
         57.000000 0.00045 {cphe}
        129.500000 0.00102 {cph}
        169.000000 0.00133 {cthe}
        709.500000 0.00560 {cth}
      11548.250000 0.09119 {d}
        322.000000 0.00254 {eee}
       3864.750000 0.03052 {ee}
        324.000000 0.00256 {f}
        158.000000 0.00125 {iiin}
         16.000000 0.00013 {iim}
       3780.000000 0.02985 {iin}
        131.500000 0.00104 {iir}
         41.000000 0.00032 {im}
       1674.000000 0.01322 {in}
        490.250000 0.00387 {ir}
       1526.500000 0.01205 {ke}
       7320.250000 0.05780 {k}
       9278.500000 0.07327 {l}
        875.000000 0.00691 {m}
        115.500000 0.00091 {n}
      21616.500000 0.17069 {o}
       1204.750000 0.00951 {p}
       5186.000000 0.04095 {q}
       5820.250000 0.04596 {r}
       1962.875000 0.01550 {she}
       2251.000000 0.01777 {sh}
       2093.000000 0.01653 {s}
        787.500000 0.00622 {te}
       4215.250000 0.03328 {t}
      15594.875000 0.12314 {y}

  Most of the lexemes that could not be parsed into elements occur only
  once or twice in all parags text. here are those that occur more than
  twice (with '!' showing where the parsing into elements failed):

    9.000000     0.016610 chckh!hy
    9.000000     0.016610 cth!hy
    7.000000     0.012920 da!i!idy
    5.000000     0.009230 chcth!hy
    5.000000     0.009230 ckh!hy
    4.500000     0.008300 a!il
    4.000000     0.007380 !ety
    4.000000     0.007380 qo!edy
    4.000000     0.007380 qo!eol
    3.000000     0.005540 chs!ey
    3.000000     0.005540 !cty
    3.000000     0.005540 q!ekchdy
    3.000000     0.005540 qo!edaiin
    3.000000     0.005540 shcph!hy
    3.000000     0.005540 shcth!hy
    2.500000     0.004610 a!is
    2.500000     0.004610 o!edy
      
  As discussed in "Note-093-extra.txt", it may be worth "fixing" the transcription
  by mapping all @'hh' to @'he', instead of rejecting them in the ELEM model.

  A relatively common pattern in those rejected words is initial @'qoe' (78 tokens)
  or @'qe' (66 tokens).  Maybe we should include those two strings as 
  elements.
  
  There are 63 tokens that start with @'oe' and 63 that start with @e.
  But maybe those are parts of words after dubious or wrong spaces.
  
  There are 90 tokens where the @c (@e with lig) is used in various
  combinations other than @'ch' or platform gallows, like @'cty' (14
  tokens), @'cky' (12 tokens) @'cke' (6 tokens) etc. There are a few
  more like those but with @i instead of @c. Those may be mistakes like
  platform gallows with missed or miswritten or misread @h.
  
  There are a couple hundred tokens with one or more @'i' not followed by 
  [nmr]. Most common letters after those rejected @i run are gallows 
  that are not followed by @h or @hh.

LEVEL 2 - OKOKO MODEL

  We now consider the words that consist only of valid elements. We map
  the elements to "O" = { @a, @o, @y }, and "K" = all the others, and
  parse the resulting string as a sequence of zero or more "K" with at
  most one "O" after each "K" and an optional "O" prefix, 
  and at most three total "O"s:
  
    do_093_okoko_stats.sh

      all          gud          bad         % gud sec-type
    ------------ ------------ ------------  ----- ----------
     6123.500000  6104.750000    18.750000  99.69 bio-parags
      949.500000   925.500000    24.000000  97.47 cos-parags
     7434.500000  7265.125000   169.375000  97.72 hea-parags
     3245.500000  3206.875000    38.625000  98.81 heb-parags
     2137.500000  2070.000000    67.500000  96.84 pha-parags
    10416.750000 10242.156250   174.593750  98.32 str-parags
     2907.500000  2851.125000    56.375000  98.06 unk-parags

    33214.750000 32665.531250   549.218750  98.35 tot-parags

  Thus, in the main sections, at least 97.7% of all words that consist
  of valid elements also fit the OKOKO model.
  
  Many of those bad words are rejected because of two or more "O" elements
  in a row.  If we allow up to two consecutive "O" (but still 
  no more than three overall), the acceptance becomes almost total:

      all          gud          bad         % gud sec-type
    ------------ ------------ ------------  ----- ----------
     6123.500000  6116.750000     6.750000  99.89 bio-parags
      949.500000   944.500000     5.000000  99.47 cos-parags
     7434.500000  7403.250000    31.250000  99.58 hea-parags
     3245.500000  3237.125000     8.375000  99.74 heb-parags
     2137.500000  2121.750000    15.750000  99.26 pha-parags
    10416.750000 10379.156250    37.593750  99.64 str-parags
     2907.500000  2886.375000    21.125000  99.27 unk-parags

    33214.750000 33088.906250   125.843750  99.62 tot-parags
    
  Most of those 125 tokens (248 lexemes) with valid elements that
  still fail the OKOKO model now look like two or more words 
  stuck together:
  
        1.500000     0.011920 OKOKOK!O:okalod!y
        1.000000     0.007950 OKOKO!O:araro!y
        1.000000     0.007950 KOKOKOK!OK:chalykor!ain
        1.000000     0.007950 KOKOOK!O:cthodoal!y
        1.000000     0.007950 KOKOKKOK!OK:dolarshyd!or
        1.000000     0.007950 KOKOKOK!OK:folorar!om
        1.000000     0.007950 KOKOKKOK!OKO:kodamchocth!ody
        1.000000     0.007950 OOKKOK!O:oaldar!y
        1.000000     0.007950 OO!OKOK:oa!orar
        1.000000     0.007950 OKOKOK!O:octhodal!y
        1.000000     0.007950 OKOKOK!O:odalal!y
        1.000000     0.007950 OKOKOK!O:okairod!y
        1.000000     0.007950 OKOKKOKK!O:okalchold!y
        1.000000     0.007950 OKOKKOK!O:okoldal!y
  
LEVEL 2 - CORE-MANTLE-CRUST MODEL

  In this model we ignore all "O" elements and map the others to the
  specific classes 
  
    "Q" = { @q }, 
    "D" = { @d, @l,@r, @s } (/dealers/)
    "X" = { @ch, @sh, @ee } with optional @e suffix (/benches/), 
    "G" = { @k, @t, @p, @f } with optional @e suffix (the /simple gallows/),
    "H" = { @chk, @ctk, @cph, @cfh } with optional @e or @h suffix (/platform gallows/),
    "N" = { @n, @m } after zero or more @i, or @r after one or more @i (/codas/).

  In my original nomenclature, the "D" elements are called the /crust/,
  the "X" elements are the /mantle/, and the "G" and "H" elements are
  the /core/. The "Q" and "N" elements then could be the /seas/.
  
  The CMC model says that a word is valid if it fits the pattern
  
     Q^q D^d X^x G^g H^h X^y D^e N^n
     
  where
    
    q,n may be 0 or 1;
    g+h may be 0 or 1;
    q+d+e+n may be at most 3;
    x+h+y may be at most 2.
    
  That is, there can be at most one gallows (G or H),
  at most three of Q, D, and N,
  and at most two benches, counting a platform gallows as one implicit bench.
  With these rules we get:
  
    do_093_cmc_stats.sh
  
      all          gud          bad         % gud sec-type
    ------------ ------------ ------------  ----- ----------
     6116.750000  5970.187500   146.562500  97.60 bio-parags
      944.500000   913.750000    30.750000  96.74 cos-parags
     7403.250000  7156.375000   246.875000  96.67 hea-parags
     3237.125000  3106.875000   130.250000  95.98 heb-parags
     2121.750000  2052.625000    69.125000  96.74 pha-parags
    10379.156250  9952.687500   426.468750  95.89 str-parags
     2886.375000  2771.125000   115.250000  96.01 unk-parags

    33088.906250 31923.625000  1165.281250  96.48 tot-parags

  There are ~1165 tokens (1555 lexemes) that satisfy the OKOKO
  model but fail the CMC model.  These are the ones that occur
  more than twice in all the parags text:
  
        7.000000     0.006010 GD!XD:pol!chedy
        5.500000     0.004720 XD!G:chol!ky
        4.500000     0.003860 XD!GN:chol!kaiin
        4.000000     0.003430 XXG!X:cheet!eey
        4.000000     0.003430 XD!X:chod!chy
        4.000000     0.003430 XD!GD:chol!kar
        4.000000     0.003430 XD!G!XD:chol!k!eedy
        4.000000     0.003430 DN!D:daira!l
        4.000000     0.003430 DN!N:dair!in
        3.000000     0.002570 N!D:aiina!l
        3.000000     0.002570 N!D:airo!dy
        3.000000     0.002570 N!D:airo!l
        3.000000     0.002570 XD!GN:cheol!kain
        3.000000     0.002570 XD!XD:chol!chedy
        3.000000     0.002570 DN!D:dairo!dy
        3.000000     0.002570 DDD!N:dalda!iin
        3.000000     0.002570 GX!H:pcho!cthy
        3.000000     0.002570 GD!X:pol!shy
        3.000000     0.002570 XXG!X:sheek!chy
        3.000000     0.002570 G!H:to!ckhy
        2.750000     0.002360 N!N:aira!m
        2.500000     0.002150 XD!X:chol!chey
        2.500000     0.002150 GX!GX:kcho!kchy
        2.500000     0.002150 GD!X:kor!chy
        2.500000     0.002150 GD!XD:okal!chedy
        2.500000     0.002150 GD!X:okal!chy
        2.500000     0.002150 GD!GN:opal!kaiin
        2.375000     0.002040 N!D:aira!l
        2.250000     0.001930 GD!XD:tol!chedy
  
  Most seem to be two or more more or less common words run together.

  To measure the CMC compliance among lexemes (as opposed to tokens) we
  extracted all words with valid EVA chars and at least 3 occurrences,
  and then verified how many of those passed through all levels and had
  valid CMC structure.
  
  The table below 

    LEXICON SIZES WITH VALID CMC STRUCTURE
    
    sec-type    vlex  vgud  vbad %gud    least common
    ---------- ----- ----- ----- ------  --------------------
    bio-parags   300   300     0 100.00  3.0000 yteey           
    cos-parags    57    57     0 100.00  3.0000 shodaiin        
    hea-parags   415   409     6  98.55  3.0000 ytoldy          
    heb-parags   208   206     2  99.04  3.0000 ypchdy          
    pha-parags   152   151     1  99.34  3.0000 ykeor           
    str-parags   554   548     6  98.92  3.0000 ykedy           
    unk-parags   206   206     0 100.00  3.0000 ytody             

    tot-parags  1376  1341    35  97.46  3.0000 ytodaiin        


>>> REVISE <<<
  
II. TABULATING ELEMENT FREQUENCIES PER SUBSECTION

    set sectags = ( `cat text-subsecs/all.names` )
    echo $sectags

    foreach etag ( RAW EQV )
      tabulate-frequencies \
        -dir ${etag}/efreqs/subsecs \
        -title "elem" \
        tot ${sectags}
    end

  Elements sorted by frequency (× 99), per subsection:

      tot     unk     pha     str     hea     heb     bio     ast     cos     zod    
      ------- ------- ------- ------- ------- ------- ------- ------- ------- -------
      17 o    16 o    23 o    15 o    20 o    14 y    15 o    18 o    16 o    20 o   
      12 y    12 a     9 y    11 y    11 y    14 o    15 y    13 y    13 y    13 a   
       9 a    11 y     8 l    11 a     9 ch   11 d    10 d     9 a    11 a     8 y   
       8 d     8 d     6 a     8 d     7 a    10 a     8 l     7 d     9 l     8 l   
       7 l     7 l     6 d     6 l     7 d     6 k     7 a     5 ch    6 d     7 t   
       5 k     5 r     4 r     6 k     6 l     5 l     6 q     5 ee    4 ch    5 r   
       4 ch    5 k     4 k     4 q     5 r     4 ch    6 k     5 r     4 k     5 d   
       4 r     4 ch    4 ch    4 ee    4 k     4 r     3 ee    4 k     4 r     5 ee  
       4 q     3 t     3 q     4 r     3 t     3 iin   3 che   3 s     4 t     3 ch  
       3 ee    3 iin   3 ee    3 iin   3 iin   3 che   3 r     3 l     3 iin   3 k   
       3 iin   3 q     3 che   3 ch    2 sh    2 q     2 she   3 t     2 sh    2 te  
       3 t     2 che   3 iin   3 che   2 q     2 t     2 t     2 che   2 q     2 iin 
       3 che   2 sh    2 ke    3 t     2 s     2 ee    2 ch    2 iin   2 ee    1 s   
       1 sh    1 ee    1 s     1 she   1 che   1 ke    1 in    2 ke    2 che   1 che 
       1 she   1 she   1 ?     1 sh    1 cth   1 sh    1 ke    1 q     1 s     1 sh  
       1 ke    1 p     1 sh    1 ke    1 ee    1 she   1 iin   1 ?     1 she   1 she 
       1 s     1 s     1 she   1 in    0 she   1 s     1 sh    1 e?    0 ke    0 p   
       0 in    0 ke    1 t     0 p     0 p     0 te    1 s     1 she   0 e?    0 ?   
       0 p     0 te    0 ckh   0 s     0 ckh   0 p     0 te    1 te    0 p     0 in  
       0 te    0 ir    0 ckhe  0 te    0 in    0 ckh   0 ckh   1 sh    0 te    0 ke  
       0 cth   0 cth   0 te    0 ir    0 m     0 f     0 p     0 p     0 ?     0 eee 
       0 ckh   0 f     0 p     0 eee   0 ke    0 in    0 cth   0 ir    0 cth   0 e?  
       0 ir    0 in    0 e?    0 ckh   0 cph   0 ir    0 ckhe  0 cth   0 in    0 m   
       0 ?     0 m     0 iiir  0 cth   0 te    0 m     0 f     0 ckh   0 m     0 cth 
       0 eee   0 ckh   0 in    0 f     0 cthe  0 cth   0 eee   0 eee   0 ckh   0 cthe
       0 f     0 cph   0 cth   0 e?    0 f     0 e?    0 e?    0 cthe  0 eee   0 iir 
       0 m     0 eee   0 cthe  0 ckhe  0 ?     0 cthe  0 ir    0 in    0 cthe  0 q   
       0 e?    0 e?    0 ir    0 ?     0 ckhe  0 ?     0 cthe  0 iir   0 ir          
       0 ckhe  0 ?     0 f     0 m     0 e?    0 ckhe  0 iiin  0 i?    0 iir         
       0 cthe  0 cfh   0 m     0 iir   0 eee   0 eee   0 h?    0 il    0 cfh         
       0 cph   0 cthe  0 eee   0 i?    0 ir    0 iir   0 cph   0 j     0 ckhe        
       0 iir   0 cphe  0 j     0 cthe  0 n     0 iiin  0 cphe  0 ckhe  0 ij          
       0 iiin  0 ckhe  0 cphe  0 iiin  0 cfh   0 cphe  0 ?     0 cph                 
       0 n     0 iir   0 cph   0 cph   0 cphe  0 cph   0 ck    0 f                   
       0 i?    0 im    0 iiin  0 il    0 i?    0 cfhe  0 ikh   0 im                  
       0 cphe  0 n     0 iir   0 n     0 iiin  0 i?    0 il    0 m                   
       0 cfh   0 i?    0 i?    0 cphe  0 iir   0 im    0 m                           
       0 iiir  0 x     0 cfhe  0 de    0 ct    0 cfh   0 n                           
       0 de    0 iil   0 de    0 x     0 de    0 n     0 cfh                         
       0 il    0 il    0 is    0 is    0 im    0 x     0 de                          
       0 im            0 n     0 im    0 cfhe  0 de    0 i?                          
       0 cfhe          0 id    0 cfhe  0 b     0 id    0 is                          
       0 x             0 pe    0 cfh   0 ck            0 ith                         
       0 is                    0 iil   0 id            0 c?                          
       0 j                     0 iid   0 iil           0 iir                         
       0 h?                    0 id    0 cf            0 b                           
       0 ck                    0 iiid  0 g             0 cp                          
       0 ct                    0 iiil  0 h?            0 ct                          
       0 iil                   0 iim   0 iiil          0 g                           
       0 id                    0 iis   0 iiir          0 iil                         

  I have compared these counts with those obtained by removing two, one, or zero
  elements from each line end.  The conclusion is that the ordering of the first
  six entries in each column is quite stable; it is probably not an artifact.

  Some quick observations: there seem to be three "extremal" samples:
  hea ("ch" abundant), bio ("q" important), and zod ("t" important).

  There are too many "e?" elements; I must check where they come from
  and perhaps modify the set of elements to account for them.

    [ It seems that many came from groups of the form "e[ktpf]e",
      "e[ktpf]ee", which could be "c[ktpf]h" and "c[ktpf]he" without
      ligatures.  Most of the remaining come from Friedman's
      transcription; there are practically none in the more careful
      transcriptions. ]

  All valid elements that occur at least 10 times in the text:

    o y a 
    q 
    n in iin iiin
    r ir iir iiir
    d
    s is
    l il
    m im
    j
    de
    k t ke te
    p f
    cth ckh cthe ckhe
    cph cfh cphe cfhe
    ch che
    sh she
    ee eee 
    x

  Valid elements that occur less than 10 times in the whole text:

    iil
    ij
    pe
    ct ck
    id

  Created a file "RAW/plots/vald/keys.dic" with all the valid elements.


  Equiv-reduced elements sorted by frequency (× 99), per subsection:

    tot      unk      pha      str      hea      heb      bio      ast      cos      zod     
    -------- -------- -------- -------- -------- -------- -------- -------- -------- --------
    38 o~    40 o~    40 o~    38 o~    39 o~    40 o~    37 o~    40 o~    41 o~    42 o~   
    10 t~    10 t~     8 l~    11 t~    11 ch~   12 d~    10 d~    10 ch~    9 t~    11 t~   
     8 d~     8 d~     7 ch~    8 d~     9 t~    10 t~     9 t~     8 t~     9 l~     8 ch~  
     8 ch~    7 l~     7 d~     8 ch~    7 d~     6 ch~    8 l~     7 d~     7 ch~    8 l~   
     7 l~     6 ch~    6 t~     6 l~     6 l~     5 l~     6 q~     5 r~     6 d~     5 d~   
     4 r~     5 r~     4 r~     4 q~     5 r~     4 r~     5 ch~    3 s~     4 r~     5 r~   
     4 q~     3 in~    3 q~     4 in~    4 in~    3 in~    3 che~   3 l~     3 in~    3 te~  
     4 in~    3 q~     3 in~    4 r~     2 sh~    3 che~   3 in~    3 te~    2 sh~    3 in~  
     3 che~   2 che~   3 che~   4 che~   2 cth~   2 q~     3 r~     3 che~   2 q~     2 che~ 
     1 te~    2 sh~    2 te~    1 te~    2 q~     2 te~    2 she~   2 in~    2 che~   1 s~   
     1 sh~    1 te~    1 s~     1 she~   2 s~     1 sh~    2 te~    1 q~     1 te~    1 she~ 
     1 she~   1 she~   1 ?~     1 sh~    1 che~   1 she~   1 sh~    1 ?~     1 s~     1 sh~  
     1 cth~   1 cth~   1 cth~   0 s~     0 te~    1 s~     1 cth~   1 e?~    1 she~   0 ?~   
     1 s~     1 s~     1 sh~    0 cth~   0 she~   1 cth~   1 s~     1 she~   0 cth~   0 e?~  
     0 ir~    0 ir~    1 she~   0 ir~    0 cthe~  0 ir~    0 cthe~  1 cth~   0 e?~    0 cthe~
     0 cthe~  0 cthe~  1 cthe~  0 cthe~  0 ir~    0 cthe~  0 e?~    1 sh~    0 ?~     0 cth~ 
     0 ?~     0 e?~    0 ir~    0 e?~    0 ?~     0 e?~    0 ir~    0 ir~    0 ir~    0 ir~  
     0 e?~    0 ?~     0 e?~    0 ?~     0 e?~    0 ?~     0 h?~    0 cthe~  0 cthe~  0 q~   
     0 n~     0 id~    0 i?~    0 i?~    0 n~     0 id~    0 ith~   0 i?~    0 id~           
     0 i?~    0 n~     0 de~    0 il~    0 i?~    0 i?~    0 ct~    0 il~                    
     0 il~    0 i?~    0 is~    0 n~     0 ct~    0 n~     0 ?~     0 id~                    
     0 id~    0 il~    0 n~     0 de~    0 id~    0 x~     0 il~                             
     0 de~    0 x~     0 id~    0 id~    0 de~    0 de~    0 n~                              
     0 ct~                      0 x~     0 il~             0 de~                             
     0 is~                      0 is~    0 b~              0 i?~                             
     0 x~                                0 is~             0 is~                             
     0 h?~                               0 h?~             0 c?~                             
     0 ith~                                                0 b~                              
     0 b~                                                                                    
     0 c?~                                                                                   

  There are 23 valid elements with frequency > 20 under the equivalence:

    o 
    t   te 
    cth cthe 
    ch  che 
    sh  she 
    d   de
    id
    l r q s m n
    in ir im il

  Valid elements with frequency below 20:

    ct is g b x

  Created a file "EQV/plots/vald/keys.dic" with all the valid elements, collapsed by the
  above equivalence.

IV. "ED"'S STORY

  Rene observed that the EVA <ed> digraph is a marker for the A/B 
  language split.  He produced some plots where the horizontal
  axis is page number, with subsections distinguished by colors.
  
  Let's count the word frequencies per page:

    zcat ../037/vms-17-ok.soc.gz \
      | tr '/' '-' \
      | gawk \
          ' \
            (($2 ~ /[A]/) && ($6 \!~ /[-=., ]/)){ \
              gsub(/[.].*$/,"",$1); print $9, substr($10,2), $1, $6; \
            } \
          ' \
      | sort | uniq -c | expand \
      | sort -b +1 -2 +2 -3 +0 -1nr \
      > .all.fpw
      
    cat .all.fpw \
      | list-page-champs -v maxChamps=4 \
      > .all.chpw

  Let's count the total word occurrences per page: 

    cat .all.fpw \
      | gawk \
          ' /./{ k = ($2 " " $3 " " $4); ct[k] += $1; } \
            END { for(w in ct) { print ct[w], w; } } \
          ' \
      | sort -b +2 -3 +0 -1nr \
      > .all.tpw

  Let's now count the <ed>-containing words per page:

    cat .all.fpw \
      | gawk \
          ' ($5 ~ /ed/){ print; } ' \
      > .ed.fpw

    cat .ed.fpw \
      | list-page-champs -v maxChamps=6 \
      > .ed.chpw

    cat .ed.fpw \
      | gawk '//{ print $1,$2,$3,$4; }' \
      | combine-counts \
      | sort -b +2 -3 \
      > .ed.tpw

  Let's plot the ratio of <ed>-words to total words per page:
  
    plot-freqs .ed.tpw .all.tpw
    
  The plots of the <ed>-ratio R show that 
  
    "hea" and "pha" are virtually <ed>-free (R < 0.03, below the erro level);
    
    "cos-1" (the part before the "zod") and "zod" begin with slightly higher
    <ed> ratios than "hea"/"pha" (R ~ 0.04); R then increases sharply
    along "zod", from R ~ 0.03 to R ~ 0.11, and jumps to R ~ 0.20 in
    "cos-2" (the part after "zod").
    
    "heb-2" (after "zod") has R ~ 0.17 just below that of "cos-2".
    "heb-1" (before "zod") has widely variable R, with mean R ~ 0.20.
    
    "str" has R ~ 0.20 like "heb", but more uniform
    (except for the two pages before "zod", which have R ~ 0.02).
    
    "bio" has R ~ 0.20 in the middle, R ~ 0.32 at both ends
    
  So, based only on these plots, the writing sequence
  would be
  
    hea + pha (no obvious order)

    cos-1 + zod 

    heb-2 

    str + heb-1 + cos-2

    bio 

V. ABOUT "ED" AND LADY "DY"

  It seems that most of the <ed> words in language B are 
  actually words that end with <edy>.  In fact there seems
  to be a very small number of words involved.
  
  Let's plot the per-page frequencies of the <dy> ending:
  
    cat .all.fpw \
      | gawk ' ($5 ~ /dy$/){ print $1,$2,$3,$4; } ' \
      | combine-counts \
      | sort -b +2 -3 \
      > .dy.tpw
  
    plot-freqs .dy.tpw .all.tpw

  This plot shows the same trends as the <ed> frequency, except that
  the data for language-A is noisier and the distinction between
  languages A and B is less marked (because the counts for 
  language A are no longer zero).
  
  Here "cos-1" and "zod" are practically equal.
  
  Curiously, pharma has a slightly higher R than herbal-A; and R
  actually decreases as we go down herbal-A. This decrease is strange
  since the trend in the zodiac pages establishes that <ed> increases
  frm older to newer, hence language A should be earlier than language B.

  Let's try again with the <edy> ending proper:
  
    cat .all.fpw \
      | gawk ' ($5 ~ /edy$/){ print $1,$2,$3,$4; } ' \
      | combine-counts \
      | sort -b +2 -3 \
      > .edy.tpw
  
    plot-freqs .edy.tpw .all.tpw

  These plots are like the <ed> plots but cleaner. 
  The "bio" subsection has R ~ 0.25 with a dip in the middle.
  Subsections "str-2" and "heb-1" have almost the same R ~ 0.15.
  Subsections "cos-2" and "heb-2" have R ~ 0.10.
  
  Subsections "cos-1" and "zod" have R ~ 0.03 (barely significant)
  and the trend in "zod" is not so clear.
  
  Finally subsections "hea-1", "hea-2", "pha", and "str-1"
  have hardly any "edy".
  
VI. THE "EDY" WORDS  
  
  Let's compute the overall frequency of each word per subsection,
  removing the <q> prefix and mapping [ktpf] to <k>:
  
    cat .all.fpw \
      | gawk \
          ' /./{ \
              gsub(/^q/,"",$5); gsub(/[ktpf]/,"k",$5); \
              print $1,$2,"000","000",$5 \
            } \
          ' \
      | combine-counts \
      | sort -b +1 -2 +0 -1nr \
      > .all.ftw
      
    cat .all.ftw \
      | gawk '/./{print $1,$2,$3,$4 } ' \
      | combine-counts \
      | sort -b +2 -3 \
      > .all.ttw

  Now let's look at the <edy> words specifically:

    cat .all.ftw \
      | gawk '($5 ~ /edy$/){ print; }' \
      > .edy.ftw

    cat .edy.ftw \
      | list-page-champs -v maxChamps=6 \
      > .edy.chtw

  Here are the six most common words in each subsection, manually sorted:

    sec   totwd  champions
    ---   -----  ----------------------------------------------------------------------
    str   10783  okedy(180) okeedy(271) chedy(193) shedy(119) lchedy(56) okchedy(131)
    bio    6716  okedy(310) okeedy(252) chedy(218) shedy(252) lchedy(59) okchedy(44)
    heb    3337  okedy(101) okeedy(31)  chedy(67)  shedy(36)             kedy(26) ykedy(25)
               
    cos    2590  okeedy(11) chedy(12) okedy(23) shedy(11) okchedy(10) kchedy(5)
    zod     997  okeedy(5)  chedy(4)  okedy(3)  shedy(5)  okshedy(2)  eeedy(2)
               
    hea    7553  chedy(1) ykchedy(1) okeedy(2) esedy(1)
    pha    2401  chedy(1) ockhedy(1) cheedy(2) cholkeedy(1) ckhedy(1) okedy(1)
               
    unk    1847  okedy(21) chedy(19) okchedy(14) shedy(14) okeedy(7) olkeedy(7)
    ---   -----  ----------------------------------------------------------------------
   
  As it can be seen, the <edy> words (there are many of them!) 
  are characteristic of language B ("bio", "heb", "str"), and
  also a bit of "cos" and "zod".   
  
  The frequency of "okedy" (and its k/t/q variants) is 
    
    1:  22 "bio" 
    1:  33 "heb" 
    1:  55 "str"
    1: 220 "cos"
    1: 200 "zod"
  
  and practically nil in "hea", "pha".

  Let's look at the words that DON'T end in <edy>:
  
    cat .all.ftw \
      | gawk ' ($5 \!~ /edy$/){ print; } ' \
      > .not-edy.ftw

    cat .not-edy.ftw \
      | list-page-champs -v maxChamps=6 \
      > .not-edy.chtw
  
  These are the six most common non-edy words in each subsection, also manually sorted:

    sec   totwd  champions
    ---   -----  ----------------------------------------------------------------------
    str   10783  okaiin(350) okal(198) okeey(341) aiin(199) okain(173) okar(184)
    bio    6716  okaiin(145) okal(185) okeey(128)           okain(240)          ol(363) oky(124)
    heb    3337  okaiin(67)  okal(56)             aiin(68)             okar(92) daiin(79) or(68)
                 
    cos    2590  aiin(44) ar(57) okeey(45)  or(43) dar(40) daiin(39)
    zod     997  aiin(29) ar(28) okeey(24)  al(30) okaiin(21) okal(17)
                 
    hea    7553  daiin(393) chol(215) chor(144) okchy(142) ckhy(138) oky(131)
    pha    2401  daiin(105) chol(47)  okeol(62) okol(52) okeey(51) ol(41)
                 
    unk    1847  okar(58) daiin(42) okaiin(40) okal(32) aiin(31) or(31)
    ---   -----  ----------------------------------------------------------------------

  Note that <daiin> is the most common word in herbal-A and pharma,
  but it shows up also in the other subsections, at 1/2 to 1/4 the 
  frequency:
  
    "hea" 1: 18 
    "pha" 1: 24

    "heb" 1: 40
    "str" 1: 75
    "bio" 1: 80

    "cos" 1: 60
    "zod" 1: 80
    
  So perhaps <daiin> is a function word that got less and less 
  used as the author's vocabulary expanded.
  
  The most popular non-<edy> words in language B are
  
   okaiin okal okeey aiin okain okar
   
  They are fairly uniform across subsections, except perhaps 
  okar which is more concentrated in herbal-B.
  
  It is hard to get any conclusion from these lists (other
  than `it strongly suggests Chinese' 8-). 
  

  Let's try with the words <chedy>/<shedy>:

    cat .all.fpw \
      | gawk \
          ' ($5 ~ /^[cs]hedy$/){ \
              print $1,$2,$3,$4 \
            } \
          ' \
      | combine-counts \
      | sort -b +2 -3 \
      > .chedy.tpw
  
    plot-freqs .chedy.tpw .all.tpw
  
  Predictably the R values are smaller overall, and 
  only the "str-2", "heb-1", and "bio" are significantly greater
  than 0.
  
  The "bio" pages still show the dip in the middle.
  
  Let's try <dain>/<daiin>, which should show the reverse trend:
  
    cat .all.fpw \
      | gawk \
          ' ($5 ~ /^d[ao]i+n$/){ \
              print $1,$2,$3,$4 \
            } \
          ' \
      | combine-counts \
      | sort -b +2 -3 \
      > .dain.tpw
  
    plot-freqs .dain.tpw .all.tpw
  
  Predictably again these pages show the opposite trends. In "hea" R is
  large and decreasing ("hea-1" has R ~ 0.07, "hea-2" has R ~ 0.04). Next
  is "pha" at R ~ 0.04, then "heb-2" and "heb-1" a R ~ 0.03, then "str", "cos",
  "zod" and "bio" all at R ~ 0.02.
  
  The "unk" pages f1r and f49v have R ~ 0.08, which is right
  in the middle of the herbal-A range.  The others have lower
  R, which is consistent with language B material.

  Let's compare the frequencies of "Ke" elements relative
  to total non-[aoy] (mostly "K" and "Ke") elements.
  
    cat .all.fpw \
      | ../017/factor-field-OK \
          -v inField=5 \
          -v outField=6 \
      | gawk \
          ' /./{ \
              f = $6; \
              gsub(/^[^{}]*/,"",f);  gsub(/[^{}]*$/,"",f);  \
              gsub(/}[^{}]*{/,"} {",f);  n = split(f, ff);  \
              for(i=1;i<=n;i++){ print $1,$2,$3,$4,ff[i]; } \
            } \
          ' \
      | grep -v '{_}' \
      | combine-counts \
      | sort -b +1 -2 +2 -3 +0 -1nr \
      > .all.fpe

    dicio-wc .all.fpw .all.fpe

      lines   words     bytes file        
    ------- ------- --------- ------------
      24921  124605    688935 .all.fpw
       5632   28160    147899 .all.fpe

  And let's compute the total non-[aoy] elements per page:

    cat .all.fpe \
      | gawk \
          ' /./{ k = ($2 " " $3 " " $4); ct[k] += $1; } \
            END { for(w in ct) { print ct[w], w; } } \
          ' \
      | sort -b +2 -3 +0 -1nr \
      > .all.tpe

    dicio-wc .all.tpw .all.tpe

      lines   words     bytes file        
    ------- ------- --------- ------------
        227     908      3798 .all.tpw
        227     908      3924 .all.tpe

  Let's now count the "Ke" elements only:

    cat .all.fpe \
      | gawk \
          ' ($5 ~ /{([ice][ktpf]?[he]|[ktpf])e}/){ print; } ' \
      > .Ke.fpe

    cat .Ke.fpe \
      | gawk '//{print $1,$2,$3,$4; }' \
      | combine-counts \
      | sort -b +2 -3 \
      > .Ke.tpe
      
    plot-freqs .Ke.tpe .all.tpe

  Strangely the plots show little change from language-A to 
  language-B, less than the variation within the same subsection.
  
  The ratio for "hea-1" is lowest (R ~ 0.03) and is minimum
  around page p025.  Curiously it seems to oscillate 
  with a period of 1-2 pages.
  
  The ratios for all other subsections are about the same,
  around 0.10. 
  
  The "zod" pages show again a sharp increasing trend,
  except for the first couple of pages.

  Observations:

    If languages A and B are indeed different languages, 
    it is hard to explain why some letter group statistics
    are so uniform, and why some are so variable.  

    If languages A and B are different spellings of the same
    language, then the spelling change must not have affected
    the use of "Ke" elements relative to the "K" elements.

    If the difference between languages A and B is merely due
    to vocabulary (including tense/person/etc.), then
    the difference again must not favor "Ke" words over "K" words.

  Let's try the gallows elements only:

    cat .all.fpe \
      | gawk \
          ' ($5 ~ /{.*[ktpf].*}/){ print; } ' \
      > .ktpf.fpe

    cat .ktpf.fpe \
      | gawk '//{print $1,$2,$3,$4; }' \
      | combine-counts \
      | sort -b +2 -3 \
      > .ktpf.tpe
      
    plot-freqs .ktpf.tpe .all.tpe

  These plots are even more uniform than the previous ones.
  The ratio of gallows elements to non-gallows is 
  amazingly constant (R ~ 0.22) for all subsections and 
  languages.  One cannot even see the "zod" trend.

  Let's look at the "skeletons" of the words, obtained by deleting the
  [aoy] inserts and the [i] and [e] modifiers.

    cat .all.fpw \
      | ../017/factor-field-OK \
          -v inField=5 \
          -v outField=6 \
      | gawk \
          ' /./{ \
              f = $6; \
              gsub(/^[^{}]*/,"",f);  gsub(/[^{}]*$/,"",f);  \
              if (match(f,/{([ice][ktpf]?[eh]|[ktpf])e}/)) \
                { gsub(/e}/,"}",f); } \
              gsub(/{i+/,"{",f); gsub(/{_}/,"",f); \
              gsub(/}[^{}]*{/,"",f);  \
              print $1,$2,$3,$4,f; \
            } \
          ' \
      | combine-counts \
      | sort -b +1 -2 +2 -3 +0 -1nr \
      > .all.fps

  And let's compute the total non-[aoy] elements per page:

    cat .all.fps \
      | gawk \
          ' /./{ k = ($2 " " $3 " " $4); ct[k] += $1; } \
            END { for(w in ct) { print ct[w], w; } } \
          ' \
      | sort -b +2 -3 +0 -1nr \
      > .all.tps

    dicio-wc .all.tpw .all.tps