Note 023

Hacking at the Voynich manuscript - Side notes
023 Characterizing Voynichese sub-languages by QOKOKOKO element freqs

Last edited on 1998-07-04 10:25:19 by stolfi

1998-06-20 stolfi
=================

[ First version done on 1998-05-04, now redone with fresher data. ]

In Note 021, I tried to classify pages according to the
frequencies of certain keywords.  John Grove pointed out that the
transcription which I used (Friedman's) has inconsistencies which may
masquerade as language differences, e.g. "dain" in place of "daiin" or
vice-versa.  Also, it seems that spacing (word division) is quite
inconsistent.

So, in an attempt to avoid those problems, I thought of using, instead
of words, the "elements" of the QOKOKOKO paradigm. See Notes 017 and 018.

Since I am not still clear on how to group the O's with K's (with the
following K, with the preceding K, with both, with neither), I will
leave them as separate elements. Also, for simplicity (without any
conviction at all) I will split every double-letter O into two
elements.

Also, given Grove's observations on anomalous "p" and "t" 
distributions at beginning-of-line, and the well-known attraction
of certain elements for end-of-line, it seems advisable to discard
the first few and last few elements of every line.

I. EXTRACTING AND COUNTING ELEMENTS

We will prepare two sets of statistics, one using raw words ("-r")
and one using word equivalence classes ("-c").

  elem-to-class -describe

    element equivalence:
      map_ee_to_ch
      ignore_gallows_eyes
      join_ei
      equate_aoy
      collapse_ii
      equate_eights
      equate_pt
      erase_word_spaces
      append_tilde

This mapping will hopefully reduce transcription and sampling noise.

Factoring the text into elements:
  
  mkdir -p RAW EQV
  
  /bin/rm -rf {RAW,EQV}/efreqs
  mkdir -p {RAW,EQV}/efreqs
  foreach utype ( pages sections )
    foreach f ( `cat text-${utype}/all.names` )
      cat text-${utype}/${f}.evt \
        | lines-from-evt | egrep '.' \
        | factor-OK | egrep '.' \
        > /tmp/${utype}-${f}.els
    end
  end
  
Counting elements and computing relative frequencies:
  
  foreach ep ( cat.RAW elem-to-class.EQV )
    set etag = ${ep:e}; set ecmd = ${ep:r}
    /bin/rm -rf ${etag}/efreqs
    mkdir -p ${etag}/efreqs
    foreach utype ( pages sections )
      set frdir = "${etag}/efreqs/${utype}"
      mkdir -p ${frdir}
      cp -p text-${utype}/all.names ${frdir}/
      foreach f ( `cat text-${utype}/all.names` )
        echo ${frdir}/$f.frq
        cat /tmp/${utype}-${f}.els \
          | trim-three-from-ends \
          | tr '{}' '\012\012' \
          | ${ecmd} | egrep '.' \
          | sort | uniq -c | expand \
          | sort -b +0 -1nr \
          | compute-freqs \
          > ${frdir}/${f}.frq
      end
    end
  end
  
Computing total frequencies:

  foreach ep ( cat.RAW elem-to-class.EQV )
    set etag = ${ep:e}; set ecmd = ${ep:r}
    foreach utype ( pages sections )
      set fmt = "${etag}/efreqs/${utype}/%s.frq"
      set frfiles = ( \
        `cat text-${utype}/all.names | gawk '/./{printf "'"${fmt}"'\n",$0;}'` \
      )
      echo ${frfiles}
      cat ${frfiles} \
        | gawk '/./{print $1, $3;}' \
        | combine-counts \
        | sort -b +0 -1nr \
        | compute-freqs \
        > ${etag}/efreqs/${utype}/tot.frq
    end
  end  
  
II. TABULATING ELEMENT FREQUENCIES PER SECTION

  set sectags = ( `cat text-sections/all.names` )
  echo $sectags

  foreach etag ( RAW EQV )
    tabulate-frequencies \
      -dir ${etag}/efreqs/sections \
      -title "elem" \
      tot ${sectags}
  end

Elements sorted by frequency (× 99), per section:

    tot     unk     pha     str     hea     heb     bio     ast     cos     zod    
    ------- ------- ------- ------- ------- ------- ------- ------- ------- -------
    17 o    16 o    23 o    15 o    20 o    14 y    15 o    18 o    16 o    20 o   
    12 y    12 a     9 y    11 y    11 y    14 o    15 y    13 y    13 y    13 a   
     9 a    11 y     8 l    11 a     9 ch   11 d    10 d     9 a    11 a     8 y   
     8 d     8 d     6 a     8 d     7 a    10 a     8 l     7 d     9 l     8 l   
     7 l     7 l     6 d     6 l     7 d     6 k     7 a     5 ch    6 d     7 t   
     5 k     5 r     4 r     6 k     6 l     5 l     6 q     5 ee    4 ch    5 r   
     4 ch    5 k     4 k     4 q     5 r     4 ch    6 k     5 r     4 k     5 d   
     4 r     4 ch    4 ch    4 ee    4 k     4 r     3 ee    4 k     4 r     5 ee  
     4 q     3 t     3 q     4 r     3 t     3 iin   3 che   3 s     4 t     3 ch  
     3 ee    3 iin   3 ee    3 iin   3 iin   3 che   3 r     3 l     3 iin   3 k   
     3 iin   3 q     3 che   3 ch    2 sh    2 q     2 she   3 t     2 sh    2 te  
     3 t     2 che   3 iin   3 che   2 q     2 t     2 t     2 che   2 q     2 iin 
     3 che   2 sh    2 ke    3 t     2 s     2 ee    2 ch    2 iin   2 ee    1 s   
     1 sh    1 ee    1 s     1 she   1 che   1 ke    1 in    2 ke    2 che   1 che 
     1 she   1 she   1 ?     1 sh    1 cth   1 sh    1 ke    1 q     1 s     1 sh  
     1 ke    1 p     1 sh    1 ke    1 ee    1 she   1 iin   1 ?     1 she   1 she 
     1 s     1 s     1 she   1 in    0 she   1 s     1 sh    1 e?    0 ke    0 p   
     0 in    0 ke    1 t     0 p     0 p     0 te    1 s     1 she   0 e?    0 ?   
     0 p     0 te    0 ckh   0 s     0 ckh   0 p     0 te    1 te    0 p     0 in  
     0 te    0 ir    0 ckhe  0 te    0 in    0 ckh   0 ckh   1 sh    0 te    0 ke  
     0 cth   0 cth   0 te    0 ir    0 m     0 f     0 p     0 p     0 ?     0 eee 
     0 ckh   0 f     0 p     0 eee   0 ke    0 in    0 cth   0 ir    0 cth   0 e?  
     0 ir    0 in    0 e?    0 ckh   0 cph   0 ir    0 ckhe  0 cth   0 in    0 m   
     0 ?     0 m     0 iiir  0 cth   0 te    0 m     0 f     0 ckh   0 m     0 cth 
     0 eee   0 ckh   0 in    0 f     0 cthe  0 cth   0 eee   0 eee   0 ckh   0 cthe
     0 f     0 cph   0 cth   0 e?    0 f     0 e?    0 e?    0 cthe  0 eee   0 iir 
     0 m     0 eee   0 cthe  0 ckhe  0 ?     0 cthe  0 ir    0 in    0 cthe  0 q   
     0 e?    0 e?    0 ir    0 ?     0 ckhe  0 ?     0 cthe  0 iir   0 ir          
     0 ckhe  0 ?     0 f     0 m     0 e?    0 ckhe  0 iiin  0 i?    0 iir         
     0 cthe  0 cfh   0 m     0 iir   0 eee   0 eee   0 h?    0 il    0 cfh         
     0 cph   0 cthe  0 eee   0 i?    0 ir    0 iir   0 cph   0 j     0 ckhe        
     0 iir   0 cphe  0 j     0 cthe  0 n     0 iiin  0 cphe  0 ckhe  0 ij          
     0 iiin  0 ckhe  0 cphe  0 iiin  0 cfh   0 cphe  0 ?     0 cph                 
     0 n     0 iir   0 cph   0 cph   0 cphe  0 cph   0 ck    0 f                   
     0 i?    0 im    0 iiin  0 il    0 i?    0 cfhe  0 ikh   0 im                  
     0 cphe  0 n     0 iir   0 n     0 iiin  0 i?    0 il    0 m                   
     0 cfh   0 i?    0 i?    0 cphe  0 iir   0 im    0 m                           
     0 iiir  0 x     0 cfhe  0 de    0 ct    0 cfh   0 n                           
     0 de    0 iil   0 de    0 x     0 de    0 n     0 cfh                         
     0 il    0 il    0 is    0 is    0 im    0 x     0 de                          
     0 im            0 n     0 im    0 cfhe  0 de    0 i?                          
     0 cfhe          0 id    0 cfhe  0 b     0 id    0 is                          
     0 x             0 pe    0 cfh   0 ck            0 ith                         
     0 is                    0 iil   0 id            0 c?                          
     0 j                     0 iid   0 iil           0 iir                         
     0 h?                    0 id    0 cf            0 b                           
     0 ck                    0 iiid  0 g             0 cp                          
     0 ct                    0 iiil  0 h?            0 ct                          
     0 iil                   0 iim   0 iiil          0 g                           
     0 id                    0 iis   0 iiir          0 iil                         

I have compared these counts with those obtained by removing two, one, or zero
elements from each line end.  The conclusion is that the ordering of the first
six entries in each column is quite stable; it is probably not an artifact.

Some quick observations: there seem to be three "extremal" samples:
hea ("ch" abundant), bio ("q" important), and zod ("t" important).

There are too many "e?" elements; I must check where they come from
and perhaps modify the set of elements to account for them.

  [ It seems that many came from groups of the form "e[ktpf]e",
    "e[ktpf]ee", which could be "c[ktpf]h" and "c[ktpf]he" without
    ligatures.  Most of the remaining come from Friedman's
    transcription; there are practically none in the more careful
    transcriptions. ]

All valid elements that occur at least 10 times in the text:

  o y a 
  q 
  n in iin iiin
  r ir iir iiir
  d
  s is
  l il
  m im
  j
  de
  k t ke te
  p f
  cth ckh cthe ckhe
  cph cfh cphe cfhe
  ch che
  sh she
  ee eee 
  x
  
Valid elements that occur less than 10 times in the whole text:
  
  iil
  ij
  pe
  ct ck
  id

Created a file "RAW/plots/vald/keys.dic" with all the valid elements.


Equiv-reduced elements sorted by frequency (× 99), per section:

  tot      unk      pha      str      hea      heb      bio      ast      cos      zod     
  -------- -------- -------- -------- -------- -------- -------- -------- -------- --------
  38 o~    40 o~    40 o~    38 o~    39 o~    40 o~    37 o~    40 o~    41 o~    42 o~   
  10 t~    10 t~     8 l~    11 t~    11 ch~   12 d~    10 d~    10 ch~    9 t~    11 t~   
   8 d~     8 d~     7 ch~    8 d~     9 t~    10 t~     9 t~     8 t~     9 l~     8 ch~  
   8 ch~    7 l~     7 d~     8 ch~    7 d~     6 ch~    8 l~     7 d~     7 ch~    8 l~   
   7 l~     6 ch~    6 t~     6 l~     6 l~     5 l~     6 q~     5 r~     6 d~     5 d~   
   4 r~     5 r~     4 r~     4 q~     5 r~     4 r~     5 ch~    3 s~     4 r~     5 r~   
   4 q~     3 in~    3 q~     4 in~    4 in~    3 in~    3 che~   3 l~     3 in~    3 te~  
   4 in~    3 q~     3 in~    4 r~     2 sh~    3 che~   3 in~    3 te~    2 sh~    3 in~  
   3 che~   2 che~   3 che~   4 che~   2 cth~   2 q~     3 r~     3 che~   2 q~     2 che~ 
   1 te~    2 sh~    2 te~    1 te~    2 q~     2 te~    2 she~   2 in~    2 che~   1 s~   
   1 sh~    1 te~    1 s~     1 she~   2 s~     1 sh~    2 te~    1 q~     1 te~    1 she~ 
   1 she~   1 she~   1 ?~     1 sh~    1 che~   1 she~   1 sh~    1 ?~     1 s~     1 sh~  
   1 cth~   1 cth~   1 cth~   0 s~     0 te~    1 s~     1 cth~   1 e?~    1 she~   0 ?~   
   1 s~     1 s~     1 sh~    0 cth~   0 she~   1 cth~   1 s~     1 she~   0 cth~   0 e?~  
   0 ir~    0 ir~    1 she~   0 ir~    0 cthe~  0 ir~    0 cthe~  1 cth~   0 e?~    0 cthe~
   0 cthe~  0 cthe~  1 cthe~  0 cthe~  0 ir~    0 cthe~  0 e?~    1 sh~    0 ?~     0 cth~ 
   0 ?~     0 e?~    0 ir~    0 e?~    0 ?~     0 e?~    0 ir~    0 ir~    0 ir~    0 ir~  
   0 e?~    0 ?~     0 e?~    0 ?~     0 e?~    0 ?~     0 h?~    0 cthe~  0 cthe~  0 q~   
   0 n~     0 id~    0 i?~    0 i?~    0 n~     0 id~    0 ith~   0 i?~    0 id~           
   0 i?~    0 n~     0 de~    0 il~    0 i?~    0 i?~    0 ct~    0 il~                    
   0 il~    0 i?~    0 is~    0 n~     0 ct~    0 n~     0 ?~     0 id~                    
   0 id~    0 il~    0 n~     0 de~    0 id~    0 x~     0 il~                             
   0 de~    0 x~     0 id~    0 id~    0 de~    0 de~    0 n~                              
   0 ct~                      0 x~     0 il~             0 de~                             
   0 is~                      0 is~    0 b~              0 i?~                             
   0 x~                                0 is~             0 is~                             
   0 h?~                               0 h?~             0 c?~                             
   0 ith~                                                0 b~                              
   0 b~                                                                                    
   0 c?~                                                                                   

There are 23 valid elements with frequency > 20 under the equivalence:

  o 
  t   te 
  cth cthe 
  ch  che 
  sh  she 
  d   de
  id
  l r q s m n
  in ir im il
  
Valid elements with frequency below 20:

  ct is g b x

Created a file "EQV/plots/vald/keys.dic" with all the valid elements, collapsed by the
above equivalence.

III. PAGE SCATTER-PLOTS

See Notes/021 for explanation of these plots.

Let's now compute the frequencies of these keywords in each page and section:

  foreach dic ( vald )
    foreach etag ( RAW EQV )
      foreach utype ( pages sections )
        set frdir = "${etag}/efreqs/${utype}"
        set ptdir = "${etag}/plots/${dic}/${utype}"
        echo "${frdir}" "${ptdir}"
        /bin/rm -rf ${ptdir}
        mkdir -p ${ptdir}
        cp -p ${frdir}/all.names ${ptdir}
        foreach fnum ( tot `cat ${frdir}/all.names` )
          printf "%30s/%-7s " "${ptdir}" "${fnum}:"
          cat ${frdir}/${fnum}.frq \
            | gawk '/./{print $1, $3;}' \
            | est-dic-probs -v dic=${etag}/plots/${dic}/keys.dic \
            > ${ptdir}/${fnum}.pos
        end
      end
    end
  end

III. SCATTER PLOTS

Let's plot them:

  set sys = "tot-hea"
  foreach dic ( vald )
    foreach etag ( RAW EQV )
      set ptdir = "${etag}/plots/${dic}/pages"
      set scdir = "${etag}/plots/${dic}/sections"
      set fgdir = "${etag}/plots/${dic}/${sys}"
      /bin/rm -rf ${fgdir}
      mkdir -p ${fgdir}
      cp -p ${ptdir}/all.names ${fgdir}/all.names
      make-3d-scatter-plots \
        ${ptdir} \
        ${fgdir} \
        ${scdir}/{tot,hea,heb,bio}.pos
    end      
  end

Again, trying to separate Herbal-A from Pharma:

  set sys = "hea-pha"
  foreach dic ( vald )
    foreach etag ( RAW EQV )
      set ptdir = "${etag}/plots/${dic}/pages"
      set scdir = "${etag}/plots/${dic}/sections"
      set fgdir = "${etag}/plots/${dic}/${sys}"
      /bin/rm -rf ${fgdir}
      mkdir -p ${fgdir}
      cp -p ${ptdir}/all.names ${fgdir}/all.names
      make-3d-scatter-plots \
        ${ptdir} \
        ${fgdir} \
        ${scdir}/{hea,pha,heb,bio}.pos
    end      
  end

The scatter plots made with colapsed letters still show the main
sections as separate clusters, but touching each other.