Hacking at the Voynich manuscript - Side notes
009 A prefix-midfix-suffix factorization of the A and B herbal pages in EVA encoding

Last edited on 1997-11-24 16:54:07 by stolfi

  This is partly a remake of work from Notebook-1.txt, originally done around
  97-07-05.

  Summary of previous relevant tasks:

    I obtained Landini's interlinear transcription of the VMs, version
    1.6 (landini-interln16.evt) from
    http://sun1.bham.ac.uk/G.Landini/evmt/intrln16.zip. [Notebook-1.txt]
    
    Around 97-11-01 I split landini-interln16.evt into many files, with one
    text unit per page. [Notebook-12.txt]
    
    On 97-11-05 I mapped those files from FSG and other ad-hoc
    alphabets to EVA.  [Notebook-12.txt] The files are
    L16-eva/fNNxx.YY, and a machine-readable description of their
    contents and logical order is in L16-eva/INDEX.
    
    Then I started going back to redoing some of the previous tasks
    using the new encoding.
    
    I extracted the Currier (;C>) and Friedman-I (;F>)
    versions of the "bio" section, in EVA alphabet, as files
    bio-{c,f}-eva.evt. I also built the associated text files and word lists
    bio-{c,f}-eva{-{gut,fun,bad},{wds,dic,frq},.txt}. [Note-001.txt]
    
    I then constructed a factorization of the words in the the bio
    section (Friedman version) into prefix-midfix-suffix, where prefix
    and suffix are maximal strings of letters in [aoyidlsrmnq] (not
    counting the "s" in "sh").  This factorization revealed about 12
    significant prefixes, 20 significant suffixes, and 40 or more
    significant midfixes.  The prefixes and suffixes seemed to be
    composed of a small number of letter groups, such as { dy al ol ar
    or ain oin aiin oiin ... }.  The midfixes were a single group of
    letters from the set { k t p f ckh cth cph cfh ch sh e }, where
    the "e"s appeared to be modifiers of the preceding
    letter. [Note-008.txt]
    
97-11-12 stolfi
===============

  I posted the Note-008 factorization to the Voynich list, and
  Rene asked whether the same pattern held in the Herbal sections.
  So let's try it. 
  
    foreach lang ( a b )

      set ufile = "he${lang}.units"
      set mfile = "he${lang}-m-eva.evt"
      set ffile = "he${lang}-f-eva.evt"
      set cfile = "he${lang}-c-eva.evt"
      echo '=== '${ufile}
      
      cat L16-eva/INDEX \
        | egrep -i -e ':herbal:'"${lang}"':.*:parags:' \
        | sed -e 's/:.*$//g' \
        > ${ufile}

      cat `cat ${ufile} | sed -e 's/^/L16-eva\//g'` \
        | egrep -v '^#' \
        > ${mfile}
        
      cat ${mfile} \
        | egrep '^<[^>]*;C>' \
        > ${cfile}

      cat ${mfile} \
        | egrep '^<[^>]*;F>' \
        > ${ffile}

      dicio-wc he${lang}-{m,f,c}-eva.evt
      
    end

     lines   words     bytes file        
    ------ ------- --------- ------------
      2477    4957    143185 hea-m-eva.evt
      1216    2432     70439 hea-f-eva.evt
      1119    2241     63768 hea-c-eva.evt

     lines   words     bytes file        
    ------ ------- --------- ------------
       727    1463     53929 heb-m-eva.evt
       364     735     26820 heb-f-eva.evt
       291     584     21307 heb-c-eva.evt

    foreach lang ( a b )
      foreach guy ( f c )
        extract-words-from-interlin \
            -chars "aoeilmnrchtpkfsqjdvxyg" \
            he${lang}-${guy}-eva.evt \
            he${lang}-${guy}-eva \
          > he${lang}-${guy}.stats
        
        dicio-wc he${lang}-${guy}-eva{,-gut,-fun,-bad}.*
      end
    end

     lines   words     bytes file        
    ------ ------- --------- ------------
      2497    2497     17089 hea-f-eva.dic
      1216    2432     70439 hea-f-eva.evt
      2497    4994     37065 hea-f-eva.frq
      1216    8058     46503 hea-f-eva.txt
      9250    9250     48887 hea-f-eva.wds
      2450    2450     16826 hea-f-eva-gut.dic
      2450    4900     36426 hea-f-eva-gut.frq
      7812    7812     45838 hea-f-eva-gut.wds
         3       3         6 hea-f-eva-fun.dic
         3       6        30 hea-f-eva-fun.frq
      1387    1387      2774 hea-f-eva-fun.wds
        44      44       257 hea-f-eva-bad.dic
        44      88       609 hea-f-eva-bad.frq
        51      51       275 hea-f-eva-bad.wds

     lines   words     bytes file        
    ------ ------- --------- ------------
      2270    2270     15622 hea-c-eva.dic
      1119    2241     63768 hea-c-eva.evt
      2270    4540     33782 hea-c-eva.frq
      1114    7285     41710 hea-c-eva.txt
      8387    8387     43914 hea-c-eva.wds
      2173    2173     15073 hea-c-eva-gut.dic
      2173    4346     32457 hea-c-eva-gut.frq
      6990    6990     40727 hea-c-eva-gut.wds
         3       3         6 hea-c-eva-fun.dic
         3       6        30 hea-c-eva-fun.frq
      1278    1278      2556 hea-c-eva-fun.wds
        94      94       543 hea-c-eva-bad.dic
        94     188      1295 hea-c-eva-bad.frq
       119     119       631 hea-c-eva-bad.wds

     lines   words     bytes file        
    ------ ------- --------- ------------
      1330    1330      8857 heb-f-eva.dic
       364     735     26820 heb-f-eva.evt
      1330    2660     19497 heb-f-eva.frq
       362    3304     19565 heb-f-eva.txt
      3706    3706     20369 heb-f-eva.wds
      1310    1310      8745 heb-f-eva-gut.dic
      1310    2620     19225 heb-f-eva-gut.frq
      3223    3223     19326 heb-f-eva-gut.wds
         3       3         6 heb-f-eva-fun.dic
         3       6        30 heb-f-eva-fun.frq
       465     465       930 heb-f-eva-fun.wds
        17      17       106 heb-f-eva-bad.dic
        17      34       242 heb-f-eva-bad.frq
        18      18       113 heb-f-eva-bad.wds

     lines   words     bytes file        
    ------ ------- --------- ------------
      1070    1070      7226 heb-c-eva.dic
       291     584     21307 heb-c-eva.evt
      1070    2140     15786 heb-c-eva.frq
       288    2631     15548 heb-c-eva.txt
      2978    2978     16242 heb-c-eva.wds
      1067    1067      7220 heb-c-eva-gut.dic
      1067    2134     15756 heb-c-eva-gut.frq
      2587    2587     15460 heb-c-eva-gut.wds
         3       3         6 heb-c-eva-fun.dic
         3       6        30 heb-c-eva-fun.frq
       391     391       782 heb-c-eva-fun.wds
         0       0         0 heb-c-eva-bad.dic
         0       0         0 heb-c-eva-bad.frq
         0       0         0 heb-c-eva-bad.wds

  Before we go on, it may be interesting to compare the 
  vocabularies of the two "languages":
  
    foreach guy ( f c )
      bool 1.2 he{a,b}-${guy}-eva-gut.dic > ${guy}-common.dic
      bool 1-2 he{a,b}-${guy}-eva-gut.dic > ${guy}-a-only.dic
      bool 2-1 he{a,b}-${guy}-eva-gut.dic > ${guy}-b-only.dic
      dicio-wc  ${guy}-{common,a-only,b-only}.dic
    end
   
     lines   words     bytes file        
    ------ ------- --------- ------------
       456     456      2619 f-common.dic
      1994    1994     14207 f-a-only.dic
       854     854      6126 f-b-only.dic

     lines   words     bytes file        
    ------ ------- --------- ------------
       367     367      2097 c-common.dic
      1806    1806     12976 c-a-only.dic
       700     700      5123 c-b-only.dic

  Ok, now let's do the factoring:

    foreach lang ( a b )
      foreach guy ( f c )
        cat he${lang}-${guy}-eva-gut.wds \
          | sed \
              -e 's/sh/X/g' \
              -e 's/$/}/' \
              -e 's/^/{/' \
              -e ':a' \
              -e 's/{\(qo\)/.\1{/' \
              -e 'ta' \
              -e 's/{\([aoydlrs]\)/.\1{/' \
              -e 'ta' \
              -e ':x' \
              -e 's/\([aoydsm]\|i*[lnr]\)}/}\1./' \
              -e 'tx' \
              -e 's/\([oaydirslmn][oaydirslmn]*\)}/}\1:/' \
              -e 's/{\([oaydirslmn][oaydirslmn]*\)/:\1{/' \
              -e 's/X/sh/g' \
              -e 's/{}/\./' \
              -e 's/\.//g' \
              -e 's/{/- -/' \
              -e 's/}/- -/' \
          > he${lang}-${guy}.factored

        cat he${lang}-${guy}.factored \
          | grep ':' \
          > he${lang}-${guy}.funny-prefs-suffs

        cat he${lang}-${guy}.factored \
          | grep -v -e '- -' \
          | sort | uniq -c | expand | sort +0 -1nr \
          > he${lang}-${guy}-unifs-all.frq

        cat he${lang}-${guy}.factored \
          | grep -e '- -' \
          | gawk '/./ {print $1}' \
          | sort | uniq -c | expand | sort +0 -1nr \
          > he${lang}-${guy}-prefs-all.frq

        cat he${lang}-${guy}.factored \
          | grep -e '- -' \
          | gawk '/./ {print $2}' \
          | sort | uniq -c | expand | sort +0 -1nr \
          > he${lang}-${guy}-midfs-all.frq

        cat he${lang}-${guy}.factored \
          | grep -e '- -' \
          | gawk '/./ {print $3}' \
          | sort | uniq -c | expand | sort +0 -1nr \
          > he${lang}-${guy}-suffs-all.frq

        cat he${lang}-${guy}.factored \
          | grep -e '- -' \
          | gawk '/./ {print ($2 $3)}' \
          | sed -e 's/--//g' \
          | sort | uniq -c | expand | sort +0 -1nr \
          > he${lang}-${guy}-tails-all.frq

        cat he${lang}-${guy}.factored \
          | gawk '/./ {print ($1 $2 $3)}' \
          | sed -e 's/--//g' \
          | sort | uniq -c | expand | sort +0 -1nr \
          > he${lang}-${guy}-words-all.frq

      end
    end
    
  Check length of longest element:
  
    foreach elem ( prefs midfs suffs unifs tails words )
      foreach lang ( a b )
        foreach guy ( f c )
          /usr/ucb/echo -n "herbal ${lang} vers ${guy}: max ${elem} = "
          cat he${lang}-${guy}-${elem}-all.frq \
            | gawk '/./ {m=length($2); if(m>mx)mx=m}; END {printf "%2d\n",mx}'
        end
      end
    end

    herbal a vers c: max prefs =  8
    herbal a vers f: max prefs =  8
    herbal b vers c: max prefs =  8
    herbal b vers f: max prefs =  5

    herbal a vers c: max midfs = 15
    herbal a vers f: max midfs = 15
    herbal b vers c: max midfs = 16
    herbal b vers f: max midfs = 13

    herbal a vers c: max suffs = 10
    herbal a vers f: max suffs =  9
    herbal b vers c: max suffs =  9
    herbal b vers f: max suffs =  9

    herbal a vers c: max unifs = 10
    herbal a vers f: max unifs =  8
    herbal b vers c: max unifs =  9
    herbal b vers f: max unifs =  8

    herbal a vers f: max tails = 16
    herbal a vers c: max tails = 16
    herbal b vers f: max tails = 14
    herbal b vers c: max tails = 17

    herbal a vers f: max words = 15
    herbal a vers c: max words = 15
    herbal b vers f: max words = 13
    herbal b vers c: max words = 16

  Let's now format the files and reduce the absolute counts to percentages 
  relative to the total factored and unfactored words:

    foreach guy ( Friedman.f Currier.c )
      foreach elem ( pref midf suff unif tail word )
        foreach lang ( A.a B.b )
          set file = "he${lang:e}-${guy:e}-${elem}s-all"
          echo "${file}.frq -> ${file}.fmt"
          cat ${file}.frq \
            | compute-freqs \
            | gawk '\
                  BEGIN {\
                    printf "by '"${guy:r}"'\nlanguage '"${lang:r}"'\n"; \
                    printf "freq pc '"${elem}"'ix\n---- -- ------------------\n";} \
                  /./   {printf "%4d %2d %s\n",$1,int($2*100+0.5),$3; t+=$1;} \
                  END   {printf "---- -- ------------------\n%4d 99 TOTAL\n",t;} \
                ' \
            > ${file}.fmt
        end
      end
    end

  Now let's print them side-by-side:

    foreach elem ( pref midf suff unif tail word )
      set tfiles = ( )
      foreach guy ( f c )
        foreach lang ( a b )
          set file = "he${lang}-${guy}-${elem}s-all"
          set tfiles = ( ${tfiles} ${file}.fmt )
        end
      end
      pr -m -t -i' '1 -w 108  ${tfiles} \
        | expand \
        > herbal-${elem}-cmp.txt
    end

  Inspired by Gabriel Landini's paper, let's prepare a graph 
  of A-freq × B-freq for each segment:
  
    foreach guy ( Friedman.f )
      foreach elem ( pref midf suff unif tail )
        set pfile = "herbal-${guy:e}-${elem}s-all"
        set afile = "hea-${guy:e}-${elem}s-all"
        set bfile = "heb-${guy:e}-${elem}s-all"
        echo "${afile}.frq, ${bfile}.frq -> ${pfile}.plt"
        cat ${afile}.frq | sort -b +1 -2 > .a
        cat ${bfile}.frq | sort -b +1 -2 > .b
        /n/gnu/bin/join \
            -a 1 -a 2 -e 0.5 \
            -j1 2 -j2 2 \
            -o1.1,2.1,0 \
            .a .b \
          > ${pfile}.plt
        plot-lang-diffs ${guy:r} ${elem} ${pfile}.plt
      end
    end