Hacking at the Voynich manuscript - Side notes
021 Plotting word frequencies per page

Last edited on 1999-07-17 11:51:26 by stolfi

[ First version done on 1998-04-28.
  Redone 1998-06-20 with fresher data.
  Redone 1998-07-02 with different dictionaries and axes.
  Redone 1999-01-30 with 1.6e6 majority transcription (Notes/045).
  Analysis part split off to Notes/019 on 1999-01-31. ]

1998-07-02 stolfi
=================

  The goal of this note is to compare the word distributions 
  among the various sections of the Voynich manuscript.

I. EXTRACTING AND COUNTING WORDS

  The source file will be the majority version (Notes/045), with weirdos 
  mapped to "*" and other basic EVA chars, and chopped into 
  pages and subsections:

    ln -s ../045/subsecs-m text-subsecs
    ln -s ../045/pages-m text-pages

  We will use the word and word class frequencies per page and per
  section derived from this data on Notes/019.
  
    mkdir -p RAW EQV
    
    ( cd RAW && ln -s ../../019/RAW/wfreqs )
    ( cd EQV && ln -s ../../019/EQV/wfreqs )
    
    ln -s ../019/fnum-to-subsec.tbl

  Creating a combined file of the source text for archiving:
  
     ( cd text-pages && cat `cat all.names | sed -e 's/$/.evt/'` ) \
       > all.evt

II. CONCEPT OF PAGE SCATTER-PLOTS

  Let's see if we can characterize the Voynichese sub-languages by
  mapping the pages into some space of moderate dimension, and looking
  for geometric clustering among that cloud of points.

  We start by fixing a list D of key words.  Then we can look at each
  language sample S as a point in the unit probability simplex, whose
  ith coordinate S_i is the estimated frequency freq(w_i,S) of the ith
  key word w_i in the sample's language.

  To properly handle low-frequency words, I guess we should use a
  Bayesian estimate of freq(w_i,S), rather than merely
  count(w,S)/size(S).  But for that we need to postulate an 'a priori'
  probability for words that do not occur in that sample, or even in any
  sample. We could use some exponential distribution in length(w), but
  let's leave that for later. Let's instead use an a priori distribution
  for the frequencies that is uniform in the unit simplex, for all words
  w in some selected dictionary D.  The Bayesian estimate of freq(w,S)
  is then 

    freq(w,S) = (count(w,S) + 1)/(count(D,S) + n)

  where count(D,S) is the total number of occurrences of D-words in the
  sample S, and n is the number of distinct words in D.
  Note that these estimated frequencies still add to 1.

  As a dictionary D, let's use the top 50 words of Rene's list:

    set sizeD = 50

    foreach ep ( cat.RAW word-ct-to-class-ct.EQV )
      set etag = ${ep:e}; set ecmd = ${ep:r}
      mkdir -p ${etag}/plots/rene
      cat Rene-words.frq \
        | ${ecmd} \
        | combine-counts \
        | sort -b +0 -1nr \
        | head -${sizeD} \
        | gawk '//{print $2;}' \
        | sort | uniq \
        > ${etag}/plots/rene/keys.dic
      list -filter 'fmt -w60' ${etag}/plots/rene/keys.dic
    end

    dicio-wc {RAW,EQV}/plots/rene/keys.dic

    --- RAW/plots/rene/keys.dic ------------------------
    aiin al ar chckhy chdy chedy cheey cheol chey chol chor
    chy daiin dain dal dar dy lchedy okaiin okain okal okar
    okedy okeedy okeey ol or otaiin otal otar otedy oteedy
    oteey oty qokaiin qokain qokal qokar qokedy qokeedy qokeey
    qokey qoky qol s saiin shedy sheey shey shol
    ----------------------------------------------------

    --- EQV/plots/rene/keys.dic ------------------------
    chctho~ chdo~ chectho~ chedo~ cheedo~ cheeo~ cheodo~ cheol~
    cheor~ cheo~ cheto~ chodo~ chol~ chor~ choto~ cho~ ctho~
    doin~ doir~ dol~ dor~ do~ lchedo~ odoin~ oin~ ol~ oroin~
    or~ otchdo~ otchedo~ otcheo~ otchol~ otcho~ otedo~ oteedo~
    oteeo~ oteodo~ oteol~ oteo~ otod~ otoin~ otol~ otor~ oto~
    o~ soin~ sor~ s~ toin~ tor~
    ----------------------------------------------------

  Then we compute the frequencies of these keywords in each page and section:

    foreach dic ( rene )
      foreach etag ( RAW EQV )
        echo "${dic}" "${etag}"
        foreach utype ( pages subsecs )
          set frdir = "${etag}/wfreqs/${utype}"
          set ptdir = "${etag}/plots/${dic}/${utype}"
          echo "${frdir}" "${ptdir}"
          /bin/rm -rf ${ptdir}
          mkdir -p ${ptdir}
          cp -p ${frdir}/all.names ${ptdir}
          foreach fnum ( tot `cat ${frdir}/all.names` )
            printf "%30s/%-7s " "${ptdir}" "${fnum}:"
            cat ${frdir}/${fnum}.frq \
              | gawk '/./{print $1, $3;}' \
              | est-dic-probs -v dic="${etag}/plots/${dic}/keys.dic" \
              > ${ptdir}/${fnum}.pos
          end
        end
      end
    end

  For plotting we will use a special grouping of subsections into
  "plot sections" defined by the following table: 
  
    set plotgrp = ( \
      pharma:pha.1,pha.2 \
      her-a-1:hea.1 \
      her-a-2:hea.2 \
      cosmo-1:cos.2 \
      cosmo-2:cos.3 \
      zodiac:zod.1 \
      stars:str.2 \
      her-b-1:heb.1 \
      her-b-2:heb.2 \
      bio:bio.1 \
      f58rv:str.1 \
      f1r:unk.1 \
      f57v:cos.1 \
      f49v:unk.2 \
      f65rv:unk.3 \
      f66r:unk.4 \
      f85r1:unk.5 \
      f86v6:unk.6 \
      f86v5:unk.7 \
      f116v:unk.8 \
      xxx:xxx.0,xxx.1,xxx.2,xxx.3,xxx.4,xxx.5 \
    )

  echo $plotgrp \
    | tr ' :,' '\012  ' \
    | gawk '/./{for(i=2;i<=NF;i++){print $(i),$1;}}' \
    | sort +0 -1 \
    > subsec-to-plot.tbl
  
  set plotsecs = ( `echo $plotgrp | tr ' ' '\012' | sed -e 's/[:].*$//g' | grep -v xxx` )
  echo ${plotsecs}
  
  Remember to edit the "plot-page-data" script to match the plot sections
  above.
    
  Make a table that maps directly f-numbers to plot sections:
  
    cat fnum-to-subsec.tbl \
      | map-field \
          -v inField=2 -v outField=3 \
          -v table=subsec-to-plot.tbl \
      | gawk '//{print $1,$3}' \
      > fnum-to-plot.tbl

  Format the plot sections table for documentation:

    /bin/rm -f plot-summary.txt
    foreach plsc ( ${plotsecs} )
      echo "${plsc}"
      echo "subsection ${plsc}" \
        >> plot-summary.txt
      cat fnum-to-plot.tbl \
        | gawk -v plsc=${plsc} '($2 == plsc){print $1}' \
        | map-field \
            -v inField=1 -v outField=2 \
            -v table=fnum-to-pnum.tbl \
        | sort +1 -2 \
        | gawk '/./{print $1}' \
        | fmt -w 50 \
        | sed -e 's/^/  /' \
        >> plot-summary.txt
      echo " " \
        >> plot-summary.txt
    end

  OK, let's plot the data 

    set sys = "tot-hea"
    foreach dic ( rene )
      foreach etag ( RAW EQV )
        set ptdir = "${etag}/plots/${dic}/pages"
        set scdir = "${etag}/plots/${dic}/subsecs"
        set fgdir = "${etag}/plots/${dic}/${sys}"
        /bin/rm -rf ${fgdir}
        mkdir -p ${fgdir}
        cp -p ${ptdir}/all.names ${fgdir}/all.names
        make-3d-scatter-plots \
          ${ptdir} \
          ${fgdir} \
          ${scdir}/{tot,tot,hea.1,heb.1,str.2}.pos
      end
    end

  Let's make another set of plots, using Herbal-A and Pharma as 
  references:

    set sys = "hea-pha"
    foreach dic ( rene )
      foreach etag ( RAW EQV )
        set ptdir = "${etag}/plots/${dic}/pages"
        set scdir = "${etag}/plots/${dic}/subsecs"
        set fgdir = "${etag}/plots/${dic}/${sys}"
        /bin/rm -rf ${fgdir}
        mkdir -p ${fgdir}
        cp -p ${ptdir}/all.names ${fgdir}/all.names
        make-3d-scatter-plots \
          ${ptdir} \
          ${fgdir} \
          ${scdir}/{tot,hea.1,pha.2,heb.1,bio.1}.pos
      end      
    end