Hacking at the Voynich manuscript - Side notes
051 Detailed label occurrence map in the Biological section

Last edited on 1999-07-21 05:06:09 by stolfi

INTRODUCTION

  In this note we try to build a detailed label occurrence map
  for the biological section labels, with paragraphs as the 
  unit of text.

PREPARING THE PARAGRAPH DATABASE 

  First, we split the bio pages into paragraphs. The file name format
  will be NNN-FFFF.XXX where NNN is a sequential paragraph number within
  the bio section, FFFF is the page's f-number, and XXX is an extension
  that distinguishes labels (".lbl) from text (".par").

    set biopages = ( \
      f75r f75v \
        f76r f76v \
          f77r f77v \
            f79r f79v \
              f78r f78v \
              f81r f81v \
            f80r f80v \
          f82r f82v \
        f83r f83v \
      f84r f84v \
    )

    echo $biopages \
      | tr ' ' '\012' \
      | sed -e 's/.*/<&/' \
      > .foo

    cat inter-cm.evt \
      | egrep -v '^[#]' \
      | egrep '[;]A>' \
      | fgrep -f .foo \
      > bio.evt

  Edited manually bio.evt producing bio-edit.evt. Inserted
  separators of the form === f83v par ====* before each paragraph,
  and === f83v lbl ====* before each label block.  
  Rearranged the paragraphs and label blocks in approximate reading order.
  Split the line <f84r.P.24;A> into <f84r.P.24a;A> (probably labels)
  and <f84r.P.24b;A> (probably text).  Rearranged the quire by moving
  bifolio f78+f81 to the center, inside f79+f80.  Completed some 
  of the "*"s of the majority version with a letter from one of
  the other versions.

  Splitting the paragraphs, and creating tables that map paragraph
  number to page f-number and paragraph type:

    mkdir evt-parags

    /bin/rm -f evt-parags/*

    cat bio-edit.evt \
      | split-parags
      
    dicio-wc parag-to*.tbl

PARAGRAPH CONCORDANCE FOR THE BIO SECTION

  Next we create a file containing one record per word occurrence, in the format

    WORD PATT1 PATT2 GNUM FNUM BNUM TYPE
    1    2     3     4    5    6    7

  where

    WORD is a word occurring in the bio section,

    PATT1 is its equivalence class representative (tight similarity),

    PATT2 is its equivalence class representative (loose similarity),

    GNUM is the number of the paragraph where the word occurs,

    FNUM is the page's f-number,

    BNUM is the bifolio's b-number,

    TYPE is "par" for occurrence in text, "lbl" for occurrence as label.

  Here it goes:

    /bin/rm -f bio.roc
    foreach gnum ( `cd evt-parags && ls [0-9][0-9][0-9][0-9] | sort` )
      echo $gnum
      cat evt-parags/$gnum \
        | words-from-evt \
        | tr ' ' '\012' \
        | add-loose-pattern \
        | add-tight-pattern \
        | /n/gnu/bin/gawk \
            -v gnum="${gnum}" \
            '/./ { print $0, gnum; }' \
       >> bio.roc
    end

    dicio-wc bio.roc

        lines   words     bytes file        
      ------- ------- --------- ------------
         6973   27892    155956 bio.roc

    cat bio.roc \
      | sort +0 -1 +3 -4 \
      | map-field \
          -v inField=4 \
          -v outField=5 \
          -v table=parag-to-fnum.tbl \
      | map-field \
          -v inField=5 \
          -v outField=6 \
          -v table=fnum-to-bifolio.tbl \
      | map-field \
          -v inField=4 \
          -v outField=7 \
          -v table=parag-to-type.tbl \
      > bio.woc
    
    dicio-wc bio.woc

        lines   words     bytes file        
      ------- ------- --------- ------------
         6973   48811    246605 bio.woc

  Let's count the total number of word occurrences per paragraph:
  
    cat bio.woc \
      | gawk '/./{ p = $4; nw[p]++; } END{ for(p in nw){ print p, nw[p];} }' \
      | sort \
      > parag-to-size.tbl 
    
PARAGRAPH INDEX FILE

  Let's join all data about each paragraph in a single file:
  
    cat parag-to-fnum.tbl \
      | gawk '/./{ fn = $2; gsub(/[rv]/, "", fn); print $1, $2, fn; }' \
      | map-field \
          -v inField=2 \
          -v outField=4 \
          -v table=fnum-to-bifolio.tbl \
      | map-field \
          -v inField=1 \
          -v outField=5 \
          -v table=parag-to-type.tbl \
      > parag.data

PREPARING THE OCCURRENCE MAP

  First, let's find the maximum length of words and patterns: 

    foreach fld ( 1 2 3 )
      echo "field $fld"
      cat bio.woc \
        | /n/gnu/bin/gawk -v fld=$fld \
            '/./{print $(fld);}' \
        | count-word-lengths
    end

      field 1
      len nwords example           
      --- ------ ------------------
        1     74 ?
        2    487 al
        3    790 ady
        4   1044 aiin
        5   1885 aiiin
        6   1566 acthey
        7    840 alchedy
        8    218 cfhdarol
        9     56 checphedy
       10     11 darcheedal
       11      2 chlchpsheey

      field 2
      len nwords example           
      --- ------ ------------------
        1     85 ?
        2    610 ol
        3    779 odo
        4   1262 oiin
        5   2117 oiiin
        6   1404 octheo
        7    499 olchedo
        8    161 cthdorol
        9     46 checthedo
       10      8 dorcheedol
       11      2 chlchtsheeo

      field 3
      len nwords example           
      --- ------ ------------------
        1     85 ?
        2    610 ol
        3    835 odo
        4   1394 oino
        5   2126 oloeo
        6   1237 oeteeo
        7    489 oleeedo
        8    145 etedoeol
        9     42 eeeeteedo
       10      8 doeeeeedol
       11      2 eeleeteeeeo

  Next we tabulate the occurrences of each word per paragraph,
  considering only text occurrences:

    foreach patt ( exact.1 tight.2 loose.3 )
      set xpt = ${patt:r}
      set fpt = ${patt:e}
      set ofile = "bio-par-${xpt}.wmap"
      echo "${ofile}"
      cat bio.woc \
        | /n/gnu/bin/gawk \
            -v fpt=${fpt} \
            '($7 == "par"){ print $(fpt), $1, ($7 == "par" ? "-" : "+"), $4; }' \
        | sort -b +0 -3 +3 -4 \
        | make-word-parag-map \
            -v nParags=112 \
            -v omitSingles=1 \
        > ${ofile}
    end

    dicio-wc bio-par-{exact,tight,loose}.wmap

      lines   words     bytes file        
    ------- ------- --------- ------------
       1015   58928    182041 bio-par-exact.wmap
       1549  133632    412631 bio-par-tight.wmap
       1657  156600    483474 bio-par-loose.wmap

  Created by hand a table "bio-fnum-to-order.tbl" that maps original
  page f-numbers to conjectured page reading order (2 digits).
  The order was chosen after a couple of iterations
  of what follows.
  
  Now create tables that map text paragraphs to blocks
  (page, folio, bifolio) in the conjectured order, with 
  blank lines at the proper places:
  
    cat parag.data \
      | map-field \
          -v inField=2 \
          -v outField=6 \
          -v table=bio-fnum-to-order.tbl \
      | sort +5 -6 +0 -1 \
      > parag-ordered.data

    foreach block ( parag.1 page.2 folio.3 bifolio.4 )
      set xbl = ${block:r}
      set fbl = ${block:e}
      cat parag-ordered.data \
        | gawk -v fbl="${fbl}" \
            ' ($5 != "par"){ next; } \
              /./{ \
                if((FNR>1)&&(fbl<=2)&&($(fbl+1)!=f)){ printf "\n"; } \
                if((FNR>1)&&(fbl<=3)&&($4!=b)){ printf "\n"; } \
                print; f=$(fbl+1); b=$4 \
              } \
            ' \
        > .temp 
      set tfile = "bio-parag-to-${xbl}-ordered.tbl"
      echo "${tfile}"
      cat .temp \
        | gawk -v fbl="${fbl}" \
            ' /^ *$/ { print; next; } // { print $1, $(fbl); }' \
        > ${tfile}
      set hfile = "bio-parag-to-${xbl}-ordered.hdr"
      echo "${hfile}"
      cat .temp \
        | gawk -v fbl="${fbl}" \
            ' /^ *$/ { print; next; } \
              // { if ($(fbl) \!= p) \
                     { if (fbl<=3) { printf "%s ", $4; } \
                       if (fbl==1) { printf "%s ", $2; } \
                       printf "%s\n", $(fbl); \
                     } \
                   p = $(fbl); \
                 } \
            ' \
        | rotate-labels \
        > ${hfile}
    end
  
  Now format the occurrence map, tabulating data by paragraph,
  page, folio, and bifolio:

    foreach xpt ( exact tight loose )
      set ifile = "bio-par-${xpt}.wmap"
      foreach xbl ( parag page folio bifolio )
        set tfile = "bio-parag-to-${xbl}-ordered.tbl"
        set hfile = "bio-parag-to-${xbl}-ordered.hdr"
        set ofile = "bio-par-${xpt}-${xbl}.fmap"
        echo "${ofile}"
        cat bio-par-${xpt}.wmap \
          | format-word-parag-map \
              -v title="bio word/${xbl} map - ${xpt} comparison" \
              -v blockTable="${tfile}" \
              -v blockHeadings="${hfile}" \
              -v nParags=112 \
              -v maxlen=11 \
              -v countWidth=1 \
              -v countSeparator=' ' \
              -v html=0 \
              -v totOnly=0 \
              -v mostBlocks=0.999 \
              -v showPattern=0 \
              -v showLineNumber=0 \
              -v showAbsCounts=1 \
              -v showRelCounts=0 \
              -v showAvgPos=1 \
          > ${ofile}
      end
    end