Hacking at the Voynich manuscript - Side notes
519 Creating per-page and per-section "best pick" text files.

Last edited on 1999-01-31 05:58:17 by stolfi

OBSOLETE - SEE Notes/045
 
1998-06-19 stolfi
=================

[ First attempt on 1998-05-04, now redone with more care. ]

This note describes the preparation of a "best pick" transcription
of the VMs in EVA, split by page and section.  

The "best pick" version consists of running text only, extracted from
Landini's interlinear file (as expanded and converted to EVA by Stolfi).
The text contains line numbers and spaces as in the EVT format.

For each text line, only one transcription---the most ""trusted" one---is
retained.  The degree of trust is rather subjective: "U" (that is, Stolfi) is
considered best, "V" (John Grove) next to best, ...., "F" (Friedman/First Study Group)
is almost last, and "C" (Currier) is even worse.

The resulting text is organized as one file per page (pages/FNUM.evt,
where FNUM is the page's f-number) and also as one file per section
(sections/TAG.evt, wher TAG is a three-letter section tag).  

Other useful files produced here:

  pages/all.names     the f-numbers of all existing pages, 
                      in natural reading order.
                      
  sections/all.names  the tags of all existing sections,
                      in some nice order
  
  sections/TAG.fnums  the f-numbers of all existing pages in 
                      section TAG, in natural readin order

Gathering the text: "best pick" transcription, excluding labels,
titles, radial lines, etc.

  cat L16-eva/UNITS \
    | gawk -v FS=':' \
        '($6 ~ /^(parags|starred-parags|circular-lines|itemized-lines)/){print $2;}' \
    > all.units

  cat `cat all.units | sed -e 's@^@L16-eva/@g'` \
    | best-pick \
        -v trcodes="UVZABENOPRSWXYKQLMRJITFGCD" \
    > all.evt

Spliting the text into pages:

  mkdir pages

  /bin/rm -f pages/all.names pages/*.evt .foo
  cat all.evt \
    | split-pages \
       -v outdir=pages \
    > pages/all.names

Collecting the list of pages in each section:
  
  mkdir sections
  
  /bin/rm -f sections/all.names sections/*.fnums sections/.foo
  set sectags = ( )
  foreach f ( \
      unknown.unk pharma.pha stars.str \
      herbal-A.hea herbal-B.heb bio.bio \
      astro.ast cosmo.cos zodiac.zod \
    )
    set tag = "${f:e}"
    set sec = "${f:r}"
    echo "${tag} = ${sec}"
    cat fnum-to-section.tbl \
      | grep -w ${sec}  \
      | gawk '/./{print $1;}' \
      > sections/${tag}.fnums
    cat `cat sections/${tag}.fnums | sed -e 's@^\(.*\)$@pages/\1.evt@g'` \
      > sections/${tag}.evt
    echo ${tag} >> sections/all.names
    set sectags = ( ${sectags} ${tag} )
  end 
  echo "sectags = ( ${sectags} )"
  dicio-wc sections/*.evt

      lines   words     bytes file        
    ------- ------- --------- ------------
         31      62      2720 sections/ast.evt
        717    1434     52758 sections/bio.evt
         44      88      3172 sections/cos.evt
       1173    2353     68800 sections/hea.evt
        370     747     27558 sections/heb.evt
        225     450     18008 sections/pha.evt
       1077    2155     88583 sections/str.evt
        242     520     18643 sections/unk.evt
          6      13      1152 sections/zod.evt

Let's list the pages in each section:

  ( cd pages && ls f*.evt ) \
    | sed -e 's/\.evt/ +/' \
    > /tmp/present.tbl

  /bin/rm -f pages-summary.txt
  foreach sec ( `cat sections/all.names` )
    echo "section ${sec}" \
      >> pages-summary.txt
    cat sections/${sec}.fnums \
      | map-field \
          -v table=/tmp/present.tbl \
          -v default='-' \
      | sed -e 's/[+] //' -e 's/- \(f[0-9vr]*\)/(\1)/' \
      | fmt -w 50 \
      | sed -e 's/^/  /' \
      >> pages-summary.txt
    echo " " \
      >> pages-summary.txt
  end
  
  

Let's tabulate which transcribers were used in which sections:

  foreach utype ( pages sections )
    /bin/rm -f ${utype}/trcodes.tbl
    foreach f ( `cat ${utype}/all.names` )
      echo $f
      cat ${utype}/$f.evt \
        | words-and-trcodes-from-evt \
        | totalize-trcodes -v name=${f:r} \
        >> ${utype}/trcodes.tbl
    end
  end

Let's format the tables.  (Note that "99" probably means 100%.)

First, find which transcribers are significant:

  cat sections/trcodes.tbl \
    | gawk \
        ' BEGIN{ split("",m); } \
          /./ { \
            for(i=0;i<=25;i++) { if($(i+2) > 0) { m[i] = 1; } } \
          } \
          END { \
            for(i=0;i<=25;i++) { if(i in m) { printf " %02d", i } } \
            printf "\n"; \
          } \
        ' \
    > .foo
  set trnums = `cat .foo`
  echo $trnums

  cat sections/trcodes.tbl \
    | gawk -v trnums="${trnums}" \
        ' BEGIN{ \
             split("",m); split(trnums,tr); \
             for(j in tr) { m[tr[j]+0] = 1; } \
          } \
          /./ { \
            printf "%-6s", $1; \
            for(i=0;i<=25;i++) \
              { if(i in m) \
                  { f = sprintf("%c(%02d%%)", i+65, $(i+2)); \
                    printf " %6s", ($(i+2) > 0 ? f : "") \
                  } \
              } \
            printf "\n"; \
          } \
        ' 

    ---------------------------------------------------------------
    unk    C(01%) F(97%)                             U(01%)       
    pha           F(60%)        L(39%)                            
    str           F(47%) J(01%)               T(51%)              
    hea           F(71%)        L(04%)               U(24%)       
    heb           F(96%)                             U(03%)       
    bio                                              U(05%) V(94%)
    ast    C(41%) F(43%)               R(14%)               V(01%)
    cos    C(18%) F(81%)                                          
    zod    C(99%)                                                 
    ---------------------------------------------------------------

JUNK

  cat fnum-to-section.tbl \
    | sed -e 's/[?][?][?]/unknown/' \
    | map-field \
        -v table=fnum-to-seq.tbl \
        -v inField=1 -v outField=3 \
    | sort +2 -3n \
    | gawk '//{printf "%-7s %s\n", $1, $2}' \
    > .foo
    
  set sectags = ( )
  foreach f ( \
      unknown.unk pharma.pha stars.str \
      herbal-A.hea herbal-B.heb bio.bio \
      astro.ast cosmo.cos zodiac.zod \
    )
    set tag = "${f:e}"
    set sec = "${f:r}"
    echo "${tag} = ${sec}"
    cat fnum-to-section.tbl \
      | grep -w ${sec}  \
      | gawk '/./{print $1;}' \
      | sort \
      > .foo
    diff sections/${tag}.fnums .foo
    set sectags = ( ${sectags} ${tag} )
  end