Hacking at the Voynich manuscript - Side notes 051 Detailed label occurrence map in the Biological section Last edited on 1999-07-21 05:06:09 by stolfi INTRODUCTION In this note we try to build a detailed label occurrence map for the biological section labels, with paragraphs as the unit of text. PREPARING THE PARAGRAPH DATABASE First, we split the bio pages into paragraphs. The file name format will be NNN-FFFF.XXX where NNN is a sequential paragraph number within the bio section, FFFF is the page's f-number, and XXX is an extension that distinguishes labels (".lbl) from text (".par"). set biopages = ( \ f75r f75v \ f76r f76v \ f77r f77v \ f79r f79v \ f78r f78v \ f81r f81v \ f80r f80v \ f82r f82v \ f83r f83v \ f84r f84v \ ) echo $biopages \ | tr ' ' '\012' \ | sed -e 's/.*/<&/' \ > .foo cat inter-cm.evt \ | egrep -v '^[#]' \ | egrep '[;]A>' \ | fgrep -f .foo \ > bio.evt Edited manually bio.evt producing bio-edit.evt. Inserted separators of the form === f83v par ====* before each paragraph, and === f83v lbl ====* before each label block. Rearranged the paragraphs and label blocks in approximate reading order. Split the line into (probably labels) and (probably text). Rearranged the quire by moving bifolio f78+f81 to the center, inside f79+f80. Completed some of the "*"s of the majority version with a letter from one of the other versions. Splitting the paragraphs, and creating tables that map paragraph number to page f-number and paragraph type: mkdir evt-parags /bin/rm -f evt-parags/* cat bio-edit.evt \ | split-parags dicio-wc parag-to*.tbl PARAGRAPH CONCORDANCE FOR THE BIO SECTION Next we create a file containing one record per word occurrence, in the format WORD PATT1 PATT2 GNUM FNUM BNUM TYPE 1 2 3 4 5 6 7 where WORD is a word occurring in the bio section, PATT1 is its equivalence class representative (tight similarity), PATT2 is its equivalence class representative (loose similarity), GNUM is the number of the paragraph where the word occurs, FNUM is the page's f-number, BNUM is the bifolio's b-number, TYPE is "par" for occurrence in text, "lbl" for occurrence as label. Here it goes: /bin/rm -f bio.roc foreach gnum ( `cd evt-parags && ls [0-9][0-9][0-9][0-9] | sort` ) echo $gnum cat evt-parags/$gnum \ | words-from-evt \ | tr ' ' '\012' \ | add-loose-pattern \ | add-tight-pattern \ | /n/gnu/bin/gawk \ -v gnum="${gnum}" \ '/./ { print $0, gnum; }' \ >> bio.roc end dicio-wc bio.roc lines words bytes file ------- ------- --------- ------------ 6973 27892 155956 bio.roc cat bio.roc \ | sort +0 -1 +3 -4 \ | map-field \ -v inField=4 \ -v outField=5 \ -v table=parag-to-fnum.tbl \ | map-field \ -v inField=5 \ -v outField=6 \ -v table=fnum-to-bifolio.tbl \ | map-field \ -v inField=4 \ -v outField=7 \ -v table=parag-to-type.tbl \ > bio.woc dicio-wc bio.woc lines words bytes file ------- ------- --------- ------------ 6973 48811 246605 bio.woc Let's count the total number of word occurrences per paragraph: cat bio.woc \ | gawk '/./{ p = $4; nw[p]++; } END{ for(p in nw){ print p, nw[p];} }' \ | sort \ > parag-to-size.tbl PARAGRAPH INDEX FILE Let's join all data about each paragraph in a single file: cat parag-to-fnum.tbl \ | gawk '/./{ fn = $2; gsub(/[rv]/, "", fn); print $1, $2, fn; }' \ | map-field \ -v inField=2 \ -v outField=4 \ -v table=fnum-to-bifolio.tbl \ | map-field \ -v inField=1 \ -v outField=5 \ -v table=parag-to-type.tbl \ > parag.data PREPARING THE OCCURRENCE MAP First, let's find the maximum length of words and patterns: foreach fld ( 1 2 3 ) echo "field $fld" cat bio.woc \ | /n/gnu/bin/gawk -v fld=$fld \ '/./{print $(fld);}' \ | count-word-lengths end field 1 len nwords example --- ------ ------------------ 1 74 ? 2 487 al 3 790 ady 4 1044 aiin 5 1885 aiiin 6 1566 acthey 7 840 alchedy 8 218 cfhdarol 9 56 checphedy 10 11 darcheedal 11 2 chlchpsheey field 2 len nwords example --- ------ ------------------ 1 85 ? 2 610 ol 3 779 odo 4 1262 oiin 5 2117 oiiin 6 1404 octheo 7 499 olchedo 8 161 cthdorol 9 46 checthedo 10 8 dorcheedol 11 2 chlchtsheeo field 3 len nwords example --- ------ ------------------ 1 85 ? 2 610 ol 3 835 odo 4 1394 oino 5 2126 oloeo 6 1237 oeteeo 7 489 oleeedo 8 145 etedoeol 9 42 eeeeteedo 10 8 doeeeeedol 11 2 eeleeteeeeo Next we tabulate the occurrences of each word per paragraph, considering only text occurrences: foreach patt ( exact.1 tight.2 loose.3 ) set xpt = ${patt:r} set fpt = ${patt:e} set ofile = "bio-par-${xpt}.wmap" echo "${ofile}" cat bio.woc \ | /n/gnu/bin/gawk \ -v fpt=${fpt} \ '($7 == "par"){ print $(fpt), $1, ($7 == "par" ? "-" : "+"), $4; }' \ | sort -b +0 -3 +3 -4 \ | make-word-parag-map \ -v nParags=112 \ -v omitSingles=1 \ > ${ofile} end dicio-wc bio-par-{exact,tight,loose}.wmap lines words bytes file ------- ------- --------- ------------ 1015 58928 182041 bio-par-exact.wmap 1549 133632 412631 bio-par-tight.wmap 1657 156600 483474 bio-par-loose.wmap Created by hand a table "bio-fnum-to-order.tbl" that maps original page f-numbers to conjectured page reading order (2 digits). The order was chosen after a couple of iterations of what follows. Now create tables that map text paragraphs to blocks (page, folio, bifolio) in the conjectured order, with blank lines at the proper places: cat parag.data \ | map-field \ -v inField=2 \ -v outField=6 \ -v table=bio-fnum-to-order.tbl \ | sort +5 -6 +0 -1 \ > parag-ordered.data foreach block ( parag.1 page.2 folio.3 bifolio.4 ) set xbl = ${block:r} set fbl = ${block:e} cat parag-ordered.data \ | gawk -v fbl="${fbl}" \ ' ($5 != "par"){ next; } \ /./{ \ if((FNR>1)&&(fbl<=2)&&($(fbl+1)!=f)){ printf "\n"; } \ if((FNR>1)&&(fbl<=3)&&($4!=b)){ printf "\n"; } \ print; f=$(fbl+1); b=$4 \ } \ ' \ > .temp set tfile = "bio-parag-to-${xbl}-ordered.tbl" echo "${tfile}" cat .temp \ | gawk -v fbl="${fbl}" \ ' /^ *$/ { print; next; } // { print $1, $(fbl); }' \ > ${tfile} set hfile = "bio-parag-to-${xbl}-ordered.hdr" echo "${hfile}" cat .temp \ | gawk -v fbl="${fbl}" \ ' /^ *$/ { print; next; } \ // { if ($(fbl) \!= p) \ { if (fbl<=3) { printf "%s ", $4; } \ if (fbl==1) { printf "%s ", $2; } \ printf "%s\n", $(fbl); \ } \ p = $(fbl); \ } \ ' \ | rotate-labels \ > ${hfile} end Now format the occurrence map, tabulating data by paragraph, page, folio, and bifolio: foreach xpt ( exact tight loose ) set ifile = "bio-par-${xpt}.wmap" foreach xbl ( parag page folio bifolio ) set tfile = "bio-parag-to-${xbl}-ordered.tbl" set hfile = "bio-parag-to-${xbl}-ordered.hdr" set ofile = "bio-par-${xpt}-${xbl}.fmap" echo "${ofile}" cat bio-par-${xpt}.wmap \ | format-word-parag-map \ -v title="bio word/${xbl} map - ${xpt} comparison" \ -v blockTable="${tfile}" \ -v blockHeadings="${hfile}" \ -v nParags=112 \ -v maxlen=11 \ -v countWidth=1 \ -v countSeparator=' ' \ -v html=0 \ -v totOnly=0 \ -v mostBlocks=0.999 \ -v showPattern=0 \ -v showLineNumber=0 \ -v showAbsCounts=1 \ -v showRelCounts=0 \ -v showAvgPos=1 \ > ${ofile} end end