Hacking at the Voynich manuscript - Side notes 519 Creating per-page and per-section "best pick" text files. Last edited on 1999-01-31 05:58:17 by stolfi OBSOLETE - SEE Notes/045 1998-06-19 stolfi ================= [ First attempt on 1998-05-04, now redone with more care. ] This note describes the preparation of a "best pick" transcription of the VMs in EVA, split by page and section. The "best pick" version consists of running text only, extracted from Landini's interlinear file (as expanded and converted to EVA by Stolfi). The text contains line numbers and spaces as in the EVT format. For each text line, only one transcription---the most ""trusted" one---is retained. The degree of trust is rather subjective: "U" (that is, Stolfi) is considered best, "V" (John Grove) next to best, ...., "F" (Friedman/First Study Group) is almost last, and "C" (Currier) is even worse. The resulting text is organized as one file per page (pages/FNUM.evt, where FNUM is the page's f-number) and also as one file per section (sections/TAG.evt, wher TAG is a three-letter section tag). Other useful files produced here: pages/all.names the f-numbers of all existing pages, in natural reading order. sections/all.names the tags of all existing sections, in some nice order sections/TAG.fnums the f-numbers of all existing pages in section TAG, in natural readin order Gathering the text: "best pick" transcription, excluding labels, titles, radial lines, etc. cat L16-eva/UNITS \ | gawk -v FS=':' \ '($6 ~ /^(parags|starred-parags|circular-lines|itemized-lines)/){print $2;}' \ > all.units cat `cat all.units | sed -e 's@^@L16-eva/@g'` \ | best-pick \ -v trcodes="UVZABENOPRSWXYKQLMRJITFGCD" \ > all.evt Spliting the text into pages: mkdir pages /bin/rm -f pages/all.names pages/*.evt .foo cat all.evt \ | split-pages \ -v outdir=pages \ > pages/all.names Collecting the list of pages in each section: mkdir sections /bin/rm -f sections/all.names sections/*.fnums sections/.foo set sectags = ( ) foreach f ( \ unknown.unk pharma.pha stars.str \ herbal-A.hea herbal-B.heb bio.bio \ astro.ast cosmo.cos zodiac.zod \ ) set tag = "${f:e}" set sec = "${f:r}" echo "${tag} = ${sec}" cat fnum-to-section.tbl \ | grep -w ${sec} \ | gawk '/./{print $1;}' \ > sections/${tag}.fnums cat `cat sections/${tag}.fnums | sed -e 's@^\(.*\)$@pages/\1.evt@g'` \ > sections/${tag}.evt echo ${tag} >> sections/all.names set sectags = ( ${sectags} ${tag} ) end echo "sectags = ( ${sectags} )" dicio-wc sections/*.evt lines words bytes file ------- ------- --------- ------------ 31 62 2720 sections/ast.evt 717 1434 52758 sections/bio.evt 44 88 3172 sections/cos.evt 1173 2353 68800 sections/hea.evt 370 747 27558 sections/heb.evt 225 450 18008 sections/pha.evt 1077 2155 88583 sections/str.evt 242 520 18643 sections/unk.evt 6 13 1152 sections/zod.evt Let's list the pages in each section: ( cd pages && ls f*.evt ) \ | sed -e 's/\.evt/ +/' \ > /tmp/present.tbl /bin/rm -f pages-summary.txt foreach sec ( `cat sections/all.names` ) echo "section ${sec}" \ >> pages-summary.txt cat sections/${sec}.fnums \ | map-field \ -v table=/tmp/present.tbl \ -v default='-' \ | sed -e 's/[+] //' -e 's/- \(f[0-9vr]*\)/(\1)/' \ | fmt -w 50 \ | sed -e 's/^/ /' \ >> pages-summary.txt echo " " \ >> pages-summary.txt end Let's tabulate which transcribers were used in which sections: foreach utype ( pages sections ) /bin/rm -f ${utype}/trcodes.tbl foreach f ( `cat ${utype}/all.names` ) echo $f cat ${utype}/$f.evt \ | words-and-trcodes-from-evt \ | totalize-trcodes -v name=${f:r} \ >> ${utype}/trcodes.tbl end end Let's format the tables. (Note that "99" probably means 100%.) First, find which transcribers are significant: cat sections/trcodes.tbl \ | gawk \ ' BEGIN{ split("",m); } \ /./ { \ for(i=0;i<=25;i++) { if($(i+2) > 0) { m[i] = 1; } } \ } \ END { \ for(i=0;i<=25;i++) { if(i in m) { printf " %02d", i } } \ printf "\n"; \ } \ ' \ > .foo set trnums = `cat .foo` echo $trnums cat sections/trcodes.tbl \ | gawk -v trnums="${trnums}" \ ' BEGIN{ \ split("",m); split(trnums,tr); \ for(j in tr) { m[tr[j]+0] = 1; } \ } \ /./ { \ printf "%-6s", $1; \ for(i=0;i<=25;i++) \ { if(i in m) \ { f = sprintf("%c(%02d%%)", i+65, $(i+2)); \ printf " %6s", ($(i+2) > 0 ? f : "") \ } \ } \ printf "\n"; \ } \ ' --------------------------------------------------------------- unk C(01%) F(97%) U(01%) pha F(60%) L(39%) str F(47%) J(01%) T(51%) hea F(71%) L(04%) U(24%) heb F(96%) U(03%) bio U(05%) V(94%) ast C(41%) F(43%) R(14%) V(01%) cos C(18%) F(81%) zod C(99%) --------------------------------------------------------------- JUNK cat fnum-to-section.tbl \ | sed -e 's/[?][?][?]/unknown/' \ | map-field \ -v table=fnum-to-seq.tbl \ -v inField=1 -v outField=3 \ | sort +2 -3n \ | gawk '//{printf "%-7s %s\n", $1, $2}' \ > .foo set sectags = ( ) foreach f ( \ unknown.unk pharma.pha stars.str \ herbal-A.hea herbal-B.heb bio.bio \ astro.ast cosmo.cos zodiac.zod \ ) set tag = "${f:e}" set sec = "${f:r}" echo "${tag} = ${sec}" cat fnum-to-section.tbl \ | grep -w ${sec} \ | gawk '/./{print $1;}' \ | sort \ > .foo diff sections/${tag}.fnums .foo set sectags = ( ${sectags} ${tag} ) end