Hacking at the Voynich manuscript - Side notes 060 Complete nd strict label concordance Last edited on 2004-07-15 17:32:39 by stolfi INTRODUCTION This note attempts to create a concordance for all figure labels, which is strict (i.e. without any letter-mapping) and complete (i.e. looks for all labels in the whole text). SETTING UP THE ENVIRONMENT ln -s ../../L16+H-eva ln -s ../037/vms-17.roc.gz ln -s /home/staff/stolfi/voynich/work ln -s /home/staff/stolfi/projects/langbank ln -s work/basify-weirdos OBTAINING THE LABEL UNITS LIST Extracting the list of units that contain labels, from the current (internal) version of the interlinear: cat L16+H-eva/INDEX \ | grep ':labels:' \ > label-units.idx Now we prepare a file in the following format: UNUM PNUM STAG FNUM UTAG LINE VTAG LABEL UCMT 1 2 3 4 5 6 7 8 9 where UNUM is the sequential number of the text unit, e.g. "0352", as in the INDEX file. PNUM is the reading-order number of the logical page, e.g. "p123", as in the INDEX file. STAG is the major section tag, e.g. "hea" or "pha". FNUM is the standard name of the text unit, e.g. "f69v". UTAG is a tag identifying the text unit in the page, e.g. "L". VTAG is a letter identifying text version (transcriber), e.g. "F". LINE is the line number, e.g. "20" or "0a". LABEL is the label, with "." between words, free from fillers, line/parag delimiters, and inline comments, with weirdos replaced by "*". Each line of a multiline label is treated as a separate label. Variant spaces "," and "-" are mapped to "." UCMT The comment field that applies to this entire text unit, taken from the INDEX file. The fields FNUM, UTAG, VTAG, and LINE use the same convention as in the interlinear file. /bin/rm -f all.lbs foreach f ( `cat label-units.idx | tr ' ' '_'` ) set fld = ( `echo "$f" | tr ':' ' '` ) set file = "L16+H-eva/${fld[2]}" echo "${file}" cat ${file} \ | basify-weirdos \ | extract-lines-as-labels \ -v unum="${fld[1]}" \ -v unit="${fld[2]}" \ -v pnum="${fld[7]}" \ -v ucmt="${fld[8]}" \ | map-field \ -v table=L16+H-eva/fnum-to-sectag.tbl \ -v inField=3 -v outField=3 \ -v default='???' \ >> all.lbs end Make a table that maps all labels to "+": cat all.lbs \ | gawk '//{ print $8, "+"; }' \ | egrep -v -e '[*]' \ | sort \ | uniq \ > mark-labels.tbl Extracts label occurrences from previously built concordance, even though it is alightly out-of-date. This must be redone when the new interlinear is released. set maxlen = 17 zcat vms-${maxlen}.roc.gz \ | sort -b +5 -6 \ | map-field \ -v table=mark-labels.tbl \ -v inField=6 -v outField=8 \ -v default="-" \ | gawk '($8 == "+"){ $8 = ""; print; }' \ > labels.ocs The format of this file is LOC VTAGS START LENGTH LCTX STRING RCTX 1 2 3 4 5 6 7 We now add some info creating the following FNUM UTAG LINE VTAGS START LENGTH LCTX STRING RCTX DEF UNUM UCMT PNUM STAG 1 2 3 4 5 6 7 8 9 10 11 12 13 14 where VTAGS is a lump of version (transcriber) codes, START and LENGTH are the position of the occurrence in the line, LCTX and RCTX are the left and right contexts (from the majority version) DEF is "+" if this occurrence is as a label, "-" otherwise. For that we need a table that maps the names of label-containing units to "+": cat label-units.idx \ | gawk -v FS=':' '// { print $2, "+"; }' \ | sort \ > mark-label-units.tbl We need also a table that maps units to their comments (with blanks mapped to "_"): cat L16+H-eva/INDEX \ | tr ' ' '_' \ | gawk -v FS=':' '//{print $2, $8; }' \ > unit-to-ucmt.tbl Now we expand the label occurrence file: cat labels.ocs \ | gawk '//{ $1 = gensub(/[.]([A-Za-z0-9]*)$/, " \\1", "g", $1); print; }' \ | map-field \ -v table=mark-label-units.tbl \ -v inField=1 -v outField=9 -v default='-' \ | map-field \ -v table=L16+H-eva/unit-to-useq.tbl \ -v inField=1 -v outField=10 -v default='????' \ | map-field \ -v table=unit-to-ucmt.tbl \ -v inField=1 -v outField=11 -v default='_' \ | gawk '//{ $1 = gensub(/[.]([A-Za-z0-9]*)$/, " \\1", "g", $1); print; }' \ | map-field \ -v table=L16+H-eva/fnum-to-pnum.tbl \ -v inField=1 -v outField=13 -v default='p???' \ | map-field \ -v table=L16+H-eva/fnum-to-sectag.tbl \ -v inField=1 -v outField=14 -v default='???' \ > labels.xoc Sort by string, unit number, and line, and format the concordance: cat labels.xoc \ | sort -b +7 -8 +10 -11 +2 -3n +4 -5n +5 -6n +3 -4 \ | format-label-concordance \ > labels-conc.html