Hacking at the Voynich manuscript - Side notes 014 Merging John Grove's list of labels into the interlinear file. Last edited on 1999-07-28 01:51:27 by stolfi Got from John Grove's site the big label table in HTML format, docs/JohnGroveLabels.html. Cleaned it up and converted partially to my own label database format. See labels-grove.idx Used "V" as the transcriber code for John Grove. Also added the "C" category for "containers", and extended category "P" to include "roots" (both in the pharma section). Many of the labels in John's list had only a page location, not a label location. I was able to locate many of them by comparing the labels agains my old list. For the rest, I used temporary text unit codes ".X", ".Y", ".Z". I created a new index file L16-eva/INDEX.N with the new units. I also inserted a new "reading order" field in front of all other fields, specifying the position of the text unit in the presumed reading order. This field can be used for sorting the index file. The unit files themselves (L16-eva/f*) haven't been updated yet. At some point I merged Grove's labels with my previous label and title file (labtit-old.idx) producing labtit-new.idx. I changed the format of the latter to make it easier to process by computer: specifically, I split the location code into four fields (page f-number, text unit, line, and transcriber code). I also added an alternate spelling field (for Grove's transcriptions in extended EVA) I also moved the section field to before the page field. I also added the reading order field in front of the record: cat labtit-new.idx \ | egrep '.' \ | tr ' |' '_ ' \ | gawk '/./{$1=($3 "." $4);print;}' \ | map-field \ -v inField=1 \ -v outField=12 \ -v table="../../L16-eva/unit-to-useq.tbl" \ | gawk '/./{$1=$12;NF=11;print;}' \ | tr '_ ' ' |' \ > labtit-new.idx@ Checking for blunders: cat labtit-new.idx \ | gawk -v FS='|' -v OFS='|' '/./{$1="";print;}' \ > .foo cat labtit-new.idx@ \ | gawk -v FS='|' -v OFS='|' '/./{$1="";print;}' \ > .bar diff .foo .bar If OK: mv labtit-new.idx labtit-new.idx~~ mv labtit-new.idx@ labtit-new.idx Checking the label categories: cat labtit-new.idx \ | gawk -v FS="|" '/./{print ($9 ":" $10);}' \ | sort | uniq -c | expand 138 ?:? 75 A:? 14 A:planet? 92 A:star 4 A:star? 5 A:title? 18 B:duct? 5 B:flow? 125 B:nymph? 12 B:organ? 45 P:container 3 P:container? 274 P:plant 58 P:plant? 47 P:root 3 P:root? 30 T:item? 66 T:title? 467 Z:day? Tabulating the transcriber codes: cat labtit-new.idx \ | gawk -v FS="|" '/./{print $6;}' \ | sort | uniq -c | expand 1 B 166 C 56 F 2 G 211 K 82 L 1 P 3 Q 25 R 1 T 31 U 895 V 7 Z Checking the longest label and longest title: cat labtit-new.idx \ | gawk -v FS="|" '($10 ~ /title/) { print; }' \ | longest-field -v FS='|' -v field=7 35 *qor.cheol.**eol.cholaiin.chol.qkar cat labtit-new.idx \ | gawk -v FS="|" '($10 !~ /title/) { print; }' \ | longest-field -v FS='|' -v field=7 23 oteoeeyd*.otal.okeal.ar Checking the longest definition: cat labtit-new.idx | longest-field -v FS='|' -v field=10 10 container? I also wrote a script labtit-idx-to-html that converts labels and titles from my "idx" format to html. cat labtit-new.idx \ | gawk -v FS="|" '($10 ~ /title/) { print; }' \ > .titles.idx cat labtit-new.idx \ | gawk -v FS="|" '($10 !~ /title/) { print; }' \ > .labels.idx foreach f ( titles labels ) set maxlen = "`cat .${f}.idx | longest-field -v FS='|' -v field=7 | cut -d' ' -f 1`" echo "${f}: maxlen = ${maxlen}" cat .${f}.idx \ | labtit-idx-to-html \ -maxlen ${maxlen} \ -title "Collected ${f}, text order" \ > ${f}-t.html end foreach f ( titles labels ) set maxlen = "`cat .${f}.idx | longest-field -v FS='|' -v field=7 | cut -d' ' -f 1`" echo "${f}: maxlen = ${maxlen}" cat .${f}.idx \ | sort -t'|' +6 -7 +0 -1n +4 -5n \ | labtit-idx-to-html \ -maxlen ${maxlen} \ -title "Collected ${f}, alphabetic order" \ > ${f}-a.html end 98-02-02 stolfi =============== Started adding Grove's labels to the L16-eva/* files mkdir labtit-evt cat labtit-new.idx \ | gawk -v FS='|' \ ' BEGIN { ouni = ""; } \ /./ { \ uni = ($3 "." $4); lin = $5; trn = $6; \ loc = ("<" uni "." lin ";" trn ">"); \ cmt = $10; \ if ($11 != "-") { cmt = ($10 " - " $11); } \ if (uni != ouni ) \ { if (ouni != "" ) close(ufile); \ ufile = ("labtit-evt/" uni); ouni = uni; \ } \ printf "# %s\n", cmt > ufile; \ printf "%-19s%s=\n", loc, $7 > ufile; \ } \ ' \ > .foo cat .foo Merged manually those files into L16-eva/f* Note that there are several units (with names f*.[XYZ]*) that were stolen from Grove's page, and need to be relabeled and have their lines renumbered. Also there are some zodiac units (some of those named f*.S) that should be split into two or three unts (inner, outer, not in circle) for consistency with other zodiac pages. 1998-05-03 stolfi ================= John Grove sent me a transcription of f15r with labels. I placed the labels in a new unit, f75r.L. Someday I must create updated versions of the label index and tables above. 1998-05-05 stolfi ================= John Grove posted his updated transcriptions of bio pages f75r--f79r, including some label sets. I merged them into my L16-eva/f* files, fixing a few of Friedman's line breaks in the process. The line numbers in f77r.X and f78r.X have changed. Updated labtit-new.idx to mirror the changes in L16-eva/f*. Regenerated the ".html" lists. 1998-07-20 stolfi ================= John Grove sent me a new transcription of f57v. I edited it into L16-eva/f57v.*, updated the file labtit-new.idx, and regenreated the ".html" file. 1998-07-23 stolfi ================= [ redone 1998-08-20 stolfi ] Let's create a "best pick" list of labels: cat labtit-new.idx \ | pick-best-label \ > labtit-best.idx Now a list of labels only: cat labtit-best.idx \ | gawk -v FS='|' -v OFS='|' '($10 !~ /^title/){print $7;}' \ > labels.lst Make a counted list of labels: cat labels.lst \ | sort | uniq -c | expand \ | sort +0 -1nr +1 -2 \ > labels.cts Now a list of words that occur in labels: cat labels.lst \ | tr '.' '\012' \ | sort | uniq -c | expand \ | sort +0 -1nr +1 -2 \ > label-wds.cts Preparing word lists for Italian and Latin: foreach f ( ital-mnz latn-bel ) cat ../../${f}.txt \ | sed -e '/^#/d' \ | tr 'A-Z' 'a-z' \ | tr -c 'a-zèìòùàéíóúá' '\012' \ | sed -e '/^ *$/d' \ | sort | uniq -c | expand \ | sort +0 -1nr +1 -2 \ > .${f}.cts end Comparing the length distribution of words in text and labels: foreach f ( Rene-words.frq label-wds.cts .ital-mnz.cts .latn-bel.cts ) echo " "; echo "$f" cat $f \ | gawk \ ' /./ { n=length($2); t[n]+=$1; tt+=$1; next; } \ END { \ for(i=1;i<25;i++) \ {printf "%7d %7.5f %2d\n", t[i],t[i]/tt,i; } \ } \ ' \ > .$f.stats end Statistics on initial and final digraphs of words and labels: cat Rene-words.frq \ | sed -e 's/ q/ /' \ | sort +1 -2 \ | combine-counts \ | sort -b +0 -1nr +1 -2 \ > Rene-words-noq.frq foreach f ( Rene-words.frq Rene-words-noq.frq labels.cts ) echo " "; echo "$f" cat $f \ | gawk \ ' /./ { b=substr(($2 "__"),1,2); e=substr(("__" $2),length($2)+1,2); \ nb[b]+=$1; ne[e]+=$1; tt+=$1; next; \ } \ END { \ printf "initial digraphs:\n"; \ for(b in nb) {printf "%7d %7.5f %s\n", nb[b],nb[b]/tt,b; } \ printf "final digraphs:\n"; \ for(e in ne) {printf "%7d %7.5f %s\n", ne[e],ne[e]/tt,e; } \ } \ ' end Again, collapsing some "equivalent" letter pairs: foreach f ( Rene-words.frq Rene-words-noq.frq labels.cts ) echo " "; echo "$f" cat $f \ | sed \ -e 's/[ktpf]/t/g' \ -e 's/[aoy]/o/g' \ -e 's/[mgjd]/d/g' \ | gawk \ ' /./ { b=substr(($2 "__"),1,2); e=substr(("__" $2),length($2)+1,2); \ nb[b]+=$1; ne[e]+=$1; tt+=$1; next; \ } \ END { \ printf "initial digraphs:\n"; \ for(b in nb) {printf "%7d %7.5f %s\n", nb[b],nb[b]/tt,b; } \ printf "final digraphs:\n"; \ for(e in ne) {printf "%7d %7.5f %s\n", ne[e],ne[e]/tt,e; } \ } \ ' end 1998-08-20 stolfi ================= Fixed the labels on page f102v2 and its neighbors. Recreated labtit.idx, labtit-best.idx, labels.lst, etc. 1999-01-24 stolfi ================= JohnGrove sent me a few days ago a list of all labels comparing the H, U, and V transcriptions, and asked me to re-check them. I edited and sorted the list; see grove-label2-edit.txt. I transcribed all the labels that had only H and V transcriptions, in stolfi-addtl-readings.txt. Now cheching the labels where we three disagree. See stolfi-re-readings.txt. (Finished 1999-01-29)