Hacking at the Voynich manuscript - Side notes 044 Adding line numbers to Takahashi's transcription Last edited on 2006-11-27 22:50:11 by stolfi The goal of this note is to merge Takeshi Takahashi's full transcription of the VMS into the interlinear file. FETCHING TAKESHI'S FILES Takeshi Takahashi announced to the Voynich list a new full transcription of the VMS in HTML format, without line numbers. I fetched his single-file version with HTTP on 1998-11-25 Some time later (1998-12-28) I fetched again the version formatted as separate pages, which Takeshi says are more up-to-date than the full file. ln -s ../../../docs/Takahashi wget \ --non-verbose \ --input-file=Takahashi/pages/all.urls \ --directory-prefix=Takahashi/pages/ \ --output-file=Takahashi/pages/wget.log Had to manually edit f66r.htm because it ws formatted as a . INITIAL CONVERSION FROM HTML TO EVT FORMAT Removing irrelevant HTML formatting: cat Takahashi/full.html \ | cleanup-takeshi-html-1 \ > tak-1.iso ( cd Takahashi/pages/ && cat `cat all.fnums | sed -e 's/$/.htm/'` ) \ | cleanup-takeshi-html-1 \ | sed \ -e 's/([^()]*)//g' \ -e '/^ *$/d' \ > kat-1.iso Replaced newline and " " by ".",
by "-" and newline,

by "=" and newline; then cleaned up the spaces. foreach f ( tak kat ) echo "$f" cat ${f}-1.iso \ | sed \ -e '/^[##]/s/\(f[0-9a-z]*\)/<\1> {'"${f}"'}@/' \ -e '/^[#]/b' \ -e 's/
/-@/' \ -e 's/

/=@/' \ -e 's@<[/]*B>@@g' \ -e 'y/ /./' \ | tr '\012@' '.\012' \ | sed \ -e 's/[.][.]*_/__/g' \ -e 's/_[.][.]*/__/g' \ -e 's/^[-._=]*//g' \ -e 's/[-._]*[=][-._=]*/=/g' \ -e 's/[.]*[-_][-._]*/-/g' \ -e 's/[.][.][.]*/./g' \ > ${f}-2.iso end INSERTING PRELIMINARY LOCATION CODES IN TAKESHI'S FILE The next step is to insert preliminary locators into Takeshi's file. This processing was done on the single-file version (tak-2.iso); later the differences between tak-2.iso and kat-2.iso were checked and fixed manually. Using transcriber "H" = Takahashi. The preliminary locators have empty unit tag and 2-digit line numbers that increase sequentially per page (i.e. , , etc.). cat tak-2.iso \ | insert-loc-codes \ > tak-3.evt BUILDING THE TAKESHI-TO-STOLFI LOCATOR DICTIONARY Then we have to produce a dictionary that maps the preliminary locators to those used in the interlinear. For that purpose we take a "best pick" version of the interlinear, chopping the text to 20 bytes, and equalizing all pages to 170 lines: ln -s ../../L16-eva cat L16-eva/UNITS \ | gawk -v FS=':' '/./{print $2;}' \ > .all.units set units = ( `cat .all.units` ) ( cd L16-eva && cat $units ) \ | best-pick \ > sto-3.evt cat sto-3.evt \ | sync-clip-evt -v pageSize=170 \ > sto.clp Then we do the same to Takahashi's file (chop the text to 20 bytes and equalize the page sizes.) We also fix some page numbering bugs: cat tak-3.evt \ | sed \ -e 's//d' \ -e '//i\' \ -e '## {}\' \ -e ' {not transcribed}' \ | sync-clip-evt -v pageSize=170 \ > tak.clp Let's check whether we have the same set of pages, in the same order, with same number of lines: dicio-wc sto.clp tak.clp grep '##' sto.clp > sto.pages grep '##' tak.clp > tak.pages diff sto.pages tak.pages OK, now we paste these two files side-by-side: /n/gnu/bin/paste -d' ' tak.clp sto.clp > tak-sto.clp We then edit tak-sto.clp manually, shifting and permuting the right half of each page (locators included) until the two truncated texts on each line are two versions of the same VMS line. Unmatched half-lines are discarded (they will require fixing the line breaks in the file anyway). Then we delete the truncated text columns, leaving only the locators (preliminary and interlinear). The resulting file is saved as tak-sto-locs.tbl. MAPPING PRELIMINARY LOCATORS TO STOLFI'S Next, we create a preliminary file tak-4.evt with Takeshi's text and (mostly) Stolfi's locators: cat tak-3.evt \ | map-locations \ -v table=tak-sto-locs.tbl \ > tak-4.evt The script map-locations will leave untouched any locations that are not listed in the dictionary. (These are identified by their empty unit tag, i.e. locators of the form ""). We edit those locators by hand. We make a copy of the file cp -p tak-4.evt tak-5.evt and edit tak-5.evt with emacs. [ 1998-12-02 ] Eventually I finished editing all location numbers in tak-5.evt. MERGING TAKESHI'S VERSION INTO THE INTERLINEAR Collecting the units making up the interlinear file (see note 024): ln -s ../../L16-eva /bin/rm -f .all.units cat L16-eva/UNITS \ | gawk -v FS=':' '/^[^#]/{printf "L16-eva/%s\n", $2;}' \ > .all.units Merging tak-5.evt into the current interlinear: cat `cat .all.units` \ > inter.evt merge-version-into-interlin \ -v sourceFile=tak-5.evt \ -v trashFile=tak-5-unmatched.evt \ -v transCodes='H' \ < inter.evt \ > inter+tak.evt Splitting back into separate files: mkdir ../../L16+H-eva ln -s ../../L16+H-eva cat inter+tak.evt \ | split-evt-into-units \ -v outdir=L16+H-eva \ > .new.units EDITING AND SYNCHRONIZING THE MERGED FILE At this point we have created a new interlinear, partitioned one unit per file. We must still edit the units L16+H-eva/f* by hand, adding "-{plant}" breaks and synchronizing "!"s. This step was done on the individual text unit files in L16+H-eva/* REMOVING NEEDLESS CAPITALIZATION In his transcription, Takeshi used capitalized EVA for proper display with Gabriel's fonts (i.e. Sh/Ch/cTh etc. instead of sh/ch/cth/etc. To simplify comparison and statistical analysis, it is better to convert those codes to lower case EVA. After all, they can be re-capitalized with Rene's VTT. So let's get a list of all unit files, in reading order: cat L16+H-eva/UNITS \ | gawk -v FS=':' '/./{print $2;}' \ > .all.units set units = ( `cat .all.units` ) Safety check: ( cd L16+H-eva && ls f[0-9]* | egrep -v '[~]$' ) | sort > .foo cat .all.units | sort > .bar diff .foo .bar Concatenating all units: ( cd L16+H-eva && cat ${units} ) \ > inter+tak-ed.evt cat inter+tak-ed.evt \ | remove-needless-capitalization \ > inter+tak-ed-noc.evt diff inter+tak-ed.evt inter+tak-ed-noc.evt | head -3000 > .foo Let's see whether we lost any information, compared to Takahashi's version: cat inter+tak-ed-noc.evt \ | vtt -b1 -l4 -c3 -tH -o0 -s0 -f1 \ | sed -e 's/[-]\(.\)/.\1/g' \ > .bar diff tak-2.iso .bar \ | prettify-diff-output \ > .foo GETTING THE WEIRDOS OUT OF THE WAY Besides the "aesthetic" capitals, Takeshi used significant characters from the full EVA character set, including plumes ['"], ligated capital letters such as "I" and "O", and 8-bit characters above decimal 127 for "weirdoes". All my statistical scripts were written for basic EVA, and it seems silly to modify and re-debug them all just for the sake of a few characters that occur only a few times in the document. Moreover the weirdos would only contribute distracting noise to the statistics, and hog the tables with useless entries. We could keep the full EVA codes in the file, and use VTT or some AWK script to convert the file to basic EVA before each processing where they would be a nuisance. However I anticipate that most of my uses of the file will fall in this category. So I think it is better to map all weirdos to "*" (or to a similar basic-EVA letter), and retain the correct full-EVA code as a stylized comment; and only convert these groups to full EVA when needed. Specifically, non-basic EVA characters will be denoted by a construct of the form "C{&XXX}" where "C" is a lower-case basic-EVA letter or "*" (to be used in "normal" processing), and "XXX" is either the 3-digit decimal code of the weirdo, or the extended EVA notation for that caracter. Here are some examples, with the precise meaning and how they will be handled in statistics and indexing: *{&252} = weirdo with decimal code 252 ("V" underbar); treat as "*". k{&K} = a "k" with stem crossed by a ligature; treat as "k". o{&o'} = "o" with plume; treat as "o". s{&S} = an "s" with a ligature, or half a "sh"; treat as "s". q{&q"} = a "q" with a plume above the connector, not touching it; treat as "q". I have temporarily used some non-eva codes after the "&", for weirdos that I could not decide how to encode in EVA. For example on I used "r{&r}" for an "r" glued to the preceding character (an "e"); and on and I used "*{&^}" for a character that looks like the upper half of an "y". These non-EVA codes are defined in #-comments at the top of each unit file, and will hopefully be replaced by official EVA in the near future. So let's do the conversion on the file (after having already removed the "aesthetic" capitalizations: cat inter+tak-ed-noc.evt \ | basify-weirdos \ > inter+tak-ed-basic.evt diff inter+tak-ed-noc.evt inter+tak-ed-basic.evt | head -3000 > .foo cat inter+tak-ed-basic.evt \ | validate-new-evt-format \ -v checkTerminators=1 \ -v checkLineLengths=1 \ >& inter+tak-ed-basic.bugs Splitting back into separate files: mkdir ../../L16+H-b-eva ln -s ../../L16+H-b-eva cat inter+tak-ed-basic.evt \ | split-evt-into-units \ -v outdir=L16+H-b-eva \ > .new.units Checking the result: cat .new.units \ | sed -e 's@L16[+]H-b-eva/@@g' \ > .foo diff .all.units .foo cat `cat .new.units` \ > .bar diff inter+tak-ed-basic.evt .bar \ | prettify-diff-output All seems OK, we can replace the interlinear directory by the new one: mv ../../L16+H-eva ../../L16+H-eva-junk mv ../../L16+H-b-eva ../../L16+H-eva rm -i L16+H-b-eva The auxiliary files (UNITS, scripts, X-

.fnums, etc.) were moved by hand from L16+H-eva-junk to the new L16+H-eva Checking whether we have lost any lines ( cd L16+H-eva && cat ${units} ) \ | gawk '/^<.*;[A-GI-Z]/{print $1;}' \ > .new.locs ( cd L16-eva && cat `cat UNITS | gawk -v FS=':' '/./{print $2;}'` ) \ | gawk '/^<.*;[A-GI-Z]/{print $1;}' \ > .old.locs diff .old.locs .new.locs \ | prettify-diff-output \ > .diff ADDING LATE CHANGES Takeshi says that the separate HTML pages are more reliable than the full file, and he also added last-minute corrections to the separate pages only. So I compared diff tak-2.iso kat-2.iso \ | prettify-diff-output \ > .diff2 Processed all differences by hand and added them to the interlinear. RECREATING TAKESHI'S VERSION Let's see how closely we can recreate Takeshi's version from the interlinear. Let's get again a list of all unit files, in reading order: cat L16+H-eva/UNITS \ | gawk -v FS=':' '/./{print $2;}' \ > .all.units set units = ( `cat .all.units` ) Safety check: ( cd L16+H-eva && ls f[0-9]* | egrep -v '[~]$' ) | sort > .foo cat .all.units | sort > .bar diff .foo .bar Concatenating all units: ( cd L16+H-eva && cat ${units} ) \ | unbasify-weirdos \ > inter+tak.evt cat inter+tak.evt \ | egrep -v '^#($|[ ])' \ | gawk \ ' \ /^<[^;<>][;]/{ \ match($0,/<[^;<>]*[;]/); \ s=substr($0,2,RLENGTH-2); \ if(s!=ps){print "#"; ps=s;} \ } \ //{print;} \ ' \ > inter+tak-noc.evt dicio-wc inter+tak.evt lines words bytes file ------- ------- --------- ------------ 39428 129301 1681652 inter+tak.evt cat inter+tak.evt \ | validate-new-evt-format \ -v checkTerminators=1 \ -v checkLineLengths=1 \ >& inter+tak.bugs Re-extracting Takahashi's version with Rene's VTT: ln -s /proj/dicio/staff/stolfi/programs/c/vtt-rene/vtt cat inter+tak.evt \ | sed -e '/^## *<[^;.<>]*>/s/## *//' \ | gawk \ ' /^##.*<[^.]*>.*/{print substr($0,index($0,"<"));} \ //{print;} \ ' \ | vtt -b1 -c2 -f1 -l4 -o0 -s0 -tH \ | sed \ -e 's/[-]\(.\)/.\1/g' \ -e '/^#$/d' \ -e '/^# /d' \ > inter+tak.iso 39685 lines read in 12143 lines de-selected 0 hash comment lines suppressed 265 empty lines suppressed 27277 lines written to output Checking for processing errors: foreach f ( kat-2 inter+tak ) echo $f cat $f.iso \ | basify-weirdos \ | unbasify-weirdos \ | sed \ -e '/^##/s/## *<\(.*\)>.*$/<'"$f"':\1>@/;t' \ -e '/^#/d' \ -e 's/[%\!]//g' \ -e 's/[-=]/./g' \ -e 's/[.][.][.]*/./g' \ -e 's/^[. ]*//g' \ -e 's/[. ]*$//g' \ -e 's/\([=.]\)/@/g' \ | tr '@' '\012' \ | egrep -v '^ *$' \ > $f.xxx end diff {kat-2,inter+tak}.xxx \ | prettify-diff-output \ > .foo Converting to HTML, quick and dirty: cat inter+tak.evt \ | vtt -a1 -b1 -c2 -f0 -l4 -o0 -s0 -tH \ | sed \ -e '/^<[^;]*>/d' \ -e 's/[-]\(.\)/.\1/g' \ -e '/^#$/d' \ -e '/^# /d' \ > tak-rebuilt.evt 39685 lines read in 12143 lines de-selected 0 hash comment lines suppressed 8 empty lines suppressed 27534 lines written to output cat tak-rebuilt.evt \ | egrep -v '## *' \ | numbered-eva-to-html \ 'VMS transcription by Takeshi Takahashi'\ > tak-rebuilt.html [1998-12-28 stolfi] Deleted L16-eva; the official directory is L16+H-eva from now on. [2000-05-08 stolfi] CHECHING FOR ANY CHANGES TO TAKAHASHI'S VERSION mkdir Takahashi/pages-2 Fetching again the full file: wget \ 'http://www3.justnet.ne.jp/~ttakahashike/voynich/pages/PagesH.txt' \ --non-verbose \ --output-document 'Takahashi/pages-2/full-2.evt' Fetching again the page files for comparison: wget \ 'http://www3.justnet.ne.jp/~ttakahashike/voynich/pages/index.htm' \ --non-verbose \ --output-document 'Takahashi/pages-2/index.htm' Manually extracted the links, saved them to files "Takahashi/pages-2/all.fnums" (f-numbers only) and "Takahashi/pages-2/all.urls" (complete URLs). Fetching the pages: wget \ --non-verbose \ --input-file=Takahashi/pages-2/all.urls \ --directory-prefix=Takahashi/pages-2/ \ --output-file=Takahashi/pages-2/wget.log Checking for completeness: ( cd Takahashi/pages-2/ && grep -L -i -e '' *.htm ) Removing spurious breaks: ( cd Takahashi/pages-2/ && cat `cat all.fnums | sed -e 's/$/.htm/'` ) \ | tr -d '\015' \ | unsplit-fontified-lines \ > kat-new-1.html Checking for unexpected format: cat kat-new-1.html \ | remove-junk-html \ | grep -v '' \ | egrep -v '^[#][#]' \ | sort | uniq \ > .trash Now do it for real: ( cd Takahashi/pages-2/ && cat `cat all.fnums | sed -e 's/$/.htm/'` ) \ | cleanup-takeshi-html-2 \ > kat-new-1.iso Insert locator codes: cat kat-new-1.iso \ | insert-loc-codes \ | egrep '^(<.*;H>|[#][#])' \ > kat-new-3.evt Map locator codes for maximum compatibility, and reduce character set: cat kat-new-3.evt \ | map-locations \ -v table=tak-sto-locs-new.tbl \ | egrep '^<' \ | remove-needless-capitalization \ | basify-weirdos \ | sed \ -e 's/{[^{}&][^{}]*}//g' \ -e 's/{}//g' \ -e 's/[-=,]/./g' \ -e 's/[.] *$/-/g' \ | sort \ > kat-new-4.evt Compare to previous version: ln -s ../../L16+H-eva cat L16+H-eva/INDEX \ | gawk -v FS=':' '/./{print $2;}' \ > .all.units set units = ( `cat .all.units` ) ( cd L16+H-eva && cat $units ) \ | egrep '^<.*;H>' \ | tr -d '%\!' \ | sed \ -e 's/{[^{}&][^{}]*}//g' \ -e 's/{}//g' \ -e 's/[-=,]/./g' \ -e 's/[.] *$/-/g' \ | sort \ > kat-old-4.evt diff kat-old-4.evt kat-new-4.evt \ | prettify-diff-output \ > .diffs2 Ckecked .diffs2 manually, folded Takeshi's updates into L16+H-eva/*. Format check: ( cd L16+H-eva && cat $units ) \ | validate-new-evt-format \ -v checkTerminators=1 \ -v checkLineLengths=1 \ >& .bugs