Hacking at the Voynich manuscript - Side notes 000 Converting the Interlinear to EVA Last edited on 1998-12-27 19:47:27 by stolfi This is a remake of work from Notebook-1.txt, originally done in 97-07-05, using newer scripts. Summary of previous relevant tasks: I obtained Landini's interlinear transcription of the VMs, version 1.6 (landini-interln16.evt) from http://sun1.bham.ac.uk/G.Landini/evmt/intrln16.zip. [Notebook-1.txt] Around 97-11-01 I split landini-interln16.evt into many files, with one text unit per page. [Notebook-12.txt] On 97-11-05 I mapped those files from FSG and other ad-hoc alphabets to EVA. [Notebook-12.txt] The files are L16-eva/fNNxx.YY, and a machine-readable description of their contents and logical order is in L16-eva/INDEX. Then I started going back to redoing some of the previous tasks using the new encoding. On 98-12-05 I redid the unforlding and conversion to EVA, in order to build a table mapping old locators (Landini's) to the new ones (Stolfi's). UNFOLDING THE ALTERNATIVE CONSTRUCTS I decided to unfold the "[|]" groups into separate lines. This unfolding should make the consensus more consistent. Note that a group like "[A.|.A]" or or "[A|O]" may be considered a consensus, whereas "[A|P]" may not be, depending on the definition of consensus. Hence it seems sensible to do the unfolding of alternatives before computing the consensus. (In previous attempts at computing the consensus, I would just take the first choice out of every alternation). (I had tried doing the unfolding after mapping to EVA, but there are some half-character choices like P[Z|] which would require posterior editing. Besides, it seems better to have an unfolded version of the FSG encoding, preserving the "%" and "!" alignment markers.) For the unfolding, I wrote a filter "unfold-alternatives" to be used with "filter-files". New transcriber codes were introduced for the variant lines; see "f0.U" for details. Note also that a line with n groups, like "A[P|F]ETR[II|O]G[A|P]E" need only generate two lines "APETRIGAE" and "AFETROGPE" and not 2^n, since the consensus should not be affected too much by crossovers. (Perhaps this is true only if the alternations are well-separated?) Besides, this interpretation is generally closer to the way the '[|]' constructs are used in the file: each branch represents one specific version of the transcription. cat L16/INDEX \ | sed -e 's/:.*$//g' \ > .units.dir mkdir L16-unf foreach f ( `cat .units.dir ` ) echo $f cat L16/$f \ | unfold-alternatives \ > L16-unf/$f end /bin/rm -f .diff foreach f ( `cat .units.dir ` ) echo $f echo ' ' >> .diff echo '=== '$f' ===' >> .diff echo ' ' >> .diff diff L16/$f L16-unf/$f \ | prettify-diff-output \ >> .diff end Expanded and complemented Landini's initial comments, producing L16/f0.{A,I,J,E,S,U}. Included comments about my unfoldings and edits. cp L16/INDEX L16-unf/ tar cvf - L16 | gzip > L16.tgz -rw-r--r-- 1 stolfi staff 170606 Nov 5 19:08 L16.tgz rm -rf L16 Also added new unit L16/f77v.L (and L16-eva/f77v.L), with the labels on figures of page f77v. (I should ask the folks in the mailing list to check the labels...) MAPPING THE INTERLINEAR TO EVA Then I converted these files to the new EVA encoding. I plan to work as much as possible with that encoding, since it is "the way of the future". mkdir L16-eva foreach f ( L16-unf/f[0-9]* ) echo "$f -> L16-eva/${f:t}" cat ${f} \ | fsg2eva \ > L16-eva/${f:t} end /bin/rm .bugs foreach f ( L16-eva/f[0-9]* ) echo "checking $f" cat ${f} \ | validate-new-evt-format \ -v chars='aoeilmnrchtpkfsqgjdvxy' \ >>& .bugs end Edited manually some occurrences of FSG and Currier codes within '{}' comments. Also fixed a few dozen bugs (bad letters, leading ".", missing lines). The file "f0.V" describes the recoding and the fixes. [Oops, made a mistake in fsg2eva (mapped 'T' to 'th' instead of 'ch'). So now I am trying to redo the mapping without losing the manual edits: mv L16-eva L16-eva-th (recreate L16-eva mechanically as above) mkdir L16-eva-xx foreach f ( L16-eva/f[0-9]* ) set fxx = "L16-eva-xx/${f:t}" echo "$f -> $fxx" cat ${f} \ | sed -e '/^ ${fxx} end diff -r L16-eva-th L16-eva-xx \ | prettify-diff-output \ > .diff (check differences and edit as appropriate) cp -p L16-eva{-th,}/INDEX cp -p L16-eva{-th,}/f0.A cp -p L16-eva{-th,}/f0.V (fix fsg2eva code in f0.V) OK, let's redo everything we were doing... [ Oops, on 31 Mar 1998 I found another bug in the fsg2eva script: "IK" was mapped to "ik" instead of "im". Fortunately the other "I*K" strings were converted correctly. This affected 78 lines of the interlinear file. Those errors were corrected by hand. ] ADDING MISSING PAGES ln -s ../../landini-intrln16.evt cat landini-intrln16.evt \ | fsg2eva \ > intrln16-eva.evt Edited manually intrln16-eva.evt to add missing pages: