Hacking at the Voynich manuscript - Side notes 062 Occurrences of Grove words Last edited on 2002-03-05 03:43:38 by stolfi INTRODUCTION John Grove observed that many text lines begin with what he called a "detachable gallows": a gallows that, if removed, leaves a valid word that occurs elsewhere in the text. Let's call the initial word of such lines a "Grove word" It has also been observed that the first word on each herbal page is usually unique, at least in the herbal section. It has also been observed that the first word of a paragraph often begins with a "p" or "f" gallows. It has also been observed that the rare words with two gallows are usually Grove words. This note is motivated by the hunch that all those special words may be names of important things --- plants, remedies, illnesses, whatever. We here analyze whether such words occur elsewhere in the text, exactly or with simple changes (e.g. removal of the detachable gallows). SETTING UP THE ENVIRONMENT Links: ln -s ../../L16+H-eva ln -s ../../basify-weirdos ln -s ~/PUB/bin/map-field COLLECTING THE TEXT cat L16+H-eva/INDEX \ | gawk -v FS=':' \ ' (($6 == "parags") || ($6 == "starred-parags")){ \ printf "%s\n", $2; \ } ' \ > .files set files = ( `cat .files` ) ( cd L16+H-eva && cat $files ) \ | basify-weirdos \ | reformat-evt-file \ > main.evr EXTRACTING THE LOCATED TOKEN LIST Create a file with one record per token, in the format SEC USEQ FNUM UNIT LINE TRAN FPOS RPOS PFRST PLAST WORD 1 2 3 4 5 6 7 8 9 10 11 where WORD is the word in question; SEC is a book section code; USEQ is the nominal position index of the UNIT in the text; FNUM, UNIT, LINE and TRAN are fields of the line locator (the page f-number; the text unit; the line number; and the transcriber code); FPOS is the sequential number of the word in the line; RPOS is the same, counting backwards from the and of line; PFRST is a boolean (0 or 1) identifying the first token of a paragraph; and PLAST is analogous for the last token. The SEC field is a three-letter code for the secion ("bio", "pha", etc.) except that "hea" and "heb" are collapsed into "her". cat main.evr \ | extract-tokens \ | map-field \ -v inField=1 -v outField=1 \ -v table=L16+H-eva/fnum-to-sectag.tbl \ | gawk ' /./{ gsub(/he[ab]/, "her", $1); $3 = ($2 "." $3); print; } ' \ | map-field \ -v inField=3 -v outField=2 \ -v table=L16+H-eva/unit-to-useq.tbl \ | gawk ' /./{ gsub(/^f.*[.]/, "", $4); print; } ' \ | sort -b +0 -1 +1 -2n +4 -5g +5 -6 +6 -7n \ > main.lts dicio-wc main.lts lines words bytes file ------- ------- --------- ------------ 114329 1257619 4037393 main.lts COLLECTING THE SPECIAL TOKENS We define a word to be "special" if it is a line-initial word of paragraph or starred-paragraph text (in any transcription), has at most one "*", has at least four letters, and either is the first word of the paragraph and starts with any gallows, or contains a "p" or "f" gallows, or contains two or more gallows. Note that this definition does not include any Grove word that begins with a "t" or "k" gallows, is not paragraph-initial, and contains no other gallows. Such a word may be an ordinary word that just happened to resemble a Grove word and happened to fall in line-initial position by accident. cat main.lts \ | select-special-tokens \ | sort -b +10 -11 +0 -1 +1 -2n +4 -5g +6 -7n +7 -8n +5 -6 \ > main.spw dicio-wc main.spw lines words bytes file ------- ------- --------- ------------ 2256 24816 83569 main.spw COMPUTE REPORT HEADINGS Extract the "headings" for the output report. The headings are pairwise disjoint sets of words, each as small as possible, that satisfy the following rules: 1. every word that occurs as a special token belongs to some heading. 2. two words that occur at the start of the same line belong to the same heading. cat main.spw \ | gawk ' /./{ print ($3 "." $4 "." $5), $11; } ' \ | sort -b +0 -1n +1 -2 | uniq \ | collect-headings \ | sort -b +1 -2n \ > main.hds dicio-wc main.hds lines words bytes file ------- ------- --------- ------------ 772 1544 12164 main.hds BUILDING THE REPORT Now read again the token list, selecting all occurrences, strong and weak, of every special word `w'. (A weak occurence is an occurrence of a word `w1' obtained from `w' by deleting an initial gallows.) Output each occurrence with two extra fields, HEAD ($12) which is the heading of `w', and TAG ($13) which is 2 for a strong occurrence on column 1, 1 for a strong occurrence elsewhere, and 0 for a weak occurence. cat main.lts \ | assign-headings -v table=main.hds \ | sort -b +11 -12 +12 -13r +0 -1 +1 -2n +4 -5g +5 -6 \ | collapse-versions-in-soc \ > main.soc dicio-wc main.soc Format result: cat main.soc \ | sort -b +11 -12 +12 -13r +0 -1 +1 -2n +4 -5g +5 -6 \ | format-soc \ -v title='Occurrences of special words' \ -v showWords=1 \ -v showWeak=1 \ > main.html