# Word occurrence map for the Biological section

The tables below show the approximate distribution of occurrences of each word along the Biological section of the Voynich manuscript.

The only reasonably sure conclusion I can draw from these maps is that the word distribution is not entirely random. (But even that may be an artifact of the recoding and error filtering steps...) Many words have their occurrences concentrated in one or two spots. Also, similar words tend to have similar distributions.

## Table format

Each line of the table corresponds to a distinct "good" word occurring somewhere in the biological section. The format is

TOTAL   AVG   DEV WORD             ABSOLUTE FREQUENCY BY BLOCK                                  RELATIVE FREQUENCY BY BLOCK
----- ----- ----- ---------------- -----------------------------------------------------------  -----------------------------------------------------------
1  22.5  16.7 8aHa             ......................1....................................  ......................9....................................
3  33.8   7.8 8aHam            .............................11..........1.................  .............................33..........3.................
2  28.5  26.4 8aHc8a           ...1.................................................1.....  ...5.................................................5.....
1  12.5  16.7 8aHca            ............1..............................................  ............9..............................................

The three numbers on the left are the word's total frequency in the text, its average block index (counting from 1), and a measure of its spread. The latter is essentially the standard deviation of the block index; except that the formula was modified to inflate the spread of words that occur onlya few times. In particular, a word that occurs only once gets a spread value of 16.7, similar to that of a frequent and uniformly distributed word (like `zcc8a', spread=18.3).

The rest of the table is divided into two sections, each having one column for each block of 100 consecutive "good" words.

In the "ABSOLUTE" section, each entry is a digit that tells how many times that word occurred in that block ("." means 0, "9" means nine or more times).

In the "RELATIVE" section, the same counts are expressed as fractions of the total count for that word, scaled to the range [0..9] ("." = zero, "9" = 100%).

## Source text

The counts were obtained from the entire Biological section of the VMs (f75r--f84v), which is in Currier's "Language B". The version used was a mechanical "consensus" of the Currier and FSG transcriptions (7670 words, including 765 end-of-line marks `//' and 75 end-of-paragraph marks `=').

After discarding words with invalid characters, and words that were transcribed differently by Currier and Friedman, only 5894 words were left. The word sequence was split into 100-word blocks, and word counts were collected for each block.

## Character encoding

For this table I used an ad-hoc encoding of the VMs script (yes, ANOTHER one), which I label HOP. First, the Currier and Friedman versions were converted from modified FSG to a "super-analytic" encoding, where each VMs symbol is broken down into separate pen strokes, and each stroke is encoded as a single letter. Then certain stroke sequences were collapsed back into single characters (not necessarily honoring the original character boundaries).

The HOP encoding identifies some VMs symbols that are often confused by the transcribers, such as [D] with [H] and [M] with [N]. It also identifies pairs which (I believe) are meaningless calligraphic variations of the same letter, such as [A] and [G], or [A] and [CI].

The HOP encoding is basically the Frogguy alphabet, with the following changes:

## Unix recipe

To build the map, I used the Unix commands

cat text-good.wds \
| enum-words-in-blocks -v WPB=100 \
| sort +1 -2 +0 -1n \
| make-word-location-map -v CTWD=1 -v PERCENT=1 -v NBLOCKS=59 \
> text.map

The file "text-good.wds" was the input text: recoded as above, one word per line, in the original order, with "bad" words omitted. enum-words-in-blocks and make-word-location-map are AWK scripts that do most of the work.

## Page revision history

97-08-06 (approx): First version posted.

97-08-08: Added the position average and deviation, and the relative frequencies (right half of each table). Removed the bad words before splitting the file into blocks, so as to keep the block sizes more uniform. Added pointers to similar tables for English and Portuguese.

97-08-14. Prepared a cleaner version of the consensus text, using the HOP encoding (which now merges `m' and `n') before the dynamic programming merge, and accepting a blank in either version as a blank in the consensus.

97-12-10. Rearranged the text in these pages, without substantial changes.

Last edited on 97-10-12 by stolfi