Last edited on 1997-12-30 04:23:26 by stolfi

Word occurrence maps in EVA

Introducion

Here you will find some tables or "maps" showing where certain Voynichese words occur along the text.

These tables attempt to improve on the label occurrence maps and word occurrence maps that I posted earlier this year. Compared to those earlier attempts, this version uses a (presumably) better word matching criterion, and takes word spaces partly into account. It also uses standard EVA, rather than an ad-hoc encoding, and gives more data about what was matched where.

Contents

These maps are rather wide; for convenient viewing, set your browser's font size to 10 pt or less.

If you want to know all the boring details, you can browse the notebook file with the Unix recipes I used to make these maps, and the working directory mentioned therein.

The input text

To build this map, I used the "Friedman" transcription, originally published by Jim Reeds. I extracted the text from G. Landini's interlinear file (interln16), and converted it to the EVA encoding.

The Friedman transcription is incomplete, and contains may errors. I considered plugging some of the gaps with the other transcriptions present in Landini's file. However, since different people make different kinds of errors, doing so would have added another source of non-uniformity and made the maps harder to interpret.

I also considered building a mechanical "consensus" of the multiple transcriptions, as I did in my previous analyses of the biological section. However, a usable consensus can be obtained only after mapping all versions to some "robust" encoding, discarding character details such as ligatures, eyes, and plumes. Doing so would reduce the usefulness and readability of the final maps. Moreover, the consensus-building algorithm will tend to eliminate certain easy-to-misread words (such as those ending with -i*n) in sections where there are two versions, and keep them where there is only one version---which only adds more noise to an laready noisy map.

The following tables gives the number of "good" words (without * characters) in the input text, classified by section and language:

      words section.lang        words lang 
      ----- ------------       ------ ---- 
        687 ?.A                   343 ?    
       1462 ?.B                 10429 A    
        173 astro.?             22254 B    
       6690 bio.B
        170 cosmo.?
        139 cosmo.B
       7571 herbal.A
       3336 herbal.B
       2171 pharma.A
      10627 stars.B

In the process of translating the Friedman transcription to EVA, I fixed a couple dozen spaces and line breaks that were obviously wrong (as determined from page images or by comparison with other transcriptions). I also separated John Grove's "titles" from the surrounding plain text, and regularized the location codes. For more details, and the complete EVA interlinear, see the description of the interln16e2 package.

Word patterns, similar words, and matching rules

My previous hunts for label occurrences convinced me that most labels can be found in the main text, in plausible locations, but only by using approximate matching criteria that ignore certain details of the character shapes.

For these maps, I considered two strings of EVA characters to be similar if the following operations reduced both to the same string:

  1. delete all {}-comments, blanks, and newlines;
  2. delete all EVA fillers [%!];
  3. map {sh ch cth ckh cph cfh} to {ee ee ete eke epe efe};
  4. map s and r to e;
  5. map k to t and f to p;
  6. replace every occurrence of ei by a;
  7. map a and y by o;
  8. collapse any string of two or more is to a single i;
  9. map {j g m} to d;
  10. delete any q followed by any letter in [oayeclktp];
  11. delete all word-space characters [-=.,];

The result of applying these operations to an EVA strings is the pattern of that string. Thus, for example, the patterns of qoteedy, yksh.dy and yted.o are oteedo, oteedo, and otedo, respectively. Therefore qoteedy is similar to yksh.dy, and neither is similar to yted.o.

In the tables, the notation xxx~ stands for the set of all strings that are similar to string xxx. Thus, qoteedy~ stands for qoteedy, oteedy, otchdy, okchmo, and all other words that have the same pattern (oteedo) pattern as qoteedy.

Roughly speaking, this definition amounts to ignoring ligatures, plumes, the left "eyes" in gallows characters, q prefixes, repeated is, the straightness of strokes in r, a, m, and the shape and length of the "tail" in y, m, j, g.

Some of these rules attempt to compensate for known sources of transcription errors: for instance, the ligatures are often too faint to see, and so is the difference between a and o. Other rules erase distinctions that, in the light of previous statistics, do not seem to affect the meaning of the word: that is the case of the q prefix, and of the difference between t and k.

Note that the order of the reduction steps is important in some cases: for example, in eiin the ei must be reduced to a before the ii gets reduced to i, so that we get ain and not an.

When looking for a string in the text, I ignored embedded EVA word spaces, but limited the comparison to strings consisting of one or more whole words. Thus the string qo.kshedo was considered similar to the strings .qotch.dy. and .qoteedy. in the text, but not .qokshed.okee..

Text blocks

The word occurrence maps are essentially tables that give the number of times that each word (associated with a row of the table) appears in each of several blocks of the text (associated with columns of the table).

A natural definition for a block could be one page of the Vms. A problem with this choice is that there are too many pages; the maps would be far too wide to display effectivey. Another problem is that the amount of text varies quite a lot from page to page; this variation would mask any variation in the word counts due to changes of subject matter.

Ideally, each block should contain the same number of good words: about 1100, if we want to have about 30 blocks. On the other hand, ideally each block should be homogeneous as to language and section, and adjacent blocks should map to adjacent columns. It is practically impossible to meet all three ideals, since the two languages are interleaved at the folio level in the herbal section.

So I settled for a compromise solution: I divided the pages into three groups according to their language ("A", "B", or "unknown"), and then split the text in each group into blocks containing an equal number of words.

Specifically, I assigned the "A" pages to blocks 0 to 9 (with 1043 words each), the "unknown" pages to block 10 (343 words), and and the "B" pages to blocks 11 through 31 (1060 words each). Within each of these three major sections, the block order agrees with the Vms page order. In each map, the page where each block begins is printed vertically at the top of the corresponding column.

Analysis and conclusions

Duhh....

men at work