Last edited on 2001-03-29 09:26:37 by stolfi

A Concordance of the VMS

This page describes a complete word index (concordance) of the VMS which was announced last year to the Voynich list, and has now been rebuilt with some impreovements for release 1.6e6 the EVA interlinear file.

Format

The VMS concordance is merely a sorted list of every word of the VMS (and many short phrases), with context, in the following format:


-------------------------------------------- doineeoeeeoe ----------------------

pha f89v2.P1.3   H        /yche* okeey qoeol daiin-chor chor cheos qol **eey 
hea f35v.P.21    ACFH      shy dchy ckhy dan/doiin chor chor=

-------------------------------------------- doineeoeeo ------------------------

hea f47v.P.11    ACH          key chyky-dchy daiin chy/cho chokeesy chy chy 
hea f47v.P.11    F            key chyky-dchy daiin chy/chy chokeesy chy chy 
hea f32r.P.18    ACF         /chokeol dchoty/doiin shoshy/dol dchol dan=
hea f32r.P.18    H           /chokeol dchoty/doiin sho shy/dol dchol dan=

Each entry lists the section, the page and line, the transcriber codes, some left context, the indexed word/phrase, and some right context.

Ordering of entries

The entries are sorted and grouped on the basis of their "pattern", a string derived from the indexed phrase by discarding some easily confused details (such as spaces, plumes, ligatures, gallows eyes, minor shape details, etc.) and the q prefix, if any. For instance the phrases daiiinchy ckhy and daiin/sho cthy yield the same pattern doineeoeteo, and are thus listed together.

Coverage

This version of the concordance covers all transcriptions in the 1.6e6 interlinear file, not just a "best pick" as in the previous version. In particular it includes Takeshi Takahashi's new complete transcription (code H). It also includes an artificial version (code A, in brighter color) derived from all the other versions by "majority vote", character by character.

I believe that all existing VMS text is now completely covered by the interlinear, except perhaps for some text in the rightmost 1/3 of the nine-rosette diagram (f85v/f86r).

This concordance includes all single words (text and labels) from the interlinear. It also includes all short phrases (up to 17 characters) that yield the same pattern as one of those words. Finally it includes every short phrase whose pattern occurs in two or more distinct locations.

On the other hand the concordance does not list words and phrases that contain "unreadable" characters ("*" or "?") Note that these characters also denote ties in the majority version; so many words are listed only in the individual versions.

For clarity, the EVA word spaces "." and "," are printed as " ". Indexed phrases may span line breaks ("/") and gaps due to figures or vellum defects ("-"), but not paragraph breaks or page boundaries ("="). For this purpose, each label is considered a single paragraph.

For technical reasons, the context phrases are always taken from the majority version, even when the indexed string comes from a "minority" one.

Control experiments

As a kind of "control experiment", mainly to illustrate the effect of "pattern sorting", this site includes also full concordances, built on the same principles, for the following texts:

The concordances

You can browse through the HTML concordances starting at these pages:

Each page has about 1000 lines and 200-400 KB.

To find a given word of phrase, use the alphabetic index:

These indices are about 1.6MB each. (I can produce a split index, if people have trouble downloading the whole thing).

Compressed archives

The total size of the VMS concordance is 16 MB (1.6MB for the index file and 100-400 KB for each of the 70 sections); and the other concordances aren't much smaller than that. So, if you wish to fetch more than a couple of pages of some concordance, you are advised to download the compressed archive (about 2 MB), and install it on your local disk.

Machine-readable concordances

You can also fetch the concordances in machine-readable format, without the HTML formatting:

Each file is about 1.5 MB compressed, expanding to 4--6 MB. Files ending in ".gz" are compressed with Unix's "gzip"; files ending in ".zip" are in Windows-compatible archive format.

These files have one record for each entry (without the pattern headers), in the format

    LOC TRANS START LENGTH LCTX STRING RCTX PATT STAG PNUM
    1   2     3     4      5    6      7    8    9    10  
where

In STRING, LCTX, and RCTX the symbol "/" is used instead of "-" to denote line breaks, in order to distingish them from embedded "-" denoting gaps, vellum defects, intruding figures, etc.

The STRING is a single and whole Vms word, as delimited by EVA word separators [-/=,.]; or a sequence of two or more consecutive words, up to a certain maximum length (17 non-space characters in this edition).

The delimiters that surrounded the STRING in the original text are not included in the string itself, but are included in the context strings.

The input texts

Finally, you can fetch the texts that were used to created these files:

The VMS text is the interlinear 1.6e6, minus the #-comments. The other texts were recast in the EVMT format, and this process may have introduced errors and loss of text. Please refer to the original sources if you plan to use them for purposes unrelated to this concordance.