Last edited on 2001-03-29 09:26:37 by stolfi

A Concordance of the VMS

This page describes a complete word index (concordance) of the VMS which was announced last year to the Voynich list, and has now been rebuilt with some impreovements for release 1.6e6 the EVA interlinear file.

Format

The VMS concordance is merely a sorted list of every word of the VMS (and many short phrases), with context, in the following format:

-------------------------------------------- doineeoeeeoe ---------------------- pha f89v2.P1.3 H /yche* okeey qoeol daiin-chor chor cheos qol **eey hea f35v.P.21 ACFH shy dchy ckhy dan/doiin chor chor= -------------------------------------------- doineeoeeo ------------------------ hea f47v.P.11 ACH key chyky-dchy daiin chy/cho chokeesy chy chy hea f47v.P.11 F key chyky-dchy daiin chy/chy chokeesy chy chy hea f32r.P.18 ACF /chokeol dchoty/doiin shoshy/dol dchol dan= hea f32r.P.18 H /chokeol dchoty/doiin sho shy/dol dchol dan=

Each entry lists the section, the page and line, the transcriber codes, some left context, the indexed word/phrase, and some right context.

Ordering of entries

The entries are sorted and grouped on the basis of their "pattern", a string derived from the indexed phrase by discarding some easily confused details (such as spaces, plumes, ligatures, gallows eyes, minor shape details, etc.) and the q prefix, if any. For instance the phrases daiiinchy ckhy and daiin/sho cthy yield the same pattern doineeoeteo, and are thus listed together.

Coverage

This version of the concordance covers all transcriptions in the 1.6e6 interlinear file, not just a "best pick" as in the previous version. In particular it includes Takeshi Takahashi's new complete transcription (code H). It also includes an artificial version (code A, in brighter color) derived from all the other versions by "majority vote", character by character.

I believe that all existing VMS text is now completely covered by the interlinear, except perhaps for some text in the rightmost 1/3 of the nine-rosette diagram (f85v/f86r).

This concordance includes all single words (text and labels) from the interlinear. It also includes all short phrases (up to 17 characters) that yield the same pattern as one of those words. Finally it includes every short phrase whose pattern occurs in two or more distinct locations.

On the other hand the concordance does not list words and phrases that contain "unreadable" characters ("*" or "?") Note that these characters also denote ties in the majority version; so many words are listed only in the individual versions.

For clarity, the EVA word spaces "." and "," are printed as " ". Indexed phrases may span line breaks ("/") and gaps due to figures or vellum defects ("-"), but not paragraph breaks or page boundaries ("="). For this purpose, each label is considered a single paragraph.

For technical reasons, the context phrases are always taken from the majority version, even when the indexed string comes from a "minority" one.

Control experiments

As a kind of "control experiment", mainly to illustrate the effect of "pattern sorting", this site includes also full concordances, built on the same principles, for the following texts:

wow: H. G. Well's War of the Worlds, from a Gutenberg Project electronic edition.

For this concrdance, the sorting patterns were obtained by collapsing letters that have similar shape, e.g. "b" and "h", "f" and "t", etc., so that "bore" and "hare" and "dorc" all get listed together.

The War of the Worlds concordance has been compressed to save disk space. Get the whole thing compressed in Unix tar+gzip format or DOS/Windows zip format.
lac: The Lewis and Clark expedition journals, by five distinct persons, all with verrie orijinoll spellung, obtained from a site devoted to the expedition prepared by Florentine Films.
Here the sorting patterns were extracted on the basis of sound similarity and a bit of grammatical reduction, so that plurals and verb tenses usually come together (as do their four or five spellings of "buffalo").
eno The book "1 Enoch" in Geez (classical Ethiopian), Electronic Ethiopian Bookshelf maintained by Michal Jerabek's of Charles University in Prague.

This book was for some time part of Bible (fragments of it were found among the Qumran cave manuscripts, dated from shortly before the Christian era). The book got eventually excluded from the Jewish and Christian canons at a very early date, and was practically forgotten in Europe; but was retained in the Coptic (Ethiopian) Church canon.

Geez is a Semitic language, related to Arabic and Hebrew. It is normally written with a large syllabic alphabet of its own, the feedel or fidel. The source text I used uses a standard encoding (SERA) of the script, where each syllable is denoted by (usually) a pair of ASCII characters, consonant and vowel. Although SERA was designed to record the script and not the sound, the result is still claimed to be a rough approximation to the actual phonetics, just as the literal-minded Japanese "mitubisi" is not very far from the phonetic-minded "Mitsubishi".

The sorting patterns were defined with the hope of gathering similar-sounding words; however I don't know an iota of Geez, so I have taken the above claim quite literally...

The concordances

You can browse through the HTML concordances starting at these pages:

Each page has about 1000 lines and 200-400 KB.

To find a given word of phrase, use the alphabetic index:

These indices are about 1.6MB each. (I can produce a split index, if people have trouble downloading the whole thing).

Compressed archives

The total size of the VMS concordance is 16 MB (1.6MB for the index file and 100-400 KB for each of the 70 sections); and the other concordances aren't much smaller than that. So, if you wish to fetch more than a couple of pages of some concordance, you are advised to download the compressed archive (about 2 MB), and install it on your local disk.

Machine-readable concordances

You can also fetch the concordances in machine-readable format, without the HTML formatting:

Each file is about 1.5 MB compressed, expanding to 4--6 MB. Files ending in ".gz" are compressed with Unix's "gzip"; files ending in ".zip" are in Windows-compatible archive format.

These files have one record for each entry (without the pattern headers), in the format

LOC TRANS START LENGTH LCTX STRING RCTX PATT STAG PNUM 1 2 3 4 5 6 7 8 9 10 where

LOC is a line locator, like "f1r.11", "f86v2.R1.12a" etc.
TRANS is one or more letters identifying the transcribers, e.g. "AFCD" means the phrase occurs in the majority version ("A"), the FSG transcription ("F"), the Currier/D'Imperio transcription ("C") and its variant copy ("D").
START is the index of the first byte of the occurrence of STRING in the text line (counting from 1). Note that START=1 is actually column 20 of EVMT-formatted text.
LENGTH is the original length of the occurrence in the text, including all fillers, comments, spaces, etc..
LCTX is the left context of STRING (zero or more words; at least one delimiter).
STRING is the non-empty word or phrase in question, without any fillers, comments, non-significant spaces, etc..
RCTX is the right context of STRING (zero or more words; at least one delimiter).
PATT the sorting pattern, derived from STRING; see above.
STAG a tag identifying a section of the VMS, e.g. "hea" or "bio".
PNUM is the page's p-number (sequential, hence handy for sorting).

In STRING, LCTX, and RCTX the symbol "/" is used instead of "-" to denote line breaks, in order to distingish them from embedded "-" denoting gaps, vellum defects, intruding figures, etc.

The STRING is a single and whole Vms word, as delimited by EVA word separators [-/=,.]; or a sequence of two or more consecutive words, up to a certain maximum length (17 non-space characters in this edition).

The delimiters that surrounded the STRING in the original text are not included in the string itself, but are included in the context strings.

The input texts

Finally, you can fetch the texts that were used to created these files:

The VMS text is the interlinear 1.6e6, minus the #-comments. The other texts were recast in the EVMT format, and this process may have introduced errors and loss of text. Please refer to the original sources if you plan to use them for purposes unrelated to this concordance.