Last edited on 2000-06-14 05:31:00 by stolfi.

A Grammar for Voynichese Words

Jorge Stolfi

Introduction
The paradigm
Normal and abnormal words
The three-layer model
- The letters a o y
- The letter e
The crust layer
The mantle layer
The core layer
Distribution of the circles
Abnormal words
Sectional variariation
Discussion and conjectures
References
Data files
Unix scripts

Introduction

The text of the Voynich manuscript (VMS) is clearly divided into word-like symbol groups by fairly distinct spaces. It has long been known that those "Voynichese words" have a non-trivial internal structure, manifested by constraints on the sequence and position of different symbols within each word.

Several structural models or "paradigms" for the Voynichese lexicon (or subsets thereof) have been proposed over the last 80 years, e.g by Tiltman [1], M. Roe [2], R. Firth [3], ande even the undersigned [4,5]. This note describes new paradigm that seems to be significantly more detailed and comprehensive than those previous models:

The new paradigm incorporates and refines our previous models, the OKOKOKO paradigm [4] and the crust-core-mantle decomposition [5]. Moreover, it attempts not only to reproduce the set of valid words, but also display the presumed combinatorial structure of those words, and the frequency distribution of various word classes.

The paradigm also provides strong support for John Grove's theory that many ordinary-looking words occur prefixed with a spurious "gallows" letter (k t p f in the EVA alphabet).

The nature and complexity of the paradigm, and its fairly uniform fit over all sections of the manuscript (including the labels on illustrations), are further evidence that the text has significant contents of some sort. Moreover, the paradigm imposes severe contraints on possible "decipherment" theories. In particular, it seems highly unlikely that the text is a Vigenère-style cipher, or was generated by a random process, or is a simple transliteration of an Indo-European language. On the other hand, the paradigm seems compatible with a codebook-based cipher (like Kircher's universal language), an invented language with systematic lexicon (like Dalgarno's), or a non-European language with largely monosyllabic words.

The paradigm

Our word paradigm is expressed as contex-free grammar, whose terminal strings are word-like strings in the EVA encoding:

The Voynichese word grammar.

Grammar notation

The notation should be fairly straightforward. There is one section in the grammar for each non-terminal symbol, in the format

  NTSYMB:
     COUNT₁ FREQ₁ CUMFREQ₁ DEF₁
     COUNT₂ FREQ₂ CUMFREQ₂ DEF₂
     ...
     COUNT_m FREQ_m CUMFREQ_m DEF_m

where NTSYMB is the non-terminal symbol being defined, and each DEF_i is an alternative replacement for it. In conventional notation, the rule above would be written

  NTSYMB -> DEF₁ | DEF₂ | ... | DEF_m

In the rewrite strings DEF_i, the terminal symbols are Voynichese letters in the basic EVA encoding, always in lowercase; while non-terminal symbols always begins with a capital letter. The period "." here denotes the empty string, and is also used as a symbol separator or concatenation operator. The comments in green are not part of the model.

The fields to the left of each rewrite rule define its frequency of use. The first field COUNT_i is the number of times the rule gets used when parsing the VMS text. The second field FREQ_i is the relative frequency of the alternative, that is, the ratio of COUNT_i relative to the total COUNT of all alternatives of NTSYMB. The third field CUMFREQ_i is the sum of all previous FREQ_j in the section, up to and including FREQ_j.

The fields COUNT_i, FREQ_i, and CUMFREQ_i take into account the word frequencies in the text. Thus, for example, 100 occurrences (tokens) of the word darar would count as 100 uses of the rule R -> d, and 200 uses of the rule R -> r.

Why the frequencies?

The primary purpose of the COUNT and FREQ fields is to express the relative "normalness" of each word pattern. We think that, at the present state of knowledge, this kind of statistical information is essential in any useful word paradigm.

The text is contaminated by sampling, transcription, and possibly scribal errors, amounting to a few percent of the text tokens --- which is probably the rate of many rare but valid word patterns. Thus, a purely qualitative model would have to either exclude too many valid patterns, or allow too many bogus ones. By listing the rule frequencies, we can be more liberal in the grammar, and list many patterns that are only marginally attested in the data, while clearly marking them as such.

Predicting word frequencies

Apart from their primary purpose, the FREQ fields also allow us to assign a predicted frequency to each word, which is obtained by mutiplying the FREQ fields in all rules used in the word's derivation, and adding these numbers for all possible derivations. (Actually there is at most one, since the grammar happens to be unambiguous.)

It would be nice if the predicted word frequencies matched the frequencies observed in the Voynich manuscript. Unfortunately this is not quite the case, at least for the highly condensed grammar given here.

The mismatch between observed and predicted frequecies is largely due to dependencies between the various choices that are made during the derivation. For instance, suppose the grammar contained the following rules:

  Word:
        100 1.00 1.00  Y.Y
        

  Y:
        100 0.50 0.50  y
        100 0.50 1.00  o

This grammar generates the words oo, oy, yo and yy, and assigns to them the same predicted frequency (0.25). However, the rule counts and frequencies are equally consistent with a text where oo and yy occur 50 times each, while oy and yo do not occur at all --- or vice-versa. In other words, the grammar does not say wether the choice of the first Y affects the choice of the second Y.

These dependencies are actually quite common in Voynichese (and in all natural languages). In English text one will find plenty of can, cannot, and man, but hardly any mannot. In Voynichese daiin, qokeedy and qokaiin are all very popular (866, 305, 266 occurrences, respectivey), while deedy is essentially nonexistent (3 occurrences). Our paradigm fails to notice this assymetry, since it allows independent choices between d- and qok-, and between -aiin and -eedy.

Why a grammar?

Although our paradigm is formulated as a context-free grammar, it actually defines a regular (or rational) stochastic language. Therefore, the grammar could be replaced, in priciple, by an equivalent probabilistic finite-state automaton (i.e., a Markov-style model).

However, we believe that the grammar notation is more convenient and readable than the equivalent automaton, for several reasons. For one thing, it is more succint: a single grammar rule with N symbols on the right-hand side would normally translate into N or more states in the automaton. Moreover, although our grammar is unambiguous, it is not left-to-right deterministic; therefore the equivalent automaton would be either non-deterministic, or would have a very large number of "still undecided" states.

(In fact, our grammar is not recursive, and thus generates a large but finite set of words. we could have simplified some rules by making them recursive (e.g. CrS), but then the rule probabilities would be much harder to interpret.)

Implied word structure

The grammar not only specifies the valid words, but also defines a parse tree for each word, which in turn implies a nested division of the same into smaller parts.

Some of this "model-imposed" structural information may be significant; for example, we belive that our parsing of each word into three nested layers must correspond to a major feature of the VMS encoding or of its underlying plaintext.

However, the reader should be warned that the overriding design goals for the grammar were to reproduce the set of observed set of words as accurately as possible, while ensuring unambiguous parsing. Therefore, one should not give too much weight to the finer divisions and associations implied by our parse trees. For example, our grammar arbitrarily associates each o letter to the letter at its right, although the evidence for such association is ambiguous at best.

Said another way, there are many grammars that would generate the same set of words, even the same word distributions, but with radically different parsings. Further study is needed to decide which details of the word decomposition are "real" (necessary to match the data), and which are arbitrary.

Coverage versus simplicity

When designing the grammar, we tried to strike a useful balance between a simple and informative model and one that would cover as much of the corpus as possible. In particular, we generally omitted rules that were used by only one or two tokens from the corpus, since those could be abbreviations, split words, or transcription errors. However, some of those rules seemed quite natural in light of the overall structure of the paradigm. It may be worth restoring some of those low frequency rules, for the sake of making the grammar more logical.

For example, the present grammar defines

IN:
   1770 0.30066 0.30066 i.N
   4019 0.68269 0.98335 ii.N
     98 0.01665 1.00000 iii.N

N:
   5246 0.89112 0.89112 n
    554 0.09411 0.98522 r
     24 0.00408 0.98930 l
     54 0.00917 0.99847 m
      9 0.00153 1.00000 s

These rules do not accomodate words containing iiii, ix, or id --- like oiiiin rokaix, or daid (1 occurrence each). Yet iiii with count of 1 would be a logical extrapolation of the i series; and, in other contexts, d and x clearly belong to the same class as r, l, s.

Normal and abnormal words

The grammar's starting non-terminal symbol (the axiom or root) is Word. For convenience, the grammar actually generates all the words that occur in the VMS transcription. Our paradigm proper consists of the sub-grammar rooted at the symbol NormalWord. The exceptions --- VMS words that do not follow our paradigm --- are listed as derivations of the symbol AbnormalWord.

It should be noted that that normal words account for over 88% of all label tokens, and over 96.5% of all the tokens (word instances) in the text. The exceptions (less than 4 every 100 text words) can be ascribed to several causes, including physical "noise" and transcription errors. (Different people transcribing the same page often disagree on their reading, with roughly that same frequency.). Indeed, most "abnormal" words are still quite similar to normal words, as discussed in a later section.

The three-layer model

As in our previous model [5], the normal words are parsed into three major nested "layers" --- crust, mantle, and core --- each composed from a specific subset of the Voynichese alphabet:

core:	`t` `p` `k` `f` `cth` `cph` `ckh` `cfh`
mantle:	`ch` `sh` `ee`
crust:	`d` `l` `r` `s` `n` `x` `i` `m` `g`

Although each of these layers can be empty, the three-layer structure is definitely non-trivial: it rules out, for example, words with two core letters bracketing mantle or crust letter. More generally, suppose we assign "densities" 1, 2, and 3 to the three main letters types above, and ignore the remaining letters. The paradigm then says that the density profile of a normal word is a single unimodal hill, without any internal minimum.

In other words,as we move away from any maximum-density letter in the word, in either direction, the density can only decrease (or remain constant). The possible density profiles (ignoring repeated digits) are

  1 2 3  
  12 21 13 31 23 32 
  121 123 131 132 231 232 321
  1231 1232 1321 2321
  12321

Note that these are a proper subset of the possible profiles with 3 or more letters. In particular, the profiles 212 213 312 313 323 are excluded by our paradigm.

Among the EVA letters not listed above, most are so rare that it seems pointless to include them in the "normal word" paradigm. Only the letters { e a o y } are frequent enough to merit special attention.

The letters `a o y`

The distribution of the "circles", the EVA letters { a o y }, is rather complex. They may occur anywhere within the three main layers, as we discuss later on. It is still an open question whether the circles are independent letters, or modifiers for adjacent letters, or both.

We have arbitrarily chosen to parse each circle as if it were a modifier of the next non-circle letter; except that a circle at the end of the word (usually a y) is parsed as a letter by itself. Thus olkchody is parsed as ol-k-ch-od-y We have no convincing excuse for this choice, except that the circles behave quite differently from the more numerous non-circles, so placing both at the same level in the grammar would obscure the structure of the non-circles.

The letter `e`

In normal words, the letter e, when not part of an ee group, is almost always located in or next to the core and mantle layers, almost alwaysafter a non-empty core or mantle. In fact, only 2% of the isolated e occur at the beginning of the mantle and core layers, while over 90% of them occur at the end of those layers. Therefore, we have chosen to parse isolated e letters as part of the preceding mantle or core letter (while allowing for an occasional o insertion between the two). Thus, for example, okoecheody is parsed as (o(k)oe)((ch)e)(o(d))(y).

Very rarely --- about 70 occurrences --- e occurs alone, surrounded by crust letters; in which case we parse it as the only letter in the mantle layer.

The crust layer

The crust layer is the part of the word consisting of the letters q d l r s x i m g, with their a o y pre-modifiers and final y or o, if any. In normal words, the crust comprises either the whole word (almost exactly 75% of the normal tokens), or a prefix and a suffix thereof (25%).

There are 459 tokens (1.3%) where the crust is not the outermost layer --- that is, where bits of crust are bracketed by non-crust letters on both sides. Most of these exceptions are actually Grove words.

Initial and final groups

The crust is not homogeneous, actually. The letter q only occurs at the beginning of a normal word, although in a few instances (less than 0.4% of all qs, ) it is preceded by o or y. The letter y occurs almost exclusively in word-initial or word-final position. The letter m occurs almost exclusively in word-final position. The same is true of the IN clusters, which consist of one or more i letters, usually followed by n, r, or (more rarely) by l, m, or s. Conversely, a subtantial fraction of the words end with y, o, m or IN. whereas y is not.

The letter q rarely occurs at beginning of paragraphs or in labels, which may mean that it is a grammatical particle (article, preposition, etc.). Indeed, its inclusing in the core layer of our model is arbitrary --- it might as well be considered a separate layer.

About `m` and `g`

It seems that the letter m is inordinately common at the end of lines, and before interruptions in the text due to intruding figures. The letter m, like the IN groups, is almost always preceded by a or o (862 tokens in 950, 91%). We note also that dam and am are the most common -am words, just as daiin and aiin are the most common -aiin words. Perhaps m is an abbreviation for iin (and/or other IN groups), used where space is tight.

On the other hand, the truth may not be that simple. of the 950 tokens that contain m, 56 (5.8%) are preceded by ai or aii rather than a alone.

The rare letter g, like m, occurs almost exclusively at the end of words (24 tokens out of 27); however, unlike m, it is not preceded by a. We note that g looks like an m, except that the leftmost stroke is rounded like that of an a. Perhaps g is an abbreviation of am?

There are 32 tokens that end in m, but not as am, om, or im. It is possible that these tokens are actually instances of g that were mistakenly transcribed as m --- a fairly common mistake.

The letters `d l r s`

The bulk of the crust consists of a variable number of "dealers", the letters d l r s.

Almost exactly 1/4 of the normal words have no core or mantle, only the crust layer. Here is the distribution of the number of dealers in these words, tabulated separately for words with and without the initial q letter:

        without q              with q
    --------------------  --------------------
      221 0.02662 .          38 0.08482 q   
     3565 0.42941 r         299 0.66741 qr  
     4066 0.48976 rr        109 0.24330 qrr 
      413 0.04975 rrr         2 0.00446 qrrr
       36 0.00434 rrrr   
        1 0.00012 rrrrr

As we can see, crust-only words without q have between 0 and 3 dealers (most often 1 or 2, 1.57 on the average). Those with q have between 0 and 2 dealers (most often 1 or 2, 1.17 on the average), not counting the q. We could say that the q counts as 0.4 of a dealer.

In words that have a split crust (non-empty core and/or mantle), the dealers are mostly located in the crust suffix. Here are the counts for various patterns of dealers, in words with and without q-letters. (The "-" denotes the core and/or mantle component.)

        without q           with q (as affix)    with q (as dealer)
    --------------------  --------------------  --------------------
     5130 0.25594 -         1277 0.27713 q-   
                            
    10572 0.52744 -r        3100 0.67274 q-r     
      820 0.04091 r-          45 0.00977 qr-     1277 0.27713 q-           
                            
     1565 0.07808 -rr        144 0.03125 q-rr    
     1579 0.07878 r-r         38 0.00825 qr-r    3100 0.67274 q-r  
       59 0.00294 rr-          0 0.00000 qrr-      45 0.00977 qr-  
                            
      112 0.00559 -rrr         2 0.00043 q-rrr   
      103 0.00514 r-rr         1 0.00022 qr-rr    144 0.03125 q-rr 
       94 0.00469 rr-r         1 0.00022 qrr-r     38 0.00825 qr-r 
        1 0.00005 rrr-         0 0.00000 qrrr-      0 0.00000 qrr- 
                            
        2 0.00010 -rrrr                          
        0 0.00000 r-rrr                             2 0.00043 q-rrr
        3 0.00015 rr-rr                             1 0.00022 qr-rr
        3 0.00015 rrr-r                             1 0.00022 qrr-r
        0 0.00000 rrrr-                             0 0.00000 qrrr-

        1 0.00005 rr-rrr

If we view the q letter as an independent affix (second column), the distribution of dealer patterns in q-words seems similar to that of words without q (first column), except for a noticeable bias in the former towards shorter words. Note in particular that -r and q-r are the most popular patterns in the two classes. On the other hand, if we try to view q as a dealer (third column), the distributions don't match at all. Thus the first interpretation seems to be the most correct of the two.

Here is again the same data, in a different format:

                             suffix length
                             -------------
                 0        1       2       3       4      avg
              +-------+-------+-------+-------+-------+-------+
prefix      0 |  5130 | 10572 |  1565 |   112 |     2 |  0.81 |
length        +-------+-------+-------+-------+-------+-------+
            1 |   820 |  1579 |   103 |       |       |  0.71 |
              +-------+-------+-------+-------+-------+-------+
            2 |    59 |    94 |     3 |     2 |       |  0.67 |
              +-------+-------+-------+-------+-------+-------+
            3 |     1 |     3 |       |       |       |  0.75 |
              +-------+-------+-------+-------+-------+-------+
          avg |  0.16 |  0.15 |  0.07 |  0.00 |  0.00 |       |
              +-------+-------+-------+-------+-------+-------+

There seems to be a slight negative dependence between the average length (number of dealers) of the crust prefix and suffix; but that effect may well be the result of transcribers inserting bogus word breaks in longer words.

In any case, the average lengths are 0.14 dealers in the prefix, 0.80 in the suffix, and 0.94 in the whole word. Note that this number is substantially less than the average length of crust-only words.

The letter `x`

The letter x is very rare (24 tokens, 1/500 the frequency of d in the same context), and is confined to a couple of sections of the book. We could have excluded it from the model, with little loss. However, its occurrences are all consistent with it being a crust letter, in the same league as d, l, r, and s.

The mantle layer

The mantle layer consists primarily of the "bench" letters: ch and sh, and the ee group, which, in its n-gram statistics, seems to be a variant of those two. As explained above, we include in the mantle also isolated e letters, except those that follow a core letter; and any o letters prefixed to the above.

Almost exactly 1/4 or the normal tokens have a non-empty mantle, but no core. In those words, the mantle typically consists of one or two benches, combined of course with single e letters and circles. If we ignore the latter, and replace sh by ch, the most common combinations in normal words are:

     68 0.00799 e       3292 0.38661 ch        
    185 0.02173 ee      3851 0.45226 che      
     90 0.01057 eee      917 0.10769 chee
      2 0.00023 eeee      24 0.00282 cheee

      3 0.00035 ech         42 0.00493 chch     17 0.00200 chech 
      2 0.00023 eche         7 0.00082 chche     2 0.00023 cheche
                             2 0.00023 chchee
      5 0.00059 eech
      2 0.00023 eeche

In words that have gallows letters, the mantle is normally split into two contiguous segments, a prefix and a suffix, and either or both of them be empty. Again, after ignoring circles, mapping sh to ch, and mapping all gallows to #, the most common core/mantle combinations in this class are

    without platform                   with platform              
  -------------------------------  -------------------------------
   5820 0.38477 .......#.........    737 0.37335 ......c#h........
   2160 0.14280 .......#e........    295 0.14944 ......c#he.......
   2339 0.15463 .......#ee.......     44 0.02229 ......c#hee......
    189 0.01250 .......#eee......      2 0.00101 ......c#heee.....
      4 0.00026 .......#eeee.....                                 
                                                                  
   1611 0.10651 .......#ch.......      8 0.00405 ......c#hch......
   1102 0.07285 .......#che......                                 
    101 0.00668 .......#chee.....                                 
      2 0.00013 .......#cheee....                                 
                                                                  
     88 0.00582 .......#ech......                                 
     40 0.00264 .......#eche.....                                 
      2 0.00013 .......#echee....                                 
                                                                  
     27 0.00179 .......#eech.....                                 
      6 0.00040 .......#eeche....                                 
                                                                  
     11 0.00073 .......#chch.....                                 
      1 0.00007 .......#chche....                                 
                                                                  
      6 0.00040 .......#chech....                                 
                                                                  
    502 0.03319 .....ch#.........    514 0.26039 ....chc#h........
     94 0.00621 .....ch#e........    126 0.06383 ....chc#he.......
     64 0.00423 .....ch#ee.......      2 0.00101 ....chc#hee......
      6 0.00040 .....ch#eee......                                 
                                                                  
    144 0.00952 .....ch#ch.......      1 0.00051 ....chc#hch......
     36 0.00238 .....ch#che......                                 
      5 0.00033 .....ch#chee.....                                 
                                                                  
      3 0.00020 .....ch#ech......                                 
                                                                  
      2 0.00013 .....ch#chch.....                                 
                                                                  
    355 0.02347 ....che#.........    183 0.09271 ...chec#h........
     69 0.00456 ....che#e........     45 0.02280 ...chec#he.......
     35 0.00231 ....che#ee.......      1 0.00051 ...chec#hee......
      2 0.00013 ....che#eee......                                 
                                                                  
     51 0.00337 ....che#ch.......                                 
     18 0.00119 ....che#che......                                 
      2 0.00013 ....che#chee.....                                 
                                                                  
     88 0.00582 ...chee#.........      4 0.00203 ..cheec#h........
     12 0.00079 ...chee#e........      3 0.00152 ..cheec#he.......
     11 0.00073 ...chee#ee.......      1 0.00051 ..cheec#hee......
                                                                  
      5 0.00033 ...chee#ch.......                                 
      2 0.00013 ...chee#che......                                 
                                                                  
                                                                  
     49 0.00324 ......e#.........      3 0.00152 .....ec#h........
     15 0.00099 ......e#e........                                 
     14 0.00093 ......e#ee.......                                 
                                                                  
     12 0.00079 ......e#ch.......                                 
      4 0.00026 ......e#che......                                 
                                                                  
      3 0.00020 .....ee#.........      3 0.00152 ....eec#h........
      0 0.00000 .....ee#e.......       2 0.00101 ....eec#he.......
      2 0.00013 .....ee#ee.......                                 
                                                                  
      2 0.00013 ....eee#.........                                 
                                                                  
      4 0.00026 ..cheee#.........                                 
                                                                  
      2 0.00013 ..cheee#ch.......                                 
                                                                  
      2 0.00013 ...chch#.........
      2 0.00013 ..chche#.........

Note that we have sorted this table as if the isolated e following the core was part of the mantle suffix. As the table shows, prefixes are generally shorter than suffixes, and, for a given prefix or suffix, the frequency generally decreases as the other affix gets more complicated.

The implied structure of the mantle is probably the weakest part of our paradigm. Actually, we still do not know whether the isolated e after the core is indeed a modifier for the gallows letter (as the grammar implies); or whether the pedestal of a platform gallows is to be counted as part of the mantle; or whether the eee groups ought to be parsed as e.ee, ee.e, or neither; and so on. These dilemmas are ullustrated in the following pages, which show the same distribution of split core-mantles above in different formats:

Allowing for both e and ee in the mantle could make the grammar ambiguous. Fortunately, it turns out that the only ambiguous string that is common enough to matter is eee. (The string eeee occurs only 4 times in the whole manuscript.) Our grammar parses eee as e followed by ee.

The core layer

The core layer of a normal word, by definition, consists of the "gallows" letters { t p k f } or their "pedestal" variants { cth cph ckh cfh }; each possibly prefixed by one or more round letters, and followed by an isolated e or oe. Alternative platforms such as ikh and ckhh, and incomplete platforms such as ck are extremely rare (abot 30 occurrences), and are classified as AbnormalWord by the grammar.

A string of two or more e letters following a gallows letter is parsed from right to left, into zero or more ee pairs, which are assigned to the mantle, and possibly a single e, which is interpreted as part of the core. Thus kee is parsed as k.ee and keee as ke.ee. We have no strong arguments for this rule, except that it avoids ambiguity.

Almost exactly half of the normal words have an empty core, while the other half has a core that consists of a single gallows letter, possibly with platform. There are 326 words with two or more gallows. Here is a breakdown of the normal gallows by type:

   7084 0.39876 k       633 0.03563 ckh 
   4162 0.23428 t       701 0.03946 cth 
    299 0.01683 f        42 0.00236 cfh 
   1159 0.06524 p       129 0.00726 cph 
                                        
   1749 0.09845 ke      223 0.01255 ckhe
    966 0.05438 te      180 0.01013 cthe
      3 0.00017 fe       15 0.00084 cfhe
      3 0.00017 pe       58 0.00326 cphe

Note the almost absolute lack of e after p and f. The anomaly of these counts can be approciated by comparing the ratios pe/te with p/t, cph/cth, and cpe/cte.

Distribution of the circles

Up to now we have ignored the presence of the "circle" letters a o y. These are usually inserted between the other letters, as in qokeedy or okedalor. The insertion is strongly context-dependent, of course. As several people have observed, two circles in consecutive positions occur with abnormaly low frequency --- much less than implied by the frequencies of individual letters. Our decision to attach the circles in the crust to adjacent letters (see the OR symbol) was dictated by this observation.

Actually, the rules about which circles may appear in each position seem to be fairly complex, and are still being sorted out. Chiefly for that reaon, the grammar is quite permissive on this point, and may in fact predict significant frequency for many words that have in fact a forbidden circle pattern

For instance, it is well-known that y (with very few exceptions)] only occurs at in word-initial or word-final position. Yet the grammar indifferently allows either y, o or a at any slot within the crust layer, and either y or o within the core and mantle layers. We considered distinguishing initial from medial circle slots in the grammar, but that would have required the duplication several rules.

Our grammar also fails to record the unequal distribution of the circles next to different "dealers", which can be inferred from the digraph and trigraph statisitcs:

        21 dd         6 dad        18 dod
       394 ld         1 lad        44 lod
        27 rd         2 rad        63 rod
        21 sd         1 sad        23 sod
        75 dl       730 dal       199 dol
        30 ll        72 lal       152 lol
        12 rl       126 ral       103 rol
         4 sl        95 sal       133 sol
        11 dr       803 dar       127 dor
        35 lr        69 lar       156 lor
         1 rr       107 rar        61 ror
         2 sr       121 sar        68 sor
       179 ds         7 das         4 dos
       396 ls         2 las        17 los
        45 rs         2 ras         7 ros
        28 ss         1 sas        16 sos

Generally speaking, the letters o and a seem to be attracted to the slots before r and l, and seem to avoid slots before d and s. To record these preferences in the grammar, it would be necessary to split the R symbol into separate symbols R -> r|l and D -> d/s, and similarly for OR.

Circles are less common within the mantle layer, but fairly common at the boundaries of those two layers. Again, the present version of the grammar doesn't try to capture these nuances: it allows an optional circle before every core or mantle letter.

On the other hand, the grammar does impose some restrictions about the circle slots just before an IN group (where only a and o are allowed), before e and ee (where only o is allowed), before other core or mantle letters (where only y or o are allowed) and the slot at the very end of the word (ditto).

Abnormal words

The words that do not fit into our paradigm are collected in the gramamr under the symbol AbnormalWord. These words comprise 1295 tokens (3.7%) in the main text, and 127 tokens (12.4%) in the labels. The vast majority are rare words that occur only once in the whole manuscript. They were manually sorted into a few major classes, according to their main "defect" as we perceived it:

Multiple: words that do not have a properly nested layer structure, and seem to be two more normal words joined together (716 tokens, 55% of the abnormal words). These can be subdivided into:
- MultiCore: words with two or more gallows (208 tokens). The most common is oteotey (3 occurrences).
- MultiCoreMantle: words with crust letters surrounded by core or mantle letters (278 tokens). The most common are chodchy and cholky (4 occurrences each)
- EmbeddedAIN: words which contain the A.IN groups in non-final position (206 tokens). The most common are daiidy and dairal (5 occurrences each).
- EmbeddedYQ: abnormal words which contain the y letter in non-final, non-initial position; or the letter q in non-initial position (24 tokens). The most common is oykeey (2 occurrences).
GroveWord: this class was defined by John Grove, who noticed that the rare words often found at the beginning of lines, such as polchedy, could be interpreted as normal words prefixed with a spurious gallows letter. Of the abnormal tokens in the text, 213 (16%) fit this description.
Weird: the remaining 366 abnormal tokens (28%) are not easily interpreted as joined words or Grove's gallows-prefixed words. We have sorted them into:
- WeirdM: words that have one of the letters m or g not preceded by a circle (57 tokens). Apart from the letter m by itself (13 occurrences), the most common is dm (4 occurrences).
- WeirdI: words that contain letter i in any context other than an IN group (68 tokens). The most common is dairin (2 occurrences).
- WeirdSE: abnormal words that contain isolated e after an s (28 tokens). The most common is shese (3 tokens).
- WeirdOther: abnormal words that did not seem to fit in any of the above categories (213 tokens). Apart from isolated letters like v (7 tokens) and c (4 tokens) --- mainly in the circular text on page f57v --- the most common are da (6 tokens), ackhy, sa, and sha (3 tokens each). Note that the latter are probably the result of misreading y as a in otherwise normal (and common) words.

It is quite possible that, when the VMS is deciphered, we will discover that some of these abnormal words are in fact quite "normal". Indeed, although most "abnormal" words occur only once, some classes of abnormal words may be sufficiently frequent and well defined to deserve recognition in the grammar. One such candidate, for example, is EmbeddedAIN, the set of words that have A.IN groups in non-final position.

Conversely, the grammar is probably too permissive in many points, so that many words that it classifies as normal are in fact errors or non-word constructs. See the section about circle letters, for example. For instance, there must be many apparently "normal" tokens which are in fact "Grove words". These could result from prepending a spurious gallows letter to a crust-only normal word (e.g. p + olarar = polarar), or prepending a spurious non-gallows letter to a suitable normal word (e.g. d + chey = dchey). Indeed, it is quite possible that most ot of the normal-looking line-initial words are in fact such "crypto-Grove" words.

Sectional variariation

The rule frequencies vary somewhat from section to section, as shown in the following pages:

The pages included in each section are listed here.. The special section txt.n is the whole text of the manuscript, as used in the main grammar page. For each of those sections, we considerd only paragraph, circular, radial, and "signature" text; excluding labels and key-like sequences. The special section lab.n consist of all labels.

It is not surprising to find variations from section to section. What is surprising is that the variations are modest; the basic paradigm seems to hold for the whole text, and the alternatives of each rule generally have similar relative frequencies.

In fact, even those modest differences may not be significant. It has been established that the Voynichese word distribution, like that of natural languages, is highly non-uniform (Zipf-like), largely unconnected to word structure, and highly variable from section to section. Therfore, the rule frequencies in any given section are likely to be dominated by the few most common words in that section --- just as the frequency of the digraph "th" in English is largely determined by the frequency of words "the" and "that".

Discussion and conjectures

Perhaps the most important feature of the paradigm is its existence. The non-trivial word structure, especially the three-layer division, pose severe constraints on cryptological explanations. In particular, simple Vigenere-style ciphers, such as the codes considerd by Strong and Brumbaugh, seem to be out of the question, as they would hardly generate the observed word structure.

In fact, the existance of a non-trivial word structure strongly suggests that the Voynichese "code" operates on isolated words, rather than on the text as a whole. (This conclusion is supported also by statistical studies of Voynichese word frequencies, and by the existence of labels and other non-linear text.)

The complexity of the paradigm also discredits the claims that the VMS is nonsense gibberish. It seems unlikely that a 15th century author would invent a random pseudo-language with such a complex, unnatural structure --- and stick to it for 240+ pages, some of them quite boring --- only to impress clients, defraud a gullible collector, embarass a rival scholar, or just for the fun of it.

The paradigm has implications also for theories that assume a straightforward (non-encrypted) encoding of some obscure language. The layered word structure does not obviously match the word structure of Indo-European languages. Semitic languages such as Arabic, Hebrew, or Ethiopian could berhaps be transliterated into Voynichese, but not by any traightforward mapping.

In fact, if the VMS is not encrypted, the layered structure suggests that the "words" are single syllables (a conclusion that is also supported by the comparatively narrow range of "word" lengths). However, the number of different "words" is far too large compared to the number of syllables in Indo-European languages. So either the script allows multiple spellings for the same syllable, or we must look for languages with large syllable inventory --- e.g. East Asian languages such as Cantonese, Vietnamese, or Tibetan. [6]

Another possibility is that the VMS "words" are isolated stems and affixes of an agglutinative language, such as Turkish, Hungarian, or several Amerind languages. (Indeed, there is evidence of a strong correlation between certain features of consecutive Voynichese words, reminiscent of the Turkish/Hungarian "vowel harmony" rule. [7])

References

[1] Brig. J. Tiltman (1951), reproduced in D'Imperio, Fig. 27. [local copy in EVA]

[2] Mike Roe, in message to the Voynich mailing list (<1997?). [local copy]

[3] Robert Firth, Notes on the Voynich manuscript, item 24 (1995).

[4] J. Stolfi, OKOKOKO: The fine structure of Voynichese words (1998).

[5] J. Stolfi, transparencies from a talk presented at the Brazilian Mathematics Coloquium (July 1997); page 24. [a href="cbm99-paradigm.ps">relevant slides]

[6] J. Stolfi, The Generalized Chinese Theory (1997).

[7] J. Stolfi, Messages to the Voynich mailing list (13 June 2000). [a href="Turkish.msg">local copy]

Data files

Here are the main data files which were used in the construction of this model:

The comprehensive interlinear transcription of the VMS.
The majority-vote transcription extracted from the same: [plain] [zip] [gzip]
Observed word frequencies per section.
The grammar file without the HTML formatting.

Unix scripts

Here are some of the Unix scripts I used (based mostly on GNU awk, gnuplot, csh, and the pbmplus tools). They may require some tuning to work at your site. If you need help, please do ask.

parse-and-tally reads a grammar and a word frequency list, parses all words, and outputs a copy of the grammar with rule counts and frequencies recomputed from the data.
enum-derivations Reads a probabilistic grammar from stdin, enumerates the language, and outputs the predicted word frequencies.
plot-joint-probs Reads a file with fields PROB1 PROB2 WORD. Plots PROB1 aginst PROB2 in log scale, fudging zeros.

There are also several other scripts that provide friendlier interfaces to the ones above. See my lab notebook for hints on how to use them. (Warning: no warranty! Read and steal anything you want, but at your own risk...)