Last edited on 1997-12-10 09:31:24 by stolfi

A prefix-midfix-suffix decomposition of Voynichese words

Overview

This note describes an intriguing decomposition of Voynichese words into three parts --- prefix, midfix and suffix.

The decomposition is based on a partition of the Voynichese (specifically, EVA) alphabet into two sets,

With these definitions, we find that almost every Voynichese word can be decomposed into a prefix and suffix composed entirely of soft letters, and a midfix or kernel composed entirely of hard letters. For instance, the popular word qoteedy is decomposed into prefix qo-, midfix -tee-, and suffix -dy.

Any of these three elements may be empty. When the midfix is empty (i.e. the word consists entirely of soft letters), the division into prefix and suffix is ambiguous; in that case I will call the whole word an unifix.

I have analyzed under this paradigm the words in the "biological" section, f75r--f84v. Below are the most common components of each class, and their counts. (The dots are not word spaces, but marks to highlight the fine structure discussed further on):

    freq prefix     freq midfix     freq suffix     freq unifix
    ---- --------   ---- --------   ---- --------   ---- --------
     1859 -          824 -k-        1728 -dy         186 ol
     1296 qo-        588 -che-      1239 -y          126 qol
      607 o-         514 -she-       422 -aiin       106 daiin
      255 ol-        387 -kee-       254 -al          71 dal
      209 l-         354 -t-         245 -ol          64 dar
      108 y-         347 -ke-        157 -ar          56 saiin
       75 d-         179 -te-         86 -ain         55 or
       45 r-         121 -ch-         66 -or          50 sol
       36 qol-       113 -tee-        51 -d           48 dy
       29 s-         105 -shee-       36 -s           36 aiin
       23 q-          95 -chee-       28 -dar         28 dol
       21 sol-        83 -sh-         25 -dal         25 oly
       12 dy-         58 -pche-       21 -am          21 lol
        8 sal-        49 -chckh-      20 -            21 sal
        7 so-         38 -kch-        20 -aly         18 ar
        6 dal-        38 -p-          16 -a           18 iin
        6 olo-        33 -tche-       16 -l           18 raiin
        5 a-          31 -sheckh-     13 -oldy        17 sor
        5 dol-        28 -tch-        12 -daiin       15 al
        4 al-         27 -kche-       10 -air         15 sar
        4 lo-         26 -chcth-      10 -ary         14 s
        4 or-         25 -checkh-     10 -r           13 olor
        3 oqo-        25 -shckh-       9 -aldy        12 olol
        3 qod-        24 -shek-        7 -as          12 rol
        2 dl-         22 -kshe-        6 -ady         11 m
        2 do-         20 -ee-          6 -alor        11 ral
        2 lol-        17 -checth-      6 -dol         11 y
        2 olol-       17 -chek-        6 -dor         10 lor
        2 qoqo-       17 -tshe-        6 -oiin        10 oldy
        2 qor-        16 -pch-         6 -sdy         10 r
        2 rol-        15 -cth-         6 -sy           9 dain
        1 alo-        14 -cthe-        5 -o            9 olaiin
        1 aro-        14 -fche-        5 -oly          8 dam
        1 dar-        12 -chckhe-      4 -alol         8 ldy
        1 dor-        12 -shcth-       4 -an           8 ly
        1 ld-         11 -ckhe-        4 -dam          8 ory
        1 od-         11 -keee-        4 -m            8 qor
        1 odd-        10 -shckhe-      4 -ody          7 l
        1 oll-        10 -shecth-      3 -ay           7 orol
        1 oro-         9 -cheek-       3 -ydy          7 qoly
     ... ...         ... ...         ... ....        ... ...
    ---- --------   ---- --------   ---- --------   ---- --------
    4666 TOTAL      4666 TOTAL      4666 TOTAL      1516 TOTAL

You can get my notebook file with the detailed procedures I used, and the data files mentioned therein. In particuler, you can get a file containing all good words of the biological section, already factored as above (63 KB).

Components are few

An unexpected feature of this decomposition is that there is a surprisingly small number of prefixes and suffixes with significative frequency. As can be sen from the above table, the distribution of any of these components falls off quite abruptly.

Midfixes are hard

Another non-trivial feature of this decomposition is that virtually words have midfixes that consist entirely of hard letters. The exceptions are quite rare (see below). If the words were random strings, we would expect a substantial number of words with hard-soft-hard sequences.

There are 74 distinct anomalous (soft-containing) midfixes, or 88 if we count repeated occurrences. Here they are:

    4 -polche-          1 -eat-             1 -palk-            1 -shocphe-
    4 -shok-            1 -eedee-           1 -palshe-          1 -shoe-
    3 -kede-            1 -eese-            1 -pdalsh-          1 -shoksh-
    3 -polsh-           1 -kalch-           1 -pockh-           1 -shot-
    2 -chedche-         1 -keedyqok-        1 -pok-             1 -talshe-
    2 -chok-            1 -keeylshe-        1 -poldak-          1 -tchdolt-
    2 -polch-           1 -keeyshe-         1 -poldshe-         1 -tchot-
    2 -talsh-           1 -keylch-          1 -polk-            1 -teae-
    1 -cheak-           1 -kok-             1 -polkee-          1 -tedee-
    1 -chedch-          1 -kolch-           1 -polshe-          1 -teyte-
    1 -chedyk-          1 -kolche-          1 -poltesh-         1 -tocthe-
    1 -cheok-           1 -kolk-            1 -porshe-          1 -tok-
    1 -cheolch-         1 -kolsh-           1 -psche-           1 -tolke-
    1 -chlchpshee-      1 -kop-             1 -pyke-            1 -torolsh-
    1 -cholche-         1 -korch-           1 -shecthedch-      1 -tot-
    1 -cholkeee-        1 -kot-             1 -sheok-           1 -tsheokee-
    1 -chop-            1 -kych-            1 -sheyk-           1 -tyot-
    1 -chot-            1 -kylk-            1 -shockh-          1 -tyqok-
    1 -chytee-          1 -palch-

Note that the 24 root occurrences that begin with p are listed here only because I assumed that that p was always a "hard" letter. But we have conjectured before that p is sort of a "joker"---probably an "ornate capital" that can be used for several distinct letters, much as the "gallows" in Cappelli's illustration.

So the ps above may well be soft letters, perhaps ds or qs. In that case they should have been parsed as part of the prefix--leaving a kosher hard-only midfix.

So we are left with 64 occurrences of truly anomalous words. That is only 1% of the sample words, and seems well within the range of transcription errors.

In particular, note that several of them contain embedded qs and ys, which are notoriously word-initial and word-final. Therefore, those exceptions may be the result of lost word breaks. For instance, the -chedyk- root comes from the word "chedykar" which may well be a "chedy" and a "kar" (two fairly common words) run together.

The fine structure

It seems that prefixes, suffixes, and unifixes can be further decomposed into a sequence of EVA letter groups, which themselves are drawn from a limite repertoire. Some common soft-letter groups are

  am ar al om or ol ain aiin oin oiin 
and there seem to be fairly strong restrictions as to how these groups and othe soft letters can be concatenated. For example,

As we all know, q is almost always word-initial, and followed by o.

Similarly, y m n is almost always word-final.

There are several pairs of soft letters that, like qo, behave almost as single letters, e.g. { ar am air dy ol or ... }.

The midfixes too seem to be composed out of a small number of building blocks, where each block is any of the letters

  k t sh ch cth ckh 
followed by zero, one, or two e characters.

Midfixes with three or more consecutive es, or beginning with e, can be explained as mis-transcriptions of other characters, chiefly ch. In fact, such errors may be the source for many of the ee groups seen in the midfix. (Note that -ee- is rare but -e- is rarer still.)

Possible interpretations

Here are some possible interpretations of this data.
  1. The VMs is written in cypher. I will leave this hypothesis to the crypto experts.

  2. The Voynich "words" are syllabes; the two classes of letters defined above are basically the vowels and consonants.

    Which class is which? Note that there are many words made entirely of soft letters, but no words made entirely of hard letters. Also, the empty prefix occurs very often, while the empty midfix and prefix are rare. Thus it seems that

    Note that the soft letters may include sounds like "y", "w", "s", "l", "n", "m" which may work as vowel modifiers rather than consonants proper.

    Keeping this in mind, the statistics for syllabes of each type are:

            
                 type  freq  perc
                 ----  ----  ----
                 V     1516  24 %
                 CV    1849  30 %
                 VCV   2797  45 %
                 VC      20   0 %
                 C       10   0 %
    

    Here V stands for one or more vowels, C for one or more consonants.

    Note that there are about 10-12 significant prefixes, and about 20 significant suffixes; which seems right for many languages, including English (12 vowel sounds, a couple dozen vowel clusters).

    The number of consonants seems a bit to high: around 20 "simple" consonants, plus a long tail of consonant pairs.

    A problem with this theory is why would the author choose to mark off syllabe breaks instead of words breaks.

  3. Voynichese is a a tonal language like Chinese or Vietnamese. This is a variant of the "syllabe" theory above; the difference is mainly that some of the letters (perhaps the prefix) would have to indicate the tones.

    This alternative has the merit that, in Chinese, the syllabes are indeed the natural unit of text.

  4. Voynichese is an agglutinative language like Turkish, Nahuatl, Quechwa, etc: the "hard" letters are the stem of the word, and the soft letters are modifying affixes.

  5. Voychinese is a semitic language like Arabic or Hebrew; the prefix, midfix, and suffix correspond to the three basic consonants, and attached vowels.