Counts of consecutive word pairs in Voynichese and other languages

The tables below show the number of times each pair of words occurs in consecutive positions in the Biological section of the Voynich manuscript.

As usual, the data provides lots of tantalizing patterns, but no definite conclusions. At least, the patterns strongly suggest that the text is actually natural language, and not random garbage.

One obvious pattern in the table is that words with similar structure seem to have similar neighbor distributions.

Also, there are some words that are unexpectedly common at the end of lines (just before the "//") or at the beginning (after "//"). Other words seem to avoid those positions. I suspect this effect is due to the fact that many end-of-lines are also end-of-paragraph, and hence end-of-sentence. Unfortunately, many paragraph breaks seem to have been omitted in the Currier and FSG transcripts, so this data is particularly noisy.

Guesses, anyone?


Source text

The counts were obtained from the entire Biological section of the VMs (f75r--f84v), which is in Currier's "Language B". The version used was a mechanical stroke-level "consensus" of the Currier and FSG transcriptions.

The text had 7054 words, including end-of-line marks "//" and end-of-paragraph marks "=". Words with invalid characters, and words that were transcribed differently by Currier and Friedman, were mapped to the special word "???".

Character encoding

The text was encoded with an ad-hoc stroke-level encoding (yes, ANOTHER one), with identification of some easily confused letters and ignoring differences which (I beleive) are just calligraphic variations.

The encoding is basically the Frogguy alphabet, with the following changes:

    Frogguy      9    2    4    x    a    s    e    e'   t    iiiv iiv  iv  
    -----------  ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ----
    This table   a    r    q    e    a    z    c    z    c    m    m    n   
    Frogguy      ig   ir   qp   lp   dj   fj   eQPt eLPt eDJt eFJt cg   &
    -----------  ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ----
    This table   k    w    H    H    P    P    cHc  cHc  cPc  cPc  cj   ig

Here is the rough correspondence between my encoding and the original FSG encoding:

    Table  FSG  
    -----  ---  
     8     8       
     H     H, D       
     P     P, F       
     a     A, G, CI      
     am    AM, AIN, CIIIL
     an    AN, CM
     ar    AR, CIR
     aw    AIR, CIIR
     c     C        
     cHc   HZ, DZ      
     cPc   PZ, FZ      
     ca    CA, CG, CCI, TI
     cc    T, CC        
     cj    6        
     e     E       
     i     I        
     ig    7        
     iu    L        
     k     K       
     m     M     
     n     N      
     o     O        
     q     4        
     r     R       
     w     IR     
     z     2       
     za    2A, 2G, 2CI, SI
     zc    S, 2C      

Since the correspondence is (on purpose) ambiguous, I cannot easily map the table back to the FSG encoding.

Table structure

The whole word-pair frequency table would be huge (about 850 by 850), but quite sparse (less than 7000 non-zero entries). To keep the output small but readable, I partitioned the vocabulary into a small set of frequently occuring key words, and a large set of non-keys.

The word-pair frequency table was then split into four sections: key key, key non-key, non-key key, and non-key non-key. Only the first three sections were computed and printed; the first has about 25 25 entries, and the other two have about 25 830 entries.

Here is the list of key words used for this run:
      //                qoe               zcccHca           oHam
      zcc8a             eccc8a            zccca             oHar
      ccc8a             oHcc8a            cccca             qoHan
      oe                zccc8a            zam               qoHam
      oHc8a             ccca              8am               qoHar
      qoHcc8a           zcca              8ar               qoHcca
      qoHc8a            cccHca            8ae               oHcca
      qoHa              zccHca            oHae              or
      qoHae             ccccHca

Unix recipe

To build the tables, I used the following Unix commands:

    cat infile.wds \
      | gawk \
          ' BEGIN { want = "="; } \
            /./ { print want, $0; want = $0; next; } \
          ' \
      | count-diword-freqs \
          -v rows=nonkeys.dic \
          -v cols=keys.dic \

The file infile.wds contains the input text, one word per line.

The command count-diword-freqs is another gawk script that counts the occurrences of each pair, and prints the formatted table to stdout.

The auxiliary files keys.dic and non-keys.dic contain the two word sets, one word per line.

Editing history for these pages

97-08-08: Computed the tables and created these HTML pages.

97-12-10: Reorganized the HTML text, moving each table to a separate file. No substantial changes.

Last edited on 97-12-10 by stolfi