Notes on the Voynich Manuscript - Part 2 [1991 December 16] ---------------------------------------- This is the report of a very brief initial computer analysis of the ASCII text. While I intend to go over to the Frogguy rules of transcription as soon as possible, this analysis was done in a couple of lunch hours some time ago, so uses the Currier rules. Separate A and B ---------------- First, I separated the A and B sections, creating simple text files VA and VB. The reason for this is pretty obvious: if the two sections are in different languages or use different cyphers, I've made my task far harder by conflating them; but if they are in fact the same, I've not made my task much harder by dividing them, since any method that can crack a 100-page document can surely crack a 50-page half document. Letter Counts ------------- Secondly, the obvious first step is to count the letter frequencies. Here is the list for the VA and VB texts, including spaces (sp) and new lines (nl): VA -- sp: 5861 O: 5152 9: 2878 S: 2852 A: 2050 8: 2011 E: 1523 C: 1500 R: 1320 F: 1296 P: 1166 nl: 1263 M: 1009 Z: 903 4: 631 2: 482 Q: 432 B: 200 J: 198 N: 193 X: 163 W: 89 D: 76 V: 60 T: 56 I: 47 3: 34 6: 34 U: 22 Y: 20 K: 11 7: 8 0: 5 G: 4 H: 4 L: 4 ,: 3 VB -- sp: 7977 C: 5688 9: 5308 O: 5283 8: 3990 A: 3065 E: 2839 F: 2794 S: 2206 4: 1933 Z: 1367 R: 1366 P: 1228 nl: 1165 M: 708 N: 556 2: 523 B: 303 X: 300 Q: 171 J: 152 V: 96 T: 78 3: 48 U: 34 W: 33 D: 22 6: 19 Y: 13 G: 9 I: 8 7: 5 K: 5 H: 3 L: 3 0: 2 5: 2 And, for comparison, here are the most common letters in a sample of English text (E) and Latin text (L). E - sp: 5182 e: 2730 o: 1717 a: 1683 t: 1681 n: 1504 r: 1469 s: 1409 h: 1401 i: 1315 l: 1010 d: 901 f: 743 u: 558 nl: 548 c: 519 w: 469 m: 449 b: 413 y: 344 g: 333 p: 260 v: 221 k: 84 j: 45 x: 36 q: 17 z: 8 L _ sp: 132 e: 105 i: 89 t: 89 u: 71 m: 62 a: 57 r: 56 s: 56 o: 55 n: 54 c: 41 nl: 39 p: 28 d: 24 l: 23 f: 12 v: 12 x: 7 b: 6 g: 6 q: 6 h: 5 j: 2 [Note: this Latin text was far too short to be a good sample, and I never found an ascii version of a longer one.] Well, the patterns are similar. The Voynich A text seems to have about 20 to 25 common symbols, with frequencies that follow the Zipf law. The average word length is 5 symbols, but I now find the Currier transcription uses single letters for what seem in the original to be compound symbols. If those compounds are written in full, the word length is nearer 7 than 5. Note also that VA and VB differ considerably. Look at the frequency of 'O' in VA and VB, and compare the frequency of 'o' in E and L. However, this doesn't prove they are different languages or use different cypher keys. At that time, there were many dialects of our modern languages, and few consistent schemes of spelling. To take an example almost contemporary with the VM, consider 'The Romaunt of the Rose'. This is all in English, but the two major parts are by different authors, use very different dialects, and employ different spelling rules. (Incidentally, nobody seems to have pointed out that one of those authors, Geoffrey Chaucer, was pretty interested in alchemy, medicine, and astronomy...) Word Counts ----------- The next step would be word counts. However, are we sure the text is divided into words? My answer is yes, I am sure. [Note: I was sure then; later, I became uncertain; I am now sure the physical groups are *not* words. We live and learn.] If the spaces were inserted at random, or by some rule intended to conceal cypher groups, then the initial and final letter frequencies should be close to the medial frequency. This is definitely not the case; there is a clear pattern of preferred initial and final letters, just as in natural language. This being so, there are some significant data in these counts. Consider this word, from VB, where I put the count in brackets: SC89 [250] This is a pretty common word, like, perhaps "SOME" in English. Now, many cypher schemes might encode "SOME" as "SC89". But anything more complex than a Gold Bug cypher would not also encode "winsome", "frolicsome", "meddlesome" &c so they also ended in "-SC89". Now look at 40ESC89 [12] 8SC89 [15] BSC89 [18] ESC89 [57] 0BSC89 [17] OESC89 [30] That's another 150 occurrences of the same symbol group, which I consider a pretty clear refutation of the notion that this is some immensely intricate cypher. This stuff feels like language. Now, do we have here the equivalent of "domina femina carmina candida", or rather the equivalent of "eeny meeny miney mo"? For nonsense, too, shows the morphemic regularities of language. I don't know; but for the present I'm pretty sure hypothesis C - cypher text - is a very distant third horse in this race. [Note: the above analysis was clearly based on an idee fixe that the symbol groups were words, and that these words were made up of initial stems and final inflections. But there are several other hypotheses that would explain the patterns. One obvious one is that the groups represent syllables in a syllabic language, such as Japanese or Swahili. Another is that the seeming stems and inflections are letters, from two parallel Trithemian cypher alphabets. So my third horse was not as distant as I thought.] Word Analysis ------------- Well, let's run for a while with hypothesis P - we have here plain text that happens to be written in an unknown script, transcribing an unknown language. What do we do? We do what Michael Ventris did when faced with this exact problem in the form of the Cretan Linear B script: look for regular patterns suggestive of first phonemes, and then grammar. How is this language encoded? Over 95% of the text is encoded using 20-odd symbols. Now, it is just possible that the underlying language has three vowels and seven consonants, for a 24-symbol syllabary, but I don't believe it. This is an alphabet, and each symbol is a consonant, vowel, or possible consonant cluster or diphthong. [Note: circular reasoning, of course. It rests on the assumption that each physical symbol is also a *sign*, a symbol with semantics. But that is not yet proven. A similar analysis of the Chinese script could also conclude it was alphabetic, with each *stroke* a letter of the presumed alphabet. So I may be right, but the reasoning is flawed.] Does it show grammatical regularity? You bet. We've already seen that "SC89" is a common word termination; there are also: FAM [13] FAN [16] FAR [17] OEFAM [12] OEFAN [24] OFAM [56] OFAN [50] OFAR [44] OPAN [18] OPAM [27] OPAR [38] ORAM [10] ORAN [10] 4OFAM [96] 4OFAN [154] 4OFAR [66] 4OPAM [15] 4OPAN [22] 4OPAR [17] from the same VB text. Look, please, at the pattern of those word counts. Those frequencies can be explained as a combination of root frequency ("4OF-" is much more common than "OP-") and ending frequency ("-AM" is slightly less common than "-AN"). This is precisely what one would expect of a technical document written in a language with end inflections. [Again, an unscientific procedure. At this point I should have framed *all* hypotheses that explain the phenomena, and looked for some test that would distinguish among them. Indeed, my thesis rapidly begins to break down under the pressure of ugly fact.] But there are some peculiarities. One is that the inflection scheme is very, very consistent, far more so than Latin. There seems to be only one major declension or conjugation. Another is that what seem to be grammatical endings are also words in their own right: AM [70] AN [24] AR [58] as if Latin had words "us", "um", "os" to match its endings. That may have been the case in proto-Indo-European, but it's not the case with Greek, Latin or most modern European languages. It would also be surprising in a synthetic language devised by a European; such languages tend to follow the same patterns as the natural languages known to the devisor: as examples I cite Volapu"k and Esperanto. A third point, and an unpleasant one, is that the VA and VB texts seem to have similar beginning bits (roots?) but different ending bits (inflections?), and that shouldn't happen either. If two dialects diverge that much in respect of grammar, there should be at least a vowel shift in the roots as well - compare German and Dutch to see an example. Finally, if those bits are roots, they are very short. You would expect things like "ferr-", "cupr-", "chalc-", "aur-", "argent-', and such, not roots of one or two letters. Maybe they are encoded in some compact notation, as if one would write Misce feum & cuum in Xbile abbreviating "ferrum", "cuprum" and "crucibile" in the obvious way. One could do the same with plant and star names, of course. At which point, I ran out of time and ideas. In the past couple of weeks, I've had more ideas, but no time to test them. If you find any of this useful, I'd be happy to hear your ideas. Robert Firth