Notes on the Voynich Manuscript - Part 24 [1995 February 6] ----------------------------------------- Well, my previous conjectures seem to have evoked widespread incredulity. This leads me to one of two conclusions: (a) You're all wrong (b) There may, just possibly, be some minor flaws in my lucid and scintillating reasoning (which is cypher text for "I'm all wrong"). In pursuance of the second alternative, then The Voynich Manuscript as a Trithemian Cypher --------------------------------------------- Incidentally, this 'Trithemius' isn't somebody with three themes; his vernacular name is Trittheim, ie tritt-heim. Language is funny. Anyway, I'm using the term loosely, to mean any sparse cypher that is locally decodable: sparse means that a small piece of plain text becomes a substantially larger piece of cypher text, and locally decodable means that each piece of cypher text is independently decodable, we don't need to look at the surrounding context or propagate state forwards during the decode. That's a big assumption, but I think defensible given the probable date and manner of production of the VMS. Well, the next step is simple in principle: what are the units of cypher text, and what units of plain text do they encode? That gives us the mapping in terms of sets, eg from the set of letters to the set of "words"; the rest is a matter of insanely tedious detail. Cypher Units ------------ The most obvious choice for the cypher unit is the "group": the sequence of symbols separated by spaces. That doesn't mean it's right; it does mean that I'd be foolish to look first at the less obvious. So let's run with that, and I advise the reader that this thread does indeed run to the end of the note. The next question is, what does each group encode? And the simplest way to get some insight into that question is to ask, how many different groups are there? For, clearly, if a cypher text contains 26 unique components, be they letters, groups, or words, it's pretty obvious what each component encodes. In the entire Voynich A corpus, there are about 2300 unique groups. But over half of them occur once only, and many of those are pretty strange in other ways, such as looking like two normal groups run together. If we exclude those, we have about 1000 groups. That's too few for natural language - Ogden's "Basic English" contains 1600 words and it's a fright. So this is either not language or some highly stylised or synthetic language, as has often been suspected. But, you know, the reduced list *still* doesn't look right, because about half of it consists of groups that occur only twice. That's again pretty strange. Eventually, I decided to set the cutoff at four occurrences: any group that occurs 4 or more times is probably genuine. This removes about 20% of the text, but it removes over 85% of the unique groups, and most of the remainder look plausible. What is Encoded? So, we have some 280 groups in the Voynich A, that occur 4 or more times, with the record being 355 for '8AM'. If we assume (pace Brumbaugh) that every group has a single decode, then that sets an upper bound at 280 for the number of different plaintext units. So they're not words. Do we have a lower bound? Well, we mustn't assume that every plain text has a unique encoding; there may be multiple encodings. So, whereas 24 unique cypher symbols imply an encoded alphabet, 24,000 *might* imply an encoded alphabet with 1000 alternatives for each letter. And that, I think, is one way to reach a lower bound: through arguments of practicality. A 24,000 symbol alphabet seems impracticable - is the scribe really going to throw 3d10 for every letter? No, the alternatives should really be sufficiently few that one can hold them in memory and cycle through them pretty well, which suggests to me about 5. That gives me a lower bound of about 56 plaintext symbols; reasoning the other way, if each Voynich group is a Trithemian letter, we have about 12 alternatives for each, which seems too many. The next obvious alternative is that each group is a syllable, which would imply from about 60 to about 200 symbols depending on language. But, again, that seems impractical. Indo-European languages aren't syllabic, and all attempts to write them in syllabic script lead to serious problems, like the Japanese writing "futobaru" for "football", and the Mycenaeans writing "iqo" for "hippos". [Note: the author again falls into the trap of taking as true that which is only assumed, namely, that the underlying language of the MS is Indo European. Remove that assumption, and it indeed becomes plausible that each Voynich group is a syllable - and the most plausible underlying language, on that assumption, seems to me to be Chinese.] The Conjecture Well before this point, I made the conjecture, but I've written it up as linear argument, in the best revisionist scientific tradition. Lying abed, brooding, I asked myself: "Robert, if you set out to create the VMS - if you wanted to generate a cypher text with the superficial regularities we observe - how would you do it?" And the answer was pretty clear, though of course it may be quite wrong. I would start from the plaintext alphabet, and create two alternative encodings, one for the odd letters and one for the even. A pair of letters would be a "group", but the spaces around groups are for the convenience of the scribes; they add no information. Further, I would create the encodings so that the odd set looked like typical roots, and the even set looked like typical inflections, in a language such as Latin or Italian, much simplified. Each letter pair would then appear to be a word, and from my niche in the Empyrean I would laugh myself silly at future generations of would-be decipherers who exclaimed at "the statistical regularities in the text". The Possible Voynich Alphabets Do the Voynich groups break up in that way? Again, from the A corpus, taking only groups that occur 4 or more times, here is my conjecture: Odd Letters Even Letters 2 89 4O 8AE 4OF 8AM 4OP AE 8 AJ 9F AM 9P AN F AR O C9 OF CC9 OP COE P OE Q OM S OR SF S9 SP SC9 SQ SO SW SOE SX SOR W Z9 X 9 (maybe) Z ZO This is a rough guess, and will surely have errors - but that's two alphabets of 21 and 23 symbols, and with the exception of that silly letter '9' almost any combination of symbols is locally decodable. (Something's wrong with 8 or AM or 8AM; otherwise, it's rigorous.) [Note: and also with S/OM and S/OR. But - and as a former compiler writer I should really have spotted this - the lexical scansion is unambiguous if you also keep track of odd and even. '8' in state "odd" must be a letter; '8' in state "even" must be the start of '89' or '8AE' or '8AM'. And who in the middle ages would have known to do that? One name springs to mind immediately: Ramon Lull, the author of the 'Ars magna'. Whose dates are ca 1235 to 1316, and whose place of birth and permanent residence was the Isle of Majorca. Oh, yes: and whose writings were later condemned by the See of Rome, which ordered all copies of them to be impounded. And some member of the Jesuit College, sifting through the loot... ? Of the making of many hypotheses there is no end.] Comments, anyone? Robert Firth [Note: the exact numbers vary depending on the romanisation method, but the official pinyin Chinese syllabary has 21 initials and 38 finals, so that doesn't fit too well. But who knows what those hypothetical companions of Marco Polo might have come up with?]