Notes on the Voynich Manuscript - Part 24	[1995 February 6]
-----------------------------------------

Well, my previous conjectures seem to have evoked widespread
incredulity.  This leads me to one of two conclusions:

(a) You're all wrong

(b) There may, just possibly, be some minor flaws in my
    lucid and scintillating reasoning (which is cypher
    text for "I'm all wrong").

In pursuance of the second alternative, then

	The Voynich Manuscript as a Trithemian Cypher
	---------------------------------------------

Incidentally, this 'Trithemius' isn't somebody with three themes;
his vernacular name is Trittheim, ie tritt-heim.  Language is funny.

Anyway, I'm using the term loosely, to mean any sparse cypher that is
locally decodable: sparse means that a small piece of plain text becomes
a substantially larger piece of cypher text, and locally decodable means
that each piece of cypher text is independently decodable, we don't
need to look at the surrounding context or propagate state forwards
during the decode.  That's a big assumption, but I think defensible
given the probable date and manner of production of the VMS.

Well, the next step is simple in principle:  what are the units of
cypher text, and what units of plain text do they encode?  That
gives us the mapping in terms of sets, eg from the set of letters
to the set of "words"; the rest is a matter of insanely tedious
detail.

Cypher Units
------------

The most obvious choice for the cypher unit is the "group": the sequence
of symbols separated by spaces.  That doesn't mean it's right; it does
mean that I'd be foolish to look first at the less obvious.  So let's
run with that, and I advise the reader that this thread does indeed run
to the end of the note.

The next question is, what does each group encode?  And the simplest way
to get some insight into that question is to ask, how many different
groups are there?  For, clearly, if a cypher text contains 26 unique
components, be they letters, groups, or words, it's pretty obvious what
each component encodes.

In the entire Voynich A corpus, there are about 2300 unique groups.  But
over half of them occur once only, and many of those are pretty strange
in other ways, such as looking like two normal groups run together.  If
we exclude those, we have about 1000 groups.  That's too few for natural
language - Ogden's "Basic English" contains 1600 words and it's a fright.
So this is either not language or some highly stylised or synthetic
language, as has often been suspected.

But, you know, the reduced list *still* doesn't look right, because
about half of it consists of groups that occur only twice.  That's
again pretty strange.  Eventually, I decided to set the cutoff at
four occurrences: any group that occurs 4 or more times is probably
genuine.  This removes about 20% of the text, but it removes over
85% of the unique groups, and most of the remainder look plausible.

What is Encoded?

So, we have some 280 groups in the Voynich A, that occur 4 or more times,
with the record being 355 for '8AM'.  If we assume (pace Brumbaugh) that
every group has a single decode, then that sets an upper bound at 280
for the number of different plaintext units.  So they're not words.

Do we have a lower bound?  Well, we mustn't assume that every plain text
has a unique encoding; there may be multiple encodings.  So, whereas 24
unique cypher symbols imply an encoded alphabet, 24,000 *might* imply an
encoded alphabet with 1000 alternatives for each letter.

And that, I think, is one way to reach a lower bound: through arguments
of practicality.  A 24,000 symbol alphabet seems impracticable - is the
scribe really going to throw 3d10 for every letter?  No, the alternatives
should really be sufficiently few that one can hold them in memory and
cycle through them pretty well, which suggests to me about 5.  That gives
me a lower bound of about 56 plaintext symbols; reasoning the other way,
if each Voynich group is a Trithemian letter, we have about 12 alternatives
for each, which seems too many.

The next obvious alternative is that each group is a syllable, which would
imply from about 60 to about 200 symbols depending on language.  But, again,
that seems impractical.  Indo-European languages aren't syllabic, and all
attempts to write them in syllabic script lead to serious problems, like
the Japanese writing "futobaru" for "football", and the Mycenaeans writing
"iqo" for "hippos".

[Note: the author again falls into the trap of taking as true that which
is only assumed, namely, that the underlying language of the MS is Indo
European.  Remove that assumption, and it indeed becomes plausible that
each Voynich group is a syllable - and the most plausible underlying
language, on that assumption, seems to me to be Chinese.]

The Conjecture

Well before this point, I made the conjecture, but I've written it up as
linear argument, in the best revisionist scientific tradition.  Lying
abed, brooding, I asked myself:  "Robert, if you set out to create the
VMS - if you wanted to generate a cypher text with the superficial
regularities we observe - how would you do it?"

And the answer was pretty clear, though of course it may be quite wrong.

I would start from the plaintext alphabet, and create two alternative
encodings, one for the odd letters and one for the even.  A pair of
letters would be a "group", but the spaces around groups are for the
convenience of the scribes; they add no information.

Further, I would create the encodings so that the odd set looked like
typical roots, and the even set looked like typical inflections, in
a language such as Latin or Italian, much simplified.  Each letter
pair would then appear to be a word, and from my niche in the Empyrean
I would laugh myself silly at future generations of would-be decipherers
who exclaimed at "the statistical regularities in the text".

The Possible Voynich Alphabets

Do the Voynich groups break up in that way?  Again, from the A corpus,
taking only groups that occur 4 or more times, here is my conjecture:

Odd Letters	Even Letters

	2		89
	4O		8AE
	4OF		8AM
	4OP		AE
	8		AJ
	9F		AM
	9P		AN
	F		AR
	O		C9
	OF		CC9
	OP		COE
	P		OE
	Q		OM
	S		OR
	SF		S9
	SP		SC9
	SQ		SO
	SW		SOE
	SX		SOR
	W		Z9
	X		9 (maybe)
	Z
	ZO

This is a rough guess, and will surely have errors - but that's two
alphabets of 21 and 23 symbols, and with the exception of that silly
letter '9' almost any combination of symbols is locally decodable.
(Something's wrong with 8 or AM or 8AM; otherwise, it's rigorous.)

[Note: and also with S/OM and S/OR.  But - and as a former compiler
writer I should really have spotted this - the lexical scansion is
unambiguous if you also keep track of odd and even.  '8' in state
"odd" must be a letter; '8' in state "even" must be the start of
'89' or '8AE' or '8AM'.  And who in the middle ages would have known
to do that?  One name springs to mind immediately: Ramon Lull, the
author of the 'Ars magna'.  Whose dates are ca 1235 to 1316, and
whose place of birth and permanent residence was the Isle of Majorca.

Oh, yes: and whose writings were later condemned by the See of Rome,
which ordered all copies of them to be impounded.  And some member
of the Jesuit College, sifting through the loot... ?  Of the making
of many hypotheses there is no end.]

Comments, anyone?

Robert Firth

[Note: the exact numbers vary depending on the romanisation method, but
the official pinyin Chinese syllabary has 21 initials and 38 finals, so
that doesn't fit too well.  But who knows what those hypothetical
companions of Marco Polo might have come up with?]