Notes on the Voynich Manuscript - Part 2	[1991 December 16]
----------------------------------------

This is the report of a very brief initial computer analysis of
the ASCII text.  While I intend to go over to the Frogguy rules
of transcription as soon as possible, this analysis was done in
a couple of lunch hours some time ago, so uses the Currier rules.

Separate A and B
----------------

First, I separated the A and B sections, creating simple text
files VA and VB.  The reason for this is pretty obvious: if the
two sections are in different languages or use different cyphers,
I've made my task far harder by conflating them; but if they
are in fact the same, I've not made my task much harder by
dividing them, since any method that can crack a 100-page document
can surely crack a 50-page half document.

Letter Counts
-------------

Secondly, the obvious first step is to count the letter frequencies.
Here is the list for the VA and VB texts, including spaces (sp) and
new lines (nl):

VA
--

sp:  5861
O:   5152
9:   2878
S:   2852
A:   2050
8:   2011
E:   1523
C:   1500
R:   1320
F:   1296
P:   1166
nl:  1263
M:   1009
Z:    903
4:    631
2:    482
Q:    432
B:    200
J:    198
N:    193
X:    163
W:     89
D:     76
V:     60
T:     56
I:     47
3:     34
6:     34
U:     22
Y:     20
K:     11
7:      8
0:      5
G:      4
H:      4
L:      4
,:      3

VB
--
sp:  7977
C:   5688
9:   5308
O:   5283
8:   3990
A:   3065
E:   2839
F:   2794
S:   2206
4:   1933
Z:   1367
R:   1366
P:   1228
nl:  1165
M:    708
N:    556
2:    523
B:    303
X:    300
Q:    171
J:    152
V:     96
T:     78
3:     48
U:     34
W:     33
D:     22
6:     19
Y:     13
G:      9
I:      8
7:      5
K:      5
H:      3
L:      3
0:      2
5:      2

And, for comparison, here are the most common letters in a sample
of English text (E) and Latin text (L).

E
-

sp:  5182
e:   2730
o:   1717
a:   1683
t:   1681
n:   1504
r:   1469
s:   1409
h:   1401
i:   1315
l:   1010
d:    901
f:    743
u:    558
nl:   548
c:    519
w:    469
m:    449
b:    413
y:    344
g:    333
p:    260
v:    221
k:     84
j:     45
x:     36
q:     17
z:      8

L
_

sp:   132
e:    105
i:     89
t:     89
u:     71
m:     62
a:     57
r:     56
s:     56
o:     55
n:     54
c:     41
nl:    39
p:     28
d:     24
l:     23
f:     12
v:     12
x:      7
b:      6
g:      6
q:      6
h:      5
j:      2

[Note: this Latin text was far too short to be a good sample, and
I never found an ascii version of a longer one.]

Well, the patterns are similar.  The Voynich A text seems to have about
20 to 25 common symbols, with frequencies that follow the Zipf law.  The
average word length is 5 symbols, but I now find the Currier transcription
uses single letters for what seem in the original to be compound symbols.
If those compounds are written in full, the word length is nearer 7 than 5.

Note also that VA and VB differ considerably.  Look at the frequency of 'O'
in VA and VB, and compare the frequency of 'o' in E and L.  However, this
doesn't prove they are different languages or use different cypher keys.
At that time, there were many dialects of our modern languages, and few
consistent schemes of spelling.  To take an example almost contemporary
with the VM, consider 'The Romaunt of the Rose'.  This is all in English,
but the two major parts are by different authors, use very different
dialects, and employ different spelling rules.  (Incidentally, nobody
seems to have pointed out that one of those authors, Geoffrey Chaucer,
was pretty interested in alchemy, medicine, and astronomy...)

Word Counts
-----------

The next step would be word counts.  However, are we sure the text is
divided into words?  My answer is yes, I am sure.  [Note: I was sure
then; later, I became uncertain; I am now sure the physical groups are
*not* words.  We live and learn.]  If the spaces were
inserted at random, or by some rule intended to conceal cypher groups,
then the initial and final letter frequencies should be close to the
medial frequency.  This is definitely not the case; there is a clear
pattern of preferred initial and final letters, just as in natural
language.  This being so, there are some significant data in these
counts.  Consider this word, from VB, where I put the count in brackets:

	SC89 [250]

This is a pretty common word, like, perhaps "SOME" in English.  Now,
many cypher schemes might encode "SOME" as "SC89".  But anything more
complex than a Gold Bug cypher would not also encode "winsome",
"frolicsome", "meddlesome" &c so they also ended in "-SC89".  Now
look at

	40ESC89 [12]
	8SC89 [15]
	BSC89 [18]
	ESC89 [57]
	0BSC89 [17]
	OESC89 [30]

That's another 150 occurrences of the same symbol group, which I
consider a pretty clear refutation of the notion that this is some
immensely intricate cypher.  This stuff feels like language.  Now,
do we have here the equivalent of "domina femina carmina candida",
or rather the equivalent of "eeny meeny miney mo"?  For nonsense,
too, shows the morphemic regularities of language.  I don't know;
but for the present I'm pretty sure hypothesis C - cypher text -
is a very distant third horse in this race.

[Note: the above analysis was clearly based on an idee fixe that
the symbol groups were words, and that these words were made up
of initial stems and final inflections.  But there are several
other hypotheses that would explain the patterns.  One obvious
one is that the groups represent syllables in a syllabic language,
such as Japanese or Swahili.  Another is that the seeming stems
and inflections are letters, from two parallel Trithemian cypher
alphabets.  So my third horse was not as distant as I thought.]

Word Analysis
-------------

Well, let's run for a while with hypothesis P - we have here plain
text that happens to be written in an unknown script, transcribing
an unknown language.  What do we do?  We do what Michael Ventris
did when faced with this exact problem in the form of the Cretan
Linear B script: look for regular patterns suggestive of first
phonemes, and then grammar.

How is this language encoded?  Over 95% of the text is encoded using
20-odd symbols.  Now, it is just possible that the underlying language
has three vowels and seven consonants, for a 24-symbol syllabary, but
I don't believe it.  This is an alphabet, and each symbol is a consonant,
vowel, or possible consonant cluster or diphthong.

[Note: circular reasoning, of course.  It rests on the assumption that
each physical symbol is also a *sign*, a symbol with semantics.  But
that is not yet proven.  A similar analysis of the Chinese script could
also conclude it was alphabetic, with each *stroke* a letter of the
presumed alphabet.  So I may be right, but the reasoning is flawed.]

Does it show grammatical regularity?  You bet.  We've already seen
that "SC89" is a common word termination; there are also:

	FAM	[13]	FAN	[16]	FAR	[17]
	OEFAM	[12]	OEFAN	[24]
	OFAM	[56]	OFAN	[50]	OFAR	[44]
	OPAN	[18]	OPAM	[27]	OPAR	[38]
	ORAM	[10]	ORAN	[10]
	4OFAM	[96]	4OFAN	[154]	4OFAR	[66]
	4OPAM	[15]	4OPAN	[22]	4OPAR	[17]

from the same VB text.  Look, please, at the pattern of those word
counts.  Those frequencies can be explained as a combination of root
frequency ("4OF-" is much more common than "OP-") and ending frequency
("-AM" is slightly less common than "-AN").  This is precisely what
one would expect of a technical document written in a language with
end inflections.

[Again, an unscientific procedure.  At this point I should have framed
*all* hypotheses that explain the phenomena, and looked for some test
that would distinguish among them.  Indeed, my thesis rapidly begins
to break down under the pressure of ugly fact.]

But there are some peculiarities.  One is that the inflection scheme
is very, very consistent, far more so than Latin.  There seems to be
only one major declension or conjugation.  Another is that what seem
to be grammatical endings are also words in their own right:

	AM	[70]	AN	[24]	AR	[58]

as if Latin had words "us", "um", "os" to match its endings.  That
may have been the case in proto-Indo-European, but it's not the
case with Greek, Latin or most modern European languages.  It would
also be surprising in a synthetic language devised by a European;
such languages tend to follow the same patterns as the natural
languages known to the devisor: as examples I cite Volapu"k and
Esperanto.

A third point, and an unpleasant one, is that the VA and VB texts
seem to have similar beginning bits (roots?) but different ending
bits (inflections?), and that shouldn't happen either.  If two
dialects diverge that much in respect of grammar, there should be
at least a vowel shift in the roots as well - compare German and
Dutch to see an example.

Finally, if those bits are roots, they are very short.  You would
expect things like "ferr-", "cupr-", "chalc-", "aur-", "argent-',
and such, not roots of one or two letters.  Maybe they are encoded
in some compact notation, as if one would write

	Misce feum & cuum in Xbile

abbreviating "ferrum", "cuprum" and "crucibile" in the obvious way.
One could do the same with plant and star names, of course.

At which point, I ran out of time and ideas.  In the past couple
of weeks, I've had more ideas, but no time to test them.  If you
find any of this useful, I'd be happy to hear your ideas.

Robert Firth