Let's compute the distribution of letters at start, middle, and end of the pseudo-"words" (delimited by spaces). I will ignore line-start and line-end as they may be special. cat .voyn.fsg \ | tr -d '/=' \ | sed \ -e 's/^ *//g' \ -e 's/ *$//g' \ | enum-ngraphs -v n=2 \ > .voyn.dig cat .voyn.fsg \ | tr -d '/=' \ | sed \ -e 's/^ *//g' \ -e 's/ *$//g' \ | enum-ngraphs -v n=3 \ > .voyn.trg cat .voyn.dig \ | egrep '^ .$' \ | sed -e 's/^.\(.\)$/\1/g' \ | sort | uniq -c | expand \ > .voyn-ws.frq cat .voyn.dig \ | egrep '^. $' \ | sed -e 's/^\(.\).$/\1/g' \ | sort | uniq -c | expand \ > .voyn-we.frq cat .voyn.trg \ | egrep '^[^ ][^ ][^ ]$' \ | sed -e 's/^.\(.\).$/\1/' \ | sort | uniq -c | expand \ > .voyn-wm.frq join \ -a 1 -a 2 -e 0 -j1 2 -j2 2 -o '0,1.1,2.1' \ .voyn-wm.frq \ .voyn-we.frq \ > .tmp join \ -a 1 -a 2 -e 0 -j1 2 -j2 1 -o '0,1.1,2.2,2.3' \ .voyn-ws.frq \ .tmp \ | gawk ' {printf "%s %5d %5d %5d\n", $1, $2, $3, $4}' \ > .voyn-wsme.frq let ini mid fin --- ----- ----- ----- 4 1456 21 5 O 1317 2558 27 S 670 383 4 T 746 689 1 8 412 2154 61 D 102 2071 14 H 101 816 4 P 28 112 4 F 6 27 1 A 126 1826 0 C 23 4235 4 I 1 71 0 Z 0 343 1 G 79 120 3126 K 0 2 5 L 2 3 3 M 0 10 395 N 0 16 458 6 1 1 3 * 11 15 10 E 340 909 947 2 140 28 57 R 130 176 561 Note that "E", "2" and "R" are the only letters that occur in significant numbers at all three positions. Note also that "2" and "R" are easily confused with each other, so the numbers are consistent with "2" being exclusively word-initial, "R" being exclusively word-final, and there being substantial misredings in both directions (10% of the "R"s misread as "2"s, 40% of the "2"s misread as "R"s). Here is an attempt to recreate the blanks in the VMs according to simple rules. First, prepare a file where every two characters are separated by " " or "-". Then replace all blanks by "-", and replace some "-" by " " before "[42]" and after "[GKLMN6R]" cat .voyn.fsg \ | tr -d '/=' \ | sed -e 's/^ *//g' -e 's/ *$//g' \ | sed \ -e 's/\(.\)/\1:/g' \ -e 's/: :/ /g' \ -e 's/:$//g' \ > .voyn-sp-org.fsg cat .voyn.fsg \ | tr -d '/= ' \ | sed \ -e 's/\(.\)/\1:/g' \ -e 's/:$//g' \ -e 's/:\([42]\)/ \1/g' \ -e 's/\([GKLMN6R]\):/\1 /g' \ > .voyn-sp-syn.fsg compare-spaces \ .voyn-sp-syn.fsg \ .voyn-sp-org.fsg \ | tr -d ':' \ > .voyn-sp.cmp : ----- ----- | 4707 676 : | 984 22311 cat .voyn-sp.cmp \ | tr -dc '+\- ' \ | sed -e 's/\(.\)/\1@/g' \ | tr '@ ' '\012_' \ | egrep '.' \ | sort | uniq -c | expand \ > .voyn-sp-o-s.frq 676 + 984 - 4707 _ cat .voyn-sp.cmp \ | tr ' ' '_' \ | enum-ngraphs -v n=3 \ | egrep '.[-_+].' \ | sort | uniq -c | expand \ | sort -b +1.0 -1.2 +0 -1nr \ > .foo It seems that many of the errors made by these space-prediction rules are due to confusion between "2" and "R" by the transcriber. Let's try to "correct" these mistakes by changing in the original word-initial "R" to "2" non-word-initial "2" to "R" Let's do these changes cat .voyn.fsg \ | tr -d '/=' \ | sed \ -e 's/^ *//g' \ -e 's/ *$//g' \ -e 's/ R/ 2/g' \ -e 's/\([^ ]\)2/\1R/g' \ | tr -d ' ' \ | sed \ -e 's/\(.\)/\1:/g' \ -e 's/:$//g' \ -e 's/:\([42]\)/ \1/g' \ -e 's/\([GKLMN6R]\):/\1 /g' \ > .voyn-sp-fix.fsg compare-spaces \ .voyn-sp-fix.fsg \ .voyn-sp-org.fsg \ | tr -d ':' \ > .voyn-sp-fix.cmp R 2 : ----- ----- ----- ----- R | 792 87 0 0 2 | 130 278 0 0 | 0 0 4759 507 : | 0 0 932 22480 cat .voyn-sp-fix.cmp \ | tr -dc '+\- ' \ | sed -e 's/\(.\)/\1@/g' \ | tr '@ ' '\012_' \ | egrep '.' \ | sort | uniq -c | expand \ > .voyn-sp-o-s-fix.frq 507 + 932 - 4759 _ cat .voyn-sp-fix.cmp \ | tr ' ' '_' \ | enum-ngraphs -v n=3 \ | egrep '.[-_+].' \ | sort | uniq -c | expand \ | sort -b +1.0 -1.2 +0 -1nr \ > .foo Let's compute what would be the initial/medial/final statistics with these R/2 changes but with the original spaces: cat .voyn-sp-fix.cmp \ | tr -d '+' \ | tr '\-' ' ' \ > .voyn-sp-fixr2.fsg cat .voyn-sp-fixr2.fsg \ | tr -d '/=' \ | sed \ -e 's/^ *//g' \ -e 's/ *$//g' \ | egrep ' .* ' \ | sed \ -e 's/^[^ ][^ ]* //g' \ -e 's/ [^ ][^ ]*$//g' \ | tr ' ' '\012' \ | egrep '.' \ > .voyn-sp-fixr2-nonend.wds cat .voyn-sp-fixr2-nonend.wds \ | sed -e 's/^\(.\).*$/\1/g' \ | sort | uniq -c | expand \ > .voyn-sp-fixr2-ws.frq cat .voyn-sp-fixr2-nonend.wds \ | sed -e 's/^.*\(.\)$/\1/g' \ | sort | uniq -c | expand \ > .voyn-sp-fixr2-we.frq cat .voyn-sp-fixr2-nonend.wds \ | egrep '...' \ | sed \ -e 's/^.\(.*\).$/\1/' \ -e 's/\(.\)/\1@/g' \ | tr '@' '\012' \ | egrep '.' \ | sort | uniq -c | expand \ > .voyn-sp-fixr2-wm.frq join \ -a 1 -a 2 -e 0 -j1 2 -j2 2 -o '0,1.1,2.1' \ .voyn-sp-fixr2-wm.frq \ .voyn-sp-fixr2-we.frq \ > .tmp join \ -a 1 -a 2 -e 0 -j1 2 -j2 1 -o '0,1.1,2.2,2.3' \ .voyn-sp-fixr2-ws.frq \ .tmp \ | gawk ' {printf "%s %5d %5d %5d\n", $1, $2, $3, $4}' \ > .voyn-sp-fixr2-wsme.frq let ini mid fin --- ----- ----- ----- 2 189 8 2 R 15 79 524 ============= PARTIALLY CLEANED-UP REPORT FOLLOWS ============= Overview This is a summary of my attempts to discover the "true" character, syllabe, and word boundaries in the VMs text. Basic assumptions I will assume a priori that the text is mostly prose in some natural language, using a peculiar invented alphabet; possibly with abbreviations and calligraphic embellishments, but without any "hard" encription. In other words, the I assume that a person who spoke the language could learn to read the VMs in "real time", with no more effort than it takes to learn any other phonetic alphabet. One thing that must be kept in mind is that the text is contaminated by transcription errors. It is hard to estimate their frequency, because their distribution is strongly non-uniform. The Friedman and Currier transcriptions often differ in the grouping of strokes: what is "CI" for one is "A" for the other. also there are differences in the counting of "I" strokes, so that "M"s turn into "N"s and vice-versa. These kinds of errors probably affect a few percent of certain letters. To further complicate matters, there is strong evidence that the Voynich manuscript we have is a copy made by two or more persons who did not understand the original. If this hypothesis is true, then the manuscript itself is probably full of copying errors. It is quite possible that the copists had to make their own guesses as to alphabet, word spacing, etc. Source material To minimize distracting artifacts from scribal and topic variation, all analysis will be done on the "language B" part of the Biological section. Specifically, I will use Currier's transcription, converted to FSG notation, as extracted from Landini's EVMT 1.6 interlinear edition. Frequency anomalies around line breaks The key observation is that the character and n-gram statistics adjacent to line boundaries are very different from those within the line. These differences were already observed by Currier, who interpreted them as evidence that the line was a functional unit. The difference is real, but the way Currier put it is rather misleading. A better way to say it is "the context of line breaks is statisticaly different from that of word spaces". Below I argue that character counts near line breaks seem anomalous because the latter were made mostly at "true" word boundaries; and line breaks seem different from spaces because the latter are *not* true word boundaries. The hypothesis that the VMs blanks are not word spaces was explored by Robert Forth in his note #??. He specifically assumed that the blanks marked phonetic units, such as stressed syllabes. There are other plausible explanations, however, so it seems more prudent to avoid making unnecessary assumptions about the causes of the difference. Unfavorable line breaking contexts One specific peculiarity of line breaks is that there are combinations of FSG characters that occur very often in the text, but rarely or never occur straddling a line break. Here are some extreme examples: tot occurs at newline pattern ----------- ----------- ----------- 1899 0.064 1 0.001 C:8 1347 0.046 0 0.000 O:E 1436 0.049 0 0.000 O:D 1629 0.055 0 0.000 4:O 2056 0.070 0 0.000 8:G The first entry of this table says that the FSG character pair "C8" occurs 1899 times in the sample text (6.4% of all character), ignoring all spaces and line breaks; but there is only one line that ends with "C" and is followed by a line that begins with "8" (about 0.1% of all line breaks). Note that the average line contains 40 characters; so, if lines were broken completely at random, we would expect "C8" to occur about 1899/40 = 47 times across a line break. Similar anomalies can be seen in the tetragram frequencies: tot occurs at newline pattern ----------- ----------- ----------- 275 0.009 0 0.000 DC:C8 162 0.006 0 0.000 OE:SC The first line says that the group "G4OD" occurs 894 times in the text (3.1% of all tetragrams), if we delete all line and word spaces; but not once do we find a line that ends with "G4" followed by a line that begins with "OD". If lines were broken at random, we would expect about 275/40 = 7 line breaks between a "DC" and a "C8". Favorable line breaking contexts Conversely, certain character combinations are far more common around line breaks than within lines, as these examples show: tot occurs at newline pattern ----------- ----------- ----------- 18 0.001 17 0.022 K:2 15 0.001 11 0.014 R:P 62 0.002 43 0.057 G:P The first line says that the digraph "K2" occurs 18 times in the sample, and 17 of those occurrences are split across a line break. If line breaks were random, the expected number of occurrences at lien breaks would be 0.5 or so. tot occurs at newline pattern ----------- ----------- ----------- 74 0.003 20 0.026 OE:4O 30 0.001 16 0.021 8G:2O 15 0.001 11 0.015 8G:HT 16 0.001 6 0.008 8G:PT 12 0.000 11 0.015 8G:PO 9 0.000 5 0.007 8G:HO 11 0.000 5 0.007 8G:GH 10 0.000 6 0.008 RG:4O 29 0.001 14 0.018 AR:4O 54 0.002 18 0.024 AE:4O 23 0.001 7 0.009 AM:4O 12 0.000 9 0.012 AE:2O 7 0.000 6 0.008 EG:2A The favorable and unfavorable line-breaking contexts are strong evidence that the lines were not broken at random, but only in certain "allowed" contexts. Presumably, the favorable contexts correspond to boundaries between natural linguistic segments---characters, syllabes, or words. Other factors Of course, there are several other possible causes for those statistical anomalies. For intance, if lines are broken at word boundaries by the trivial "greedy" algorithm (break before the first word that would not fit in the line), then line breaks will be more likely before long words than short ones. Thus, line break statistics will be biased towards contexts of the form (end of short word):(start of long word). On the other hand, a human scribe will probably avoid (consciously or uncounsciously) breaking a line between an article or preposition and the following noun. In that case, the line break statistics will be biased towards contexts of the form (end of long word):(start of short word). The "words" of the Voynich manuscript Text in the VMs is clearly broken by spaces into groups of characters that superficially look like words of the language. Here is a sample (in FSG notation): FTC8GDARG ODCCG 4ODAR SGDTC8G 4ODAR SC8G 8AN SCG EG 2SCOE 4OETC8G TC8GDAR TCDCC8G RAR 4ODAN TAK ODTCG 4CG DAN SCCDG EHAN OEDAR OR 8ODZG EDAKO GDCCG ESCG DAE 8G SCG OD SCG 4ODCC8G SCGDAR SCG DZCG R AN OE OESCC8G 4ORCCG 4ODG PTC8G 4ODS8G GHAN TC8G 4ODAR TG EOE TC8G 4ODG 2ARTCG 4OHAR8G 8SCDZG 4ODAN TDZG ESC8G ODCC8G 4ODT8G TCHCG EO 4ODC8G 4ODAN TCCDCG 4ODAP OETC8G 2AE 8SOR 4OHAR T8G SCG 4ODAN C*DZG8G OHCG HC8G ETC8G 4ODC*8G 4ODAIR OEG 4ODCC8G 8G 4ODAE ODAR SC8G 8OR TCDAK 2SCDZG 4ODAE OEG SCG R OE TCCG SCG 8G ESC8G 4ODG PTC8G DCC8G 4ODC8G 4ODC8G 4ODC8G 4ODC8G 4ODAN OESC8G Are the "words" really words? Unfortunately, there is evidence that the groups of letters delimited by spaces are not words in the ordinary linguistic sense. For one thing, the average length of those groups is rather short, particularly if we consider that some letters of the "true" Voynich alphabet must be combinations of two or more FSG codes. (For instance, it is very likely that the combinations "SC", "SCC", "TC", and "TCC" are single letters.) Moreover, the sequence of "words" is highly repetitive: we often see the same word repeated two or more times over a few lines, as exemplified by "4ODAN" in the above sample. Sometimes the same word is repeated several times in consecutive positions, as "4ODC8G" in the last line. Finally, as noted by Robert Firth, the "words" have a rather rigid structure; or, said another way, the spaces are strongly correlated with the adjacent letters. If we consider the letter frequencies at the beginning, middle, and end of words, we get: FSG ini med fin --- ----- ----- ----- 4 1328 16 2 O 1107 1817 24 S 639 242 3 T 678 442 1 8 325 1685 50 A 89 1350 0 C 22 3442 4 D 91 1694 11 H 91 674 3 F 5 24 1 P 26 96 4 I 1 47 0 Z 0 306 1 G 65 66 2756 M 0 1 330 N 0 3 390 6 1 0 3 K 0 1 5 L 1 2 3 2 112 17 47 R 92 70 479 E 245 560 801 Except for "E", "2", and "R", we can hardly find a letter than occurs in significant numbers at all three positions. In fact, it looks like the blanks have been inserted into a continuous text according to very simple rules: after every "6", "G", "K", "L", "M", "N", "R", or before every "2" or "4". These trivial rules correctly reproduce 83% of the original spaces in the sample text, and insert only 12% additional ones. Said another way, if we try to predict whether there is a blank between two consective non-blank letters, these rules will give the right answer 94% of the time. (Incidentally, these scores could be improved to 84%, 9%, and 95%, respectively, if we assumed that all 92 "R"s preceded by space are actually "2"s, and all 64 "2"s not preceded by space are actually "R"s. Moreover, it must be kept in mind that many spaces were omitted or inserted during the transcription.) Here are the first few lines of the text, showing the effect of these rules: " " is a correctly predicted space, "+" is a predicted but non-existing space, "-" an existing but non-predicted space. FTC8G+DAR+G ODCCG 4ODAR SG+DTC8G 4ODAR SC8G 8AN SCG EG 2SCOE 4OETC8G TC8G+DAR TCDCC8G R+AR 4ODAN TAK ODTCG 4CG DAN SCCDG EHAN OEDAR OR 8ODZG EDAK+O-G+DCCG ESCG DAE-8G SCG OD-SCG 4ODCC8G SCG+DAR SCG DZCG R AN OE-OESCC8G 4OR+CCG 4ODG PTCG+DCCAR OEDG 8AR ODCG 4ODAN THZG 4ODCC8G 4ODG PTC8G 4ODS8G G+HAN TC8G 4ODAR TG EOE-TC8G 4ODG 2AR+TCG 4OHAR+8G 8SCDZG 4ODAN TDZG ESC8G ODCC8G 4ODT8G TCHCG EO 4ODC8G 4ODAN TCCDCG 4ODAP-OETC8G 2AE 8SOR 4OHAR T8G SCG 4ODAN C*DZG+8G OHCG HC8G ETC8G 4ODC*8G 4ODAIR OEG 4ODCC8G 8G 4ODAE-ODAR SC8G 8OR TCDAK 2SCDZG 4ODAE-OEG SCG R OE-TCCG SCG 8G ESC8G 4ODG PTC8G DCC8G 4ODC8G 4ODC8G 4ODC8G 4ODC8G 4ODAN OESC8G If the space-delimited groups are words of the language, it follows that either the phonetics of the language is uncommonly constrained, or that the writing system uses different symbols for the same phonetic element depending on its position in the word. Another possible explanation is that the letter groups are syllabes, or other linguistic elements smaller than a word. This alternative would explain not only the predictability of spaces, but also the shortness of the words and the repetitiveness of the text. Word spaces are unlike line breaks As futher evidence that the spaces do not define words, we note that there are combinations of FSG characters that often occur adjacent to or across space characters, but not across line breaks; and vice-versa. For example at NL or SP at NL only pattern ----------- ----------- ----------- 196 0.030 2 0.003 E:T 63 0.010 0 0.000 R:A 111 0.017 0 0.000 N:T 110 0.017 1 0.001 M:T The first line here says that there are 196 occurrences of "ET" straddling a blank (a space and/or a line break), amounting to 3% of all blanks; but only 2 of those occurrences straddle a line break, which is 0.3% of all line breaks. Note that there are 6420 blanks in the sample, of which 765 are line breaks; i.e. 8.4 gaps per line on the average. So, if lines were broken randomly at the same places where blanks are inserted, we would expect 196/8.4 = 23 occurrences of "ET" straddling a line break. There are many possible explanations for these anomalies. For instance, some characters may be written differently at line-start or line-end than in the middle of a line (this is probbaly the case of some of the "P" and "F" gallows). Some characters may be abbreviations that are only used at line-end, for lack of space (FSG "6" and "K" may belong to this class.) Certain characters and character groups may be used only at sentence-start or sentence-end, which tend to coincide with line breaks. The "word/syllabe" theory A particularly simple and likely explanation for the space vs. newline anomaies is that the latter occur mostly at "true" word boundaries, whereas the spaces define smaller units, such as syllabes. However, this "word/syllabe" explanation is not entirely satisfactory. Assuming that the VMs author choose to break the text into syllabes rather than words, why would he bother to break lines only between words, rather than between syllabes? Moreover, if we accept that explanation, we must also assume that a fair fraction of the spaces would be unbreakable. In that case, it seems that the right margin should look more ragged than it is. Specifically, if "|" repesents the ideal margin, we we should see line breaks of the form | xxx xxx xxxxxx xxx xxxx xxxxxx xxxxxxx | | xxxxx xxxx xxxxxx xxxx xxxxxx xxx | | yyy xxx xxxxxx xxx xxxx xxxxxx xxxxxxx | where a short word "yyy" that would have fit on one line got pushed to the next line, because the space following was not a word boundary. In fact, such "premature breaks" seem nonexistent, at least in the few pages I have seen. Positive line breaking anomalies A stronger argument against the word/syllabe theory is the existence of patterns that are *more* likely around line breaks than could be explained by it. Consider for example at NL or SP at NL only pattern ----------- ----------- ----------- 53 0.008 28 0.037 E:2 20 0.003 12 0.016 E:G 70 0.011 36 0.047 G:G The first line says that there are 53 occurrences of "E" and "2" separated by a space or newline; and 28 of them are actually line breaks. If line breaks were just a random sample of the blanks, we would expect the second count to be 53/8.4 = 6.3. But even if the line breaks were word boundaries, and the spaces were syllabe boundaries, to get 28 line breaks between "E" and "2" we would have to assume an average of four syllabes per word, or about two words per line. Thus, we seem forced to assume that that the true word boundaries are not necessarily a subset of the spaces. The "bogus spaces" theory If the spaces are neither word boundaries, nor a superset thereof, what could they be? Robert Firth proposed in one of his notes that the spaces mark phonetic units that may span true word boundaries; specifically, units defined by stress or quantity patterns, like the "poetic syllabes", or the "feet" of metric poetry. A variant of this theory is that the spaces have some purely phonetic function: they may indicate glottal stops, vowel lengthening, stress marks, etc. As we observed, the spaces can be accurately predicted from the adjacent letters. In fact, their occurrences seem too regular even for syllabic boundaries. Thus, we must consider the hypothesis that the spaces are meaningless; specifically, that they were inserted mechanically, without regard for word boundaries, according to simple local rules like the ones we described above. Why would the writer choose to do so? One obvious possibility is to make the "cypher" harder to break. Another possibility is that the VMs may be a copy made by someone who did not undertand the text, from an original where the word spacing was absent or rather subtle. Such a copist could easily have have mistaken the visually wider gap after certain characters (such as FSG "M") for word breaks. Or he may have identified (consciously or unconsciously) certain characters such as "G" and "L" with similar-looking medieval Latin abbreviations, and hence assumed that they marked the end of words. Because of these observations, it seems that we should not take the blank characters of the VMs as word separators in the standard sense. If we want to do semantics-oriented analysis (word location maps, label hunting, word correlations) we should either discard the spaces, and work with n-grams in arbitrary position; or try to discover the "true" word boundaries by looking at those patterns that are common line breaks. Recovering the true word structure from line breaks Unfortunately, theis second path is not very promising. In most natural languages, the words and morphemes are usually defined by a lexicon---essentially, a large table of valid words, numbering in the thousands. We cannot expect to find a reasonably small set of letter patterns that can recognize the boundaries of words. What we *can* get out the line break statistics, with reasonable confidence, is the structure of the syllabes of the language---assuming, of course, that the line breaks are mostly true word boundaries, and these are a subset of the syllabe boundaries. Here is the basic idea. Let's define a `split pattern' as an expression like .G:[24]O where "." is a wildcard and ":" is the place where we are considering inserting a word break (it matches an empty string). Let's assume that the line breaks in the VMs are mostly a random sample of the "true" word breaks. That is, each "true" word break in the text has basically the same chance of becoming a line break. Let's assume futhermore that line breaks occur only at word breaks, i.e. a word is almost never split across lines. (The paragraph structure skews this distribution somewhat, but since the average paragraph seem to have five lines or more, this particular bias cannot change the line-breaking statistics by more than 20% STOPPED HERE Finaly, let's assume that each VMs line contains mw "true" words, on the average. In that case, if the split pattern describes a mandatory word break (i.e., if the pattern only matches the compressed VMs in such a way that the ":" matches a "true" word boundary) then, we expect the ratio NL/NT to be about 1/mw. Suppose we take the VMs and remove all spaces, newlines, and paragraph marks. Let NT be the total number of occurrences of the split pattern in that compressed text; and let NL be the number of occurrences where the ":" falls where a line break existed in the VMs. On the other hand, if the split pattern can occur also inside words, then the ratio NL/NT will be less than 1/mw. If the pattern can occur at any position in the word, indifferently, then NL/NT should be close to 1/mc, where mc is the average number of characters per line (~40 in the Biological section). Finally, if the pattern never occurs at a true word boundary, then its frequency at line boundaries should be close to zero. The only occurrences should be "accidents" - words split across lines, scription/transcription errors, non-text uses, etc. If the probability of a line break to occur inside a word is less than 0.05, then such a "forbidden" pattern should not occur at line boundaries with Thus, we can then classify a split pattern into one of four classes: * mandatory word breaks: those for which NL > 1 and NL/NT > 1/mw * mandatory word NON-break: if NT > mc and NL/NT < * possible word breaks: those with NL > 1, NT > 2mw, NL/NT < ??? * uncertain: those with NT < 2mc and NL < 2, or with NT < 2mw Global digram frequencies, ignoring line breaks and word spaces: