Recomputed word pair tables with the new cont-diword-freqs: cat .wds \ | sed -e 's/c?m$/am/g' \ | sed -e '/?/s/^.*$/???/g' \ > .fix.wds cat .fix.wds \ | sort | uniq \ > .fix.dic cat .fix.wds \ | enum-word-pairs \ | count-diword-freqs -v rows=.fix.dic -v cols=.ckeys \ > .baz tac .fix.wds \ | sed -e '/=/d' \ | enum-word-pairs \ | count-diword-freqs -v rows=.fix.dic -v cols=.rkeys \ > .bar I took all words with 8 or more occurrences, and looked at the probabilities "in" and "fn" of the "word" occurring at beginning-of-line and end-of-line, respectively. I added the two probabilities, and got the probability "ex" of each word occuring at an extremal position in the line. Here are the words, sorted by the probability "in" of being line-initial: word freq in fn ex ------------ ---- -- -- -- Poe 8 99 0 99 8zcc8a 17 88 0 88 zor 10 79 0 79 azcc8a 8 74 0 74 8ccc8a 9 66 0 66 zoe 25 59 0 59 Hccc8a 14 49 0 49 zam 52 44 5 49 Pccc8a 14 42 7 49 aHcc8a 12 41 8 49 8an 12 33 0 33 zae 14 28 14 42 zar 11 27 27 54 aHc8a 12 24 8 32 eoe 17 23 58 81 zccor 9 22 0 22 8am 100 19 8 27 qoHccc8a 11 18 9 27 zcoe 11 18 0 18 8oe 17 17 0 17 ??? 1294 17 18 35 qoHcca 81 16 3 19 qoHcc8a 183 15 2 17 oeHcc8a 14 14 0 14 qoHoe 21 14 0 14 qoHca 43 13 4 17 qoPccc8a 8 12 12 24 qoe 81 12 11 23 qoeccca 8 12 0 12 cccoe 17 11 0 11 qoHam 200 11 2 13 zccc8a 36 11 0 11 oeHcca 19 10 0 10 ccoe 11 9 0 9 eor 10 9 39 48 ezcc8a 21 9 14 23 qoHan 54 9 1 10 oeccca 12 8 8 16 8ar 51 7 9 16 oezcc8a 14 7 14 21 qoHae 113 7 7 14 qoHc8a 198 7 3 10 Ham 16 6 0 6 oHan 16 6 12 18 8ae 50 5 15 20 eccc8a 52 5 17 22 oHcca 34 5 2 7 zccoe 17 5 0 5 oHc8a 83 4 3 7 oeccc8a 23 4 26 30 qoHar 48 4 2 6 zcca 69 4 4 8 zccca 23 4 0 4 cccca 31 3 0 3 ccc8a 172 2 6 8 oHae 39 2 10 12 or 40 2 7 9 qoHa 79 2 25 27 ccca 67 1 5 6 8a 35 0 31 31 Hae 10 0 9 9 Hc8a 25 0 7 7 Hcc8a 14 0 0 0 aHcca 9 0 0 0 ae 12 0 24 24 am 20 0 9 9 cc8a 16 0 12 12 cccHa 12 0 0 0 cccHc8a 8 0 0 0 cccHca 50 0 7 7 cccc8a 19 0 10 10 ccccHa 12 0 0 0 ccccHca 35 0 0 0 cccz 8 0 0 0 e8a 8 0 74 74 eHam 8 0 0 0 eHc8a 8 0 0 0 eccca 15 0 19 19 oHa 25 0 27 27 oHam 76 0 3 3 oHar 35 0 2 2 oHca 21 0 4 4 oHcc8a 56 0 8 8 oHoe 11 0 0 0 oPccc8a 14 0 7 7 oe 127 0 12 12 oe8a 9 0 77 77 oeHa 10 0 0 0 oeHam 22 0 4 4 oeHc8a 19 0 21 21 oea 23 0 78 78 oeoe 8 0 49 49 oeor 13 0 23 23 oezcca 8 0 12 12 qoHccca 8 0 0 0 ram 14 0 21 21 roe 9 0 33 33 zca 9 0 11 11 zcc8a 204 0 4 4 zccHa 14 0 7 7 zccHca 37 0 0 0 zccHcca 8 0 0 0 zcccHa 12 0 8 8 zcccHca 31 0 0 0 By the probability "fn" of being line-final: word freq in fn ex ------------ ---- -- -- -- oea 23 0 78 78 oe8a 9 0 77 77 e8a 8 0 74 74 eoe 17 23 58 81 oeoe 8 0 49 49 eor 10 9 39 48 roe 9 0 33 33 8a 35 0 31 31 oHa 25 0 27 27 zar 11 27 27 54 oeccc8a 23 4 26 30 qoHa 79 2 25 27 ae 12 0 24 24 oeor 13 0 23 23 oeHc8a 19 0 21 21 ram 14 0 21 21 eccca 15 0 19 19 ??? 1294 17 18 35 eccc8a 52 5 17 22 8ae 50 5 15 20 ezcc8a 21 9 14 23 oezcc8a 14 7 14 21 zae 14 28 14 42 cc8a 16 0 12 12 oHan 16 6 12 18 oe 127 0 12 12 oezcca 8 0 12 12 qoPccc8a 8 12 12 24 qoe 81 12 11 23 zca 9 0 11 11 cccc8a 19 0 10 10 oHae 39 2 10 12 8ar 51 7 9 16 Hae 10 0 9 9 am 20 0 9 9 qoHccc8a 11 18 9 27 8am 100 19 8 27 aHc8a 12 24 8 32 aHcc8a 12 41 8 49 oHcc8a 56 0 8 8 oeccca 12 8 8 16 zcccHa 12 0 8 8 Hc8a 25 0 7 7 Pccc8a 14 42 7 49 cccHca 50 0 7 7 oPccc8a 14 0 7 7 or 40 2 7 9 qoHae 113 7 7 14 zccHa 14 0 7 7 ccc8a 172 2 6 8 ccca 67 1 5 6 zam 52 44 5 49 oHca 21 0 4 4 oeHam 22 0 4 4 qoHca 43 13 4 17 zcc8a 204 0 4 4 zcca 69 4 4 8 oHam 76 0 3 3 oHc8a 83 4 3 7 qoHc8a 198 7 3 10 qoHcca 81 16 3 19 oHar 35 0 2 2 oHcca 34 5 2 7 qoHam 200 11 2 13 qoHar 48 4 2 6 qoHcc8a 183 15 2 17 qoHan 54 9 1 10 8an 12 33 0 33 8ccc8a 9 66 0 66 8oe 17 17 0 17 8zcc8a 17 88 0 88 Ham 16 6 0 6 Hcc8a 14 0 0 0 Hccc8a 14 49 0 49 Poe 8 99 0 99 aHcca 9 0 0 0 azcc8a 8 74 0 74 cccHa 12 0 0 0 cccHc8a 8 0 0 0 ccccHa 12 0 0 0 ccccHca 35 0 0 0 cccca 31 3 0 3 cccoe 17 11 0 11 cccz 8 0 0 0 ccoe 11 9 0 9 eHam 8 0 0 0 eHc8a 8 0 0 0 oHoe 11 0 0 0 oeHa 10 0 0 0 oeHcc8a 14 14 0 14 oeHcca 19 10 0 10 qoHccca 8 0 0 0 qoHoe 21 14 0 14 qoeccca 8 12 0 12 zccHca 37 0 0 0 zccHcca 8 0 0 0 zccc8a 36 11 0 11 zcccHca 31 0 0 0 zccca 23 4 0 4 zccoe 17 5 0 5 zccor 9 22 0 22 zcoe 11 18 0 18 zoe 25 59 0 59 zor 10 79 0 79 By probability "ex" of being line-extreme: word freq in fn ex ------------ ---- -- -- -- Poe 8 99 0 99 8zcc8a 17 88 0 88 eoe 17 23 58 81 zor 10 79 0 79 oea 23 0 78 78 oe8a 9 0 77 77 azcc8a 8 74 0 74 e8a 8 0 74 74 8ccc8a 9 66 0 66 zoe 25 59 0 59 zar 11 27 27 54 Hccc8a 14 49 0 49 Pccc8a 14 42 7 49 aHcc8a 12 41 8 49 oeoe 8 0 49 49 zam 52 44 5 49 eor 10 9 39 48 zae 14 28 14 42 ??? 1294 17 18 35 8an 12 33 0 33 roe 9 0 33 33 aHc8a 12 24 8 32 8a 35 0 31 31 oeccc8a 23 4 26 30 8am 100 19 8 27 oHa 25 0 27 27 qoHa 79 2 25 27 qoHccc8a 11 18 9 27 ae 12 0 24 24 qoPccc8a 8 12 12 24 ezcc8a 21 9 14 23 oeor 13 0 23 23 qoe 81 12 11 23 eccc8a 52 5 17 22 zccor 9 22 0 22 oeHc8a 19 0 21 21 oezcc8a 14 7 14 21 ram 14 0 21 21 8ae 50 5 15 20 eccca 15 0 19 19 qoHcca 81 16 3 19 oHan 16 6 12 18 zcoe 11 18 0 18 8oe 17 17 0 17 qoHca 43 13 4 17 qoHcc8a 183 15 2 17 8ar 51 7 9 16 oeccca 12 8 8 16 oeHcc8a 14 14 0 14 qoHae 113 7 7 14 qoHoe 21 14 0 14 qoHam 200 11 2 13 cc8a 16 0 12 12 oHae 39 2 10 12 oe 127 0 12 12 oezcca 8 0 12 12 qoeccca 8 12 0 12 cccoe 17 11 0 11 zca 9 0 11 11 zccc8a 36 11 0 11 cccc8a 19 0 10 10 oeHcca 19 10 0 10 qoHan 54 9 1 10 qoHc8a 198 7 3 10 Hae 10 0 9 9 am 20 0 9 9 ccoe 11 9 0 9 or 40 2 7 9 ccc8a 172 2 6 8 oHcc8a 56 0 8 8 zcca 69 4 4 8 zcccHa 12 0 8 8 Hc8a 25 0 7 7 cccHca 50 0 7 7 oHc8a 83 4 3 7 oHcca 34 5 2 7 oPccc8a 14 0 7 7 zccHa 14 0 7 7 Ham 16 6 0 6 ccca 67 1 5 6 qoHar 48 4 2 6 zccoe 17 5 0 5 oHca 21 0 4 4 oeHam 22 0 4 4 zcc8a 204 0 4 4 zccca 23 4 0 4 cccca 31 3 0 3 oHam 76 0 3 3 oHar 35 0 2 2 Hcc8a 14 0 0 0 aHcca 9 0 0 0 cccHa 12 0 0 0 cccHc8a 8 0 0 0 ccccHa 12 0 0 0 ccccHca 35 0 0 0 cccz 8 0 0 0 eHam 8 0 0 0 eHc8a 8 0 0 0 oHoe 11 0 0 0 oeHa 10 0 0 0 qoHccca 8 0 0 0 zccHca 37 0 0 0 zccHcca 8 0 0 0 zcccHca 31 0 0 0 Since there are 765 occurrences of "//" in about 6900 words, the expected probability of a word occuring at a specific end of a line is about 12%, and 24% of it occuring at either end. Taking 12% as the split point for "in" or "fn", we get the following tentative categories: Extremists: word freq in fn ex ------------ ---- -- -- -- ??? 1294 17 18 35 eoe 17 23 58 81 qoPccc8a 8 12 12 24 zae 14 28 14 42 zar 11 27 27 54 Finalists: word freq in fn ex ------------ ---- -- -- -- 8a 35 0 31 31 8ae 50 5 15 20 ae 12 0 24 24 cc8a 16 0 12 12 e8a 8 0 74 74 eccc8a 52 5 17 22 eccca 15 0 19 19 eor 10 9 39 48 ezcc8a 21 9 14 23 oHa 25 0 27 27 oHan 16 6 12 18 oe 127 0 12 12 oe8a 9 0 77 77 oeHc8a 19 0 21 21 oea 23 0 78 78 oeccc8a 23 4 26 30 oeoe 8 0 49 49 oeor 13 0 23 23 oezcc8a 14 7 14 21 oezcca 8 0 12 12 qoHa 79 2 25 27 ram 14 0 21 21 roe 9 0 33 33 Initialists: word freq in fn ex ------------ ---- -- -- -- 8am 100 19 8 27 8an 12 33 0 33 8ccc8a 9 66 0 66 8oe 17 17 0 17 8zcc8a 17 88 0 88 Hccc8a 14 49 0 49 Pccc8a 14 42 7 49 Poe 8 99 0 99 aHc8a 12 24 8 32 aHcc8a 12 41 8 49 azcc8a 8 74 0 74 oeHcc8a 14 14 0 14 qoHca 43 13 4 17 qoHcc8a 183 15 2 17 qoHcca 81 16 3 19 qoHccc8a 11 18 9 27 qoHoe 21 14 0 14 qoe 81 12 11 23 qoeccca 8 12 0 12 zam 52 44 5 49 zccor 9 22 0 22 zcoe 11 18 0 18 zoe 25 59 0 59 zor 10 79 0 79 Medialists: word freq in fn ex ------------ ---- -- -- -- cccoe 17 11 0 11 qoHam 200 11 2 13 zccc8a 36 11 0 11 oeHcca 19 10 0 10 ccoe 11 9 0 9 qoHan 54 9 1 10 oeccca 12 8 8 16 8ar 51 7 9 16 qoHae 113 7 7 14 qoHc8a 198 7 3 10 Ham 16 6 0 6 oHcca 34 5 2 7 zccoe 17 5 0 5 oHc8a 83 4 3 7 qoHar 48 4 2 6 zcca 69 4 4 8 zccca 23 4 0 4 cccca 31 3 0 3 ccc8a 172 2 6 8 oHae 39 2 10 12 or 40 2 7 9 ccca 67 1 5 6 Hae 10 0 9 9 Hc8a 25 0 7 7 Hcc8a 14 0 0 0 aHcca 9 0 0 0 am 20 0 9 9 cccHa 12 0 0 0 cccHc8a 8 0 0 0 cccHca 50 0 7 7 cccc8a 19 0 10 10 ccccHa 12 0 0 0 ccccHca 35 0 0 0 cccz 8 0 0 0 eHam 8 0 0 0 eHc8a 8 0 0 0 oHam 76 0 3 3 oHar 35 0 2 2 oHca 21 0 4 4 oHcc8a 56 0 8 8 oHoe 11 0 0 0 oPccc8a 14 0 7 7 oeHa 10 0 0 0 oeHam 22 0 4 4 qoHccca 8 0 0 0 zca 9 0 11 11 zcc8a 204 0 4 4 zccHa 14 0 7 7 zccHca 37 0 0 0 zccHcca 8 0 0 0 zcccHa 12 0 8 8 zcccHca 31 0 0 0 Note that the average line has about 10 words. The average number of lines per paragraph is at most 10 (but an unknown number of paragraph breaks may have been lost in the transcription). Here are three explanations I can think of for a word w to have a marked preference for or aversion to these extremal positions: (1) Grammar: If w occurs preferably at the end of a sentence, it will be often found at the end of paragraphs, which are a significant fraction (10% or more) of all end-of-lines. This effect can only boost the end-of-line probability up to the fraction F of sentences that end at end-of-line. The extreme cases are `e8a' and `oe8a' (around 75%). To explain these numbers by cause (1), it would require at least 3/4 of all sentences to end at end-of-line. Conversely, if w has preference for beginning-of-sentence, it will be found at end-of-line only if a paragraph contains two or more sentences. An extreme case is `qoHam', that occurs 200 times, but only 2% of those occurrences are at end-of-line. We tentatively conclude that at most 2% of the sentences begin one word before end-of-line. If the second and subsequent sentences of a paragraph begin at random positions of the line, then such sentences are less than 20% of all sentences, and hence 80% of all paragraphs contain only one sentence. (2) Word splitting. In the VMs, words may have been split across line breaks without obvious markings. The left halves of split words would then show up as end-loving, begin-loathing; and symmetrically for ther right halves. This explanation canot account for the many end-loving words ending in `8a', like `eccc8a', because `8a' rarely occurs in the middle of a word: it is almost always final, and a few times initial. Likewise, it cannot account for end-loathing words that begin with `qo', which appears to be strictly word-initial. Also, this effect cannot explain words that avoid both ends of the line, like Hcc8a 14 aHcca 9 cccHa 12 cccHc8a 8 ccccHa 12 ccccHca 35 cccz 8 eHam 8 eHc8a 8 oHoe 11 oeHa 10 qoHccca 8 zccHca 37 zccHcca 8 zcccHca 31 (3) False line breaks: In a sense the opposite of (2). Suppose w is part of a longer word x, but the letter spacing is such that x is often transcribed as two or three separate words, one of them being w. Then w will seem to avoid end-of-line, begin-of-line, or both, depending on the position of w in x. This effect can only explain end-avoidance, not end-attraction. Also, it seems unlikely to be due to bad judgement by the transcribers; the word spaces in VMs are usually pretty clear, and anyway I only considered word breaks where both Friedman and Currier agreed. So, this explanation only flies if the the word spaces are bogus by design. Conclusion: the most likely explanation for most anomalous words seems to be (1). Posted an improved version of these comments to the "voynich" list.