Hacking at the Voynich manuscript - Side notes 008 A prefix-midfix-suffix factorization of the bio section in EVA encoding Last edited on 1998-01-28 02:07:27 by stolfi This is partly a remake of work from Notebook-1.txt, originally done around 97-07-05. Summary of previous relevant tasks: I obtained Landini's interlinear transcription of the VMs, version 1.6 (landini-interln16.evt) from http://sun1.bham.ac.uk/G.Landini/evmt/intrln16.zip. [Notebook-1.txt] Around 97-11-01 I split landini-interln16.evt into many files, with one text unit per page. [Notebook-12.txt] On 97-11-05 I mapped those files from FSG and other ad-hoc alphabets to EVA. [Notebook-12.txt] The files are L16-eva/fNNxx.YY, and a machine-readable description of their contents and logical order is in L16-eva/INDEX. Then I started going back to redoing some of the previous tasks using the new encoding. I extracted the Currier (;C>) and Friedman-I (;F>) versions of the "bio" section, in EVA alphabet, as files bio-{c,f}-eva.evt. I also built the associated text files and word lists bio-{c,f}-eva{-{gut,fun,bad},{wds,dic,frq},.txt}. [Note-001.txt] Afer mapping the data to a reduced alphabet (ERA) [Note-003.txt], I identified a paradigm which consists of 267 prefixes combined with 147 suffixes; the latter are maximal strings of the form [doirlmn]*. If we use just the 91 most common prefixes (>= 3 occurrences) and the most common 24 suffixes(>= 20 occurrences), we can reproduce 5867 of the 6166 original words (95.1%) counting repetitions. If we look only at distinct words, we get 534 out of 763 with 98 prefixes and 22 suffixes. [Note-006.txt] Later I did a similar factorization of the words in the full EVA alphabet, defining suffixes as [daoirsl]*[rlmnyd]. This factorization had 526 prefixes and 206 suffixes, and left out 88 words (55 distinct). The top 160 prefixes and 30 suffixes generate 5176 out of 6166 words (84%). Ignoring word frequencies, with 175 prefixes and 33 suffixes we can generate 712 out 1342 words (53%). [Note-007.txt] 97-11-11 stolfi =============== The previous factorizations seemed insatisfactory, I thought perhaps I was beeing too greedy and including in the suffix things that belonged to the prefix. Looking at the suffix list of Note-007.txt, I concluded that a suffix is made up of these components: [aoydsm] i*n i*r i*l So this will be the basis of my next factorization attempt. There is strong indication that these letters are to be parsed in pairs { al ar aiin dy ol ... } But let's leave that for later. Also each word seems to have a (generally short) prefix made with letters { qo o a y l d r s } (but beware not to eat the "s" in "sh"). So I will try a tripartition of the words: cat bio-f-eva-gut.wds \ | sed \ -e 's/sh/X/g' \ -e 's/$/}/' \ -e 's/^/{/' \ -e 's/{\([qoaydirslmn][qoaydirslmn]*\)/\1{/' \ -e 's/\([qoaydirslmn][qoaydirslmn]*\)}/}\1/' \ -e 's/X/sh/g' \ -e 's/{}/\./' \ -e 's/\.//g' \ -e 's/{/- -/' \ -e 's/}/- -/' \ > factored [ I redid all this processing on 97-12-10, after fixing a bug in the prefix-parsing "sed" script. In its previous form, the script had left some "q"s attached to the midfix. ] Each line in the file factored is either a single word consisting entirely of prefix/suffix letters (a "unifix"), or a Voynich word broken into three Unix words ("prefix", "midfix", suffix"). cat factored \ | grep -v -e '- -' \ > unifs-all.wds cat factored \ | grep -e '- -' \ | gawk '/./ {print $1}' \ > prefs-all.wds cat factored \ | grep -e '- -' \ | gawk '/./ {print $2}' \ > midfs-all.wds cat factored \ | grep -e '- -' \ | gawk '/./ {print $3}' \ > suffs-all.wds dicio-wc {prefs,midfs,suffs,unifs}-all.wds lines words bytes file ------ ------- --------- ------------ 4666 4666 13963 prefs-all.wds 4666 4666 26318 midfs-all.wds 4666 4666 18576 suffs-all.wds 1516 1516 6418 unifs-all.wds foreach f ( prefs midfs suffs unifs ) cat ${f}-all.wds \ | sort | uniq -c | expand | sort +0 -1nr \ > ${f}-all.frq end dicio-wc {prefs,midfs,suffs,unifs}-all.frq lines words bytes file ------ ------- --------- ------------ 53 106 657 prefs-all.frq 209 418 3295 midfs-all.frq 113 226 1513 suffs-all.frq 230 460 3013 unifs-all.frq pr -m -w 100 -e -t \ {prefs,midfs,suffs,unifs}-all.frq \ | expand \ > joint-all.frq freq prefix freq midfix freq suffix freq unifix ---- -------- ---- -------- ---- -------- ---- -------- 1859 - 824 -k- 1728 -dy 186 ol 1296 qo- 588 -che- 1239 -y 126 qol 607 o- 514 -she- 422 -aiin 106 daiin 255 ol- 387 -kee- 254 -al 71 dal 209 l- 354 -t- 245 -ol 64 dar 108 y- 347 -ke- 157 -ar 56 saiin 75 d- 179 -te- 86 -ain 55 or 45 r- 121 -ch- 66 -or 50 sol 36 qol- 113 -tee- 51 -d 48 dy 29 s- 105 -shee- 36 -s 36 aiin 23 q- 95 -chee- 28 -dar 28 dol 21 sol- 83 -sh- 25 -dal 25 oly 12 dy- 58 -pche- 21 -am 21 lol 8 sal- 49 -chckh- 20 - 21 sal 7 so- 38 -kch- 20 -aly 18 ar 6 dal- 38 -p- 16 -a 18 iin 6 olo- 33 -tche- 16 -l 18 raiin 5 a- 31 -sheckh- 13 -oldy 17 sor 5 dol- 28 -tch- 12 -daiin 15 al 4 al- 27 -kche- 10 -air 15 sar 4 lo- 26 -chcth- 10 -ary 14 s 4 or- 25 -checkh- 10 -r 13 olor 3 oqo- 25 -shckh- 9 -aldy 12 olol 3 qod- 24 -shek- 7 -as 12 rol 2 dl- 22 -kshe- 6 -ady 11 m 2 do- 20 -ee- 6 -alor 11 ral 2 lol- 17 -checth- 6 -dol 11 y 2 olol- 17 -chek- 6 -dor 10 lor 2 qoqo- 17 -tshe- 6 -oiin 10 oldy 2 qor- 16 -pch- 6 -sdy 10 r 2 rol- 15 -cth- 6 -sy 9 dain 1 alo- 14 -cthe- 5 -o 9 olaiin 1 aro- 14 -fche- 5 -oly 8 dam 1 dar- 12 -chckhe- 4 -alol 8 ldy 1 dor- 12 -shcth- 4 -an 8 ly 1 ld- 11 -ckhe- 4 -dam 8 ory 1 od- 11 -keee- 4 -m 8 qor 1 odd- 10 -shckhe- 4 -ody 7 l 1 oll- 10 -shecth- 3 -ay 7 orol 1 oro- 9 -cheek- 3 -ydy 7 qoly ... ... ... ... ... .... ... ... ---- -------- ---- -------- ---- -------- ---- -------- 4666 TOTAL 4666 TOTAL 4666 TOTAL 1516 TOTAL Let's count the words with empty prefix and suffix: cat factored | egrep '^-.*[^-]$' | wc cat factored | egrep '[^-]-.*-$' | wc cat factored | egrep '^-.*-$' | wc cat factored | egrep '^[^-].*-.*[^-]$' | wc empty prefix = 1849 empty suffix = 20 both empty = 10 both non-empty = 2797 Listing the non-hard midfixes: cat midfs-all.frq \ | sed -e 's/sh/X/g' \ | egrep '[^- 0-9Xchtkpfe]' \ | sed -e 's/X/sh/g' \ > midfs-anomalous.frq