Hacking at the Voynich manuscript - Side notes 505 A complete factorization of words in ERA alphabet Last edited on 1999-01-31 06:20:52 by stolfi OBSOLETE This is partly a remake of work from Notebook-1.txt, originally done around 97-07-05. Summary of previous relevant tasks: I obtained Landini's interlinear transcription of the VMs, version 1.6 (landini-interln16.evt) from http://sun1.bham.ac.uk/G.Landini/evmt/intrln16.zip. [Notebook-1.txt] Around 97-11-01 I split landini-interln16.evt into many files, with one text unit per page. [Notebook-12.txt] On 97-11-05 I mapped those files from FSG and other ad-hoc alphabets to EVA. [Notebook-12.txt] The files are L16-eva/fNNxx.YY, and a machine-readable description of their contents and logical order is in L16-eva/INDEX. Then I started going back to redoing some of the previous tasks using the new encoding. I extracted the Currier (;C>) and Friedman-I (;F>) versions of the "bio" section, in EVA alphabet, as files bio-{c,f}-eva.evt. I also built the associated text files and word lists bio-{c,f}-eva{-{gut,fun,bad},{wds,dic,frq},.txt}. [Note-001.txt] Eventually I diecided that it was necessary to map the data to a reduced alphabet (ERA), identifying similar letters: both to reduce transcription and sampling noise, and to make the results more manageable. Accordingly, I created files bio-{c,f}-era-gut.{wds,dic} [Note-003.txt] After some ad-hoc hacking, I tentatively identified a paradigm which consists of a couple hundred prefixes combined with suffixes of the form -e*d and -e*d?y (the EDY class) and -e*d?[ao]i*[rlnm] (the AIR class). The prefixes and their statistics were saved to file Note-004/pref-suff-class.txt [Note-004.txt] 97-11-10 stolfi =============== Let's see if there are any other interesting suffixes besides the EDY and AIR classes: cat bio-f-era-gut.wds \ | egrep -v '[do]$' \ | egrep -v 'oi*[rlmn]$' \ > Note-005/.fun-suffs.wds dicio-wc Note-005/.fun-suffs.wds lines words bytes file ------ ------- --------- ------------ 174 174 717 Note-005/.fun-suffs.wds It seems we have pretty much got them all. Let's look at this remainder: cat Note-005/.fun-suffs.wds \ | sort | uniq -c | expand \ | revbytes | sort | revbytes \ > Note-005/.fun-suffs.frq The only significant suffixes in this list are { -l -e*r }: l(7) chl(6) lchl(2) dolchl(1) rolchl(1) kl(1) lkl(1) okl(4) r(24) cheer(6) dcheer(1) lcheer(1) olcheer(2) ocheer(2) rcheer(1) ekeer(1) oekeer(1) okeer(5) cher(11) lcher(1) rcher(2) cheker(1) oker(2) chr(4) lchr(1) rchr(3) chekr(1) There are also 11 isolated "m"s, and 18 isolated "in"s, and some funny suffixes: -ede(1) -ee(3) -e(2) -dl(5) -odl(1) -ll(1) -nl(1) -oinl(1) -eerl(1) -em(4) -oinm(1) -een (2) -edr(1) -lr(6) -olr(6) -rlr(1) Looking at the prefixes, it seems that many of those that end in 'ol-' are followed by a very restricted class of suffixes. Let's see: cat bio-f-era-gut.wds \ | egrep 'ol[edoirlmn][edoirlmn]*$' \ | sed \ -e 's/^/ /g' \ -e 's/^.*[^edoirlmn]\([edoirlmn]*\)$/\1/' \ | sort | uniq -c | expand \ | sort +0 -1nr \ > Note-005/.olsuffs.frq The significant suffixes found are olo(63) oldo(29) olor(23) olol(21) oloin(14) dolo(10) And the next, not so significant, are eoldo(7) doldo(5) eolo(5) lolo(5) olr(5) orolo(5) rolo(5) olom(4) OK, so let's collect all the individual suffixes (with EVA -> ERA collapse), including those special ones. We now have a simple definition of "suffix": a maximal terminal string consisting only of /[edoirlmn]/ characters. cat bio-f-era-gut.wds \ | sed -e 's:\([edoirlmn]*\)$:- -\1:' \ > Note-005/.factored dicio-wc bio-f-era-gut.wds .factored lines words bytes file ------ ------- --------- ------------ 6166 6166 34893 bio-f-era-gut.wds 6166 12332 53391 Note-005/.factored Now, let's collect the prefixes and suffixes: cat Note-005/.factored \ | gawk '/./ {print $1}' \ | revbytes | sort | uniq | revbytes \ > Note-005/.prefs-all.dic cat Note-005/.factored \ | gawk '/./ {print $2}' \ | sort | uniq \ > Note-005/.suffs-all.dic dicio-wc prefs-all.dic suffs-all.dic lines words bytes file ------ ------- --------- ------------ 156 156 1023 Note-005/.prefs-all.dic 219 219 1369 Note-005/.suffs-all.dic Great. Now let's count their occurrences and list the most important: cat Note-005/.factored \ | gawk '/./ {print $1}' \ | revbytes | sort | revbytes | uniq -c | expand \ | sort +0 -1nr \ > Note-005/.prefs-all.frq The 53 most important prefixes (at least 3 occurrences), accounting for 6040 words ok-(1730) -(1544) ch-(1036) k-(229) chek-(199) olk-(156) lch-(148) olch-(115) cheek-(110) okch-(106) dch-(61) rch-(57) kch-(54) och-(49) pch-(46) opch-(41) lk-(38) ek-(35) chk-(29) p-(25) oek-(18) rolk-(17) op-(14) rolch-(14) dok-(12) okech-(12) polch-(11) ep-(10) lcheek-(9) ofch-(9) chok-(8) olkch-(8) dolch-(7) orch-(6) rchek-(6) rok-(6) chep-(5) chepch-(5) olok-(5) rk-(5) cheep-(4) cheok-(4) dk-(4) fch-(4) lchek-(4) okolch-(4) chedch-(3) dolk-(3) kok-(3) kolch-(3) lkch-(3) lpch-(3) olkech-(3) The other 103 prefixes (less than 3 occurrences), accounting for 126 words total: chech-(2) cheeek-(2) chf-(2) dlch-(2) dorch-(2) ef-(2) epch-(2) kech-(2) lchep-(2) lok-(2) lolk-(2) ocheek-(2) odch-(2) of-(2) okolk-(2) olchek-(2) olfch-(2) olpch-(2) ook-(2) opolk-(2) orok-(2) pok-(2) roek-(2) chch-(1) chedok-(1) cheekch-(1) cheekeedch-(1) cheolch-(1) chkch-(1) chlchpch-(1) choek-(1) choep-(1) chokch-(1) cholch-(1) cholk-(1) chop-(1) chpch-(1) dcheek-(1) dchek-(1) dchok-(1) dkch-(1) dokch-(1) dolfch-(1) efch-(1) ekch-(1) kchdolk-(1) kchek-(1) kcheok-(1) keok-(1) koek-(1) kolk-(1) korch-(1) korolch-(1) ldch-(1) lf-(1) loch-(1) ochch-(1) ochek-(1) ochep-(1) oddch-(1) odok-(1) odorch-(1) oeek-(1) oep-(1) okchok-(1) okechek-(1) okeech-(1) okeeolch-(1) okeolch-(1) okoch-(1) okok-(1) okook-(1) okop-(1) olcheek-(1) olchk-(1) olek-(1) olkeeoch-(1) ollch-(1) oloef-(1) olokch-(1) ololch-(1) ololk-(1) olpoek-(1) ooch-(1) opolch-(1) ork-(1) pchef-(1) pdolch-(1) poldch-(1) poldok-(1) polk-(1) polkech-(1) porch-(1) prch-(1) rcheek-(1) rchkch-(1) rek-(1) reok-(1) rkch-(1) rokch-(1) rolchk-(1) rolkch-(1) rpch-(1) Now for the suffixes: cat Note-005/.factored \ | gawk '/./ {print $2}' \ | sort | uniq -c | expand \ | sort +0 -1nr \ > Note-005/.suffs-all.frq The 20 most significant suffixes (at least 30 occurrences), accounting for 5344 words: -edo(1161) -ol(690) -eo(608) -oin(561) -eedo(427) -o(348) -eeo(299) -or(267) -do(167) -eol(123) -doin(117) -dol(106) -rol(94) -roin(86) -dor(74) -olo(63) -ror(47) -eor(43) -r(33) -ed(30) The 62 intermediate-frequency ones (less than 30, at least 3 occurrences), accounting for another 659 words: -oldo(29) -edor(27) -lol(26) -oro(26) -l(23) -olor(23) -om(23) -edol(21) -olol(21) -eer(20) -orol(19) -ro(19) -in(18) -er(17) -odo(16) -eeol(15) -lo(15) -(14) -eed(14) -lor(14) -oloin(14) -oir(13) -edoin(12) -eeeo(11) -m(11) -oroin(11) -dolo(10) -d(9) -dom(9) -ldo(9) -oror(9) -eeedo(7) -eoldo(7) -oeedo(7) -doir(6) -doro(6) -lr(6) -dl(5) -doldo(5) -eeor(5) -eolo(5) -lolo(5) -odoin(5) -olr(5) -orolo(5) -rdo(5) -rolo(5) -rom(5) -eedol(4) -em(4) -loin(4) -olom(4) -on(4) -edom(3) -ee(3) -eedor(3) -ero(3) -oo(3) -ool(3) -oor(3) -roir(3) -rorol(3) The 137 least significant ones (less than 3 occurrences), accounting for only 163 words: -dedo(2) -deedo(2) -dolol(2) -dolor(2) -dorodo(2) -dororo(2) -e(2) -edoir(2) -een(2) -eero(2) -eodo(2) -eoin(2) -eolol(2) -eom(2) -eoo(2) -ldoin(2) -ldol(2) -lom(2) -loroin(2) -odol(2) -odor(2) -oeeedo(2) -oil(2) -ololo(2) -olorol(2) -orom(2) -deeedo(1) -deeo(1) -doedo(1) -doil(1) -doindo(1) -doinl(1) -doirodo(1) -doirol(1) -doiroldo(1) -dolord(1) -doloro(1) -dool(1) -dordo(1) -doroin(1) -dorom(1) -doror(1) -dororom(1) -drol(1) -ede(1) -ededo(1) -edeeo(1) -edeo(1) -edoldo(1) -edolo(1) -edool(1) -edoor(1) -edoro(1) -edorol(1) -edr(1) -eedeeo(1) -eedoldo(1) -eedom(1) -eeerol(1) -eeodo(1) -eeoin(1) -eeolo(1) -eeoolor(1) -eerl(1) -eerol(1) -eod(1) -eodoin(1) -eoeoloin(1) -eold(1) -eolor(1) -eoro(1) -eorol(1) -erdo(1) -ino(1) -ld(1) -lddo(1) -ldolor(1) -ldor(1) -ll(1) -lldor(1) -lod(1) -loeeo(1) -loinm(1) -loldo(1) -lolom(1) -lolor(1) -lorol(1) -lroiror(1) -lron(1) -lror(1) -n(1) -nl(1) -odeedo(1) -odeeo(1) -odoirol(1) -odorol(1) -oeeo(1) -oeo(1) -oinolo(1) -old(1) -olddo(1) -oldoir(1) -oldol(1) -oleedo(1) -oleeedo(1) -oleereo(1) -ollom(1) -olod(1) -oloino(1) -oloir(1) -ololdo(1) -olordo(1) -oloro(1) -oloroin(1) -oloror(1) -olro(1) -olrolo(1) -ooin(1) -ooo(1) -ooon(1) -ordo(1) -orodl(1) -orodo(1) -oroir(1) -orolom(1) -orolr(1) -ororo(1) -ororor(1) -rlr(1) -rodor(1) -roino(1) -roirol(1) -roldo(1) -rolor(1) -roro(1) -roroin(1) -roror(1) Perhaps we should have left the "e*" part in the prefix... Here are all the 219 suffixes sorted by reverse suffix, and grouped by last letters: 14 - 9 -d 30 -ed 14 -eed 1 -ld 1 -old 1 -eold 1 -eod 1 -lod 1 -olod 1 -dolord 2 -e 1 -ede 3 -ee 23 -l 5 -dl 1 -orodl 2 -oil 1 -doil 1 -ll 1 -nl 1 -doinl 690 -ol 106 -dol 21 -edol 4 -eedol 2 -ldol 1 -oldol 2 -odol 123 -eol 15 -eeol 26 -lol 21 -olol 2 -dolol 2 -eolol 3 -ool 1 -dool 1 -edool 94 -rol 1 -drol 1 -eerol 1 -eeerol 1 -doirol 1 -odoirol 1 -roirol 19 -orol 1 -edorol 1 -odorol 1 -eorol 1 -lorol 2 -olorol 3 -rorol 1 -eerl 11 -m 4 -em 1 -loinm 23 -om 9 -dom 3 -edom 1 -eedom 2 -eom 2 -lom 1 -ollom 4 -olom 1 -lolom 1 -orolom 5 -rom 2 -orom 1 -dorom 1 -dororom 1 -n 2 -een 18 -in 561 -oin 117 -doin 12 -edoin 2 -ldoin 5 -odoin 1 -eodoin 2 -eoin 1 -eeoin 4 -loin 14 -oloin 1 -eoeoloin 1 -ooin 86 -roin 11 -oroin 1 -doroin 2 -loroin 1 -oloroin 1 -roroin 4 -on 1 -ooon 1 -lron 348 -o 167 -do 1 -lddo 1 -olddo 1161 -edo 2 -dedo 1 -ededo 427 -eedo 2 -deedo 1 -odeedo 7 -eeedo 1 -deeedo 1 -oleeedo 2 -oeeedo 1 -oleedo 7 -oeedo 1 -doedo 9 -ldo 29 -oldo 5 -doldo 1 -edoldo 1 -eedoldo 7 -eoldo 1 -loldo 1 -ololdo 1 -roldo 1 -doiroldo 1 -doindo 16 -odo 2 -eodo 1 -eeodo 1 -doirodo 1 -orodo 2 -dorodo 5 -rdo 1 -erdo 1 -ordo 1 -dordo 1 -olordo 608 -eo 1 -edeo 299 -eeo 1 -deeo 1 -edeeo 1 -eedeeo 1 -odeeo 11 -eeeo 1 -oeeo 1 -loeeo 1 -oeo 1 -oleereo 15 -lo 63 -olo 10 -dolo 1 -edolo 5 -eolo 1 -eeolo 5 -lolo 2 -ololo 1 -oinolo 5 -rolo 1 -olrolo 5 -orolo 1 -ino 1 -oloino 1 -roino 3 -oo 2 -eoo 1 -ooo 19 -ro 3 -ero 2 -eero 1 -olro 26 -oro 6 -doro 1 -edoro 1 -eoro 1 -oloro 1 -doloro 1 -roro 1 -ororo 2 -dororo 33 -r 1 -edr 17 -er 20 -eer 13 -oir 6 -doir 2 -edoir 1 -oldoir 1 -oloir 3 -roir 1 -oroir 6 -lr 5 -olr 1 -orolr 1 -rlr 267 -or 74 -dor 27 -edor 3 -eedor 1 -ldor 1 -lldor 2 -odor 1 -rodor 43 -eor 5 -eeor 14 -lor 23 -olor 2 -dolor 1 -ldolor 1 -eolor 1 -lolor 1 -eeoolor 1 -rolor 3 -oor 1 -edoor 47 -ror 1 -lroiror 1 -lror 9 -oror 1 -doror 1 -oloror 1 -roror 1 -ororor