Hacking at the Voynich manuscript - Side notes 506 An alternative complete factorization of words in ERA alphabet Last edited on 1999-01-31 06:21:20 by stolfi OBSOLETE This is partly a remake of work from Notebook-1.txt and Notebook-2.txt, originally done between 97-07-05 and 97-07-16. Summary of previous relevant tasks: I obtained Landini's interlinear transcription of the VMs, version 1.6 (landini-interln16.evt) from http://sun1.bham.ac.uk/G.Landini/evmt/intrln16.zip. [Notebook-1.txt] Around 97-11-01 I split landini-interln16.evt into many files, with one text unit per page. [Notebook-12.txt] On 97-11-05 I mapped those files from FSG and other ad-hoc alphabets to EVA. [Notebook-12.txt] The files are L16-eva/fNNxx.YY, and a machine-readable description of their contents and logical order is in L16-eva/INDEX. Then I started going back to redoing some of the previous tasks using the new encoding. I extracted the Currier (;C>) and Friedman-I (;F>) versions of the "bio" section, in EVA alphabet, as files bio-{c,f}-eva.evt. I also built the associated text files and word lists bio-{c,f}-eva{-{gut,fun,bad},{wds,dic,frq},.txt}. [Note-001.txt] Eventually I decided that it was necessary to map the data to a reduced alphabet (ERA), identifying similar letters: both to reduce transcription and sampling noise, and to make the results more manageable. Accordingly, I created files bio-{c,f}-era-gut.{wds,dic} [Note-003.txt] After some ad-hoc hacking, I tentatively identified a paradigm which consists of 156 prefixes combined with 219 suffixes; the latter are maximal strings of the form [edoirlmn]*. [Note-005.txt] 97-11-11 stolfi =============== It seems that we have too many suffixes and too few prefixes. Probably, attaching the "e"s to the suffix is a mistake. For one thing, some "e"s come from 'c[pftk]h' gallows, which should be in the prefix. Also, the 'e's are always at the left end of the suffixes. So let's redo Note-005.txt, with the "e"s in the prefix. I.e. a suffix is now a maximal terminal string of the form /[doirlmn]*/ cat bio-f-era-gut.wds \ | sed -e 's:\([doirlmn]*\)$:- -\1:' \ > Note-006/.factored dicio-wc bio-f-era-gut.wds Note-006/.factored lines words bytes file ------ ------- --------- ------------ 6166 6166 34893 bio-f-era-gut.wds 6166 12332 53391 Note-006/.factored Now, let's collect the prefixes and suffixes: cat Note-006/.factored \ | gawk '/./ {print $1}' \ | revbytes | sort | uniq | revbytes \ > Note-006/.prefs-all.dic cat Note-006/.factored \ | gawk '/./ {print $2}' \ | sort | uniq \ > Note-006/.suffs-all.dic dicio-wc Note-006/.prefs-all.dic Note-006/.suffs-all.dic lines words bytes file ------ ------- --------- ------------ 267 267 1858 Note-006/.prefs-all.dic 147 147 901 Note-006/.suffs-all.dic Great. Now let's count their occurrences and list the most important: cat Note-006/.factored \ | gawk '/./ {print $1}' \ | revbytes | sort | revbytes | uniq -c | expand \ | sort +0 -1nr \ > Note-006/.prefs-all.frq The 27 most important prefixes (at least 30 occurrences), accounting for 5449 words -(1510) ok-(923) che-(748) oke-(414) okee-(383) chee-(147) ch-(139) k-(123) lche-(116) cheke-(113) olche-(88) olk-(80) cheeke-(77) ke-(62) okch-(54) chek-(53) okche-(52) olkee-(44) dche-(43) rche-(42) kee-(41) kche-(37) oche-(35) chekee-(32) pche-(32) opche-(31) olke-(30) The next 64 prefixes (less than 30 but at least 3 occurrences), accounting for another 507 words: cheek-(24) p-(24) chk-(22) lk-(20) lch-(19) kch-(16) olch-(16) eke-(15) ekee-(14) op-(14) lchee-(13) ochee-(13) pch-(13) rolche-(12) dch-(11) olchee-(11) lkee-(10) oekee-(10) rolkee-(10) cheekee-(9) rch-(9) oee-(8) oeke-(8) ofche-(8) okech-(8) opch-(8) chok-(7) dchee-(7) lke-(7) okeee-(7) chke-(6) ek-(6) polch-(6) rchee-(6) dok-(5) ee-(5) epee-(5) lcheeke-(5) polche-(5) rolk-(5) cheok-(4) doke-(4) epe-(4) okeche-(4) olkch-(4) olkche-(4) rchek-(4) rk-(4) chepche-(3) chepee-(3) dee-(3) dokee-(3) dolch-(3) dolche-(3) eee-(3) fche-(3) kok-(3) kolch-(3) lcheekee-(3) lkche-(3) olok-(3) orche-(3) rok-(3) rokee-(3) The other 176 prefixes (less than 3 occurrences), accounting for another 210 words: chech-(2) chedche-(2) cheepe-(2) cheepee-(2) chep-(2) chepch-(2) de-(2) dke-(2) dlche-(2) e-(2) keee-(2) lchek-(2) lcheke-(2) lpche-(2) odche-(2) odee-(2) oeee-(2) of-(2) okede-(2) okolch-(2) okolche-(2) okolk-(2) olcheke-(2) olfche-(2) olkech-(2) olkeee-(2) olpche-(2) ook-(2) opchee-(2) opolk-(2) orch-(2) rcheke-(2) rolch-(2) rolke-(2) chch-(1) chedch-(1) chedok-(1) cheee-(1) cheeeke-(1) cheeekee-(1) cheekch-(1) cheekeedch-(1) chekeee-(1) cheolch-(1) chf-(1) chfee-(1) chkch-(1) chkee-(1) chlchpchee-(1) choe-(1) choeke-(1) choepee-(1) chokch-(1) chokee-(1) cholche-(1) cholkeee-(1) chop-(1) chpche-(1) dcheeke-(1) dchekee-(1) dchok-(1) deee-(1) dk-(1) dkche-(1) dkee-(1) doe-(1) dokch-(1) dolchee-(1) dolfche-(1) dolk-(1) dolke-(1) dolkeee-(1) dorch-(1) dorchee-(1) eedee-(1) efche-(1) efe-(1) efeee-(1) ekch-(1) ep-(1) epch-(1) epche-(1) fch-(1) kchdolk-(1) kchee-(1) kchek-(1) kcheokee-(1) kech-(1) keche-(1) keoe-(1) keoke-(1) koekee-(1) kolke-(1) korch-(1) korolch-(1) lcheek-(1) lchepe-(1) lchepee-(1) ldche-(1) lf-(1) lkede-(1) loche-(1) loee-(1) lok-(1) lokee-(1) lolk-(1) lolke-(1) lpch-(1) och-(1) ochche-(1) ocheeke-(1) ocheekee-(1) ocheke-(1) ochep-(1) oddche-(1) odoke-(1) odorche-(1) oeekee-(1) oep-(1) ofch-(1) okchok-(1) okecheke-(1) okedee-(1) okeech-(1) okeeolche-(1) okeolch-(1) okoch-(1) okok-(1) okook-(1) okop-(1) olcheeke-(1) olchk-(1) olee-(1) oleee-(1) oleere-(1) oleke-(1) olkeche-(1) olkeeoche-(1) ollch-(1) oloefe-(1) olokche-(1) oloke-(1) olokee-(1) ololche-(1) ololkee-(1) olpoeke-(1) ooche-(1) opolche-(1) orchee-(1) ork-(1) orok-(1) oroke-(1) pchee-(1) pchefe-(1) pdolch-(1) pe-(1) pok-(1) poke-(1) poldche-(1) poldok-(1) polkech-(1) polkee-(1) porche-(1) prche-(1) rcheeke-(1) rchkch-(1) rekee-(1) reok-(1) rkch-(1) rkee-(1) roeke-(1) roekee-(1) rokchee-(1) rolchk-(1) rolkche-(1) rpch-(1) Now for the suffixes: cat Note-006/.factored \ | gawk '/./ {print $2}' \ | sort | uniq -c | expand \ | sort +0 -1nr \ > Note-006/.suffs-all.frq The 24 most significant suffixes (at least 20 occurrences), accounting for 5796 words: -do(1781) -o(1275) -ol(828) -oin(564) -or(315) -dol(131) -doin(129) -dor(104) -rol(96) -roin(86) -r(70) -olo(69) -d(53) -ror(47) -oldo(36) -oro(27) -lol(26) -om(25) -olor(24) -ro(24) -l(23) -olol(23) -(20) -orol(20) The 33 intermediate-frequency ones (less than 20, at least 3 occurrences), accounting for another 264 words: -odo(19) -in(18) -lo(15) -m(15) -oloin(15) -lor(14) -dom(13) -oir(13) -dolo(11) -oroin(11) -ldo(9) -oror(9) -doir(8) -doldo(7) -doro(7) -lr(6) -odoin(6) -rdo(6) -dl(5) -lolo(5) -olr(5) -oo(5) -orolo(5) -rolo(5) -rom(5) -loin(4) -olom(4) -on(4) -n(3) -ool(3) -oor(3) -roir(3) -rorol(3) The 90 least significant ones (less than 3 occurrences), accounting for only 106 words: -dolol(2) -dolor(2) -dool(2) -dorodo(2) -dororo(2) -ldoin(2) -ldol(2) -lom(2) -loroin(2) -odol(2) -odor(2) -oil(2) -old(2) -ololo(2) -olorol(2) -orom(2) -doil(1) -doindo(1) -doinl(1) -doirodo(1) -doirol(1) -doiroldo(1) -dolord(1) -doloro(1) -door(1) -dordo(1) -doroin(1) -dorol(1) -dorom(1) -doror(1) -dororom(1) -dr(1) -drol(1) -ino(1) -ld(1) -lddo(1) -ldolor(1) -ldor(1) -ll(1) -lldor(1) -lod(1) -loinm(1) -loldo(1) -lolom(1) -lolor(1) -lorol(1) -lroiror(1) -lron(1) -lror(1) -nl(1) -od(1) -odoirol(1) -odorol(1) -oinolo(1) -olddo(1) -oldoir(1) -oldol(1) -ollom(1) -olod(1) -oloino(1) -oloir(1) -ololdo(1) -olordo(1) -oloro(1) -oloroin(1) -oloror(1) -olro(1) -olrolo(1) -ooin(1) -oolor(1) -ooo(1) -ooon(1) -ordo(1) -orodl(1) -orodo(1) -oroir(1) -orolom(1) -orolr(1) -ororo(1) -ororor(1) -rl(1) -rlr(1) -rodor(1) -roino(1) -roirol(1) -roldo(1) -rolor(1) -roro(1) -roroin(1) -roror(1) Let's compare the suffixes of a few common prefixes: set tfiles = ( ) set totw = 0 set sufw = 7 set digs = 3 echo " " foreach f ( k ok ke oke che kee okee chee lche ch chek ) set g = "Note-006/.suffs-${f}.frq" echo "$f-" /bin/rm -f ${g} echo "frq" "$f-" \ | gawk '/./ {printf "%'"${digs}"'s %-'"${sufw}"'s\n", $1, $2}' \ >> ${g} echo "--------------" "--------------" \ | gawk '/./ {printf "%.'"${digs}"'s %.'"${sufw}"'s\n", $1, $2}' \ >> ${g} cat Note-006/.factored \ | egrep '^'"${f}"'-' \ | gawk '/./ {print $2}' \ | revbytes | sort | revbytes | uniq -c \ | gawk '/./ {printf "%'"${digs}"'d %s\n", $1, $2}' \ | sort +0 -1nr \ >> ${g} @ totw = ${totw} + ${digs} + 1 + ${sufw} + 1 set tfiles = ( ${tfiles} ${g} ) end pr -m -s' ' -t -i' '1 -w ${totw} ${tfiles} \ | expand \ > Note-006/prefs-cmp.txt It seems that "k-" and "ok-" have a somewhat different "conjugation" than most other prefixes: frq k- frq ok- --- ------- --- ------- 37 -oin 387 -oin 35 -ol 225 -ol 19 -or 118 -o 8 -o 111 -or 4 -om 13 -olo 4 -oro 8 -oldo 3 -oldo 6 -oir 2 -odo 6 -om ... ... ... ... frq ke- frq oke- frq che- frq lche- frq kee- frq okee- --- ------- --- ------- --- ------- --- ------- --- ------- --- ------- 47 -do 305 -do 422 -do 72 -do 25 -do 242 -do 7 -ol 69 -o 169 -o 27 -o 9 -o 118 -o 3 -o 9 -dor 65 -ol 5 -ol 3 -d 7 -d 1 -dol 8 -ol 28 -or 4 -d 3 -ol 5 -r 1 -dool 6 -or 13 -dol 1 -dol 1 -dom 3 -ol 1 -dor 3 -d 11 -r 1 -dom 2 -dor 1 -olo 3 -dol 10 -doin 1 -dor 2 -ro 1 -or 2 -doin 8 -d 1 -doro 1 -dol ... ... ... ... ... ... ... ... ... ... ... ... frq chee- frq ch- frq chek- --- ------- --- ------- --- ------- 66 -o 41 -ol 35 -o 62 -do 20 -do 5 - 6 -ol 20 -o 5 -or 6 -r 16 -or 4 -oin 2 -n 6 -l 2 -om 1 -d 4 -dor 1 -ol 1 -dor 4 -r 1 -r 1 -oin 3 -oldo ... ... ... ... ... ... ... ... ... ... ... ... OK, let's generate a table with the main prefixes and suffixes: cat Note-006/.prefs-all.frq \ | sort +0 -1nr \ | gawk '($1 >= 3) {print $2}' \ | revbytes | sort | revbytes \ > Note-006/.prefs-top.dic cat Note-006/.suffs-all.frq \ | sort +0 -1nr \ | gawk '($1 >= 20) {print $2}' \ > Note-006/.suffs-top.dic dicio-wc Note-006/.prefs-top.dic Note-006/.suffs-top.dic lines words bytes file ------ ------- --------- ------------ 91 91 558 Note-006/.prefs-top.dic 24 24 110 Note-006/.suffs-top.dic cat Note-006/.factored \ | count-diword-freqs \ -v rows=Note-006/.prefs-top.dic \ -v cols=Note-006/.suffs-top.dic \ -v digits=3 \ > Note-006/pref-suff-wds-table.txt cat Note-006/.factored \ | fgrep -w -f Note-006/.prefs-top.dic \ | fgrep -w -f Note-006/.suffs-top.dic \ | wc This set of prefixes and suffixes covers 5867 of the 6166 original words (95.1%)! Let's compute the corresponding numbers without taking word repetitions into account: cat bio-f-era-gut.dic \ | sed -e 's:\([doirlmn]*\)$:- -\1:' \ | egrep -e '- -' \ > Note-006/.dic-factored dicio-wc bio-f-era-gut.dic Note-006/.dic-factored lines words bytes file ------ ------- --------- ------------ 763 763 5164 bio-f-era-gut.dic 763 1526 7453 Note-006/.dic-factored Let's get again the prefixes and suffixes (should not change): cat Note-006/.dic-factored \ | gawk '/./ {print $1}' \ | revbytes | sort | uniq | revbytes \ > Note-006/.prefs-all.dic cat Note-006/.dic-factored \ | gawk '/./ {print $2}' \ | sort | uniq \ > Note-006/.suffs-all.dic dicio-wc Note-006/.prefs-all.dic Note-006/.suffs-all.dic lines words bytes file ------ ------- --------- ------------ 267 267 1858 Note-006/.prefs-all.dic 147 147 901 Note-006/.suffs-all.dic cat Note-006/.dic-factored \ | gawk '/./ {print $1}' \ | revbytes | sort | revbytes | uniq -c | expand \ | sort +0 -1nr \ > Note-006/.prefs-all.frq cat Note-006/.dic-factored \ | gawk '/./ {print $2}' \ | sort | uniq -c | expand \ | sort +0 -1nr \ > Note-006/.suffs-all.frq Here are the prefixes, with number of *distinct* words using them: The 98 prefs with at least two occurrences, accounting for 594 distinct words: -(123) ok-(32) ch-(25) che-(22) oke-(16) k-(15) lche-(12) okee-(11) chee-(10) olk-(10) p-(10) cheke-(9) olche-(9) ke-(8) lk-(8) oche-(8) olkee-(8) rche-(8) chek-(7) chk-(6) eke-(6) lch-(6) okch-(6) op-(6) pche-(6) dch-(5) dchee-(5) kch-(5) kee-(5) lchee-(5) opch-(5) opche-(5) pch-(5) rch-(5) dche-(4) kche-(4) ochee-(4) oekee-(4) okech-(4) olke-(4) polch-(4) rchee-(4) rolche-(4) cheeke-(3) chekee-(3) chok-(3) ee-(3) ek-(3) ekee-(3) epe-(3) kolch-(3) lkee-(3) oeke-(3) ofche-(3) okche-(3) olchee-(3) polche-(3) rk-(3) rolk-(3) chech-(2) chedche-(2) cheekee-(2) chep-(2) chepche-(2) chke-(2) dee-(2) dlche-(2) dok-(2) doke-(2) dokee-(2) dolch-(2) dolche-(2) e-(2) eee-(2) epee-(2) kok-(2) lcheekee-(2) lkche-(2) lke-(2) lpche-(2) odche-(2) odee-(2) oee-(2) of-(2) okeche-(2) okede-(2) okeee-(2) okolch-(2) okolk-(2) olch-(2) olkch-(2) olkche-(2) olkech-(2) olok-(2) rok-(2) rolch-(2) rolke-(2) rolkee-(2) The 169 prefixes that account for only one word each: chch-(1) chedch-(1) chedok-(1) cheee-(1) cheeeke-(1) cheeekee-(1) cheek-(1) cheekch-(1) cheekeedch-(1) cheepe-(1) cheepee-(1) chekeee-(1) cheok-(1) cheolch-(1) chepch-(1) chepee-(1) chf-(1) chfee-(1) chkch-(1) chkee-(1) chlchpchee-(1) choe-(1) choeke-(1) choepee-(1) chokch-(1) chokee-(1) cholche-(1) cholkeee-(1) chop-(1) chpche-(1) dcheeke-(1) dchekee-(1) dchok-(1) de-(1) deee-(1) dk-(1) dkche-(1) dke-(1) dkee-(1) doe-(1) dokch-(1) dolchee-(1) dolfche-(1) dolk-(1) dolke-(1) dolkeee-(1) dorch-(1) dorchee-(1) eedee-(1) efche-(1) efe-(1) efeee-(1) ekch-(1) ep-(1) epch-(1) epche-(1) fch-(1) fche-(1) kchdolk-(1) kchee-(1) kchek-(1) kcheokee-(1) kech-(1) keche-(1) keee-(1) keoe-(1) keoke-(1) koekee-(1) kolke-(1) korch-(1) korolch-(1) lcheek-(1) lcheeke-(1) lchek-(1) lcheke-(1) lchepe-(1) lchepee-(1) ldche-(1) lf-(1) lkede-(1) loche-(1) loee-(1) lok-(1) lokee-(1) lolk-(1) lolke-(1) lpch-(1) och-(1) ochche-(1) ocheeke-(1) ocheekee-(1) ocheke-(1) ochep-(1) oddche-(1) odoke-(1) odorche-(1) oeee-(1) oeekee-(1) oep-(1) ofch-(1) okchok-(1) okecheke-(1) okedee-(1) okeech-(1) okeeolche-(1) okeolch-(1) okoch-(1) okok-(1) okolche-(1) okook-(1) okop-(1) olcheeke-(1) olcheke-(1) olchk-(1) olee-(1) oleee-(1) oleere-(1) oleke-(1) olfche-(1) olkeche-(1) olkeee-(1) olkeeoche-(1) ollch-(1) oloefe-(1) olokche-(1) oloke-(1) olokee-(1) ololche-(1) ololkee-(1) olpche-(1) olpoeke-(1) ooche-(1) ook-(1) opchee-(1) opolche-(1) opolk-(1) orch-(1) orche-(1) orchee-(1) ork-(1) orok-(1) oroke-(1) pchee-(1) pchefe-(1) pdolch-(1) pe-(1) pok-(1) poke-(1) poldche-(1) poldok-(1) polkech-(1) polkee-(1) porche-(1) prche-(1) rcheeke-(1) rchek-(1) rcheke-(1) rchkch-(1) rekee-(1) reok-(1) rkch-(1) rkee-(1) roeke-(1) roekee-(1) rokchee-(1) rokee-(1) rolchk-(1) rolkche-(1) rpch-(1) Now, the top 22 suffixes, accounting for 595 distinct words: -o(165) -do(130) -ol(56) -or(39) -oin(27) -d(26) -r(19) -dol(17) -dor(15) -(14) -olo(12) -om(11) -oldo(10) -l(8) -odo(8) -oro(7) -ro(6) -dom(5) -m(5) -oir(5) -olor(5) -orol(5) The 30 suffixes with less than 5 but at least 2 distinct prefixes, accounting for 73 distinct words: -doin(4) -oo(4) -rdo(4) -doir(3) -doldo(3) -oloin(3) -olol(3) -olr(3) -oroin(3) -rol(3) -dolo(2) -dool(2) -doro(2) -ldo(2) -ldoin(2) -lo(2) -loin(2) -lol(2) -lor(2) -lr(2) -n(2) -odoin(2) -odol(2) -old(2) -ololo(2) -olom(2) -oor(2) -orolo(2) -orom(2) -oror(2) The 95 suffixes that occur with only one prefix: -dl(1) -doil(1) -doindo(1) -doinl(1) -doirodo(1) -doirol(1) -doiroldo(1) -dolol(1) -dolor(1) -dolord(1) -doloro(1) -door(1) -dordo(1) -dorodo(1) -doroin(1) -dorol(1) -dorom(1) -doror(1) -dororo(1) -dororom(1) -dr(1) -drol(1) -in(1) -ino(1) -ld(1) -lddo(1) -ldol(1) -ldolor(1) -ldor(1) -ll(1) -lldor(1) -lod(1) -loinm(1) -loldo(1) -lolo(1) -lolom(1) -lolor(1) -lom(1) -loroin(1) -lorol(1) -lroiror(1) -lron(1) -lror(1) -nl(1) -od(1) -odoirol(1) -odor(1) -odorol(1) -oil(1) -oinolo(1) -olddo(1) -oldoir(1) -oldol(1) -ollom(1) -olod(1) -oloino(1) -oloir(1) -ololdo(1) -olordo(1) -oloro(1) -oloroin(1) -olorol(1) -oloror(1) -olro(1) -olrolo(1) -on(1) -ooin(1) -ool(1) -oolor(1) -ooo(1) -ooon(1) -ordo(1) -orodl(1) -orodo(1) -oroir(1) -orolom(1) -orolr(1) -ororo(1) -ororor(1) -rl(1) -rlr(1) -rodor(1) -roin(1) -roino(1) -roir(1) -roirol(1) -roldo(1) -rolo(1) -rolor(1) -rom(1) -ror(1) -roro(1) -roroin(1) -rorol(1) -roror(1) cat Note-006/.prefs-all.frq \ | sort +0 -1nr \ | gawk '($1 >= 2) {print $2}' \ | revbytes | sort | revbytes \ > Note-006/.prefs-top.dic cat Note-006/.suffs-all.frq \ | sort +0 -1nr \ | gawk '($1 >= 5) {print $2}' \ > Note-006/.suffs-top.dic dicio-wc Note-006/.prefs-top.dic Note-006/.suffs-top.dic lines words bytes file ------ ------- --------- ------------ 98 98 600 Note-006/.prefs-top.dic 22 22 95 Note-006/.suffs-top.dic cat Note-006/.dic-factored \ | count-diword-freqs \ -v rows=Note-006/.prefs-top.dic \ -v cols=Note-006/.suffs-top.dic \ -v digits=3 \ > Note-006/pref-suff-dic-table.txt cat Note-006/.dic-factored \ | fgrep -w -f Note-006/.prefs-top.dic \ | fgrep -w -f Note-006/.suffs-top.dic \ | wc These prefixes and suffixes account for 534 out of the original 763 words (70%). Not terribly impressive, but still... Perhaps we are being too greedy, and including in the suffix things that belong in the prefix.