Hacking at the Voynich manuscript - Side notes 009 A prefix-midfix-suffix factorization of the A and B herbal pages in EVA encoding Last edited on 1997-11-24 16:54:07 by stolfi This is partly a remake of work from Notebook-1.txt, originally done around 97-07-05. Summary of previous relevant tasks: I obtained Landini's interlinear transcription of the VMs, version 1.6 (landini-interln16.evt) from http://sun1.bham.ac.uk/G.Landini/evmt/intrln16.zip. [Notebook-1.txt] Around 97-11-01 I split landini-interln16.evt into many files, with one text unit per page. [Notebook-12.txt] On 97-11-05 I mapped those files from FSG and other ad-hoc alphabets to EVA. [Notebook-12.txt] The files are L16-eva/fNNxx.YY, and a machine-readable description of their contents and logical order is in L16-eva/INDEX. Then I started going back to redoing some of the previous tasks using the new encoding. I extracted the Currier (;C>) and Friedman-I (;F>) versions of the "bio" section, in EVA alphabet, as files bio-{c,f}-eva.evt. I also built the associated text files and word lists bio-{c,f}-eva{-{gut,fun,bad},{wds,dic,frq},.txt}. [Note-001.txt] I then constructed a factorization of the words in the the bio section (Friedman version) into prefix-midfix-suffix, where prefix and suffix are maximal strings of letters in [aoyidlsrmnq] (not counting the "s" in "sh"). This factorization revealed about 12 significant prefixes, 20 significant suffixes, and 40 or more significant midfixes. The prefixes and suffixes seemed to be composed of a small number of letter groups, such as { dy al ol ar or ain oin aiin oiin ... }. The midfixes were a single group of letters from the set { k t p f ckh cth cph cfh ch sh e }, where the "e"s appeared to be modifiers of the preceding letter. [Note-008.txt] 97-11-12 stolfi =============== I posted the Note-008 factorization to the Voynich list, and Rene asked whether the same pattern held in the Herbal sections. So let's try it. foreach lang ( a b ) set ufile = "he${lang}.units" set mfile = "he${lang}-m-eva.evt" set ffile = "he${lang}-f-eva.evt" set cfile = "he${lang}-c-eva.evt" echo '=== '${ufile} cat L16-eva/INDEX \ | egrep -i -e ':herbal:'"${lang}"':.*:parags:' \ | sed -e 's/:.*$//g' \ > ${ufile} cat `cat ${ufile} | sed -e 's/^/L16-eva\//g'` \ | egrep -v '^#' \ > ${mfile} cat ${mfile} \ | egrep '^<[^>]*;C>' \ > ${cfile} cat ${mfile} \ | egrep '^<[^>]*;F>' \ > ${ffile} dicio-wc he${lang}-{m,f,c}-eva.evt end lines words bytes file ------ ------- --------- ------------ 2477 4957 143185 hea-m-eva.evt 1216 2432 70439 hea-f-eva.evt 1119 2241 63768 hea-c-eva.evt lines words bytes file ------ ------- --------- ------------ 727 1463 53929 heb-m-eva.evt 364 735 26820 heb-f-eva.evt 291 584 21307 heb-c-eva.evt foreach lang ( a b ) foreach guy ( f c ) extract-words-from-interlin \ -chars "aoeilmnrchtpkfsqjdvxyg" \ he${lang}-${guy}-eva.evt \ he${lang}-${guy}-eva \ > he${lang}-${guy}.stats dicio-wc he${lang}-${guy}-eva{,-gut,-fun,-bad}.* end end lines words bytes file ------ ------- --------- ------------ 2497 2497 17089 hea-f-eva.dic 1216 2432 70439 hea-f-eva.evt 2497 4994 37065 hea-f-eva.frq 1216 8058 46503 hea-f-eva.txt 9250 9250 48887 hea-f-eva.wds 2450 2450 16826 hea-f-eva-gut.dic 2450 4900 36426 hea-f-eva-gut.frq 7812 7812 45838 hea-f-eva-gut.wds 3 3 6 hea-f-eva-fun.dic 3 6 30 hea-f-eva-fun.frq 1387 1387 2774 hea-f-eva-fun.wds 44 44 257 hea-f-eva-bad.dic 44 88 609 hea-f-eva-bad.frq 51 51 275 hea-f-eva-bad.wds lines words bytes file ------ ------- --------- ------------ 2270 2270 15622 hea-c-eva.dic 1119 2241 63768 hea-c-eva.evt 2270 4540 33782 hea-c-eva.frq 1114 7285 41710 hea-c-eva.txt 8387 8387 43914 hea-c-eva.wds 2173 2173 15073 hea-c-eva-gut.dic 2173 4346 32457 hea-c-eva-gut.frq 6990 6990 40727 hea-c-eva-gut.wds 3 3 6 hea-c-eva-fun.dic 3 6 30 hea-c-eva-fun.frq 1278 1278 2556 hea-c-eva-fun.wds 94 94 543 hea-c-eva-bad.dic 94 188 1295 hea-c-eva-bad.frq 119 119 631 hea-c-eva-bad.wds lines words bytes file ------ ------- --------- ------------ 1330 1330 8857 heb-f-eva.dic 364 735 26820 heb-f-eva.evt 1330 2660 19497 heb-f-eva.frq 362 3304 19565 heb-f-eva.txt 3706 3706 20369 heb-f-eva.wds 1310 1310 8745 heb-f-eva-gut.dic 1310 2620 19225 heb-f-eva-gut.frq 3223 3223 19326 heb-f-eva-gut.wds 3 3 6 heb-f-eva-fun.dic 3 6 30 heb-f-eva-fun.frq 465 465 930 heb-f-eva-fun.wds 17 17 106 heb-f-eva-bad.dic 17 34 242 heb-f-eva-bad.frq 18 18 113 heb-f-eva-bad.wds lines words bytes file ------ ------- --------- ------------ 1070 1070 7226 heb-c-eva.dic 291 584 21307 heb-c-eva.evt 1070 2140 15786 heb-c-eva.frq 288 2631 15548 heb-c-eva.txt 2978 2978 16242 heb-c-eva.wds 1067 1067 7220 heb-c-eva-gut.dic 1067 2134 15756 heb-c-eva-gut.frq 2587 2587 15460 heb-c-eva-gut.wds 3 3 6 heb-c-eva-fun.dic 3 6 30 heb-c-eva-fun.frq 391 391 782 heb-c-eva-fun.wds 0 0 0 heb-c-eva-bad.dic 0 0 0 heb-c-eva-bad.frq 0 0 0 heb-c-eva-bad.wds Before we go on, it may be interesting to compare the vocabularies of the two "languages": foreach guy ( f c ) bool 1.2 he{a,b}-${guy}-eva-gut.dic > ${guy}-common.dic bool 1-2 he{a,b}-${guy}-eva-gut.dic > ${guy}-a-only.dic bool 2-1 he{a,b}-${guy}-eva-gut.dic > ${guy}-b-only.dic dicio-wc ${guy}-{common,a-only,b-only}.dic end lines words bytes file ------ ------- --------- ------------ 456 456 2619 f-common.dic 1994 1994 14207 f-a-only.dic 854 854 6126 f-b-only.dic lines words bytes file ------ ------- --------- ------------ 367 367 2097 c-common.dic 1806 1806 12976 c-a-only.dic 700 700 5123 c-b-only.dic Ok, now let's do the factoring: foreach lang ( a b ) foreach guy ( f c ) cat he${lang}-${guy}-eva-gut.wds \ | sed \ -e 's/sh/X/g' \ -e 's/$/}/' \ -e 's/^/{/' \ -e ':a' \ -e 's/{\(qo\)/.\1{/' \ -e 'ta' \ -e 's/{\([aoydlrs]\)/.\1{/' \ -e 'ta' \ -e ':x' \ -e 's/\([aoydsm]\|i*[lnr]\)}/}\1./' \ -e 'tx' \ -e 's/\([oaydirslmn][oaydirslmn]*\)}/}\1:/' \ -e 's/{\([oaydirslmn][oaydirslmn]*\)/:\1{/' \ -e 's/X/sh/g' \ -e 's/{}/\./' \ -e 's/\.//g' \ -e 's/{/- -/' \ -e 's/}/- -/' \ > he${lang}-${guy}.factored cat he${lang}-${guy}.factored \ | grep ':' \ > he${lang}-${guy}.funny-prefs-suffs cat he${lang}-${guy}.factored \ | grep -v -e '- -' \ | sort | uniq -c | expand | sort +0 -1nr \ > he${lang}-${guy}-unifs-all.frq cat he${lang}-${guy}.factored \ | grep -e '- -' \ | gawk '/./ {print $1}' \ | sort | uniq -c | expand | sort +0 -1nr \ > he${lang}-${guy}-prefs-all.frq cat he${lang}-${guy}.factored \ | grep -e '- -' \ | gawk '/./ {print $2}' \ | sort | uniq -c | expand | sort +0 -1nr \ > he${lang}-${guy}-midfs-all.frq cat he${lang}-${guy}.factored \ | grep -e '- -' \ | gawk '/./ {print $3}' \ | sort | uniq -c | expand | sort +0 -1nr \ > he${lang}-${guy}-suffs-all.frq cat he${lang}-${guy}.factored \ | grep -e '- -' \ | gawk '/./ {print ($2 $3)}' \ | sed -e 's/--//g' \ | sort | uniq -c | expand | sort +0 -1nr \ > he${lang}-${guy}-tails-all.frq cat he${lang}-${guy}.factored \ | gawk '/./ {print ($1 $2 $3)}' \ | sed -e 's/--//g' \ | sort | uniq -c | expand | sort +0 -1nr \ > he${lang}-${guy}-words-all.frq end end Check length of longest element: foreach elem ( prefs midfs suffs unifs tails words ) foreach lang ( a b ) foreach guy ( f c ) /usr/ucb/echo -n "herbal ${lang} vers ${guy}: max ${elem} = " cat he${lang}-${guy}-${elem}-all.frq \ | gawk '/./ {m=length($2); if(m>mx)mx=m}; END {printf "%2d\n",mx}' end end end herbal a vers c: max prefs = 8 herbal a vers f: max prefs = 8 herbal b vers c: max prefs = 8 herbal b vers f: max prefs = 5 herbal a vers c: max midfs = 15 herbal a vers f: max midfs = 15 herbal b vers c: max midfs = 16 herbal b vers f: max midfs = 13 herbal a vers c: max suffs = 10 herbal a vers f: max suffs = 9 herbal b vers c: max suffs = 9 herbal b vers f: max suffs = 9 herbal a vers c: max unifs = 10 herbal a vers f: max unifs = 8 herbal b vers c: max unifs = 9 herbal b vers f: max unifs = 8 herbal a vers f: max tails = 16 herbal a vers c: max tails = 16 herbal b vers f: max tails = 14 herbal b vers c: max tails = 17 herbal a vers f: max words = 15 herbal a vers c: max words = 15 herbal b vers f: max words = 13 herbal b vers c: max words = 16 Let's now format the files and reduce the absolute counts to percentages relative to the total factored and unfactored words: foreach guy ( Friedman.f Currier.c ) foreach elem ( pref midf suff unif tail word ) foreach lang ( A.a B.b ) set file = "he${lang:e}-${guy:e}-${elem}s-all" echo "${file}.frq -> ${file}.fmt" cat ${file}.frq \ | compute-freqs \ | gawk '\ BEGIN {\ printf "by '"${guy:r}"'\nlanguage '"${lang:r}"'\n"; \ printf "freq pc '"${elem}"'ix\n---- -- ------------------\n";} \ /./ {printf "%4d %2d %s\n",$1,int($2*100+0.5),$3; t+=$1;} \ END {printf "---- -- ------------------\n%4d 99 TOTAL\n",t;} \ ' \ > ${file}.fmt end end end Now let's print them side-by-side: foreach elem ( pref midf suff unif tail word ) set tfiles = ( ) foreach guy ( f c ) foreach lang ( a b ) set file = "he${lang}-${guy}-${elem}s-all" set tfiles = ( ${tfiles} ${file}.fmt ) end end pr -m -t -i' '1 -w 108 ${tfiles} \ | expand \ > herbal-${elem}-cmp.txt end Inspired by Gabriel Landini's paper, let's prepare a graph of A-freq × B-freq for each segment: foreach guy ( Friedman.f ) foreach elem ( pref midf suff unif tail ) set pfile = "herbal-${guy:e}-${elem}s-all" set afile = "hea-${guy:e}-${elem}s-all" set bfile = "heb-${guy:e}-${elem}s-all" echo "${afile}.frq, ${bfile}.frq -> ${pfile}.plt" cat ${afile}.frq | sort -b +1 -2 > .a cat ${bfile}.frq | sort -b +1 -2 > .b /n/gnu/bin/join \ -a 1 -a 2 -e 0.5 \ -j1 2 -j2 2 \ -o1.1,2.1,0 \ .a .b \ > ${pfile}.plt plot-lang-diffs ${guy:r} ${elem} ${pfile}.plt end end