Hacking at the Voynich manuscript - Side notes 021 Plotting word frequencies per page Last edited on 1999-07-17 11:51:26 by stolfi [ First version done on 1998-04-28. Redone 1998-06-20 with fresher data. Redone 1998-07-02 with different dictionaries and axes. Redone 1999-01-30 with 1.6e6 majority transcription (Notes/045). Analysis part split off to Notes/019 on 1999-01-31. ] 1998-07-02 stolfi ================= The goal of this note is to compare the word distributions among the various sections of the Voynich manuscript. I. EXTRACTING AND COUNTING WORDS The source file will be the majority version (Notes/045), with weirdos mapped to "*" and other basic EVA chars, and chopped into pages and subsections: ln -s ../045/subsecs-m text-subsecs ln -s ../045/pages-m text-pages We will use the word and word class frequencies per page and per section derived from this data on Notes/019. mkdir -p RAW EQV ( cd RAW && ln -s ../../019/RAW/wfreqs ) ( cd EQV && ln -s ../../019/EQV/wfreqs ) ln -s ../019/fnum-to-subsec.tbl Creating a combined file of the source text for archiving: ( cd text-pages && cat `cat all.names | sed -e 's/$/.evt/'` ) \ > all.evt II. CONCEPT OF PAGE SCATTER-PLOTS Let's see if we can characterize the Voynichese sub-languages by mapping the pages into some space of moderate dimension, and looking for geometric clustering among that cloud of points. We start by fixing a list D of key words. Then we can look at each language sample S as a point in the unit probability simplex, whose ith coordinate S_i is the estimated frequency freq(w_i,S) of the ith key word w_i in the sample's language. To properly handle low-frequency words, I guess we should use a Bayesian estimate of freq(w_i,S), rather than merely count(w,S)/size(S). But for that we need to postulate an 'a priori' probability for words that do not occur in that sample, or even in any sample. We could use some exponential distribution in length(w), but let's leave that for later. Let's instead use an a priori distribution for the frequencies that is uniform in the unit simplex, for all words w in some selected dictionary D. The Bayesian estimate of freq(w,S) is then freq(w,S) = (count(w,S) + 1)/(count(D,S) + n) where count(D,S) is the total number of occurrences of D-words in the sample S, and n is the number of distinct words in D. Note that these estimated frequencies still add to 1. As a dictionary D, let's use the top 50 words of Rene's list: set sizeD = 50 foreach ep ( cat.RAW word-ct-to-class-ct.EQV ) set etag = ${ep:e}; set ecmd = ${ep:r} mkdir -p ${etag}/plots/rene cat Rene-words.frq \ | ${ecmd} \ | combine-counts \ | sort -b +0 -1nr \ | head -${sizeD} \ | gawk '//{print $2;}' \ | sort | uniq \ > ${etag}/plots/rene/keys.dic list -filter 'fmt -w60' ${etag}/plots/rene/keys.dic end dicio-wc {RAW,EQV}/plots/rene/keys.dic --- RAW/plots/rene/keys.dic ------------------------ aiin al ar chckhy chdy chedy cheey cheol chey chol chor chy daiin dain dal dar dy lchedy okaiin okain okal okar okedy okeedy okeey ol or otaiin otal otar otedy oteedy oteey oty qokaiin qokain qokal qokar qokedy qokeedy qokeey qokey qoky qol s saiin shedy sheey shey shol ---------------------------------------------------- --- EQV/plots/rene/keys.dic ------------------------ chctho~ chdo~ chectho~ chedo~ cheedo~ cheeo~ cheodo~ cheol~ cheor~ cheo~ cheto~ chodo~ chol~ chor~ choto~ cho~ ctho~ doin~ doir~ dol~ dor~ do~ lchedo~ odoin~ oin~ ol~ oroin~ or~ otchdo~ otchedo~ otcheo~ otchol~ otcho~ otedo~ oteedo~ oteeo~ oteodo~ oteol~ oteo~ otod~ otoin~ otol~ otor~ oto~ o~ soin~ sor~ s~ toin~ tor~ ---------------------------------------------------- Then we compute the frequencies of these keywords in each page and section: foreach dic ( rene ) foreach etag ( RAW EQV ) echo "${dic}" "${etag}" foreach utype ( pages subsecs ) set frdir = "${etag}/wfreqs/${utype}" set ptdir = "${etag}/plots/${dic}/${utype}" echo "${frdir}" "${ptdir}" /bin/rm -rf ${ptdir} mkdir -p ${ptdir} cp -p ${frdir}/all.names ${ptdir} foreach fnum ( tot `cat ${frdir}/all.names` ) printf "%30s/%-7s " "${ptdir}" "${fnum}:" cat ${frdir}/${fnum}.frq \ | gawk '/./{print $1, $3;}' \ | est-dic-probs -v dic="${etag}/plots/${dic}/keys.dic" \ > ${ptdir}/${fnum}.pos end end end end For plotting we will use a special grouping of subsections into "plot sections" defined by the following table: set plotgrp = ( \ pharma:pha.1,pha.2 \ her-a-1:hea.1 \ her-a-2:hea.2 \ cosmo-1:cos.2 \ cosmo-2:cos.3 \ zodiac:zod.1 \ stars:str.2 \ her-b-1:heb.1 \ her-b-2:heb.2 \ bio:bio.1 \ f58rv:str.1 \ f1r:unk.1 \ f57v:cos.1 \ f49v:unk.2 \ f65rv:unk.3 \ f66r:unk.4 \ f85r1:unk.5 \ f86v6:unk.6 \ f86v5:unk.7 \ f116v:unk.8 \ xxx:xxx.0,xxx.1,xxx.2,xxx.3,xxx.4,xxx.5 \ ) echo $plotgrp \ | tr ' :,' '\012 ' \ | gawk '/./{for(i=2;i<=NF;i++){print $(i),$1;}}' \ | sort +0 -1 \ > subsec-to-plot.tbl set plotsecs = ( `echo $plotgrp | tr ' ' '\012' | sed -e 's/[:].*$//g' | grep -v xxx` ) echo ${plotsecs} Remember to edit the "plot-page-data" script to match the plot sections above. Make a table that maps directly f-numbers to plot sections: cat fnum-to-subsec.tbl \ | map-field \ -v inField=2 -v outField=3 \ -v table=subsec-to-plot.tbl \ | gawk '//{print $1,$3}' \ > fnum-to-plot.tbl Format the plot sections table for documentation: /bin/rm -f plot-summary.txt foreach plsc ( ${plotsecs} ) echo "${plsc}" echo "subsection ${plsc}" \ >> plot-summary.txt cat fnum-to-plot.tbl \ | gawk -v plsc=${plsc} '($2 == plsc){print $1}' \ | map-field \ -v inField=1 -v outField=2 \ -v table=fnum-to-pnum.tbl \ | sort +1 -2 \ | gawk '/./{print $1}' \ | fmt -w 50 \ | sed -e 's/^/ /' \ >> plot-summary.txt echo " " \ >> plot-summary.txt end OK, let's plot the data set sys = "tot-hea" foreach dic ( rene ) foreach etag ( RAW EQV ) set ptdir = "${etag}/plots/${dic}/pages" set scdir = "${etag}/plots/${dic}/subsecs" set fgdir = "${etag}/plots/${dic}/${sys}" /bin/rm -rf ${fgdir} mkdir -p ${fgdir} cp -p ${ptdir}/all.names ${fgdir}/all.names make-3d-scatter-plots \ ${ptdir} \ ${fgdir} \ ${scdir}/{tot,tot,hea.1,heb.1,str.2}.pos end end Let's make another set of plots, using Herbal-A and Pharma as references: set sys = "hea-pha" foreach dic ( rene ) foreach etag ( RAW EQV ) set ptdir = "${etag}/plots/${dic}/pages" set scdir = "${etag}/plots/${dic}/subsecs" set fgdir = "${etag}/plots/${dic}/${sys}" /bin/rm -rf ${fgdir} mkdir -p ${fgdir} cp -p ${ptdir}/all.names ${fgdir}/all.names make-3d-scatter-plots \ ${ptdir} \ ${fgdir} \ ${scdir}/{tot,hea.1,pha.2,heb.1,bio.1}.pos end end