Hacking at the Voynich manuscript - Side notes 015 Colorized version of the whole text # Last edited on 2000-07-11 10:46:43 by stolfi Looking at the first few pages of the herbal section, I observed that on most paragraphs there is at least one word that occurs twice in that paragraph, but not in most other paragraphs. It seems worthwhile to produce a colorized edition of the text, with those words highlighted. The first version of this site was built in 1998 with a best-pick from the then-available versions. Let's rebuild it from the majority version instead. Let's collect the VMs text and split it into one file per page: ln -s ../045/only-m.evt majority.evt mkdir pages-evt /bin/rm -f pages-evt/*.evt nonempty.fnums cat majority.evt \ | split-pages \ -v trcode='A' \ -v outdir=pages-evt \ > nonempty.fnums As a control, let's do the same for a piece of English text: cat engl-wow.evt \ | split-pages \ -v trcode='W' \ -v outdir=pages-evt \ > .wow.fnums cat .wow.fnums >> nonempty.fnums Next, let's make a table that maps a page's F-number to its section. be sure to split the herbal section by language: cat L16+H-eva/INDEX \ | gawk -v FS=':' \ ' //{ gsub(/[.].*$/, "", $2); \ if($3 == "her") {$3 = ($3 "-" $4);} \ if($3 ~ /[?]/) {$3 = "unk";} \ print $2,$3; \ } ' \ | sed -e 's/her-A/hea/' -e 's/her-B/heb/' \ | uniq \ > fnum-to-section.tbl cat .wow.fnums \ | gawk '/./{print $1, "wow"; }' \ >> fnum-to-section.tbl Consistency checks: cat fnum-to-section.tbl \ | gawk '//{print $1;}' \ | sort \ | uniq -d cat fnum-to-section.tbl \ | gawk '//{print $2;}' \ | sort \ | uniq -c Let's make a list of pages for each section: set secs = ( unk pha str hea heb bio cos zod wow ) set secsx = `echo ${secs} | tr ' ' ','` echo ${secsx} mkdir sections-fnums /bin/rm -f sections-fnums/???.fnums .foo foreach sec ( ${secs} ) echo "${sec}" cat fnum-to-section.tbl \ | grep -w ${sec} \ | gawk '/./{print $1;}' \ | fgrep -x -f nonempty.fnums \ > sections-fnums/${sec}.fnums end Checking if we got all pages: cat nonempty.fnums \ | sort \ > .foo cat sections-fnums/{${secsx}}.fnums \ | sort \ > .bar bool 1-2 .foo .bar bool 2-1 .foo .bar Let's also make a table that maps a table's F-number to its K-number (position in section): /bin/rm -f fnum-to-knum.tbl foreach sec ( ${secs} ) echo "${sec}" cat sections-fnums/${sec}.fnums \ | gawk '/./{printf "%s %02d\n", $1, k; k++; }' \ >> fnum-to-knum.tbl end cat fnum-to-knum.tbl \ | gawk '//{print $1;}' \ | sort \ | uniq -d Let's make a table of word pattern occurences per page: set equivopts = ( \ -v erase_ligatures=0 \ -v erase_plumes=0 \ -v ignore_gallows_eyes=1 \ -v join_ei=1 \ -v equate_aoy=1 \ -v collapse_ii=0 \ -v equate_eights=0 \ -v equate_pt=1 \ -v erase_q=1 \ -v erase_word_spaces=1 \ ) mkdir pages-pwct /bin/rm -f pages-pwct/*.pwct .foo foreach fnum ( `cat nonempty.fnums` ) echo "${fnum}" cat pages-evt/${fnum}.evt \ | basify-weirdos \ | enum-text-phrases -f eva2erg.gawk \ -v maxLength=0 \ | egrep -v '[*?]' \ | add-match-key -f eva2erg.gawk \ -v inField=5 \ -v outField=6 \ ${equivopts} \ | gawk '/./{print ($5 ":" $6);}' \ | sort | uniq -c | expand \ | sort -b +0 -1nr \ > pages-pwct/${fnum}.pwct end STOPPED HERE Combine them by section: mkdir sections-pwfct /bin/rm -f sections-pwfct/???.pwfct .foo @ knum = 0; foreach sec ( ${secs} ) echo "${sec}" foreach fnum ( `cat sections-fnums/${sec}.fnums` ) cat pages-pwct/${fnum}.pwct \ | gawk \ -v fnum=${fnum} -v knum=${knum} \ ' /./{ \ gsub(/[:]/, " ", $2); \ print $1, fnum, sprintf("%02d",knum), $2, $3; \ } \ ' \ | sort -b +2 -3n +0 -1nr \ > sections-pwfct/${sec}.pwfct @ knum = ${knum} + 1 end end Now compute the tables of pattern lumpiness (per section) and the page strangeness (per page and section): mkdir sections-pwfstr /bin/rm -f sections-pwfstr/*.pwfstr .foo foreach sec ( ${secs} ) echo "=== section ${sec} ===" cat sections-pwfct/${sec}.pwfct \ | compute-strangeness \ | sort -b +4 -5gr +0 -1 \ > sections-pwfstr/${sec}.pwfstr end Merge them by section: mkdir sections-strf /bin/rm -f sections-strf/???.strf .foo foreach sec ( ${secs} ) echo "=== section ${sec} ===" /bin/rm -f .tmp foreach fnum ( `cat sections-fnums/${sec}.fnums` ) echo "${fnum}" cat pages-str/${fnum}.str \ | gawk -v fnum=${fnum} '/./{ print $1,$2,$3,$4,$5,fnum,$6,$7; }' \ | map-field \ -v table=fnum-to-knum.tbl \ -v inField=6 \ -v outField=7 \ >> sections-strf/${sec}.strf end end Now let's choose interesting words to color for each section, and assign them hues based on mean page position: mkdir sections-select set minStrangeness = "3.5" /bin/rm -f sections-select/???.whue .foo foreach sec ( ${secs} ) echo "=== section ${sec} ===" cat sections-strf/${sec}.strf \ | choose-peculiar-words \ -v maxPatterns=25 \ -v minStrangeness=${minStrangeness} \ -v maxDensity=0.25 \ > sections-select/${sec}.whue end dicio-wc sections-select/{${secsx}}.whue lines words bytes file ------ ------- --------- ------------ 17 34 204 sections-select/unk.whue 15 30 192 sections-select/pha.whue 25 50 329 sections-select/str.whue 25 50 302 sections-select/hea.whue 25 50 313 sections-select/heb.whue 20 40 246 sections-select/bio.whue 24 48 297 sections-select/cos.whue 2 4 23 sections-select/zod.whue 11 22 120 sections-select/wow.whue List of words to color in each section: foreach sec ( ${secs} ) echo "=== section ${sec} ===" cat sections-select/${sec}.whue \ | gawk '/./{print $2;}' \ | fmt -w 60 end === section unk === chotcho otchdo cthor doin otedo otoiin oiin cho o chol ctheo e t otor otom chor ctho === section pha === dol otol oteodo chol oteor cthodo doiin sheo cheodo dom sho cheoctho oteeor cheom doir === section str === otoin otedo otol oteedo oteeo lteedo ltoin sheo shedo otolo ltchdo otoir ltchedo shctho otolor dol otoiin otchedo doir ltcheo otcho otor ches cheeeo toir === section hea === shol do chom oto doiin dom don tchol otcho cheo otol s otchol tcheo ol teo shom otoiin otod dsho toiin ocheor chodo tcheeo dol === section heb === do dol otedo shee s cheodo tol oteodo chdo otsho shdo otedol otcho otoiin oldor shol otor otchdo todo otoldo oteeo otol or chectho tchedo === section bio === ol oltoin otoiin oteedo otedo tedo oloin tol otoin oin shectho lchedo do dor cheor cheol otol soin r shdo === section cos === t r o otoiin shes oteos otedo otor ol otcheo oteod otodo sor otchedo odoin or otom chotol oteo oteodo oteeo cheeo otooiin oteeos === section zod === oteodo s === section wow === its is we night sun ot men os i little but Let's make a list of the wordpatterns that occur once in each section: mkdir sections-unique /bin/rm -f sections-unique/???.dic .foo foreach sec ( ${secs} ) echo "=== section ${sec} ===" cat sections-pwfct/${sec}.pwct \ | gawk '/./{ gsub(/[:]/, " ", $0); print $1, $3; }' \ | combine-counts \ | gawk '($1 == 1){print $2}' \ | sort \ > sections-unique/${sec}.dic end dicio-wc sections-unique/{${secsx}}.dic lines words bytes file ------ ------- --------- ------------ 374 374 2691 sections-unique/unk.dic 476 476 3472 sections-unique/pha.dic 1216 1216 9318 sections-unique/str.dic 863 863 6467 sections-unique/hea.dic 525 525 3798 sections-unique/heb.dic 523 523 3789 sections-unique/bio.dic 622 622 4653 sections-unique/cos.dic 311 311 2250 sections-unique/zod.dic 591 591 4586 sections-unique/wow.dic Now let's create a color table pages-clr/PAGE.clr for each page. mkdir pages-clr set bgColor = "000000" set textColor = "ffffff" set linkColor = "00ff99" set vlinkColor = "009900" set alinkColor = "eeff99" set uniqueColor = "ffffff" set defaultColor = "bbbbbb" We assign white to patterns that occur once in the whole book. We assign a different hue to each pattern, on a per-section basis. The intensity is assigned per page, based on the strangeness of the word in that page. /bin/rm -f pages-clr/*.clr pages-clr/*.spw .foo foreach sec ( ${secs} ) echo "=== ${sec} ===" cat sections-strf/${sec}.strf \ | sort -b +5 -6 +4 -5gr \ | create-color-tables \ -v colorPatterns=sections-select/${sec}.whue \ -v outDir=pages-clr \ -v uniqueColor=${uniqueColor} \ -v minStrangeness=${minStrangeness} \ -v minLum=0.6 \ -v maxLum=0.8 \ > /dev/null end Now let's colorize the pages: mkdir pages-html /bin/rm -f pages-html/*.html .foo foreach sec ( ${secs} ) echo "=== section ${sec} ===" colorize-pages \ -v section=${sec} \ -v indent=2 \ \ -v bgColor="${bgColor}" \ -v textColor="${textColor}" \ -v linkColor="${linkColor}" \ -v vlinkColor="${vlinkColor}" \ -v alinkColor="${alinkColor}" \ -v uniqueColor="${uniqueColor}" \ -v defaultColor="${defaultColor}" \ `cat sections-fnums/${sec}.fnums` end make-page-index \ -v indent=2 \ \ -v bgColor="${bgColor}" \ -v textColor="${textColor}" \ -v linkColor="${linkColor}" \ -v vlinkColor="${vlinkColor}" \ -v alinkColor="${alinkColor}" \ -v uniqueColor="${uniqueColor}" \ -v defaultColor="${defaultColor}" \ ${equivopts} \ ${secs}