Hacking at the Voynich manuscript - Side notes 053 Reordering the pages by word and element frequencies Last edited on 1999-07-22 17:21:57 by stolfi INTRODUCTION The VMS has apparently ben bound by someone who did not know the correct order of the pages. For one thing, the biological bifolio f78+f81 has a double-page figure that cannot be seen in its entirey as bound. Also the mixing of languages in the herbal section is highly suspicious. In this note we consider the problem of finding the correct order of the pages by looking at the distribution of words or other linguistic elements. (Rene Zandbergen and Gabriel Landini did the same by looking at the distribution of digraphs; see the Teddington meeting report.) THE BASIC IDEA For each word w, let F[p][w] be its estimated frequency in the language of page p. The vector F[p] is then a representation of page p in a space of large dimension. Ideally the frequency of usage of a word in a text depends on the topic at hand. In a single prose-like text that spans multiple pages (as seems to be the case with the biological section), we expect th same topic to extend across page boundaries; therefore, the frequencies of use of a given word on two adjacent pages should be more similar than the frequencies on two non-adjacent pages. Therefore, we should look for orderings of the VMS pages that make the word distributions change gradually from each page to the next. For a single word, we want a permutation p[0..P-1] of the pages that minimizes the sum D2(w,p) = Sum { (F[p[k]][w] - F[p[k-1]][w])^2 : k = 1..P-1 } For a single word, the solution is to sort the pages by frequency. In general, we want to minimize the sum of D2(w,p) over all words w. That is we want to minimize Sum { Sum { (F[p[k]][w] - F[p[k-1]][w])^2 : k = 1..P-1 } : w in W } which is the same as Sum { Sum { (F[p[k]][w] - F[p[k-1]][w])^2 : w in W } : k = 1..P-1 } which is Sum { dist(F[p[k]] - F[p[k-1]])^2 : k = 1..P-1 } In other words we want a path connecting the points F[p] that minimizes the *square* of the Euclidean edge lengths, d2(p,q) = dist(p,q)^2. This is a special case of the general traveling salesman problem. Note that d2(a,b) does not satisfy the usual triangle inequality since d2(a,b) can be twice d2(a,c)+d2(b,c). But the squared distance should be nicer to work with than the plain distance. Note that in order to prepare the data we must compute a matrix of size P^2, each entry of it being a a sum of |W| terms. Since P>200, it seems unfeasible at the moment to compute the whole matrix. We should be content with computing partial matrices for small sections, e.g. bio or str. In fact we only have hope of finding the best solution (by exhaustive search) for sections with 5 bifolios or less. OBTAINING THE DATA The word counts per page (majority version) were obtained in Note 019 "Analyzing word frequencies per section". mkdir -p RAW EQV ( cd RAW && ln -s ../../019/RAW/wfreqs ) ( cd EQV && ln -s ../../019/EQV/wfreqs ) ln -s ../019/fnum-to-subsec.tbl As a warm-up exercise, let's compute the distance matrix for the *folios* of the biological section. Let's first extract a list of the f-numbers of pages in the bio section: For realistic test: set subsec = "bio.1"; set ddir = RAW/wfreqs For toy test: set subsec = "tst.1"; set ddir = "TST/wfreqs"; In any case: mkdir subsecs set sdir = "subsecs/${subsec}"; echo ${sdir} mkdir ${sdir} cat fnum-to-subsec.tbl \ | gawk -v subsec="$subsec" '($2 == subsec){ print $1; }' \ > ${sdir}/all.pages dicio-wc ${sdir}/all.pages set pages = ( `cat ${sdir}/all.pages` ); echo $pages Now let's assign letter codes A-Z to the bifolios of the bio section, to simplify formatting of matrices and permutations: Likewise let's assign letters A-Za-z to the *folios* (not pages) of the bio section. Namely, the lowest-numbered half of each bifolio gets the same (uppercase) letter as the bifolio, and the other half gets the corresponding lowercase letter. foreach utype ( bifolio folio ) cat fnum-to-${utype}.tbl \ | fgrep -f ${sdir}/all.pages \ | assign-${utype}-letters \ > ${sdir}/fnum-to-${utype}.tbl end dicio-wc ${sdir}/fnum-to-*.tbl lines words bytes file ------- ------- --------- ------------ 8 16 64 subsecs/tst.1/fnum-to-bifolio.tbl 8 16 64 subsecs/tst.1/fnum-to-folio.tbl lines words bytes file ------- ------- --------- ------------ 20 40 140 subsecs/bio.1/fnum-to-folio.tbl 20 40 140 subsecs/bio.1/fnum-to-bifolio.tbl Now let's combine the counts for folios and bifolios: mkdir ${sdir}/{folio,bifolio}-cts mkdir ${tdir}/{folio,bifolio}-cts Create lists of folios and bifolios in their current order: foreach utype ( bifolio folio ) cat ${sdir}/fnum-to-${utype}.tbl \ | gawk '(\!($2 in p)){print $2; p[$2]=1;}' \ > ${sdir}/all.${utype}s end set folios = ( `cat ${sdir}/all.folios` ); echo $folios set bifolios = ( `cat ${sdir}/all.bifolios` ); echo $bifolios KEY WORD SELECTION Select the subset of "key words" to consider in the distance matrix computation. Keyword set 0: take all words that occur in the section. set wset = 0 cat ${ddir}/subsecs/${subsec}.frq \ | egrep -v '[?*]' \ | gawk '/./{ print $3; }' \ > ${sdir}/key-words-${wset}.dic (The distances between units were too small with this keyword set.) Keyword set 1: the words that occur at least twice in the section: set wset = 1 cat ${ddir}/subsecs/${subsec}.frq \ | egrep -v '[?*]' \ | gawk '($1 >= 2){ print $3; }' \ > ${sdir}/key-words-${wset}.dic Keyword set 2: words that occur at least 4 times. set wset = 2 cat ${ddir}/subsecs/${subsec}.frq \ | egrep -v '[?*]' \ | gawk '($1 >= 4){ print $3; }' \ > ${sdir}/key-words-${wset}.dic Keyword set 3: words that occur at least 5 and at most 20 times. set wset = 3 cat ${ddir}/subsecs/${subsec}.frq \ | egrep -v '[?*]' \ | gawk '(($1 >= 5)&&($1 <= 20)){ print $3; }' \ > ${sdir}/key-words-${wset}.dic Keyword set 4: words that occur at at least 21 times. set wset = 4 cat ${ddir}/subsecs/${subsec}.frq \ | egrep -v '[?*]' \ | gawk '($1 >= 21){ print $3; }' \ > ${sdir}/key-words-${wset}.dic Keyword set 5: words that occur 4 to 6 times. set wset = 5 cat ${ddir}/subsecs/${subsec}.frq \ | egrep -v '[?*]' \ | gawk '(($1 >= 4)&&($1 <= 6)){ print $3; }' \ > ${sdir}/key-words-${wset}.dic Keyword set 6: words that occur more than 40 times. set wset = 6 cat ${ddir}/subsecs/${subsec}.frq \ | egrep -v '[?*]' \ | gawk '($1 >= 41){ print $3; }' \ > ${sdir}/key-words-${wset}.dic Keyword set 7: words that occur more than 100 times. set wset = 7 cat ${ddir}/subsecs/${subsec}.frq \ | egrep -v '[?*]' \ | gawk '($1 >= 101){ print $3; }' \ > ${sdir}/key-words-${wset}.dic Keyword set 8: words that occur between 41 and 100 times. set wset = 8 cat ${ddir}/subsecs/${subsec}.frq \ | egrep -v '[?*]' \ | gawk '(($1 >= 41)&&($1 <= 100)){ print $3; }' \ > ${sdir}/key-words-${wset}.dic dicio-wc ${sdir}/key-words-*.dic lines words bytes file ------- ------- --------- ------------ 1247 3741 26754 subsecs/bio.1/key-words-0.dic 478 478 2827 subsecs/bio.1/key-words-1.dic 261 261 1465 subsecs/bio.1/key-words-2.dic 149 149 854 subsecs/bio.1/key-words-3.dic 72 72 383 subsecs/bio.1/key-words-4.dic 98 98 577 subsecs/bio.1/key-words-5.dic 36 36 180 subsecs/bio.1/key-words-6.dic 10 10 52 subsecs/bio.1/key-words-7.dic 26 26 128 subsecs/bio.1/key-words-8.dic DISTANCE MATRICES Collect the keyword frequencies per folio and bifolio: echo "frequencies computed for keyword set ${wset}" set nwords = "`cat ${sdir}/key-words-${wset}.dic | wc -l`"; echo $nwords foreach utype ( bifolio folio ) echo "${wset}" > ${sdir}/${utype}-cts/.current-wset set units = ( `cat ${sdir}/all.${utype}s` ) foreach unit ( ${units} ) echo "$unit" set files = ( \ ` cat ${sdir}/fnum-to-${utype}.tbl | gawk -v unit="${unit}" '($2 == unit){ printf "%s.frq\n", $1;}' ` \ ) echo "${utype} ${unit} = ${files}" ( cd ${ddir}/pages && cat ${files} ) \ | gawk '/./{print $1, $3}' \ | combine-counts \ | fgrep -w -f ${sdir}/key-words-${wset}.dic \ | estimate-freqs -v nItems="${nwords}" \ > ${sdir}/${utype}-cts/${unit}.frq end end Now we compute the distance matrices: foreach utype ( bifolio folio ) set wset = "`cat ${sdir}/${utype}-cts/.current-wset`" echo "wset = ${wset}" set nwords = "`cat ${sdir}/key-words-${wset}.dic | wc -l`"; echo $nwords set units = ( `cat ${sdir}/all.${utype}s` ) compute-dist-matrix \ -v nItems="${nwords}" \ -v countDir="${sdir}/${utype}-cts/" \ -v units="${units}" \ > ${sdir}/dist-${utype}-${wset}.matrix echo " " echo "With keyword set ${wset}:" echo " " cat ${sdir}/dist-${utype}-${wset}.matrix end With keyword set 0: A B C D E -------- -------- -------- -------- -------- A 0.000000 0.001940 0.002088 0.001521 0.002563 B 0.001940 0.000000 0.001486 0.002373 0.003180 C 0.002088 0.001486 0.000000 0.002334 0.003441 D 0.001521 0.002373 0.002334 0.000000 0.003062 E 0.002563 0.003180 0.003441 0.003062 0.000000 A B C D E e d c b a -------- -------- -------- -------- -------- -------- -------- -------- -------- -------- A 0.000000 0.001841 0.002170 0.001716 0.001767 0.002128 0.001779 0.001786 0.001960 0.001816 B 0.001841 0.000000 0.001914 0.002059 0.002257 0.002641 0.002032 0.001656 0.001552 0.001901 C 0.002170 0.001914 0.000000 0.001969 0.002895 0.003462 0.002114 0.001707 0.001642 0.002096 D 0.001716 0.002059 0.001969 0.000000 0.002237 0.002950 0.001542 0.002054 0.002011 0.001633 E 0.001767 0.002257 0.002895 0.002237 0.000000 0.001574 0.001939 0.002001 0.002437 0.002309 e 0.002128 0.002641 0.003462 0.002950 0.001574 0.000000 0.002565 0.002253 0.002682 0.002841 d 0.001779 0.002032 0.002114 0.001542 0.001939 0.002565 0.000000 0.001947 0.002006 0.001757 c 0.001786 0.001656 0.001707 0.002054 0.002001 0.002253 0.001947 0.000000 0.001424 0.002053 b 0.001960 0.001552 0.001642 0.002011 0.002437 0.002682 0.002006 0.001424 0.000000 0.002009 a 0.001816 0.001901 0.002096 0.001633 0.002309 0.002841 0.001757 0.002053 0.002009 0.000000 With keyword set 1: A B C D E -------- -------- -------- -------- -------- A 0.000000 0.003681 0.004026 0.002760 0.004920 B 0.003681 0.000000 0.002776 0.004960 0.006266 C 0.004026 0.002776 0.000000 0.004910 0.006868 D 0.002760 0.004960 0.004910 0.000000 0.006076 E 0.004920 0.006266 0.006868 0.006076 0.000000 A B C D E e d c b a -------- -------- -------- -------- -------- -------- -------- -------- -------- -------- A 0.000000 0.003989 0.005066 0.003678 0.003827 0.004672 0.003828 0.003828 0.004370 0.003986 B 0.003989 0.000000 0.004341 0.004758 0.005240 0.006074 0.004681 0.003477 0.003176 0.004258 C 0.005066 0.004341 0.000000 0.004704 0.007331 0.008557 0.005230 0.003846 0.003649 0.004982 D 0.003678 0.004758 0.004704 0.000000 0.005312 0.007026 0.003356 0.004978 0.004888 0.003534 E 0.003827 0.005240 0.007331 0.005312 0.000000 0.003159 0.004361 0.004538 0.005902 0.005482 e 0.004672 0.006074 0.008557 0.007026 0.003159 0.000000 0.005827 0.004903 0.006167 0.006715 d 0.003828 0.004681 0.005230 0.003356 0.004361 0.005827 0.000000 0.004722 0.004977 0.003903 c 0.003828 0.003477 0.003846 0.004978 0.004538 0.004903 0.004722 0.000000 0.002962 0.004804 b 0.004370 0.003176 0.003649 0.004888 0.005902 0.006167 0.004977 0.002962 0.000000 0.004694 a 0.003986 0.004258 0.004982 0.003534 0.005482 0.006715 0.003903 0.004804 0.004694 0.000000 With keyword set 2: A B C D E -------- -------- -------- -------- -------- A 0.000000 0.004968 0.005525 0.003686 0.006683 B 0.004968 0.000000 0.003745 0.007070 0.008611 C 0.005525 0.003745 0.000000 0.007072 0.009575 D 0.003686 0.007070 0.007072 0.000000 0.008451 E 0.006683 0.008611 0.009575 0.008451 0.000000 A B C D E e d c b a -------- -------- -------- -------- -------- -------- -------- -------- -------- -------- A 0.000000 0.005774 0.007721 0.005387 0.005573 0.006838 0.005611 0.005545 0.006485 0.005889 B 0.005774 0.000000 0.006536 0.007282 0.007884 0.009061 0.007128 0.004979 0.004509 0.006366 C 0.007721 0.006536 0.000000 0.007349 0.011615 0.013381 0.008306 0.005798 0.005459 0.007720 D 0.005387 0.007282 0.007349 0.000000 0.008221 0.010820 0.005019 0.007952 0.007775 0.005231 E 0.005573 0.007884 0.011615 0.008221 0.000000 0.004400 0.006529 0.006809 0.009171 0.008419 e 0.006838 0.009061 0.013381 0.010820 0.004400 0.000000 0.008749 0.007114 0.009276 0.010249 d 0.005611 0.007128 0.008306 0.005019 0.006529 0.008749 0.000000 0.007503 0.007948 0.005864 c 0.005545 0.004979 0.005798 0.007952 0.006809 0.007114 0.007503 0.000000 0.004299 0.007418 b 0.006485 0.004509 0.005459 0.007775 0.009171 0.009276 0.007948 0.004299 0.000000 0.007223 a 0.005889 0.006366 0.007720 0.005231 0.008419 0.010249 0.005864 0.007418 0.007223 0.000000 With keyword set 3: A B C D E -------- -------- -------- -------- -------- A 0.000000 0.008596 0.009378 0.008075 0.008009 B 0.008596 0.000000 0.008759 0.009005 0.008243 C 0.009378 0.008759 0.000000 0.009327 0.009156 D 0.008075 0.009005 0.009327 0.000000 0.009229 E 0.008009 0.008243 0.009156 0.009229 0.000000 A B C D E e d c b a -------- -------- -------- -------- -------- -------- -------- -------- -------- -------- A 0.000000 0.010003 0.011975 0.010437 0.010274 0.011059 0.010005 0.010812 0.011028 0.011051 B 0.010003 0.000000 0.011561 0.010772 0.010249 0.011223 0.010639 0.010630 0.011077 0.011661 C 0.011975 0.011561 0.000000 0.012168 0.012396 0.012247 0.011954 0.011850 0.011979 0.011900 D 0.010437 0.010772 0.012168 0.000000 0.011007 0.012675 0.011028 0.011394 0.012156 0.011482 E 0.010274 0.010249 0.012396 0.011007 0.000000 0.011205 0.010552 0.010752 0.011707 0.011044 e 0.011059 0.011223 0.012247 0.012675 0.011205 0.000000 0.011803 0.011403 0.011294 0.011352 d 0.010005 0.010639 0.011954 0.011028 0.010552 0.011803 0.000000 0.010654 0.011822 0.011166 c 0.010812 0.010630 0.011850 0.011394 0.010752 0.011403 0.010654 0.000000 0.011303 0.011848 b 0.011028 0.011077 0.011979 0.012156 0.011707 0.011294 0.011822 0.011303 0.000000 0.012095 a 0.011051 0.011661 0.011900 0.011482 0.011044 0.011352 0.011166 0.011848 0.012095 0.000000 With keyword set 4: A B C D E -------- -------- -------- -------- -------- A 0.000000 0.008980 0.010107 0.006642 0.013395 B 0.008980 0.000000 0.006549 0.014435 0.017926 C 0.010107 0.006549 0.000000 0.014241 0.019875 D 0.006642 0.014435 0.014241 0.000000 0.017491 E 0.013395 0.017926 0.019875 0.017491 0.000000 A B C D E e d c b a -------- -------- -------- -------- -------- -------- -------- -------- -------- -------- A 0.000000 0.011570 0.016294 0.010815 0.011492 0.014477 0.011866 0.010841 0.013405 0.011531 B 0.011570 0.000000 0.013552 0.016007 0.017792 0.020305 0.016454 0.009723 0.008449 0.012631 C 0.016294 0.013552 0.000000 0.015768 0.027317 0.031561 0.019072 0.011600 0.010617 0.016251 D 0.010815 0.016007 0.015768 0.000000 0.018689 0.024852 0.010022 0.017925 0.017325 0.010038 E 0.011492 0.017792 0.027317 0.018689 0.000000 0.007961 0.014376 0.014815 0.021252 0.018602 e 0.014477 0.020305 0.031561 0.024852 0.007961 0.000000 0.019778 0.015033 0.021080 0.023070 d 0.011866 0.016454 0.019072 0.010022 0.014376 0.019778 0.000000 0.017850 0.018800 0.011940 c 0.010841 0.009723 0.011600 0.017925 0.014815 0.015033 0.017850 0.000000 0.007854 0.015405 b 0.013405 0.008449 0.010617 0.017325 0.021252 0.021080 0.018800 0.007854 0.000000 0.014961 a 0.011531 0.012631 0.016251 0.010038 0.018602 0.023070 0.011940 0.015405 0.014961 0.000000 With keyword set 5: A B C D E -------- -------- -------- -------- -------- A 0.000000 0.015171 0.016168 0.015575 0.014795 B 0.015171 0.000000 0.016379 0.017969 0.016674 C 0.016168 0.016379 0.000000 0.018949 0.016653 D 0.015575 0.017969 0.018949 0.000000 0.018493 E 0.014795 0.016674 0.016653 0.018493 0.000000 A B C D E e d c b a -------- -------- -------- -------- -------- -------- -------- -------- -------- -------- A 0.000000 0.017364 0.019513 0.019282 0.018359 0.018517 0.018728 0.018984 0.019835 0.019621 B 0.017364 0.000000 0.018675 0.019837 0.018863 0.018509 0.018413 0.018638 0.018085 0.017973 C 0.019513 0.018675 0.000000 0.020098 0.018858 0.019433 0.019346 0.019391 0.019297 0.018965 D 0.019282 0.019837 0.020098 0.000000 0.020348 0.019997 0.018991 0.020518 0.019628 0.018151 E 0.018359 0.018863 0.018858 0.020348 0.000000 0.018856 0.019700 0.020379 0.019835 0.018750 e 0.018517 0.018509 0.019433 0.019997 0.018856 0.000000 0.018748 0.018377 0.018793 0.018165 d 0.018728 0.018413 0.019346 0.018991 0.019700 0.018748 0.000000 0.019889 0.019097 0.018054 c 0.018984 0.018638 0.019391 0.020518 0.020379 0.018377 0.019889 0.000000 0.019047 0.019023 b 0.019835 0.018085 0.019297 0.019628 0.019835 0.018793 0.019097 0.019047 0.000000 0.018554 a 0.019621 0.017973 0.018965 0.018151 0.018750 0.018165 0.018054 0.019023 0.018554 0.000000 With keyword set 6: A B C D E -------- -------- -------- -------- -------- A 0.000000 0.012940 0.013690 0.010134 0.020768 B 0.012940 0.000000 0.008796 0.022844 0.029096 C 0.013690 0.008796 0.000000 0.021595 0.031861 D 0.010134 0.022844 0.021595 0.000000 0.028200 E 0.020768 0.029096 0.031861 0.028200 0.000000 A B C D E e d c b a -------- -------- -------- -------- -------- -------- -------- -------- -------- -------- A 0.000000 0.016976 0.022803 0.015724 0.016672 0.022572 0.018364 0.014523 0.018987 0.016886 B 0.016976 0.000000 0.019258 0.024475 0.027819 0.033093 0.027238 0.012924 0.010811 0.018447 C 0.022803 0.019258 0.000000 0.022286 0.042816 0.051345 0.029645 0.015924 0.014177 0.023027 D 0.015724 0.024475 0.022286 0.000000 0.029168 0.040905 0.014113 0.027564 0.026453 0.014939 E 0.016672 0.027819 0.042816 0.029168 0.000000 0.010641 0.022294 0.022562 0.033872 0.028841 e 0.022572 0.033093 0.051345 0.040905 0.010641 0.000000 0.032962 0.024785 0.034788 0.037844 d 0.018364 0.027238 0.029645 0.014113 0.022294 0.032962 0.000000 0.029105 0.030201 0.018716 c 0.014523 0.012924 0.015924 0.027564 0.022562 0.024785 0.029105 0.000000 0.010611 0.023144 b 0.018987 0.010811 0.014177 0.026453 0.033872 0.034788 0.030201 0.010611 0.000000 0.021925 a 0.016886 0.018447 0.023027 0.014939 0.028841 0.037844 0.018716 0.023144 0.021925 0.000000 With keyword set 7: A B C D E -------- -------- -------- -------- -------- A 0.000000 0.024697 0.021902 0.022253 0.047342 B 0.024697 0.000000 0.016077 0.067946 0.074634 C 0.021902 0.016077 0.000000 0.054178 0.080114 D 0.022253 0.067946 0.054178 0.000000 0.072549 E 0.047342 0.074634 0.080114 0.072549 0.000000 A B C D E e d c b a -------- -------- -------- -------- -------- -------- -------- -------- -------- -------- A 0.000000 0.033450 0.043694 0.037565 0.030930 0.047851 0.036868 0.024454 0.032704 0.031666 B 0.033450 0.000000 0.044591 0.075141 0.074467 0.077829 0.078277 0.023577 0.017538 0.039497 C 0.043694 0.044591 0.000000 0.054296 0.110691 0.132447 0.070696 0.039988 0.026165 0.040654 D 0.037565 0.075141 0.054296 0.000000 0.072766 0.107786 0.026442 0.073902 0.067858 0.026341 E 0.030930 0.074467 0.110691 0.072766 0.000000 0.016514 0.049695 0.054983 0.083623 0.068331 e 0.047851 0.077829 0.132447 0.107786 0.016514 0.000000 0.080771 0.057655 0.085966 0.086974 d 0.036868 0.078277 0.070696 0.026442 0.049695 0.080771 0.000000 0.073186 0.078391 0.040357 c 0.024454 0.023577 0.039988 0.073902 0.054983 0.057655 0.073186 0.000000 0.019658 0.047499 b 0.032704 0.017538 0.026165 0.067858 0.083623 0.085966 0.078391 0.019658 0.000000 0.038134 a 0.031666 0.039497 0.040654 0.026341 0.068331 0.086974 0.040357 0.047499 0.038134 0.000000 According to this matrix, the two halves of each bifolio are more similar to each other than to other folios. The only exception is bifolio A, whose two leaves A-a are about as far apart as A-b, A-B, A-D, A-d, and farther apart than A-c or A-E. We should compute the matrix for pages... With keyword set 8: A B C D E -------- -------- -------- -------- -------- A 0.000000 0.025477 0.032770 0.014678 0.026348 B 0.025477 0.000000 0.017877 0.017311 0.031356 C 0.032770 0.017877 0.000000 0.028090 0.035054 D 0.014678 0.017311 0.028090 0.000000 0.031945 E 0.026348 0.031356 0.035054 0.031945 0.000000 A B C D E e d c b a -------- -------- -------- -------- -------- -------- -------- -------- -------- -------- A 0.000000 0.033273 0.046463 0.024895 0.031992 0.028727 0.029565 0.032668 0.040787 0.035185 B 0.033273 0.000000 0.031515 0.022852 0.032934 0.042998 0.022895 0.027224 0.025879 0.033263 C 0.046463 0.031515 0.000000 0.034602 0.055337 0.054171 0.040839 0.022761 0.028957 0.050304 D 0.024895 0.022852 0.034602 0.000000 0.041506 0.042909 0.026610 0.036616 0.036847 0.031611 E 0.031992 0.032934 0.055337 0.041506 0.000000 0.023721 0.035939 0.031024 0.046460 0.043361 e 0.028727 0.042998 0.054171 0.042909 0.023721 0.000000 0.040227 0.027364 0.038896 0.051080 d 0.029565 0.022895 0.040839 0.026610 0.035939 0.040227 0.000000 0.034880 0.034214 0.027948 c 0.032668 0.027224 0.022761 0.036616 0.031024 0.027364 0.034880 0.000000 0.020169 0.044718 b 0.040787 0.025879 0.028957 0.036847 0.046460 0.038896 0.034214 0.020169 0.000000 0.048712 a 0.035185 0.033263 0.050304 0.031611 0.043361 0.051080 0.027948 0.044718 0.048712 0.000000 BEST ORDERING OF BIFOLIOS Now we look for the mest ordering of bifolios. echo "wset = ${wset}" enum-perms \ -v elems="`echo ${bifolios}|tr -d ' '`" \ | find-cheapest-path \ -v matrixFile=${sdir}/dist-bifolio-${wset}.matrix Result with the bifolio distance matrix 0 (all words): best score = 0.007729 BCDAE EADCB With bifolio distance matrix 1 (words that occur at least twice): best score = 0.015293 CBADE EDABC With bifolio distance matrix 2 (words that occur at least 4 times): natural score = 0.024236 natural perm = ABCDE best score = 0.020850 best perms = CBADE EDABC With bifolio matrix 3 (words with count in [5..20]): natural score = 0.035911 natural perm = ABCDE best score = 0.033086 best perms = CBEAD DAEBC BEST PHYSICALLY VALID ORDERING OF FOLIOS let's look for optimal physically valid orderings of the folios. Note that the "elems" list must be an UNNESTED permutation. I will retain only the solutions that have D before d. echo "wset = ${wset}" nice enum-folio-perms \ -v elems="AaBbCcDdEe" \ | nice find-cheapest-path \ -v matrixFile=${sdir}/dist-folio-${wset}.matrix Result with the folio distance matrix 0 (all words) natural score = 0.017480 natural perm = ABCDEedcba best score = 0.014879 best perms = CcbBaDdAEe Note that the bifolios are all unnested except for A wrapped around D. Also D is central as it must. The permutation BbcCDdaAEe, all unnested, has comparable score (0.015108). With folio distance matrix 1 (words that occur at least twice): natural score = 0.039710 natural perm = ABCDEedcba best score = 0.031946 best perms = CcbBaDdAEe ()()(())() Note that it is the same as the best solution for matrix 0. With folio distance matrix 2 (words that occur at least 4 times): natural score = 0.060054 natural perm = ABCDEedcba best score = 0.046806 best perms = CcbBaDdAEe ()()(())() Again the same solution, with A nested around D. The unnested solution BbcCDdaAEe has score 0.048700 in this matrix. With folio matrix 3 (words with count in [5..20]): natural score = 0.101799 natural perm = ABCDEedcba best score = 0.098164 best perms = beaDdAEBcC (((())))() This solution has everything nested except D (good!) and C. But the score is only a trifle lower than the current solution. With folio matrix 4 (the 72 words with count > 20): natural score = 0.127983 natural perm = ABCDEedcba best score = 0.091913 best perms = CcbBaDdAEe ()()(())() Again the same solution as with matrices 0-2, with A nested around D. With folio matrix 5 (words with count in [4..6]): natural score = 0.171579 natural perm = ABCDEedcba best score = 0.165447 best perms = ceaDdbBAEC (((()()))) At least bifolio D is OK. With folio matrix 6 (the 36 words with count > 40): natural score = 0.192932 natural perm = ABCDEedcba best score = 0.130247 best perms = eEAcCbBaDd ()(()())() This one is different... but D is still OK. With folio matrix 7 (the 10 words with count > 100): natural score = 0.433366 natural perm = ABCDEedcba best score = 0.232615 best perms = eEAcBbCaDd ()((()))() This one is only a bit different from the previous one: C and B got nested. Note that the cos difference is substantial, but still the points are far from being a unidimensional cluster... With folio matrix 8 (the 26 words that occur between 41 and 100 times). natural score = 0.308605 natural perm = ABCDEedcba best score = 0.231597 best perms = EeADCcbBda ()((()())) This one is no good because it has C and B nested in D.