Hacking at the Voynich manuscript - Side notes 077 Further comparisons of Recipes to the Shennong Bencao Last edited on 2026-07-01 12:04:01 by stolfi INTRODUCTION This note makes further analyses comparing the Starred Paragraphs Section (SPS, aka Recipes Section) of the VMS to the Chinese medical classic Sennon Bencaojing (SBJ). SETUP ln -s ${HOME}/ttf ln -s ../work ln -s work/langbank ln -s ${HOME}/bin/gawkf.sh ln -s work/faf.sh ln -s work/chinese_funcs.py ln -s work/error_funcs.gawk ln -s work/error_funcs.py ln -s work/make_funcs.py ln -s work/process_funcs.py ln -s work/vms_linear_gray_image_funcs.py ln -s ../076/common_funcs_076.gawk ln -s work/compute_freqs_from_counts.py ln -s work/make_histogram.gawk ln -s work/turn_histogram_into_polygonal_line.gawk CLEANED AND REGULARIZED TRANSCRIPTIONS The first step is to produce transcriptions of the SBJ ("res/bencao-*.ivt") and of the SPS ("res/starps-*.ivt") that are (1) cleaned of complicated artifacts like weirdo codes, inline comments, and other markup,and (2) in compatible formats that can be processed by the same scripts. Transcription line format These files all have the same format: each line has fields "<{LOC}> {TEXT}" where {LOC} is a locus identifier in the original book, and {TEXT} is a chunk of transcribed text. For the SBJ the field {LOC} has the form "{SSEC}.{SSUB}.{SLIN}" where {SSEC} is "b1" to "b3", {SSUB} is one digit 1..6, and {SLIN} is a 3-digit line number starting from "001" that is unique within each {SSEC}. (The {SLIN} was supposed to be sequential but some lines of the original file had to be split and thus got out-of-sequence {SLIN}s.) An example is "b1.2.023". For the SPS the field {LOC} is the standard VMS locus ID, namely "f{PAGE}.{LSEQ}" where {PAGE} is a 3-digit number 103..116 followed by "r" or "v", and {LSEQ} is a line number in the page, from 1. An example is "f105v.32". Four pages of the SPS, "f109r" to "f110v", are missing in the physical manuscript. One line "f105r.10" is omitted because in my transcription of the SPS its {TEXT} was merged into the next line. Three other lines are omitted because they are believed to be titles. The last page is "f116r" (not "f116v") and only lines 1 to 30 are assumed to be part of the SPS. Transcription file names The basic naming scheme for all transcription files is "res/{ivt_name}.ivt" where {ivt_name} is "{book}-{bsub}-{utype}-{ltype}", and {book} is the source book, "bencao" (SBJ) or "starps" (SPS). {bsub} is a subset of the book. For "bencao" the only {bsub} of interest is "fu", the full book. For "starps" there are two subsets: "fu", the full book, and "gd", a subset that consists of the lines which seem to be correctly grouped into parags -- excluding those blocks of lines that seem to be two or more parags run together. {utype} specifies the nature of the text and what constitutes a unit for statistical purposes. The {utype} can be only "ch" for the SBJ, and "ec", "wp", or "wc" for the SPS. See {cleanup_starps_raw_text} {cleanup_raw_bencao_text} and {} {ltype} specifies the amount of text that is in each line of the file. For "bencao" files is must be "par", meaning each line of the file is one parag ("recipe", entry) of the SJB. For "starps" files it may be "par", meaning that each line of the file is a parag of the SPS, or "pag" meaning that each line of the file is a whole page. The latter only makes sense for the "fu" subset. UNITS OF COUNTING The {utype} field in file names specifies the nature of the raw and cleaned ttranscription {TEXT} and the unit used to measure parag sizes and word positions. See {} in {size_position_funcs.py} for a detailed explanation. TWO-FILE COMPARATIVE HISTOGRAMS The procedures below produce various ".upp" files for different source files (SBJ and SPS), different source formats (EVA letters, Voynichese words, Chinese characters, etc.) and different filtering policies (all parags or only some, etc.). Each line of a ".upp" file contains the location {loc[ip]} of a parag {ip} and its size {W[ip]} defined as the number of /units/ (which may be syllables, characters, words, etc.) With each ".upp" file we make a histogram, where each bin {kb} has a range {[wlo[kb] _ whi[kb]}, and a count {P[kb]} of parags {ip} which have {wlo[kb] < W[ip] ~ whi[kb]}. To meaningfully compare two ".upp" files "res/{file0}.upp" and "res/{file1}.upp", we draw their histograms on the same plot, scaling the number of words {W[ip]} of each parag in dataset {file0} by a factor {W1_mag = W0_avg/W1_avg}, where {W0_avg} and {W1_avg} are the averages of {W[ip]} in the two datasets. Then we scale the parag counts {P[kb]} in each bin of histogram 1 by {P1_mag = P0_num/P1_num}, where {P0_num} and {P1_num} are the number of parags in each dataset. These adjustments make the average word count and the total parag count of {file1} to be the same as those of {file0}. VOYNICH STARRED PARAGS FILE For the SPS of the Voynich manuscript, I used the transcription that I created in note 076 as the source. It has been moved to note 074 and included in the joint file "../074/join25e1.ivt". Actually I used the extracted file "../074/st_files/str-parags.ivt" which is similar but cleaner -- with no '{}', weirdo codes, inline comments, page headers, titles, etc. See "report/src/0210_starred_parags_src.py" for more information. From the file "../074/st_files/str-parags.ivt" file I created files res/starps-fu-ec-par.ivt res/starps-fu-wp-par.ivt res/starps-fu-wc-par.ivt res/starps-gd-ec-par.ivt res/starps-gd-wp-par.ivt res/starps-gd-wc-par.ivt In the "ec" versions all word or line separators [-,.] are deleted. In the wc "wc" versions all commas are mapped to period '.'. In the "wp" version the commas are deleted but the periods '.' are retained. In both the "wc" and "wp" versions the line separator '-' is mapped to '.' too. See "convert_starps_lin_ivt_to_par_ivt.py" and the rules for in "make_rules_starps.py". COUNTING ENTRIES AND WORDS IN THE STARRED PARAGS SECTION From each file "res/starps-{fu,gd}-{utype}-par.ivt", where {utype} is "ec", "wp", or "wc", we counted the corresponding units creating files "res/starps-{fu,gd}-{utype}.upp". For {ec} the units are EVA letters; for "wp" and "wc" they are words. See the script "count_units_per_line.py" and the make rules for "res/starps-fu-{utype}.upp" and "res/starps-gd-{utype}.upp" where {utype} is "ec", "wc", or "wp". BENCAO FILE For the SBJ, I created a file "in/bencao-4.chu" by merging and reordering two versions of the Shennong Bencao Jing obtained from the net, and reformatted the result to look sort of like an EVT file. The text is in Chinese characters (mixed traditional and simplified), in the Unicode UTF-8 encoding. See the comments in the file for details of the cleanup. ??? REVISE FROM NOW ON Excluding the section titles, the shortest remaining entry seemed to be normal: 鼯鼠:主墮胎,令易產。 wú shǔ zhǔ duò tāi, lìng yì chǎn. Flying squirrel: causes abortion, makes childbirth easier. (If that can be called "normal"...) COUNTING ENTRIES AND WORDS IN THE SHENNONG BENCAO JIN The program that does the counting of words per recipe in the "orignal" SBJ (Unicode Chinese chars) excludes all blank lines and #-comments (including the subsection titles) and the introduction (part "s0"). It also exclude the Chinese ideographic punctuation marks [。,:]. It counts each Chinese character (syllable) as one word. See the Makefile for "res/bencao-fu-ch-par.upp" 435 total lines 364 recipes 3230 ignored chars of type 'punct' 361 ignored chars of type 'blank' 13144 total syllables 36.11 avg syllables/recipe The histograms has notable minimym at ~24-25 hanzi per parag. The unique shortest SBJ parag is (the "Flying Squirrel" one, "wú shǔ : zhǔ duò tāi, lìng yì chǎn"). It has 8??? syllables. The second shortest, also unique, is with 12??? syllables . The unique longest parag is (the "Red Rooster") with 92??? syllables, "dān xióng jī wèi gān wēi wēn. zhǔ nǚzǐ bēng...". The second and third longest, unique but close, are with 73??? syllables and with 72???. Then there are two other parags with 70??? syllables, then a gap to 64??? syllables. COMPARATIVE SBJ-SPS HISTOGRAMS We created two-hstogram plots comparing various combinations of SBJ and SPS ".upp" files. The plots of "bencao-fu-ch" and "starps-fu-wc" (scaled) were surprisingly similar. The scale factors for the latter were {W1_mag} = 1.1462, {P1_mag} = 1.0826) Both had a big hump from {W=26} to {W=48}, with a small but significant drop around {W=35}. Both had similar shortest parags (with word counts {W0=8} and scaled {W1=7}), and similar second-shortest ones ({W0=12} and scaled {W1=13}) Both had somewhat similar longest parags, with {W0=92} and scaled {W1=85}. We also did the same comparison with "starps-gd-wc" instead of "starps-fu-wc", but the plots were almost identical, even though the scaling factors were different ({W1_mag} = 1.1484, {P1_mag} = 1.1569) Notable differences were an excess of SPS parags with scaled counts {W1} below 25, notably 15..19, 21, 22, and 24. Unscaled, those seem to be 14..19, and 21, respectively, with 18, 19, and 21 being the worst. Also the second, thirdm and fourth largest parags in the SPS, with unscaled {W} 72, 70, and 64, do not seem to have a match in the SBJ. We isolated from "res/starps-fu-wc.upp" the parags with counts 18, 19, and 21, and cheched their pages: list_pages_with_parags_of_given_sizes.sh "18,19,21" \ < res/starps-fu-wc.upp > .badpages-18 Those offending parags were distributed all over the SPS, but there were 4 each on f106r and f106v. Also 2 each on f105r and f105v, and 2 each on f108r and f108v. They may have been pages where writing was more cramped and so more words were run together, resulting in fewer words per parag. Must revise the commas in those two pages. Adding the smallest batch we get list_pages_with_parags_of_given_sizes.sh "14,15,16,17,18,19,21" \ < res/starps-fu-wc.upp \ > .badpages-14 The worst pages are still f106r and f106v with five anomalous parags each. Then f103r with 4, then f107v, f108r, f112v, f113r with 3 each. let's list the pages with the three anomalous big parags> list_pages_with_parags_of_given_sizes.sh "64,70,72" \ < res/starps-fu-wc.upp \ > .badpages-64 Two of these are on f105v, one on f115r. STATISTICS ??? REVISE The basic statistics for the number of words per parag (upp) of the two files are: statistic ! bencao ! starps ------------+---------+-------- parags | 364 | 327 words | 13144 | 10699 min upp | 8 | 6 max upp | 92 | 77 avg upp | 37.1 | 32.7 dev upp | 9.7 | 12.0 LOOKING FOR SIMILAR SEQUENCES OF PARAGS We create a grayscale image where pixel in row i and column j is proportional to the simlarity of word counts of parag i from SPS and recipe j from the SBJ. ./make_similarity_image.sh ???NAME1 UNIT! NAME2 UNIT2 ====================================================================== Here are some advances in the comparison between the Starred Parags (SPS) section and the Shennong Bencao Jing (SBJ). Recall that the files are: [quote="Jorge_Stolfi" pid='67750' dateline='1750041874'] starps.eva The Starred Paragraphs section (SPS) from Takeshi's transcription in the 1.6e6 interlinear file, from page f102r to line 30 of f116r. With one parag per line, in the EVA encoding, with all alignment fillers and comments removed, all weirdos and missing chars mapped to '*', one "=" at start and end of each line (= parag). The file is in UTF-8 encoding. Again, if you just click on those links you will see gibberish, because the server at my Univ expects plain text files to be in ISO-Latin-1 and thus messes up the formatted HTML that it sends to your browser. You will have to download the files and look at them with any text editor or viewer that understands UTF-8. Here is the histogram of the word counts upp: At first sight the histograms are different, but there are some intriguing similarities. Note that both files have 23 entries with 27 words (the most common entry length in both files), six entries with 23 words, 8 entries with 37 words, 2 entries with 47 words, one entry with 53 words, one entry with 59 words, and one entry with 62 words. In both files, there are anomalously few entries with 23, 37, and 43 words. Considering the missing bifolio in the SPS quire, we have 6 surprising near coincidences: number of entries, and the mode, min, max, average, and deviation of the number of words per paragraph. (The total number of words is not an extra coincidence since it is the average npw times the number of entries.) Compared to the SBJ, the SPS has a somewhat broader npw histogram, as implied by the standard deviation. It has more entries with 10-20 words and 35-70 words, and fewer with 21-34 words. In particular, the SBJ has a second mode: 23 parags of 34 words, whereas the SPS has only 11. These discrepancies could be the result of the some word spaces being incorrectly inserted or omitted in the SPS as it was digitized; somewhat at random, with almost the same probability. Alternatively, some parag breaks in the SPS may be wrong, causing, for example, two consecutive parags that should have 22 and 32 words to become parags of 16 and 38 words; and two parags that should have 7 and 76 words to become parags with 13 and 70 words. Both kinds of errors would have little effect on the average npw, but would increase its standard deviation, as observed. There is also the bonus coincidence of both files having originally subsection titles with ~4 words each, althout the number of such titles is vastly different. More on that later. Now for the bad news. As @oshfdk observed, there are hundreds of multiword sequences that occur many times in the SBJ. In particular, there is a 10-word phrase that occurs six times, on six consecutive lines: [code] 久食輕身不老,延年神仙。一名 iǔ.shí.qīng.shēn.bùlǎo.yán.nián.shénxiān.yī.míng iǔ.shí.qīng.shēn.bùlǎo.yán.nián.shénxiān.yī.míng iǔ.shí.qīng.shēn.bùlǎo.yán.nián.shénxiān.yī.míng iǔ.shí.qīng.shēn.bùlǎo.yán.nián.shénxiān.yī.míng iǔ.shí.qīng.shēn.bùlǎo.yán.nián.shénxiān.yī.míng iǔ.shí.qīng.shēn.bùlǎo.yán.nián.shénxiān.yī.míng Eating it for a long time will make you light and immortal. It is also called [code] On the other hand, in the SPS the longest phrases that occur more than once have only 3 words; and the most common occurs only three times: [code] chedy.qokeey.qokeey chedy.qokeey.qokeey chedy.qokeey.qokeey [/code] I will discuss the implications of this difference in another post. ====================================================================== >> REDO THIS >> >> REDO THIS >> THE DATA Raw data files, with comments prefixed with "#", recipe numbers in the form S-NNN prefixed with "##", and each kanji surrounded by ASCII spaces: ln -s ~/IMPORT/texts/chinese/ShennongBencao/text.big5 bencao-raw.big5 ln -s ~/IMPORT/texts/chinese/ShennongBencao/text.jis bencao-raw.jis Data files without punctuation: cat bencao-raw.big5 \ | gawk \ ' /^#/ {print; next;} \ // { \ gsub(/[ ]+[{}][ ]+/, " ", $0); \ gsub(/[ ]+[\241][\264]/, "", $0); \ gsub(/[ ]+[\241][D]/, "", $0); \ print; \ } ' \ > bencao.big5 cat bencao-raw.jis \ | gawk \ ' /^#/ {print; next;} \ // { \ gsub(/[ ]+[{}][ ]+/, " ", $0); \ gsub(/[ ]+[\201][\234]/, "", $0); \ gsub(/[ ]+[\201][D]/, "", $0); \ print; \ } ' \ > bencao.jis dicio-wc bencao{-raw,}.{big5,jis} vstars.eva lines words bytes file ------- ------- --------- ------------ 2008 17532 57611 bencao-raw.big5 1510 19003 70177 bencao-raw.jis 2008 13705 46183 bencao.big5 1510 15229 57427 bencao.jis 1742 13642 86751 vstars.eva * 1734 13354 85052 vstars.eva ** * = as of sometime before 2004-05-30. ** = as of 2004-05-30. Unknown why it changed since the original run. Extracted the Voynichese "stars" section from the Majority version, reformatted to be comparable to the Bencao (line numbers as NNNV-U-LL, recipe numbers as "## S-NNN", all words surrounded by ASCII space). Fixed many errors by hand, against KHE's images (also in the interlinear file). BASIC STATISTICS Checking whether each VMS page has been split into the correct number of recipes: cat vstars.eva \ | count-recipes-per-page \ > vstars.rpp diff true.rpp vstars.rpp total 328 recipes Note that total has changed. This is because, during the 05/2004 round of edits some long recipes were split at paragraph breaks, even though there were no stars there. This is not too unreasonable, because the stars seem to have been placed without much care, as if the scribe did not understand that they were associated with the paragraphs. Basic statistics - total tokens, words, recipes: foreach f ( bencao.big5 vstars.eva ) printf "\n%-10s" "${f:r}:" cat $f \ | print-tk-wd-counts \ > ${f:r}.twct cat ${f:r}.twct \ | sort -b -k3nr -k1n \ | egrep -v '^000 ' \ > ${f:r}.twsr end bencao: total 357 recipes, 12826 tokens ( 0 bad), 35.93 tokens/recipe, 1113 good words vstars: total 328 recipes, 10491 tokens ( 38 bad), 31.98 tokens/recipe, 2996 good words Note that these counts have changed since 02/2002. They used to be vstars: total 323 recipes, 10542 tokens, ( 595 bad), 32.64 tokens/recipe, 2767 good words During the 05/2004 round of edits, many tokens became joined with their neighbors, because the spaces were entered as faithfully as possible. However, if we believe the word structure paradigm, then many of those joined words should have been kept separate. Also note that over 550 "bad" tokens were fixed by those edits. RECIPE LENGTH HISTOGRAMS Plotting the recipe length histograms: foreach tw ( tk.3 wd.4 ) foreach f ( bencao vstars ) printf "\n%s (%s): " "${f}" "${tw:r}" cat ${f}.twct \ | gawk -v fld="${tw:e}" '/./{ print $(fld); }' \ | compute-tk-wd-histogram -v quantum=5 \ > ${f}.${tw:r}h end foreach fmt ( png ) plot-twhi -format ${fmt} \ bencao.${tw:r}h Bencao 1 \ vstars.${tw:r}h Voynich 2 \ > recipe-${tw:r}-hist.${fmt} end end RECIPE LENGTH PLOTS Plotting the recipe lengths as function of position in text: foreach fmt ( png ) foreach f ( bencao.Bencao vstars.Voynich ) plot-recipe-attr \ -format ${fmt} \ ${f:r}.twct "${f:e} (tk)" 3 1 1.0 \ > ${f:r}-tk-counts.${fmt} end end Dito, smoothed: foreach width ( 09 ) foreach fmt ( png ) foreach f ( bencao.Bencao vstars.Voynich ) foreach type ( avg dif ) cat ${f:r}.twct \ | gawk '/./{ print $1, $2, $3; }' \ | filter-recipe-data -v ${type}=1 -v width=${width} \ > ${f:r}-${type}${width}.tct end plot-recipe-attr \ -format ${fmt} \ ${f:r}-avg${width}.tct "${f:e} avg${width}" 3 1 1.0 \ ${f:r}-dif${width}.tct "${f:e} dif${width}" 3 2 60.0 \ > ${f:r}-tk-counts-dif${width}.${fmt} end end end COINCIDENCE IMAGES Computing the coincidence image: foreach width ( 09 ) foreach et ( 0.5/0.05/avg 0.01/0.01/dif ) set err = "${et:h}" set type = "${et:t}" compute-coincidence-image \ -v absErr=${err:h} -v relErr=${err:t} \ -v xFile=bencao-${type}${width}.tct -v xField=3 \ -v yFile=vstars-${type}${width}.tct -v yField=3 \ | pgmnorm | pnmdepth 255 \ > recipe-tk-counts-${type}${width}.pgm display recipe-tk-counts-${type}${width}.pgm end end REPEATED WORDS Checking for repeats foreach f ( bencao.big5 vstars.eva ) printf "\n%s: " "${f:r}" cat ${f} \ | list-repeats \ > ${f:r}.reps cat ${f:r}.reps | wc -l cat ${f:r}.reps \ | gawk '/./{ print $2; }' \ | sort | uniq -c | expand \ | map-field \ -v table=big5-to-html.tbl \ -v inField=2 -v outField=3 \ -v forgiving=1 \ | map-field \ -v table=html-to-py.tbl \ -v inField=3 -v outField=4 \ -v forgiving=1 \ | gawk '//{ print $1, ($3 "=" $4); }' \ | sort -b -k1nr -k2 \ > ${f:r}.rtop head -3 ${f:r}.rtop end bencao: 41 8 洗=(xi3,xian3) 6 血=(xue4,xie3) 5 寒=(han2) vstars: 81 10 qokeedy=qokeedy 10 qokeey=qokeey 7 ar=ar Build word-paragraph occurrence map. foreach f ( bencao.big5 vstars.eva ) cat ${f} \ | sed \ -e 's/^[#][#] */@/' \ -e 's/[#].*$//' \ -e 's/^[0-9][-A-Za-z0-9]*[ ]/ /' \ -e '/^[ ]*$/d' \ | tr ' ' '\012' \ | gawk \ ' BEGIN{ \ split("", map); \ split("", wd); nwd=0; split("", wdct); \ split ("", pg); npg = 0; p = "???"; \ } \ /^[@]/ { \ p = $1; gsub(/[@]/, "", p); \ pg[npg] = p; npg++; next; \ } \ /./ { \ w = $1; \ if (! (w in wdct)) \ { wd[nwd] = w; nwd++; wdct[w] = 0; } \ wdct[w]++; map[p,w]++; \ } \ END { \ for (w in wdct) \ { printf "%-20s %5d ", w, wdct[w]; \ for (i = 0; i < npg; i++) \ { p = pg[i]; \ if ((p,w) in map) \ { printf "%d", map[p,w]; } \ else \ { printf "."; } \ } \ printf "\n"; \ } \ } \ ' \ | sort -b -k2nr -k1 \ > ${f:r}.wpm end LULZ According to Google Translate, the subsection titles are: 1.1.001 玉石部上品 yùshí ù shàngpǐn Top grade jade 1.2.019 玉石部中品 yùshí bù zhōng pǐn Jade department middle grade 1.3.033 玉石部下品 yùshí bùxià pǐn jade subordinate product 1.4.044 草部上品 cǎo bù shàngpǐn Top grade grass 1.5.102 草部中品 cǎo bù zhōng pǐn Kusanabe middle grade 1.6.162 草部下品 cǎo bùxià pǐn The lowest grade of grass 1.7.219 木部上品 mù bù shàngpǐn Top grade wood 1.8.234 木部中品 mù bù zhōng pǐn Kibe middle grade 1.9.253 木部下品 mù bùxià pǐn Kibe inferior grade 2.1.001 蟲獸部上品 chóng shòu bù shàngpǐn Top quality insects and beasts 2.2.017 蟲獸部中品 chóng shòu bù zhōng pǐn Insect and animal department medium quality 2.3.042 蟲獸部下品 chóng shòu bùxià pǐn Insect Beast Subordinates 2.4.069 果菜部上品 guǒcài bù shàngpǐn Top quality fruits and vegetables department 2.5.080 果菜部中品 guǒcài bù zhōng pǐn Medium range of fruits and vegetables 2.6.087 果菜部下品 guǒcài bùxià pǐn Fruit and vegetable products 2.7.091 米穀部上品 mǐgǔ bù shàngpǐn Top grade rice cereals 2.8.094 米穀部中品 mǐgǔ bù zhōng pǐn Mid-grade rice 2.9.098 米穀部下品 mǐgǔ bùxià pǐn The inferior product of Rice Valley