Hacking at the Voynich manuscript - Side notes 037 A concordance of the VMs Last edited on 2001-03-29 09:02:23 by stolfi The goal of this note is to prepare a full concordance for the VMs, showing every word or short phrase in its context. 98-09-14 stolfi (redone in part 98-12-31) =============== I. STRUCTURE OF THE CONCORDANCE The concordance will be produced in two formats: a single machine-readable file, and a set of HTML pages (one per word pattern). The main intermediate file for a location map is a "raw concordance". Each line of this file represents one occurrence of some string in the text, in the format LOC TRANS START LENGTH LCTX STRING RCTX PATT STAG PNUM HNUM 1 2 3 4 5 6 7 8 9 10 11 where LOC is a line locator, like "f1r.11", "f86v2.R1.12a" etc. TRANS is a letter identifying a transcriber, e.g. "F" for Friedman. START is the index of the first byte of the occurrence in the text line (counting from 1). LENGTH is the original length of the occurrence in the text, including fillers, comments, spaces, etc.. LCTX is a word or word sequence, the left context of STRING. STRING is the non-empty string in question, without any fillers, comments, non-significant spaces, line breaks, etc.. RCTX is a word or word sequence, the right context of STRING. PATT a sorting pattern derived from STRING; see below. STAG a tag identifying a section of the VMS, e.g. "hea" or "bio". PNUM is the page's p-number, which is sequential and better for sorting. HNUM an HTML page number; see below. For EVA format text, the START is relative to column 20. In that case LENGTH does not include columns 1-19 and "#"-comments. It does include "{}"-comments, "!" and "%" fillers, and any ASCII blanks beyond column 20. In this file the symbol "/" will be used instead of "-" to denote line breaks, in order to distingish them from embedded "-" denoting gaps, vellum defects, intruding figures, etc. The STRING is a single and whole Vms word, as delimited by EVA word separators [-/=,.]; or a sequence of two or more consecutive words, up to a certain maximum length. The LCTX and RCTX strings consist of zero or more words, including their word delimiters, that surrounded this occurrence of STRING. Each context is extended until it includes a specified number of non-delimiter characters, or hits the boundary of the enclosing unit or paragraph. At least one word delimiter is always present. The delimiters that surrounded the STRING in the original text are *not* included in the string itself, but are included in the context. The STRING, LCTX, and RCTX fields may extend across comments, fillers, gaps and intrusions ("-") and ordinary line breaks ("/"); but not across paragraph breaks ("=") or changes of textual unit. When the string extends across a line break, the NEWLINE character, the locators in columns 1-19, and any interveing #-comments are not included in the string, and not counted in the LENGTH. However the EVA codes for word and line break [-/.,] are included in STRING and and counted in LENGTH. II. THE INPUT TEXTS II.1 THE VMS TEXT The VMS concordance is to be built from a majority version of the EVA interlinear (derived from release interln1.6e6.evt; see note 045) and all major individual versions. Since the enum-text-phrases scripts can process only one version at a time, we must run it several times. We must turn bad weirdos into "*". ln -s ../045/inter-cm.evt cat inter-cm.evt \ | egrep -e '^(|## *)<[^;<>]*(|;[ACDFGHKLTUV])>' \ | sed \ -e 's/[&][*][^{}][^{}]/*\!\!\!/g' \ -e 's/[&][^{}][*][^{}]/*\!\!\!/g' \ -e 's/[&][^{}][^{}][*]/*\!\!\!/g' \ | basify-weirdos \ > vms.evt II.2 WAR OF THE WORLDS Creating a control text (Well's War of the Worlds). I took chapters 1-16 (about 34000 words) and hacked the text into an EVMT-like format, as follows. I added an "=" at the end of each paragraph, with emacs. Then I regularized the spaces with the following script gawk \ ' \ /^[#]/ { print; next; } \ /^ *$/ { print; next; } \ /^ *CHAPTER/ { print; next; } \ /./ { \ str = tolower($0); gsub(/[0-9]/, "n", str); \ gsub(/[.;:\!?]/, ".", str); gsub(/[-,/"'"'"'()]/, " ", str); \ gsub(/[ ]*$/, " ", str); \ gsub(/[ ]*[.][. ]*/, "-", str); gsub(/[ ]+/, ".", str); \ gsub(/[-.]*[=][-.]*/, "=", str); gsub(/[.]*[-][.]*/, "-", str); \ gsub(/^[- .]*/, "", str); print str; \ } \ ' Then I added location codes, with the following script: gawk \ ' BEGIN { pg=0; sc=0; } \ /^[#]/ { print; next; } \ /^ *$/ { print "#"; \ if(ln > 0) \ { un++; if(un > 9) { pg++; un=1; } ln = 0; } \ next; } \ /^ *CHAPTER/ { gsub(/^ */, "# ", $0); print; pg++; sc++; un=1; ln=0; next; } \ /./ { \ ln++; \ if((ln == 1) && (un == 1)) \ { printf "f%03dr c%02d\n", pg, sc > "wow-fnum-to-sectag.tbl"; } \ loc = sprintf(" ", pg, un, ln); \ gsub(/^[- .,=]*/, loc, $0); print; next; \ } \ ' II.3 LEWIS AND CLARK JOURNALS Creating another control text, with more erratic spelling (Lewis and Clark's expedition journals, by various participants, abridged and merged by Florentine Films.) With emacs, I deleted all editorial notes, prefixed dates with "@", author names with "%", and text lines with "|". The result is lac.txt Still with emacs, I added an "=" at the end each paragraph. Then I converted the text to lowercase, and the spaces to [-=.,], with this script: cat ../Texts/full/engl-wow gawk \ ' \ /^[#@%]/ { print; next; } \ /^ *$/ { print; next; } \ /^[|]/ { \ str = tolower($0); gsub(/^[|][ ]*/, "", str); gsub(/[0-9]/, "n", str); \ gsub(/[.;:\!?]/, ".", str); gsub(/[-,/"'"'"'()]/, " ", str); \ gsub(/[ ]*$/, " ", str); \ gsub(/[ ]*[.][. ]*/, "-", str); gsub(/[ ]+/, ".", str); \ gsub(/[-.]*[=][-=.]*/, "=", str); gsub(/[.]*[-][-.]*/, "-", str); \ gsub(/^[- .]*/, "", str); printf "| %s\n", str; \ } \ ' Then I made each entry into a page with the following script: gawk \ ' BEGIN { pg=0; } \ /^[#]/ {print; next;} \ /^[@]/ {gsub(/^[@]/, "#", $0); print; next;} \ /^[%]/ { \ pg++; ln = 0; tr = substr($0,3); \ gsub(/^[%]/, "#", $0); print; next; \ } \ /^ *$/ { print "#"; next; } \ /^[|]/ { \ ln++; gsub(/^[|][ ]*/, "", $0); \ printf " %s\n", pg, ln, tr, $0; \ } \ ' \ | sed \ -e 's/;Charles Floyd, Jr.>/;F>/g' \ -e 's/;John Ordway>/;O>/g' \ -e 's/;Joseph Whitehouse>/;W>/g' \ -e 's/;Meriwether Lewis>/;L>/g' \ -e 's/;Patrick Gass>/;G>/g' \ -e 's/;William Clark>/;C>/g' Let's define the sections as the transcribers: cat lac.evt \ | gawk \ ' BEGIN{ tb["G"]="gas"; tb["L"]="lew"; tb["C"]="cla"; \ tb["O"]="ord"; tb["F"]="flo"; tb["W"]="whi"; \ } \ /[<]/ { \ gsub(/[>].*/, "", $0); gsub(/[<]/, "", $0); \ gsub(/[.].*[;]/, " ", $0); if($2 in tb) {$2 = tb[$2];} \ print; next; \ } \ ' \ | sort | uniq \ > lac-fnum-to-sectag.tbl II.4 BOOK OF ENOCH Another control text: The Book of Enoch in Ethiopian (Ge`ez, SERA encoding), edited by Michal Jerabek of Charles University. The starting text is I replaced space by ".", then ":|:" by "=" and line break with emacs, "::." by "-" (no line break), then added line numbers with emacs, and the script in ../../Texts/renumber-evt-lines, and more emacs. Handy variables: ${samples} is the list of samples to process, ${samvers} includes the versions to consider, ${samtitmajs} includes the titles and majority version (if any). set samples = ( vms wow lac eno ) set samcomm = "`echo ${samples} | tr ' ' ','`" set vmsvers = "ACDFGHKLTUV" set wowvers = "W" set lacvers = "CLFGOW" set enovers = "J" set samvers = ( vms.${vmsvers} wow.${wowvers} lac.${lacvers} eno.${enovers} ) set samtitmajs = ( \ vms.VMS/A \ wow.War_of_the_Worlds/W \ lac.Lewis_and_Clark/ \ eno.1_Enoch/J \ ) III VALIDATION AND STATISTICS Checking all versions: foreach sam ( ${samples} ) echo ""; echo "sample = ${sam}" cat ${sam}.evt \ | validate-new-evt-format \ -v checkLineLenths=1 \ -v chars="`cat ${sam}.chars`" \ -v checkLineTerminators=1 \ -v requireUnitHeaders=0 \ >& ${sam}.bugs tail -3 ${sam}.bugs end Word length statistics (to select a suitable phrase length): foreach sam ( ${samples} ) echo ""; echo "sample = ${sam}" cat ${sam}.evt \ | words-from-evt \ | egrep -v '[*?]' \ | count-word-lengths \ > ${sam}.wst end multicol -v titles="${samples}" `echo {${samcomm}}.wst` > .all.wst vms wow lac eno ----------------------------- ----------------------------- ----------------------------- ----------------------------- len nwords example len nwords example len nwords example len nwords example --- ------ ------------------ --- ------ ------------------ --- ------ ------------------ --- ------ ------------------ 1 3379 y 1 1632 a 1 2036 a 1 5 k 2 9436 ol 2 5266 my 2 5936 an 2 471 `1 3 15059 ary 3 7873 had 3 7470 the 3 822 Sdq 4 28359 oror 4 5928 come 4 6452 city 4 2022 lomu 5 40135 sheey 5 3997 which 5 4672 ruins 5 3132 'Inze 6 31165 chckhy 6 2965 before 6 2583 appear 6 2810 teSHfe 7 18422 okeolan 7 2614 brother 7 2214 antient 7 3422 mewa`Il 8 7071 chedkaly 8 1588 stopping 8 1168 numerous 8 2272 we'azman 9 2525 lcheylchy 9 1003 direction 9 691 regularly 9 1775 weyeHewru 10 730 pchallarar 10 547 beginnings 10 401 supporting 10 867 bema`hdere 11 183 darailchedy 11 225 elphinstone 11 166 recollected 11 507 ytwe`hew`hu 12 93 lshedyoraiin 12 116 astonishment 12 84 extraodanary 12 256 weyrE'Iywomu 13 30 cheopolteeedy 13 54 indescribable 13 24 circumstances 13 120 weytwe`hew`hu 14 8 cheoltchedaiin 14 16 disintegrating 14 6 asstonishingly 14 64 'Im'Istnfasomu 15 4 ypchocpheosaiin 15 3 notwithstanding 15 8 notwithstanding 15 15 we'a`I`Smtihomu 16 1 chepchefyshdchdy 16 2 incomprehensible 16 3 counterballanced 16 16 we'itrE'Iywomunu 17 1 chckhoekeckhesshy 17 1 instancetaniously 17 4 webe`slTanatihomu 18 2 weyastebeqWu`Iwomu Thus the longest word in the VMS is 17 characters. Let's adopt that as the maximum phrase width. set maxlen = 17 (Oops, the "eno" sample has an 18-character word. Must remember to allow for that in the formatting...) III. THE BASIC CONCORDANCE Creating the basic concordance, fields 1-7. We can discard all entries that have any "*"s in the STRING field (but "*"s in the context field are OK.). echo "maxlen = ${maxlen}" foreach samver ( ${samvers} ) set sam = "${samver:r}" set ver = ( ` echo ${samver:e} | sed -e 's/./& /g'` ) rm -f ${sam}-${maxlen}.roc foreach v ( ${ver} ) echo " "; echo "sample = ${sam} version = ${v}" cat ${sam}.evt \ | egrep '^(|## *)<[^;]*(|;'"${v}"')>' \ | enum-text-phrases -f eva2erg.gawk \ -v maxLength=${maxlen} \ -v leftContext=15 \ -v rightContext=15 \ | gawk '($6 !~ /[*]/){print;}' \ | gzip \ > ${sam}-${maxlen}-${v}.roc.gz echo "kept `zcat ${sam}-${maxlen}-${v}.roc.gz | wc -l` good phrases" end end sample = vms version = A read 38659 words wrote 114954 phrases kept 101945 good phrases sample = vms version = C read 18102 words wrote 54103 phrases kept 52849 good phrases sample = vms version = D read 284 words wrote 803 phrases kept 765 good phrases sample = vms version = F read 33316 words wrote 97933 phrases kept 96487 good phrases sample = vms version = G read 4177 words wrote 12097 phrases kept 11908 good phrases sample = vms version = H read 37919 words wrote 112114 phrases kept 110479 good phrases sample = vms version = K read 258 words wrote 319 phrases kept 317 good phrases sample = vms version = L read 1231 words wrote 3554 phrases kept 3169 good phrases sample = vms version = T read 5360 words wrote 14333 phrases kept 14323 good phrases sample = vms version = U read 11386 words wrote 35057 phrases kept 31739 good phrases sample = vms version = V read 9654 words wrote 30022 phrases kept 28869 good phrases sample = wow version = W read 33829 words wrote 119530 phrases kept 119530 good phrases sample = lac version = C read 7831 words wrote 28859 phrases kept 28859 good phrases sample = lac version = L read 5695 words wrote 20175 phrases kept 20175 good phrases sample = lac version = F read 1398 words wrote 5561 phrases kept 5561 good phrases sample = lac version = G read 4891 words wrote 17548 phrases kept 17548 good phrases sample = lac version = O read 9293 words wrote 36540 phrases kept 36540 good phrases sample = lac version = W read 4807 words wrote 18837 phrases kept 18837 good phrases sample = eno version = J read 18582 words wrote 39799 phrases kept 39799 good phrases Merging the VMS concordances. We sort by location, position and length, then transcriber code, for the sake of the context-replacement step below. zcat vms-${maxlen}-[$vmsvers].roc.gz \ | sort +0 -1 +2 -4 +1 -2 \ | gzip \ > vms-${maxlen}-temp.roc.gz Removing non-redundant entries from the VMS concordance. We replace the context strings by those of the majority version, if available, in order to reduce the number or entries in later steps.. Then we condense all entries that have the same location and contents, and differ only in transcriber code. zcat vms-${maxlen}-temp.roc.gz \ | gawk \ ' \ ($2 == "A"){ oloc=$1; opos=$3; olen=$4; lc=$5; rc=$7; print; next;} \ (($1==oloc)&&($3==opos)&&($4==olen)) { $5=lc; $7=rc; print; } \ ' \ | sort +0 -1 +2 -7 +1 -2 \ | remove-redundant-roc-entries \ | gzip \ > vms-${maxlen}.roc.gz 404905 records read 267200 records ignored 137705 records written Pretending to do the same for other samples: zcat wow-${maxlen}-[$wowvers].roc.gz \ | sort +0 -1 +2 -7 +1 -2 \ | gzip \ > wow-${maxlen}.roc.gz zcat lac-${maxlen}-[$lacvers].roc.gz \ | sort +0 -1 +2 -7 +1 -2 \ | gzip \ > lac-${maxlen}.roc.gz zcat eno-${maxlen}-[$enovers].roc.gz \ | sort +0 -1 +2 -7 +1 -2 \ | gzip \ > eno-${maxlen}.roc.gz Checking the word length distribution, to make sure we didn't lose anything: foreach sam ( ${samples} ) echo " "; echo "sample = ${sam}" zcat ${sam}-${maxlen}.roc.gz \ | gawk '($6 \!~ /[-/.,=]/) { print $6; }' \ | count-word-lengths \ > .${sam}.wst end multicol -v titles="${samples}" `echo .{${samcomm}}.wst` > .new.wst The next step is to add to each accepted line of the concordance an 8th field, PATT, which is obtained from STRING by identifying similar characters, removing spaces and q's, etc. foreach sam ( ${samples} ) echo " "; echo "sam = ${sam}" zcat ${sam}-${maxlen}.roc.gz \ | add-${sam}-match-key -f eva2erg.gawk \ -v inField=6 \ -v outField=8 \ | gzip \ > ${sam}-${maxlen}.poc.gz echo "`zcat ${sam}-${maxlen}.poc.gz | wc` ${sam}-${maxlen}.poc" end lines words bytes file ------- ------- -------- ------------ 137705 1101640 10955849 vms-17.poc 119530 956240 9499018 wow-17.poc 127520 1020160 9551708 lac-17.poc 39799 318392 3107113 eno-17.poc Checking for empty patterns: foreach sam ( ${samples} ) echo " "; echo "${sam}" zcat ${sam}-${maxlen}.poc.gz \ | gawk '(NF \!= 8){ print;}' end (empty output) Let's collect the patterns and count their frequencies: foreach sam ( ${samples} ) echo " "; echo "${sam}" zcat ${sam}-${maxlen}.poc.gz \ | gawk '/./{p=($1 ":" $3 ":" $4); if(p\!=op){print $8;} op=p;}' \ | sort | uniq -c | expand \ | sort +0 -1nr +1 -2 \ > ${sam}-${maxlen}.pfr cat ${sam}-${maxlen}.pfr \ | gawk '($1 > 1){print}' \ > ${sam}-${maxlen}-nonu.pfr dicio-wc ${sam}-${maxlen}{,-nonu}.pfr end lines words bytes file ------- ------- --------- ------------ 48539 97078 1028787 vms-17.pfr 6778 13556 123522 vms-17-nonu.pfr lines words bytes file ------- ------- --------- ------------ 73545 147090 1530446 wow-17.pfr 8121 16242 140188 wow-17-nonu.pfr lines words bytes file ------- ------- --------- ------------ 68984 137968 1171549 lac-17.pfr 9978 19956 150836 lac-17-nonu.pfr lines words bytes file ------- ------- --------- ------------ 23328 46656 464196 eno-17.pfr 3602 7204 61830 eno-17-nonu.pfr Then we extract the FNUM temporarily as field 9, then use it to insert the section tag STAG (hea, heb, bio, etc.) as the final field 9, and the page's p-number PNUM as field 10: foreach sam ( ${samples} ) echo " "; echo "${sam}" zcat ${sam}-${maxlen}.poc.gz \ | gawk '/./{f=$1;gsub(/[.].*$/,"",f); print $0,f;}' \ | map-field \ -v inField=9 \ -v outField=9 \ -v table=${sam}-fnum-to-sectag.tbl \ | map-field \ -v inField=10 \ -v outField=10 \ -v table=${sam}-fnum-to-pnum.tbl \ | gawk '/./{NF = 10;print;}' \ | gzip \ > ${sam}-${maxlen}.soc.gz echo "`zcat ${sam}-${maxlen}.soc.gz | wc` ${sam}-${maxlen}.soc" end lines words bytes file ------- ------- --------- ------------ 137705 1377050 12195194 vms-17.soc 119530 1195300 10574788 wow-17.soc 127520 1275200 10699388 lac-17.soc 39799 397990 3465304 eno-17.soc Next we sort the concordance by pattern, actual string (minus blanks), section, and page. For this purpose we add a temporary 11th field which is just STRING, RCTX and LCTX concatented without blanks. This field is removed after the sort. foreach sam ( ${samples} ) echo " "; echo "${sam}" zcat ${sam}-${maxlen}.soc.gz \ | gawk \ ' /./{ \ key = gensub(/[-/=,. ]/,"","g",($6 $7 $5)); \ print $0, key; } \ ' \ | sort +7 -8 +10 -11 +8 -9 +9 -10 \ | gawk '/./{NF = 10; print}' \ | gzip \ > ${sam}-${maxlen}-srt.soc.gz echo "`zcat ${sam}-${maxlen}-srt.soc.gz | wc` ${sam}-${maxlen}-srt.soc" end lines words bytes file ------- ------- --------- ------------ 137705 1377050 12195194 vms-17-srt.soc 119530 1195300 10574788 wow-17-srt.soc 127520 1275200 10699388 lac-17-srt.soc 39799 397990 3465304 eno-17-srt.soc Let's count the number of entries according to the initial letter(s) of the pattern: foreach sam ( ${samples} ) echo " "; echo "${sam}"; echo " " zcat ${sam}-${maxlen}-srt.soc.gz \ | gawk '//{print substr($8, 1, 1);}' \ | sort | uniq -c | expand \ > .${sam}.lstat end multicol -v titles="${samples}" `echo .{${samcomm}}.lstat` > .all.lstat vms wow lac eno --------- --------- --------- --------- 14454 d 15973 d 2089 & 1528 0 44363 e 6278 e 5941 b 2 a 48 i 11239 i 3208 d 2722 b 4921 l 18062 n 8500 e 723 d 36 n 24588 o 5601 f 6871 e 62935 o 5941 p 2155 g 4009 j 15 q 2590 r 6323 h 4900 k 10766 t 8958 s 758 j 2559 l 90 v 25892 t 6133 k 2917 m 77 x 9 z 2616 l 504 n 9537 n 573 r 37502 o 2097 s 3762 p 1179 t 2620 r 46 u 7148 s 9169 v 22227 t 1400 x There are too many strings to build a single HTML file with the concordance. So let's extract only those patterns that include at least two entries, or whose STRING is a single word: foreach sam ( ${samples} ) echo " "; echo "${sam}" zcat ${sam}-${maxlen}-srt.soc.gz \ | mark-interesting-patterns \ | gawk '($11 == "+"){NF = 10; print;}' \ | gzip \ > ${sam}-${maxlen}-ok.soc.gz echo "`zcat ${sam}-${maxlen}-ok.soc.gz | wc` ${sam}-${maxlen}-ok.soc" end vms 77322 records marked "+" 60383 records marked "-" 77322 773220 6356963 vms-17-ok.soc wow 56292 records marked "+" 63238 records marked "-" 56292 562920 4506071 wow-17-ok.soc lac 69402 records marked "+" 58118 records marked "-" 69402 694020 5401726 lac-17-ok.soc eno 23628 records marked "+" 16171 records marked "-" 23628 236280 1922173 eno-17-ok.soc Let's compute the max size of the fields, with dots and all: foreach field ( 1 2 5 6 7 8 9 ) echo " " echo "=== field ${field} ===" foreach sam ( ${samples} ) echo " "; echo "${sam}" zcat ${sam}-${maxlen}-ok.soc.gz \ | gawk -v n=${field} '/./ {print $(n);}' \ | count-word-lengths \ > .${sam}-${field}.szs end echo "field ${field} ===" > .all-${field}.szs echo ' ' >> .all-${field}.szs multicol -v titles="${samples}" `echo .{${samcomm}}-${field}.szs` \ >> .all-${field}.szs end field 1 - max 12 field 2 - max 6 field 5 - max 33, can be clipped at 20 or less. field 6 - max 24 field 7 - max 32, can be clipped at 20 or less. field 8 - max 17 I recorded these numbers as shell variables: set maxlft = 32 set maxstr = 24 set maxrht = 33 IV. THE HTML CONCORDANCE Let's split the HTML-formatted concordance into pages with about 1000 lines. We will add the 11th field, HNUM, which is the HTML page number (counting from 0): set pgsize = 1000; echo "pgsize = $pgsize" foreach sam ( ${samples} ) echo " "; echo "${sam}" zcat ${sam}-${maxlen}-ok.soc.gz \ | gawk -v pgsize=${pgsize} \ ' /./{ \ pat=$8; \ if ((pat \!= ppat) && (NR-ndone > pgsize)) \ { pg++; ndone = NR-1; } \ printf "%s %03d\n", $0, pg; \ ppat = pat; \ } \ ' \ | gzip \ > ${sam}-${maxlen}-ok.hoc.gz echo "`zcat ${sam}-${maxlen}-ok.hoc.gz | wc` ${sam}-${maxlen}-ok.hoc" end lines words bytes file ------- ------- ------- ------------ 77322 850542 6666251 vms-17-ok.hoc 56292 619212 4731239 wow-17-ok.hoc 69402 763422 5679334 lac-17-ok.hoc 23628 259908 2016685 eno-17-ok.hoc Checking the number of strings per page: foreach sam ( ${samples} ) echo " "; echo "${sam}" zcat ${sam}-${maxlen}-ok.hoc.gz \ | gawk '/./{ print $11; }' \ | sort | uniq -c | expand \ > .${sam}.pgsizes end multicol -v titles="${samples}" `echo .{${samcomm}}.pgsizes` vms wow lac eno ----------- ----------- ----------- ----------- 1295 000 1014 000 1106 000 1037 000 1002 001 1000 001 1000 001 1002 001 1002 002 1075 002 1004 002 1010 002 1244 003 1250 003 1052 003 1041 003 1004 004 1022 004 1001 004 1006 004 1003 005 1038 005 1009 005 1000 005 1002 006 1001 006 1167 006 1000 006 1001 007 1000 007 1092 007 1001 007 1300 008 1001 008 1107 008 1000 008 1001 009 1001 009 1123 009 1000 009 1001 010 1378 010 1158 010 1003 010 1006 011 1001 011 1013 011 1093 011 1061 012 1000 012 1064 012 1007 012 1146 013 1003 013 1000 013 1001 013 1037 014 1033 014 1002 014 1002 014 1008 015 1002 015 1004 015 1001 015 1012 016 1215 016 1006 016 1001 016 1024 017 1013 017 1052 017 1056 017 1002 018 1021 018 1003 018 1000 018 1058 019 1128 019 1036 019 1000 019 1018 020 1303 020 1000 020 1000 020 1000 021 1749 021 1001 021 1000 021 1033 022 1027 022 1049 022 1000 022 1019 023 2079 023 1014 023 367 023 1005 024 1003 024 1008 024 1000 025 1001 025 1024 025 1000 026 1000 026 2714 026 1004 027 1133 027 2280 027 1002 028 2003 028 1008 028 1089 029 1479 029 1001 029 1002 030 1001 030 1002 030 1010 031 1009 031 1121 031 1001 032 1003 032 2477 032 1016 033 1020 033 1003 033 1576 034 1001 034 1005 034 1001 035 1108 035 1032 035 1019 036 1000 036 1004 036 1003 037 3720 037 1618 037 1002 038 1003 038 1000 038 1003 039 1000 039 1144 039 1588 040 1147 040 1000 040 1001 041 1001 041 1002 041 1000 042 1170 042 1016 042 1003 043 1061 043 1000 043 1007 044 1400 044 1001 044 1686 045 1001 045 1029 045 1018 046 1026 046 1050 046 1002 047 648 047 1001 047 1000 048 3479 048 1021 049 1063 049 1001 050 1015 050 1330 051 1102 051 1002 052 1000 052 1000 053 1011 053 1295 054 1314 054 1135 055 1005 055 1134 056 1000 056 1595 057 1001 057 1001 058 1020 058 2680 059 789 059 1010 060 1000 061 1067 062 1013 063 1003 064 1003 065 1150 066 1002 067 1000 068 1005 069 558 070 Now let's split the file into pages: foreach sam ( ${samples} ) echo " "; echo "${sam}" set dir = "${sam}-pages-new" mkdir ${dir} rm -f /tmp/${sam}-???.hoc zcat ${sam}-${maxlen}-ok.hoc.gz \ | gawk -v sam=${sam} \ ' BEGIN { ppg = ""; } \ /./{ \ pg = $11; \ if (pg \!= ppg) \ { if(ppg \!= "") { close(wr); } \ wr = ("/tmp/" sam "-" pg ".hoc"); ppg = pg; \ printf "%s\n", pg; \ } \ print > wr; \ } \ END { if(ppg \!= "") { close(wr); } } \ ' \ > ${sam}-pages-new/all.nums cat ${sam}-pages-new/all.nums end Let's format the pages: cat /tmp/vms-005.hoc \ | format-new-concordance \ -v title="Test Concordance - Section 005" \ -v prevPage=004 -v nextPage=006 \ -v maxLeft=20 -v maxString=${maxstr} -v maxRight=20 \ -v majTran=A \ > test.html foreach samtitmaj ( ${samtitmajs} ) set samtit = "${samtitmaj:h}" set maj = "${samtitmaj:t}" set sam = "${samtit:r}" set tit = "`echo ${samtit:e} | tr '_' ' '`" echo " "; echo "${sam} (${tit}) maj = '${maj}'" rm -f ${sam}-pages-new/*.html set pgs = ( `cat ${sam}-pages-new/all.nums` ) set nxs = ( $pgs[2-] ) set prev = ""; foreach page ( $pgs ) if ( $#nxs > 0 ) then set next = "$nxs[1]"; shift nxs; else set next = "" endif echo "${page} prev = $prev next = $next" cat /tmp/${sam}-${page}.hoc \ | format-new-concordance \ -v title="${tit} Concordance - Section ${page}" \ -v prevPage="$prev" -v nextPage="$next" \ -v maxLeft=20 -v maxString=${maxstr} -v maxRight=20 \ -v majTran="${maj}" \ > ${sam}-pages-new/${page}.html set prev = "$page" end end Let's create the alphabetic index file: foreach samtitmaj ( ${samtitmajs} ) set samtit = "${samtitmaj:h}" set sam = "${samtit:r}" set tit = "`echo ${samtit:e} | tr '_' ' '`" echo " "; echo "${sam} (${tit})" zcat ${sam}-${maxlen}-ok.hoc.gz \ | gawk '/./{ print $6,$11,$8,gensub(/[-/=,.]/, "", "g", $6); }' \ | sort +3 -4 +1 -2n \ | uniq \ | gawk '/./{ NF=3; print; }' \ | format-string-index \ -title "${tit} Concordance - Word and Phrase index" \ > ${sam}-pages-new/index.html end V. PUBLISHING Preparing a compressed archive: foreach sam ( ${samples} ) echo " "; echo "${sam}" rm -f ${sam}-pages-new/all.zip ( cd ${sam}-pages-new && zip -klv all index.html ???.html ) end Preparing a zip-compressed version of the machine concordance for the benefit of Windows/dos users: cat vms-${maxlen}-ok.soc.gz | zip vms-${maxlen}-ok.zip - Installing the pages: foreach sam ( ${samples} ) echo " "; echo "${sam}" mv ${sam}-pages ${sam}-pages-old mv ${sam}-pages-new ${sam}-pages touch ${sam}-pages/.www_browsable end See Analysis-dam-draft.txt