Created word file for kwic-index: extract-words-from-interlin \ -chars 'aoeilmnrchtpkfsqjdvxyg' \ Note-010/vtx-f-eva.evt \ Note-010/vtx-f-eva lines words bytes file ------ ------- --------- ------------ 3890 33945 203820 Note-010/vtx-f-eva.txt 37462 37462 210854 Note-010/vtx-f-eva.wds 6711 6711 48108 Note-010/vtx-f-eva.dic 33026 33026 201154 Note-010/vtx-f-eva-gut.wds 6526 6526 46995 Note-010/vtx-f-eva-gut.dic 4206 4206 8412 Note-010/vtx-f-eva-fun.wds 3 3 6 Note-010/vtx-f-eva-fun.dic 230 230 1288 Note-010/vtx-f-eva-bad.wds 182 182 1107 Note-010/vtx-f-eva-bad.dic Sample from Note-010/vtx-f-eva.txt: fyays ykal ar ytaiin shol shory ?k?res ykor sholdy sory ckhar ory kair chtaiin shor ar cthar cthar dana syaiir sheky or ykaiin shod cthoary cthes daraiin sy ?oiin oteey otear roloty cthaar daiin okaiin or okan sairy chear cthaiin cphar cfhaiin = odar shy shol cphoy oydar sh s cfhoaiin shodary yshey shody okcho y otchol chocthy oschy dain chor kos daiin shor cfhol shody = Digraph counts: TT . / = a o e i l m n r c h t p k f s q j d x y g ? - ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- . 29062 . . . 1436 6098 87 6 1184 4 4 449 6069 . 604 172 1131 76 3508 4639 . 2605 3 916 4 67 . / 3213 . 11 1 15 475 5 31 66 29 . 17 138 . 199 49 26 5 557 460 1 572 1 532 . 23 . = 688 . . . . 20 1 . 2 2 . . 14 . 164 319 77 27 15 24 . 12 . 10 . 1 . a 12668 83 59 3 8 11 7 6031 2585 686 86 2820 42 . 19 9 49 4 41 3 . 68 1 20 . 26 7 o 21135 931 50 5 249 67 282 191 4886 153 10 2356 522 . 3114 486 5322 123 377 22 . 1827 6 117 . 34 5 e 17145 87 6 1 436 2693 4267 14 6 16 15 22 371 . 150 63 355 31 336 . 1 4669 . 3591 2 13 . i 11602 6 . . 1 3 4 5274 28 12 5566 607 10 . 2 2 49 . 20 . . 12 1 2 . 3 . l 9070 4856 385 52 325 482 43 2 37 10 . 30 648 . 87 36 908 26 363 3 . 374 3 361 . 3 36 m 951 298 546 67 7 6 . . . 2 . . 2 . . . . . 1 . . 6 . 1 . 2 13 n 5684 4936 416 159 9 14 . . 10 7 1 14 9 . 1 . 7 . 8 1 . 21 . 26 . 2 43 r 6387 4430 281 60 750 318 14 22 14 3 . 3 121 . 2 2 7 1 46 . . 39 3 245 . 4 22 c 12069 . . . . . . . . . . . . 9893 954 219 929 74 . . . . . . . . . h 16238 57 8 1 752 3661 7467 7 67 5 . 23 626 . 96 25 196 7 106 1 . 979 4 2130 . 16 4 t 5897 39 3 1 1434 554 1377 1 11 . . . 906 954 1 1 1 . 166 1 . 17 . 420 2 7 1 p 1468 28 1 . 159 207 6 1 . . . . 685 219 1 . 1 . 69 . . 28 . 61 . 2 . k 9688 69 23 1 2777 570 3392 9 32 1 . 2 988 929 1 1 . 1 208 . . 10 . 663 . 8 3 f 392 15 3 . 52 43 3 . . 1 . . 154 74 . . . . 15 . . 6 . 25 . 1 . s 6231 707 100 16 596 330 35 1 5 2 . 1 82 4169 1 4 13 2 33 . . 16 . 97 1 3 17 q 5180 2 . . 6 5052 39 . 2 . 1 . 32 . 7 2 24 . 2 . . 1 . 8 . 2 . j 2 1 . . . . . . . . . . . . . . . . 1 . . . . . . . . d 11515 420 108 7 3608 389 95 4 70 9 . 10 306 . 7 7 22 . 167 . . 23 . 6231 . 11 21 x 22 6 . . 5 8 . . . . . . . . . . . . . . . . . 3 . . . y 15521 12018 1193 311 20 49 8 . 49 7 . 19 272 . 480 70 562 15 139 12 . 161 . 2 . 1 133 g 10 . . . . . . . 5 . . . . . 3 . . . . . . . . . . 2 . ? 311 73 19 4 16 18 13 5 7 2 . 11 17 . 3 1 6 . 9 . . 10 . 16 1 80 . - 305 . . . 7 67 . 3 4 . 1 3 55 . 1 . 3 . 44 14 . 59 . 44 . . . ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- TOT 202454 29062 3213 688 12668 21135 17145 11602 9070 951 5684 6387 12069 16238 5897 1468 9688 392 6231 5180 2 11515 22 15521 10 311 305 Next-symbol probability (× 99): . / = a o e i l m n r c h t p k f s q j d x y g ? - -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- . . . . 5 21 . . 4 . . 2 21 . 2 1 4 . 12 16 . 9 . 3 . . . / . . . . 15 . 1 2 1 . 1 4 . 6 2 1 . 17 14 . 18 . 16 . 1 . = . . . . 3 . . . . . . 2 . 24 46 11 4 2 3 . 2 . 1 . . . a 1 . . . . . 47 20 5 1 22 . . . . . . . . . 1 . . . . . o 4 . . 1 . 1 1 23 1 . 11 2 . 15 2 25 1 2 . . 9 . 1 . . . e 1 . . 3 16 25 . . . . . 2 . 1 . 2 . 2 . . 27 . 21 . . . i . . . . . . 45 . . 47 5 . . . . . . . . . . . . . . . l 53 4 1 4 5 . . . . . . 7 . 1 . 10 . 4 . . 4 . 4 . . . m 31 57 7 1 1 . . . . . . . . . . . . . . . 1 . . . . 1 n 86 7 3 . . . . . . . . . . . . . . . . . . . . . . 1 r 69 4 1 12 5 . . . . . . 2 . . . . . 1 . . 1 . 4 . . . c . . . . . . . . . . . . 81 8 2 8 1 . . . . . . . . . h . . . 5 22 46 . . . . . 4 . 1 . 1 . 1 . . 6 . 13 . . . t 1 . . 24 9 23 . . . . . 15 16 . . . . 3 . . . . 7 . . . p 2 . . 11 14 . . . . . . 46 15 . . . . 5 . . 2 . 4 . . . k 1 . . 28 6 35 . . . . . 10 9 . . . . 2 . . . . 7 . . . f 4 1 . 13 11 1 . . . . . 39 19 . . . . 4 . . 2 . 6 . . . s 11 2 . 9 5 1 . . . . . 1 66 . . . . 1 . . . . 2 . . . q . . . . 97 1 . . . . . 1 . . . . . . . . . . . . . . j 50 . . . . . . . . . . . . . . . . 50 . . . . . . . . d 4 1 . 31 3 1 . 1 . . . 3 . . . . . 1 . . . . 54 . . . x 27 . . 23 36 . . . . . . . . . . . . . . . . . 13 . . . y 77 8 2 . . . . . . . . 2 . 3 . 4 . 1 . . 1 . . . . 1 g . . . . . . . 50 . . . . . 30 . . . . . . . . . . 20 . ? 23 6 1 5 6 4 2 2 1 . 4 5 . 1 . 2 . 3 . . 3 . 5 . 25 . - . . . 2 22 . 1 1 . . 1 18 . . . 1 . 14 5 . 19 . 14 . . . -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- TOT 14 2 0 6 10 8 6 4 0 3 3 6 8 3 1 5 0 3 3 0 6 0 8 0 0 0 Previous-symbol probability (× 99): TT . / = a o e i l m n r c h t p k f s q j d x y g ? - -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- . 14 . . . 11 29 1 . 13 . . 7 50 . 10 12 12 19 56 89 . 22 13 6 40 21 . / 2 . . . . 2 . . 1 3 . . 1 . 3 3 . 1 9 9 50 5 5 3 . 7 . = 0 . . . . . . . . . . . . . 3 22 1 7 . . . . . . . . . a 6 . 2 . . . . 51 28 71 1 44 . . . 1 1 1 1 . . 1 5 . . 8 2 o 10 3 2 1 2 . 2 2 53 16 . 37 4 . 52 33 54 31 6 . . 16 27 1 . 11 2 e 8 . . . 3 13 25 . . 2 . . 3 . 3 4 4 8 5 . 50 40 . 23 20 4 . i 6 . . . . . . 45 . 1 97 9 . . . . 1 . . . . . 5 . . 1 . l 4 17 12 7 3 2 . . . 1 . . 5 . 1 2 9 7 6 . . 3 13 2 . 1 12 m 0 1 17 10 . . . . . . . . . . . . . . . . . . . . . 1 4 n 3 17 13 23 . . . . . 1 . . . . . . . . . . . . . . . 1 14 r 3 15 9 9 6 1 . . . . . . 1 . . . . . 1 . . . 13 2 . 1 7 c 6 . . . . . . . . . . . . 60 16 15 9 19 . . . . . . . . . h 8 . . . 6 17 43 . 1 1 . . 5 . 2 2 2 2 2 . . 8 18 14 . 5 1 t 3 . . . 11 3 8 . . . . . 7 6 . . . . 3 . . . . 3 20 2 . p 1 . . . 1 1 . . . . . . 6 1 . . . . 1 . . . . . . 1 . k 5 . 1 . 22 3 20 . . . . . 8 6 . . . . 3 . . . . 4 . 3 1 f 0 . . . . . . . . . . . 1 . . . . . . . . . . . . . . s 3 2 3 2 5 2 . . . . . . 1 25 . . . 1 1 . . . . 1 10 1 6 q 3 . . . . 24 . . . . . . . . . . . . . . . . . . . 1 . j 0 . . . . . . . . . . . . . . . . . . . . . . . . . . d 6 1 3 1 28 2 1 . 1 1 . . 3 . . . . . 3 . . . . 40 . 4 7 x 0 . . . . . . . . . . . . . . . . . . . . . . . . . . y 8 41 37 45 . . . . 1 1 . . 2 . 8 5 6 4 2 . . 1 . . . . 43 g 0 . . . . . . . . . . . . . . . . . . . . . . . . 1 . ? 0 . 1 1 . . . . . . . . . . . . . . . . . . . . 10 25 . - 0 . . . . . . . . . . . . . . . . . 1 . . 1 . . . . . -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- Symbol entropy: 3.994 Next-symbol entropy: 2.197 STOPPED HERE | sed -e 's/<\(f[^>]*\.[^>]*\)\.[^>]*>/\1/' \ | map-field \ -v field=1 \ -v table=Note-010/unit-to-division.tbl \ | sort +0 -1 +2 -3n \ | gawk \ -v blocksize=${blocksize} \ ' /./ { \ divn=$1; t++; \ if ((divn != odivn)||(t > blocksize)) \ {b++; t=1; odivn=divn; } \ print $4, b; \ } \ ' \ > Note-010/vtx-f-eva- | sort | uniq -c | expand ===break here 97-11-17 stolfi =============== [ This part was later redone in a different manner, see above ] I wrote a script that reads a list of words and an ".evt" interlinear file, look up the former in the latter, and for each occurrence found writes a list of the form OFFSET POS WORD OCC where is the location of the line, e.g. OFFSET is the byte offset of the occurrence, from col 20 POS is the amount of text preceding the occurrence. WORD is the target word that was found to occur there. OCC is the actual variant of WORD that is there. The script ignores fillers and comments in the comparison. It has options to use an error-tolerant recoding before the comparison and/or to consider only matches of whole words. The POS field is a running count of "matching occasions", normally EVA characters, from the beginning of the file to this occurrence, exclusive. If using the error-tolerant option, the count refers to characters of the reduced coding. If matching as words, it means whole words. Note that multiple versions (Currier, Friedman, etc.) of the same line are counted as distinct lines in POS. Thus the input had better have a single transcription of aeach line. Here is a quick test: /bin/rm -f Note-010/test-w?-f?.occ foreach w ( 0 1 ) foreach f ( 0 1 ) echo "=== aswords=$w forgiving=$f ===" cat Note-010/vtx-f-eva.evt \ | head -10 \ | find-occurrences -f eva2erg.gawk \ -v show=1 \ -v aswords=${w} \ -v ignoreq=1 \ -v forgiving=${f} \ -v wordfile=Note-010/test.dic \ >& Note-010/test-w${w}-f${f}.shocc end end dicio-wc Note-010/test-w{0,1}-f{0,1}.shocc === aswords=0 forgiving=0 === === aswords=0 forgiving=1 === === aswords=1 forgiving=0 === === aswords=1 forgiving=1 === lines words bytes file ------ ------- --------- ------------ 39 83 2331 Note-010/test-w0-f0.shocc 53 126 3134 Note-010/test-w0-f1.shocc 35 75 2215 Note-010/test-w1-f0.shocc 36 92 2236 Note-010/test-w1-f1.shocc There are some words that account for a large proportion of the occurrences, especially when aswords=0. For this kind of search we should use a reduced word list. Let's count the occurrences in ~100 lines from Note-010/vtx-f-eva.evt foreach kind ( labels titles ) foreach w ( 0 1 ) foreach f ( 0 1 ) echo " " echo "=== kind=${kind} aswords=$w forgiving=$f ===" cat Note-010/vtx-f-eva.evt \ | grep -v '^#' \ | gawk '((NR % 39) == 17) {print;}' \ | find-occurrences -f eva2erg.gawk \ -v aswords=${w} \ -v ignoreq=1 \ -v forgiving=${f} \ -v wordfile=Note-010/${kind}.dic \ > Note-010/${kind}-test-w${w}-f${f}.occ end end dicio-wc Note-010/${kind}-test-w{0,1}-f{0,1}.occ end === kind=labels aswords=0 forgiving=0 === loaded 427 words tested 4322 potential matching sites found 2007 occurrences === kind=labels aswords=0 forgiving=1 === loaded 427 words tested 4182 potential matching sites found 9919 occurrences === kind=labels aswords=1 forgiving=0 === loaded 427 words tested 824 potential matching sites found 267 occurrences === kind=labels aswords=1 forgiving=1 === loaded 427 words tested 791 potential matching sites found 1081 occurrences lines words bytes file ------ ------- --------- ------------ 2007 10035 56635 Note-010/labels-test-w0-f0.occ 9919 49595 292167 Note-010/labels-test-w0-f1.occ 267 1335 8419 Note-010/labels-test-w1-f0.occ 1081 5405 34510 Note-010/labels-test-w1-f1.occ === kind=titles aswords=0 forgiving=0 === loaded 120 words tested 4322 potential matching sites found 1326 occurrences === kind=titles aswords=0 forgiving=1 === loaded 120 words tested 4182 potential matching sites found 6124 occurrences === kind=titles aswords=1 forgiving=0 === loaded 120 words tested 824 potential matching sites found 193 occurrences === kind=titles aswords=1 forgiving=1 === loaded 120 words tested 791 potential matching sites found 647 occurrences lines words bytes file ------ ------- --------- ------------ 1326 6630 37574 Note-010/titles-test-w0-f0.occ 6124 30620 176199 Note-010/titles-test-w0-f1.occ 193 965 6105 Note-010/titles-test-w1-f0.occ 647 3235 20982 Note-010/titles-test-w1-f1.occ Now let's list the words that occur too often: @ maxocc = 30 foreach kind ( labels titles ) echo " " /usr/ucb/echo "words from ${kind} with count > ${maxocc} in subsample" /usr/ucb/echo "w f noccs words" /usr/ucb/echo "- - ------ ------------" foreach w ( 0 1 ) foreach f ( 0 1 ) echo " " set n = `cat Note-010/${kind}-test-w${w}-f${f}.occ | wc -l` cat Note-010/${kind}-test-w${w}-f${f}.occ \ | gawk '/./ {print $4}' \ | sort | uniq -c | expand \ | sort +0 -1nr \ > Note-010/${kind}-test-w${w}-f${f}.frq cat Note-010/${kind}-test-w${w}-f${f}.frq \ | gawk '($1>='"${maxocc}"'){print $2}' \ > Note-010/${kind}-common-w${w}-f${f}.dic set s = ( `cat Note-010/${kind}-common-w${w}-f${f}.dic` ) /n/gnu/bin/printf "${w} ${f} %6d %s\n" ${n} "${s}" end end end words from labels with count > 30 in subsample w f noccs words - - ------ ------------ 0 0 2007 o y dy aiin al ar or qor daiin ody ys 0 1 9919 o y j shy dy gy ar or qor ys al ay oky oty aiin oin sar aj am sal ches ary asy ody char chas chos shar shor shedy tey aly chol shol qkol tol daiin dar okaiin okain otaiin otain okeey oteey qotesy rary sary shchy tar akol okal okol otal otol qokal cham rals sals chody dal cheys saiin 1 0 267 1 1 1081 okaiin okain otaiin otain words from titles with count > 30 in subsample w f noccs words - - ------ ------------ 0 0 1326 y r dy qok ol daiin chedy 0 1 6124 r y chy dy qok ol yy sar chdy eedy char chedy cheg sam chol shol daiiin daiin dain dar okain ytaiin ytain okeey qoteey kar qkar sheey otol qokal saiin 1 0 193 1 1 647 okain ytaiin ytain Now let's build a personalized word list for each option combination. When aswords=0, we delete the popular words above. When aswords=1, we keep everything. In either case, when forgiving=1, we wliminate all but one representative of each equivalence class: foreach kind (labels titles) foreach w ( 0 1 ) foreach f ( 0 1 ) echo " "; echo "=== kind=${kind} aswords=$w forgiving=$f ===" if ( $w == 1 ) then set wfilter = ( cat ) else set wfilter = ( fgrep -v -w -f Note-010/${kind}-common-w0-f${f}.dic ) endif cat Note-010/${kind}.dic \ | remove-variant-words -f eva2erg.gawk \ -v forgiving=${f} \ -v ignoreq=1 \ | ${wfilter} \ > Note-010/${kind}-w${w}-f${f}.dic end end dicio-wc Note-010/${kind}-w{0,1}-f{0,1}.dic end lines words bytes file ------ ------- --------- ------------ 415 415 2926 Note-010/labels-w0-f0.dic 277 277 2096 Note-010/labels-w0-f1.dic 425 425 2960 Note-010/labels-w1-f0.dic 312 312 2253 Note-010/labels-w1-f1.dic lines words bytes file ------ ------- --------- ------------ 112 112 764 Note-010/titles-w0-f0.dic 76 76 559 Note-010/titles-w0-f1.dic 119 119 790 Note-010/titles-w1-f0.dic 98 98 658 Note-010/titles-w1-f1.dic The following label words were eliminated from one or more of these lists because of duplication: am = od = aj asy = oeo = ary chas = eeoe = char chos = eeoe = char chpaly = eepolo = chfaly dolaram = doloeod = dolaraj dolary = doloeo = dalary dolory = doloeo = dalary doly = dolo = daly dyly = dolo = daly gan = don = dan gy = do = dy oin = oin = aiin okain = otoin = okaiin okal = otol = akol okaly = otolo = okala okalyd = otolod = okalam okaram = otoeod = okaraj okarchaj = otoeeeod = okarchag okeeos = oteeoe = okchor okeol = oteol = okeal okody = otodo = okagy okol = otol = akol okolar = otoloe = okalar okoly = otolo = okala okolyd = otolod = okalam okor = otoe = okar okorad = otoeod = okaraj okyd = otod = okam okydy = otodo = okagy olcphy = olepeo = alcphy oldam = oldod = oldaj opcharoiin = opeeoeoin = opcharaiin oporain = opoeoin = oforain opyrkydal = opoetodol = ofyskydal or = oe = ar orar = oeoe = arar oroj = oeod = oram orol = oeol = oral osaro = oeoeo = oraro oshodody = oeeododo = oshodady otaiin = otoin = okaiin otain = otoin = okaiin otair = otoie = okair otal = otol = akol otala = otolo = okala otalaj = otolod = okalam otalal = otolol = okalal otalar = otoloe = okalar otalchy = otoleeo = okolshy otaldy = otoldo = okoldy otalgar = otoldoe = otaldar otalshy = otoleeo = okolshy otalsy = otoleo = atalsy otaly = otolo = okala otam = otod = okam otar = otoe = okar otaraldy = otoeoldo = okoraldy otaralgy = otoeoldo = okoraldy otaram = otoeod = okaraj otchar = oteeoe = okchor otchos = oteeoe = okchor oteey = oteeo = okeey oteol = oteol = okeal oteolar = oteoloe = okealar otey = oteo = oteo otody = otodo = okagy otokol = ototol = atakal otol = otol = akol otolam = otolod = okalam otold = otold = otald otoldy = otoldo = okoldy otolor = otoloe = okalar otols = otole = okols otoly = otolo = okala otor = otoe = okar otora = otoeo = okary otorad = otoeod = okaraj otorain = otoeoin = okaraiin otoram = otoeod = okaraj otosey = otoeeo = otoeeo otra = oteo = oteo otral = oteol = okeal oty = oto = oky otyda = otodo = okagy otyld = otold = otald otysam = otoeod = okaraj qokal = okal = okal qokal = otol = akol qor = oe = ar qor = or = or qotesy = oteeo = okeey sals = eole = rals sary = eoeo = rary shar = eeoe = char shol = eeol = chol sholshgy = eeoleedo = sholshdy shor = eeoe = char siiir = oie = oir solal = eolol = salal soraly = eoeolo = sorala sydarary = eodoeoeo = rydarary syly = eolo = soly sysam = eoeod = sosam tol = tol = qkol y = o = o ykaly = otolo = okala ykary = otoeo = okary ykas = otoe = okar ykchdy = oteedo = otchdy ykeeody = oteeodo = otchody ykeol = oteol = okeal ykolaiin = otoloin = otalaiin ykyd = otod = okam ys = oe = ar ytalshdy = otoleedo = otolchdy ytoaiin = otooin = qokoaiin The following title words were eliminated from one or more of these lists because of duplication: chkor = eetoe = chkar daiin = doin = daiiin dain = doin = daiiin eedy = eedo = chdy lshy = leeo = lchy okor = otoe = okar otair = otoie = otaiir otar = otoe = okar otcheo = oteeeo = okchey otchol = oteeol = okchol otshey = oteeeo = okchey qkar = kar = kar qkar = toe = kar qokal = otol = otol qoteey = oteeo = okeey schol = eeeol = cheol shol = eeol = chol teerodal = teeeodol = keerodal ydaraishy = odoeoieeo = ydaraisho ykar = otoe = okar ytaiin = otoin = okain ytain = otoin = okain ytchas = oteeoe = ytchar Just to make sure, I ran a test with the personalized label sets but a 200-line subsample of the sample text: foreach kind ( labels titles ) foreach w ( 0 1 ) foreach f ( 0 1 ) echo " " echo "=== kind=${kind} aswords=$w forgiving=$f ===" cat Note-010/vtx-f-eva.evt \ | grep -v '^#' \ | gawk '((NR % 19) == 7) {print;}' \ | find-occurrences -f eva2erg.gawk \ -v show=1 \ -v aswords=${w} \ -v ignoreq=1 \ -v forgiving=${f} \ -v wordfile=Note-010/${kind}-w${w}-f${f}.dic \ > Note-010/${kind}-test-w${w}-f${f}.shocc end end dicio-wc Note-010/${kind}-test-w{0,1}-f{0,1}.shocc end === kind=labels aswords=0 forgiving=0 === loaded 415 words tested 8842 potential matching sites found 793 occurrences === kind=labels aswords=0 forgiving=1 === loaded 277 words tested 8549 potential matching sites found 1721 occurrences === kind=labels aswords=1 forgiving=0 === loaded 425 words tested 1704 potential matching sites found 520 occurrences === kind=labels aswords=1 forgiving=1 === loaded 312 words tested 1633 potential matching sites found 940 occurrences lines words bytes file ------ ------- --------- ------------ 1411 3101 93398 Note-010/labels-test-w0-f0.shocc 2339 5248 149853 Note-010/labels-test-w0-f1.shocc 1138 2555 77771 Note-010/labels-test-w1-f0.shocc 1558 3686 102995 Note-010/labels-test-w1-f1.shocc === kind=titles aswords=0 forgiving=0 === loaded 112 words tested 8842 potential matching sites found 620 occurrences === kind=titles aswords=0 forgiving=1 === loaded 76 words tested 8549 potential matching sites found 994 occurrences === kind=titles aswords=1 forgiving=0 === loaded 119 words tested 1704 potential matching sites found 408 occurrences === kind=titles aswords=1 forgiving=1 === loaded 98 words tested 1633 potential matching sites found 794 occurrences lines words bytes file ------ ------- --------- ------------ 1238 2755 81583 Note-010/titles-test-w0-f0.shocc 1612 3794 103508 Note-010/titles-test-w0-f1.shocc 1026 2331 70370 Note-010/titles-test-w1-f0.shocc 1412 3394 93581 Note-010/titles-test-w1-f1.shocc OK, let's run it for real. foreach kind ( labels titles ) foreach f ( 0 1 ) foreach w ( 1 0 ) echo " "; echo "=== kind=${kind} aswords=$w forgiving=$f ===" cat Note-010/vtx-f-eva.evt \ | find-occurrences -f eva2erg.gawk \ -v aswords=${w} \ -v ignoreq=1 \ -v forgiving=${f} \ -v wordfile=Note-010/${kind}-w${w}-f${f}.dic \ > Note-010/${kind}-vtx-w${w}-f${f}.occ end end dicio-wc Note-010/${kind}-vtx-{w1-f0,w0-f0,w1-f1,w0-f1}.occ end === kind=labels aswords=1 forgiving=0 === loaded 425 words tested 31733 potential matching sites found 9992 occurrences words not found: adairchdy ainam airar aj akol alamchy alcphy alet alif araly ararchodaiin araydy aroshol atakal atalsy ay chalsain chdaiir chekair cheosdy chetal chfaly chockhoy choeesy chofany chofora choity chokaro choram chosaroshol chpaly ctharal daiind daij daj dakocth daliir dalsy daramgal dararaiin darargi dariiir darshody dcheeor dcheoldy dokor dolaj dolaraj dolaram dolary dolchsody dolory dydariin dykchal dyly dytolg dytoly eeolchee ekeeey fary gan gy iirody ilkeeepol j koeeorain korain korainy oalcheg oarcheos ochepalain ocholsharam ochory odiiir oeepoaly oeolales oeoldain ofakal ofaldo ofaralar ofchdagy ofcheody oforain ofyskydal okagy okaiindan okala okalar okaldal okalyd okaraj okarchag okarchaj okchshy okealar okeechor okeoaly oklairgy okldam okolar okolinj okolshy okolyd okorad okoraldy okshdchas okydseoj olaran olaras olcheeeey olcheom olchiom oldaj oleoedin olkao olkchdal onary opaepom opaladj opalar opaldiiir opaloiiry opalorar opalrar opchaday opcharaiin opcharoiin opcholdy opchosam opeealdm opocphor oporain opyrkydal opysaj orald oraraly orchedal oroj osano osaro oshodady oshodody otainy otakaikan otala otalaj otalalg otalaly otalchy otaldar otalef otaleky otalgar otalody otalshy otaraldy otaralgy otarer otcheodar otchodals otchoshy otdorgy otdrdy oteeary oteeeiir oteoaldy oteolar oteosal oteosarar oteoys otodol otoeeo otokol otolam otolarol otolchdy otolchtey otold otoloaram otolor otols otooeey otora otorad otorain otoram otorchety otosey otoshos otra otral otshshdy otyda otydary otyld otysam oydchy qolsa qotesy rakar rals rydarary salal salf saloiin saloiinsheol salols seeyar sheoraj sholshdy sholshgy siiir siiircthr socharcfh sochorcfhy sofal soity solal soleesos solsy sorala soraly sororal sorory sosainr sosam sydarary syly sysam taol tasdaiin tolchd ydcphiiirdy ydcphody yfain ykary ykas ykchochdy ykocfhy ykoecfhy ykyd yorain ypsharal ys yshesas ytaem ytalshdy ytarem ytoaiin ytodaiir yypchy === kind=labels aswords=0 forgiving=0 === loaded 415 words tested 164044 potential matching sites found 14741 occurrences words not found: adairchdy ainam aj alamchy alcphy alet alif ararchodaiin araydy aroshol atakal atalsy chalsain chdaiir chekair cheosdy chfaly chockhoy chofany chofora choity chosaroshol chpaly ctharal daij daj dakocth daliir daramgal darargi dariiir dcheeor dcheoldy dokor dolaj dolaraj dolaram dolchsody dolory dykchal dytolg dytoly fary gan gy ilkeeepol koeeorain korain korainy oalcheg ochepalain ocholsharam ochory odiiir oeepoaly oeolales oeoldain ofakal ofaldo ofaralar ofchdagy oforain ofyskydal okagy okaiindan okaraj okarchag okarchaj okchshy okeechor okeoaly oklairgy okldam okolar okolinj okolshy okorad okoraldy okshdchas okydseoj olaran olaras olcheeeey olcheom olchiom oldaj oleoedin olkao olkchdal onary opaepom opaladj opalar opaldiiir opaloiiry opalorar opchaday opcharaiin opcharoiin opcholdy opchosam opeealdm opocphor oporain opyrkydal opysaj oraraly oroj osano oshodady oshodody otakaikan otalaj otalalg otalaly otalchy otalef otaleky otalgar otalody otaralgy otarer otcheodar otchodals otchoshy otdorgy otdrdy oteeary oteeeiir oteoaldy oteosal oteosarar oteoys otoeeo otokol otolam otolarol otolchdy otolchtey otoloaram otolor otooeey otorad otorain otoram otorchety otosey otoshos otra otral otshshdy otydary otyld otysam oydchy qotesy rydarary salf saloiin saloiinsheol seeyar sheoraj sholshdy sholshgy siiir siiircthr socharcfh sochorcfhy sofal soity solal soleesos solsy sorala soraly sororal sorory sosainr sosam sydarary syly sysam taol tasdaiin ydcphiiirdy ydcphody yfain ykchochdy ykocfhy ykoecfhy ypsharal yshesas ytaem ytarem ytoaiin ytodaiir yypchy === kind=labels aswords=1 forgiving=1 === loaded 312 words tested 30298 potential matching sites found 17920 occurrences words not found: adairchdy ainam airar alamchy alet alif ararchodaiin araydy aroshol chalsain cheosdy chfaly chofany chofora choity chokaro chosaroshol ctharal dakocth daliir dalsy daramgal dararaiin darargi darshody dcheoldy dolaraj dolchsody dydariin dykchal dytolg eeolchee fary iirody ilkeeepol koeeorain oalcheg oarcheos ochepalain ocholsharam odiiir oeepoaly oeolales oeoldain ofakal ofaralar ofchdagy ofyskydal okaiindan okarchag okeoaly oklairgy okldam okolinj okoraldy okshdchas okydseoj olaran olaras olcheeeey olchiom oleoedin olkao onary opaepom opaladj opaldiiir opaloiiry opalorar opalrar opchaday opcharaiin opchosam opeealdm opocphor oraraly osano oshodady otakaikan otalalg otalaly otalef otaleky otchodals otchoshy otdorgy otdrdy oteeeiir oteoaldy oteosal oteosarar otolarol otolchtey otoloaram otooeey otorchety otydary oydchy rydarary salf saloiinsheol salols sholshdy siiircthr socharcfh sochorcfhy sofal soity soleesos solsy sorala sororal sorory sosainr tasdaiin ydcphiiirdy ydcphody ykchochdy ykoecfhy ypsharal ytaem ytarem ytodaiir === kind=labels aswords=0 forgiving=1 === loaded 277 words tested 158752 potential matching sites found 32651 occurrences words not found: adairchdy alif chofany chosaroshol daliir daramgal darargi dcheoldy ilkeeepol ochepalain ocholsharam oeepoaly ofakal ofyskydal okaiindan oklairgy okldam okolinj okshdchas olaran olchiom oleoedin opaepom opaladj opaldiiir opaloiiry opalorar opeealdm opocphor otakaikan otaleky otdorgy otdrdy oteeeiir otolchtey otoloaram saloiinsheol ydcphiiirdy ydcphody lines words bytes file ------ ------- --------- ------------ 9992 49960 333228 Note-010/labels-vtx-w1-f0.occ 14741 73705 499217 Note-010/labels-vtx-w0-f0.occ 17920 89600 600753 Note-010/labels-vtx-w1-f1.occ 32651 163255 1192563 Note-010/labels-vtx-w0-f1.occ === kind=titles aswords=1 forgiving=0 === loaded 119 words tested 31733 potential matching sites found 7555 occurrences words not found: aiicthy alak cheg cholkal chorly dainod dak daldalol dodaiin dorain dytchdy ekchey ilnirsireik iriilil keerodal kolschees oiilireo okokchod okokchodg olchariiirfin olcheir olpchdy otchodeey otcholcheaiin otosy saiinchy shekealy teerodal ychealod ydaraisho ydaraishy ykdy ytchas yy === kind=titles aswords=0 forgiving=0 === loaded 112 words tested 164044 potential matching sites found 11367 occurrences words not found: aiicthy cheg chorly ilnirsireik iriilil keerodal kolschees oiilireo okokchod okokchodg olchariiirfin olcheir olpchdy otchodeey otcholcheaiin otosy shekealy teerodal ychealod ydaraisho ydaraishy ykdy ytchas === kind=titles aswords=1 forgiving=1 === loaded 98 words tested 30298 potential matching sites found 14850 occurrences words not found: alak cholkal chorly dainod dak daldalol ilnirsireik iriilil kolschees oiilireo okokchod okokchodg olchariiirfin otchodeey otcholcheaiin saiinchy ychealod ydaraisho === kind=titles aswords=0 forgiving=1 === loaded 76 words tested 158752 potential matching sites found 19778 occurrences words not found: ilnirsireik iriilil oiilireo okokchodg olchariiirfin otcholcheaiin ydaraisho lines words bytes file ------ ------- --------- ------------ 11367 56835 386511 Note-010/titles-vtx-w0-f0.occ 19778 98890 716561 Note-010/titles-vtx-w0-f1.occ 7555 37775 253910 Note-010/titles-vtx-w1-f0.occ 14850 74250 505514 Note-010/titles-vtx-w1-f1.occ Checking the maximum word length: cat Note-010/{labels,titles}-vtx-{w1-f0,w0-f0,w1-f1,w0-f1}.occ \ | gawk '/./ {k=length($4); if(k>m){m=k} k=length($5); if(k>n){n=k}} END {print m, n;}' 12 16 I saved this parameter in a shell variable: set maxlen = 17 Finally, I built the block-based maps. I printed the counts with wide formats in order to use sort-distr. Three digits + space should be enough for the maximum occurrences of a word in a block. The "ss" list below gives all the w/f combinations (the "directory" part) and number of potential matching sites ("name" part) for each combination. set ss = ( 1.0/31733 0.0/164044 1.1/30298 0.1/158752 ) The following variable defines the number of blocks to use: set nb = 20 So here it is: foreach kind ( labels titles ) foreach s ( $ss ) set sites = ${s:t} set wf = ${s:h} set f = ${wf:e} set w = ${wf:r} echo " "; echo "=== kind=${kind} aswords=$w forgiving=$f sites=${sites} ===" cat Note-010/${kind}-vtx-w${w}-f${f}.occ \ | sort -b +3 -5 \ | gawk ' /./ {print int(('"${nb}"'*$3)/'"${sites}"') + 1, $4, $5;}' \ | make-word-location-map \ -v showAvgPos=0 \ -v maxlen=${maxlen} \ -v ctwd=4 \ -v wdefs=Note-010/${kind}.wdl \ -v nblocks=${nb} \ > Note-010/${kind}-by-block-w${w}-f${f}-a.map grep '999' Note-010/${kind}-by-block-w${w}-f${f}-a.map end dicio-wc Note-010/${kind}-by-block-{w1-f0,w0-f0,w1-f1,w0-f1}-a.map end === kind=labels aswords=1 forgiving=0 sites=31733 === === kind=labels aswords=0 forgiving=0 sites=164044 === === kind=labels aswords=1 forgiving=1 sites=30298 === === kind=labels aswords=0 forgiving=1 sites=158752 === lines words bytes file ------ ------- --------- ------------ 247 6422 35665 Note-010/labels-by-block-w1-f0-a.map 568 14768 81987 Note-010/labels-by-block-w0-f0-a.map 1196 31096 172722 Note-010/labels-by-block-w1-f1-a.map 7361 191386 1062781 Note-010/labels-by-block-w0-f1-a.map === kind=titles aswords=1 forgiving=0 sites=31733 === === kind=titles aswords=0 forgiving=0 sites=164044 === === kind=titles aswords=1 forgiving=1 sites=30298 === === kind=titles aswords=0 forgiving=1 sites=158752 === lines words bytes file ------ ------- --------- ------------ 118 3068 17057 Note-010/titles-by-block-w1-f0-a.map 225 5850 32515 Note-010/titles-by-block-w0-f0-a.map 724 18824 104652 Note-010/titles-by-block-w1-f1-a.map 2608 67808 376332 Note-010/titles-by-block-w0-f1-a.map It is useful to know what locations correspond ot what blocks: foreach kind ( labels titles ) foreach s ( $ss ) set sites = ${s:t} set wf = ${s:h} set f = ${wf:e} set w = ${wf:r} echo " "; echo "=== kind=${kind} aswords=$w forgiving=$f sites=${sites} ===" cat Note-010/${kind}-vtx-w${w}-f${f}.occ \ | sort -b +2 -3n \ | gawk \ ' /./ { \ bl = int(('"${nb}"'*$3)/'"${sites}"') + 1; \ if (bl \!= ba) { printf "%02d %s\n", bl, $1; ba=bl;} \ } \ ' \ > Note-010/${kind}-by-block-w${w}-f${f}.blpages end dicio-wc Note-010/${kind}-by-block-{w1-f0,w0-f0,w1-f1,w0-f1}.blpages end With some editing and rotate-labels -v width=4 we get a useful header for the maps: --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- f f f f f f f f f f f f f f f f f f f f 1 9 2 3 4 4 5 7 7 8 8 8 8 1 1 1 1 1 1 1 r v 1 1 0 8 8 5 8 0 3 6 9 0 0 0 0 1 1 1 v v v v r v r r r v v 0 3 5 7 1 2 4 6 1 r v v v r v v --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- I also prepared similar maps where all variants of the same word are added together. Note that this makes a difference even in the case aswords=1 forgiving=0 because of the "q"-elimination. foreach kind ( labels titles ) foreach s ( $ss ) set sites = ${s:t} set wf = ${s:h} set f = ${wf:e} set w = ${wf:r} echo " "; echo "=== kind=${kind} aswords=$w forgiving=$f sites=${sites} ===" cat Note-010/${kind}-vtx-w${w}-f${f}.occ \ | gawk ' /./ {print int(('"${nb}"'*$3)/'"${sites}"') + 1, $4, "*";}' \ | sort -b +1 -2 \ | make-word-location-map \ -v showAvgPos=0 \ -v maxlen=${maxlen} \ -v ctwd=4 \ -v wdefs=Note-010/${kind}.wdl \ -v nblocks=${nb} \ > Note-010/${kind}-by-block-w${w}-f${f}-t.map grep '999' Note-010/${kind}-by-block-w${w}-f${f}-t.map end dicio-wc Note-010/${kind}-by-block-{w1-f0,w0-f0,w1-f1,w0-f1}-t.map end === kind=labels aswords=1 forgiving=0 sites=31733 === === kind=labels aswords=0 forgiving=0 sites=164044 === === kind=labels aswords=1 forgiving=1 sites=30298 === === kind=labels aswords=0 forgiving=1 sites=158752 === lines words bytes file ------ ------- --------- ------------ 174 4524 25114 Note-010/labels-by-block-w1-f0-t.map 218 5668 31456 Note-010/labels-by-block-w0-f0-t.map 189 4914 27280 Note-010/labels-by-block-w1-f1-t.map 238 6188 34333 Note-010/labels-by-block-w0-f1-t.map === kind=titles aswords=1 forgiving=0 sites=31733 === === kind=titles aswords=0 forgiving=0 sites=164044 === === kind=titles aswords=1 forgiving=1 sites=30298 === === kind=titles aswords=0 forgiving=1 sites=158752 === lines words bytes file ------ ------- --------- ------------ 85 2210 12282 Note-010/titles-by-block-w1-f0-t.map 89 2314 12853 Note-010/titles-by-block-w0-f0-t.map 80 2080 11553 Note-010/titles-by-block-w1-f1-t.map 69 1794 9956 Note-010/titles-by-block-w0-f1-t.map Some quick observations that can be drawn from these maps: On "k" versus "t": For all [kt]-containing label words that appear in enough numbers, the "k" and "t" variants have very similar distributions along the text. For instance: -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- f f f f f f f f f f f f f f f f f f f f 1 9 2 3 4 4 5 7 7 8 8 8 8 1 1 1 1 1 1 1 r v 1 1 0 8 8 5 8 0 3 6 9 0 0 0 0 1 1 1 v v v v r v r r r v v 0 3 5 7 1 2 4 6 1 r v v v r v v -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- kal 43 . . . 1 3 2 8 2 3 1 2 1 1 . 1 8 5 2 2 1 tal 27 . . . 1 . 1 5 2 3 5 2 3 1 . 1 2 1 . . . kar 56 1 . . 5 5 8 3 . 2 1 1 3 2 2 3 7 5 4 4 . tar 46 1 . . 3 1 3 1 . 4 4 6 3 5 1 2 2 3 5 1 1 key 17 1 . . 2 2 2 2 . . 1 . 1 1 1 . . 1 3 . . tey 12 1 1 . 1 . . 1 . . . 1 2 1 1 . . 1 1 . 1 kol 39 5 5 1 1 3 4 1 . 1 . 1 4 2 4 . 2 2 1 . 2 tol 53 . 3 3 1 4 6 3 4 2 5 3 3 5 2 1 1 4 . 1 2 kor 20 2 . 2 5 1 2 . . . . 1 . 2 4 1 . . . . . tor 22 . 3 1 2 1 3 . . 1 2 1 2 4 1 . 1 . . . . okaiin 299 3 7 7 20 5 9 8 9 21 13 15 5 13 12 11 34 31 28 34 14 otaiin 218 8 6 4 7 2 5 3 3 20 11 7 11 7 7 14 30 12 23 27 11 ykaiin 48 2 2 4 5 4 2 . . . 1 5 11 2 . 3 3 1 1 1 1 ytaiin 49 3 2 2 2 2 2 1 1 1 1 2 15 . . 4 8 . . 2 1 qokaiin 459 2 3 5 11 2 6 16 37 60 59 27 13 9 15 29 38 27 42 24 34 qotaiin 128 3 1 2 4 . 1 . 10 12 9 5 6 2 4 14 11 5 10 13 16 okain 32 . 1 2 4 2 5 . 6 2 3 . . 1 1 1 2 . . 1 1 otain 14 1 1 . 1 2 . 1 4 . 2 . 1 . . . . . 1 . . okal 127 3 5 1 7 5 10 7 6 5 7 7 3 12 2 5 11 12 6 8 5 otal 129 1 2 3 4 5 3 8 2 10 7 7 11 6 2 5 14 16 6 8 9 qokal 187 . . . 2 1 4 18 28 20 34 20 4 8 12 6 7 11 8 2 2 qotal 69 . . . 1 1 2 4 7 7 14 3 6 . 4 6 1 3 4 2 4 okeey 148 2 1 1 4 4 3 3 4 7 6 . 5 7 24 5 10 24 23 5 10 oteey 97 2 . 1 1 2 . 2 3 3 5 3 1 9 12 6 7 18 11 5 6 okol 50 7 3 5 1 3 3 1 . 3 . 3 3 8 4 1 2 2 . 1 . otol 62 3 6 6 1 4 7 1 2 3 1 3 8 3 2 2 2 2 . 6 . qokol 87 2 7 5 3 4 6 3 6 6 6 5 7 11 3 4 1 4 . 1 3 qotol 42 2 5 3 5 5 1 1 2 2 4 1 2 2 . 4 . . . 2 1 okchol 11 1 1 1 . 2 2 1 . . . . . 2 1 . . . . . . otchol 26 6 3 4 2 5 2 1 . . . 1 1 1 . . . . . . . okeeol 16 . 1 . . . . . . . . 1 1 3 4 2 . 1 2 1 . oteeol 8 . . . 1 . . . . . . . . 1 . 1 2 . 2 . 1 okeody 23 1 . 1 . 2 3 4 . . . 1 . 1 5 1 . 1 2 1 . oteody 19 . . . . 1 1 1 . . . 2 3 1 1 2 1 . 4 1 1 qokeody 31 . . 1 . 3 1 4 . . . 1 8 3 6 . . 2 1 1 . qoteody 11 . . . . . . . . . . . 3 1 1 2 . 1 . 1 2 So either "k" and "t" are the same letter, or they are very close variants, like two cases of the same noun. (Singular and plural wouldn't do, neither different verb tenses or persons. These variants ought to show more independent distributions.) On "o" versus "qo": Looking at the aswords=1 forgiving=0 maps, which list every word with and without the "q" prefix, we can notice that the two alternatives generally have very similar distributions. For instance: -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- f f f f f f f f f f f f f f f f f f f f 1 9 2 3 4 4 5 7 7 8 8 8 8 1 1 1 1 1 1 1 r v 1 1 0 8 8 5 8 0 3 6 9 0 0 0 0 1 1 1 v v v v r v r r r v v 0 3 5 7 1 2 4 6 1 r v v v r v v -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- oKal 256 4 7 4 11 10 13 15 8 15 14 14 14 18 4 10 25 28 12 16 14 qoKal 256 . . . 3 2 6 22 35 27 48 23 10 8 16 12 8 14 12 4 6 oKol 112 10 9 11 2 7 10 2 2 6 1 6 11 11 6 3 4 4 . 7 . qoKol 129 4 12 8 8 9 7 4 8 8 10 6 9 13 3 8 1 4 . 3 4 oly 53 1 1 1 4 1 1 8 5 10 5 4 1 2 3 2 . 1 . . 3 qoly 7 . . . . . . 1 1 2 3 . . . . . . . . . . oaiin 36 1 1 2 3 1 7 2 . 1 . . 1 . 3 . 1 1 6 2 4 qoaiin 28 3 1 2 . 2 1 . 1 . . . 5 . 1 2 1 1 3 4 1 ocKhey 9 . . . 1 1 . . 1 . 1 . 1 1 1 1 1 . . . . qocKhey 23 . 1 . . . 4 . 1 1 1 1 1 3 4 2 2 1 1 . . odaiin 66 6 5 5 . 1 7 2 . 1 1 3 6 4 4 7 4 4 1 1 4 qodaiin 47 4 . 1 . 1 7 1 . . . 2 2 3 . 11 4 1 1 5 4 odar 19 1 1 1 1 . 2 1 . . . . 1 2 . 5 1 2 . 1 . qodar 11 1 1 1 . . . . . 1 . 1 1 4 1 . . . . . . oKain 46 1 2 2 5 4 5 1 10 2 5 . 1 1 1 1 2 . 1 1 1 qoKain 74 . . 1 1 1 1 16 19 15 6 3 . 1 1 1 2 2 1 3 . oKair 40 . . . 2 1 2 1 . . 3 . 2 1 . 6 14 1 1 4 2 qoKair 19 . . . . . . 1 . 2 . . 1 1 . 6 . 4 1 3 . okaly 12 . . . 2 2 . 1 . 2 . 2 . . 1 . 1 . . . 1 qokaly 14 . . . . . 1 6 1 1 1 1 . . 3 . . . . . . oKam 65 1 1 4 10 4 7 3 . . . 2 7 2 2 6 2 5 2 4 3 qoKam 36 1 1 . 1 1 2 3 1 1 . . 1 2 2 3 3 3 2 4 5 oKar 231 2 2 2 17 7 17 16 11 13 7 15 14 18 2 19 10 13 12 24 10 qoKar 208 . . 1 13 11 5 10 8 19 13 21 16 11 6 10 12 10 6 15 21 oKchor 34 6 12 3 5 1 2 1 . . . . . 1 . 1 . . 1 . 1 qoKchor 23 1 10 3 4 2 1 . . . . . . 2 . . . . . . . oKeol 84 4 3 2 4 5 . . 1 1 2 2 4 12 21 3 3 9 2 5 1 qoKeol 57 1 . 1 . 3 1 1 . . 1 1 8 8 19 1 2 2 5 1 2 oKeey 245 4 1 2 5 6 3 5 7 10 11 3 6 16 36 11 17 42 34 10 16 qoKeey 320 2 5 3 4 5 4 6 20 19 26 17 13 10 41 17 9 50 42 9 18 oKy 181 7 15 12 20 13 19 6 10 10 8 2 9 13 7 5 7 3 4 6 5 qoKy 225 6 19 13 6 6 7 19 15 20 30 13 9 11 13 7 4 5 6 5 11 oKeedy 186 . . 6 2 6 2 7 17 20 7 7 1 5 9 17 9 24 24 15 8 qoKeedy 369 . . 4 2 5 1 17 66 41 38 26 3 1 14 15 15 56 33 10 22 oKeeody 19 . . . . . 1 2 . . . 1 1 2 3 2 2 3 . 2 . qoKeeody 18 . 1 . . . 2 4 . . . . 1 1 1 1 2 . 1 2 2 oKeor 25 . 4 . . 1 1 . . . . 1 2 3 5 . 1 1 5 . 1 qoKeor 25 . . 1 . . 1 1 3 . 1 1 2 2 8 1 . 2 1 . 1 oKey 90 1 . 2 3 6 1 9 6 8 5 6 3 1 17 6 1 4 5 . 6 qoKey 137 3 2 2 . 3 3 9 16 10 11 13 2 4 15 5 . 19 7 4 9 The o/qo parallels are analogous to the k/t parallels, but apparently not as strong. The divergences between the o and qo forms seem compatible with different syntactic roles (e.g. different noun declensions. Alternatively "q" may be an article or the "and" conjunction. On reliability in general: The following labels words cannot be found at all with aswords=1 forgiving=0, even though ignoreq=1; but can be found with aswords=1 forgiving=1: -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- f f f f f f f f f f f f f f f f f f f f f 1 9 2 3 4 4 5 7 7 8 8 8 8 1 1 1 1 1 1 1 1 r v 1 1 0 8 8 5 8 0 3 6 9 0 0 0 0 1 1 1 1 v v v v r v r r r v v 0 3 5 7 1 2 4 6 word T defined on found as tot 6 1 r v v v r v v v ------------- - ----------------- ------------- --- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- aj A f67v2.C1.1;C * aj Z f70v2.S2.2;C * aj am 61 3 2 2 1 1 2 2 . 1 1 4 6 4 2 3 6 5 13 2 1 aj od 7 1 . . 1 2 1 . . . . . . . . 1 . . . . 1 aj om 12 . 1 1 1 1 1 . . . . 1 3 . . . 1 1 . . 1 aj qod 7 . . 3 1 . 1 . . . . . . . . 1 . . 1 . . aj yd 1 . . . 1 . . . . . . . . . . . . . . . . alcphy Z f71v.S1.5;K * alcphy alcfhy 1 . . . . . . . . . . . . . 1 . . . . . . alcphy olcphy 1 . 1 . . . . . . . . . . . . . . . . . . atakal P f89r2.b.2;K * atakal otykal 1 . . . . . 1 . . . . . . . . . . . . . . atakal qokokal 1 . . . . . . . . . . . . . . . 1 . . . . atalsy P f89r2.b.3;L * atalsy otalsy 1 . . . . . . . . . . . . . 1 . . . . . . chdaiir Z f72r1.S.7;K * chdaiir chdair 2 . . . . . . . . . . . . . . . . 1 . 1 . chdaiir shdair 2 . . . . . . . . . . 1 . . . . 1 . . . . chekair A f67v2.L2.1;C * chekair shekair 1 . . . . . . 1 . . . . . . . . . . . . . chockhoy P f100r.b.2;C * chockhoy chocthoy 1 1 . . . . . . . . . . . . . . . . . . . daij A f67v2.C2.1;C * daij daid 1 . . 1 . . . . . . . . . . . . . . . . . daij daiim 4 1 . . . . 1 . . . . . 1 . . . 1 . . . . daj A f67v2.C2.1;C * daj dam 92 6 2 7 6 13 14 4 2 1 1 9 10 5 1 5 . 1 1 2 2 daj dod 1 . . . . 1 . . . . . . . . . . . . . . . daj dom 7 1 3 1 1 . . . . . . . . 1 . . . . . . . daj dym 1 . . . . . . . . . 1 . . . . . . . . . . dariiir A f68v2.R.12;C * dariiir dyair 1 . . . . . . . . . . . . . . . . 1 . . . dcheeor A f67v2.L4.1;C * dcheeor dcheeos 1 . . . . . . . . . . . . . . . . . . . 1 dcheeor dsheeos 1 . . . . . . . . . . . . 1 . . . . . . . dokor P f100v.T.1;C * dokor P f101v1.R1.1;C * dokor dakar 1 . . . . . . 1 . . . . . . . . . . . . . dolaj A f67v2.S1.1;C * dolaj dalam 5 . . . . . . 2 . . . 1 . . . . . 2 . . . dolaj dalom 1 . . . . . . . . . . . . . . . . . . . 1 dytoly P f101v1.R2.1;C * dytoly dykaly 1 . . . . 1 . . . . . . . . . . . . . . . korain P f89r2.m2.1;K * korain koraiin 3 2 . . . . . . . . . . 1 . . . . . . . . korain koroiin 1 . . . . . . . . 1 . . . . . . . . . . . korain taraiin 1 . . . 1 . . . . . . . . . . . . . . . . korain toraiin 2 . . . . . . . . 1 . . . . . 1 . . . . . korainy P f89r2.m2.1;L * korainy karaiiny 1 . . . . . . . . . . . . . . . . . 1 . . ochory S f68r2.S.4;R * ochory yshyey 1 . . . . 1 . . . . . . . . . . . . . . . ofaldo P f88r.b.3;K * ofaldo opaldy 1 . . . . 1 . . . . . . . . . . . . . . . ofaldo yfoldy 1 . . . 1 . . . . . . . . . . . . . . . . oforain P f89r2.m2.5;K * oforain oparaiin 1 . . . . . . . . . . . . . . . . . . 1 . oforain oporaiin 1 . . . . . . . . . . . . . . . 1 . . . . okagy Z f70v2.S1.7;K * okagy okady 2 . . . . . . . . . . 1 . . . . . . 1 . . okagy okody 6 1 . . 1 . . . . . . . . . 3 . . . . 1 . okagy okydy 1 . . . . . . . . . . . . . . . . . . . 1 okagy otady 1 . . . . . . . . . . 1 . . . . . . . . . okagy otody 9 . 1 1 1 . . . . . . . . 3 . 1 . . 1 . 1 okagy otydy 2 . 1 . . . 1 . . . . . . . . . . . . . . okagy qokady 1 . . . . . . . . 1 . . . . . . . . . . . okagy qokody 9 1 . 1 . 1 3 . . . . . 1 . 1 . . . . 1 . okagy qokydy 2 . . . . . . 1 . . . 1 . . . . . . . . . okagy qotady 1 . . . . . . . . . . . . . . 1 . . . . . okagy qotody 10 . . . . 1 1 . . . . 1 2 1 . . . . . 2 2 okagy qotydy 1 . . . . 1 . . . . . . . . . . . . . . . okagy ykody 2 1 . . . . 1 . . . . . . . . . . . . . . okagy ytady 1 . . . . . 1 . . . . . . . . . . . . . . okagy ytoda 1 . . . . 1 . . . . . . . . . . . . . . . okagy ytody 6 1 . . . . . 1 . . . . 4 . . . . . . . . okagy ytydy 2 . . . 1 1 . . . . . . . . . . . . . . . okaraj Z f70v2.S2.7;C * okaraj okaram 2 . . . 1 . . . . . . . . . . . . 1 . . . okaraj otaram 3 . . . . . . . . 1 . . . . . . 2 . . . . okaraj qokaram 2 . . . . . . . . . . . . . . . . . . 1 1 okaraj ykaryd 1 . . . . 1 . . . . . . . . . . . . . . . okaraj ytoryd 1 . . . 1 . . . . . . . . . . . . . . . . okchshy P f89r1.t.1;K * okchshy P f89r1.t.1;L * okchshy okcheeo 1 . . . . . . . . . . . . . . . 1 . . . . okchshy okcheey 6 . . 1 1 . . . . . . . . . . 1 1 . 2 . . okchshy okchesy 1 . . 1 . . . . . . . . . . . . . . . . . okchshy okechey 2 . . . . . . . . . . . . . . 1 . . . 1 . okchshy okeechy 2 . . . . . . . . . . . . . . . . . 2 . . okchshy okeeshy 2 . . . . . . . . . . . . . . 1 . . 1 . . okchshy okeshey 1 . . . . . . . 1 . . . . . . . . . . . . okchshy oksheey 1 . . . . . . . . . . . . . . 1 . . . . . okchshy otcheeo 1 . . . . . . . . . . . . . . 1 . . . . . okchshy otcheey 4 . . . . 1 . . . . . . . . 1 1 1 . . . . okchshy oteechy 1 . . . . . . . . . . . . . . 1 . . . . . okchshy qokcheey 1 . . . . . . . . . . . . . . . . . . 1 . okchshy qokechey 1 . . . . . . . . 1 . . . . . . . . . . . okchshy qokeechy 4 . . . . . . . . . . . . . . 1 1 . 2 . . okchshy qokeeesy 1 . . 1 . . . . . . . . . . . . . . . . . okchshy qokeeshy 1 . . . . . . . . . . . . . 1 . . . . . . okchshy qokeshey 1 . . . . . . . . . . . . . . . . . 1 . . okchshy qotcheey 1 . . . . 1 . . . . . . . . . . . . . . . okchshy ykechey 1 . . . . . . . . . . . . . . . . . . . 1 okchshy ykeechy 3 . . 1 . . . 1 1 . . . . . . . . . . . . okchshy ykeesey 1 . . . . . . 1 . . . . . . . . . . . . . okchshy ykeeshy 1 . . . . . . . . . . . 1 . . . . . . . . okchshy yksheey 1 . . . . . . . . . . . . . 1 . . . . . . okchshy ytchchy 1 . . . . . . . . . . . . 1 . . . . . . . okchshy ytcheey 1 1 . . . . . . . . . . . . . . . . . . . okeechor S f68r2.S.24;R * okeechor otshchor 1 . . . 1 . . . . . . . . . . . . . . . . okeechor qokecheos 1 . . . . . . . . . . . . . . . . . . . 1 okeechor qotecheor 1 . . . . . . . . . . . . 1 . . . . . . . okeechor ytcheear 1 1 . . . . . . . . . . . . . . . . . . . okolshy Z f70v1.S.13;K * okolshy okalchy 1 . . . . . . . . . 1 . . . . . . . . . . okolshy okolchy 2 . . . 1 . . 1 . . . . . . . . . . . . . okolshy qotolchy 1 . . . . . . . . . . . . . . . 1 . . . . olcheom P f100r.b.4;K * olcheom olsheam 1 . . . . . . . . . . 1 . . . . . . . . . olcheom olsheod 1 . . . . . . . . . . . . 1 . . . . . . . oldaj P f89r1.t.3;L * oldaj aldam 2 . . . . . 1 1 . . . . . . . . . . . . . oldaj oldam 5 . . 1 2 . 1 . . . . . . . 1 . . . . . . olkchdal B f77v.L.2;U * olkchdal olkeedyl 1 . . . . . . . . . . 1 . . . . . . . . . opalar Z f71v.S1.10;K * opalar Z f71v.S1.10;K * opalar Z f71v.S1.9;C * opalar Z f71v.S1.9;K * opalar opalor 1 . . . . . . . . . . . . . . . . . . 1 . opcholdy N f67r2.L;Z * opcholdy opchaldy 1 . . . . 1 . . . . . . . . . . . . . . . opysaj Z f70v2.S1.8;C * opysaj ofaram 1 . . . . . . 1 . . . . . . . . . . . . . opysaj oforam 1 . . . . . 1 . . . . . . . . . . . . . . opysaj oparam 1 . . . . . . 1 . . . . . . . . . . . . . opysaj qoforom 1 . . . 1 . . . . . . . . . . . . . . . . otalody Z f71v.S2.2;C * otalody Z f71v.S2.2;K * otalody okalody 1 . . . . . . 1 . . . . . . . . . . . . . otalody ykolody 1 . . . . 1 . . . . . . . . . . . . . . . otalody ytalody 1 . . . . . . 1 . . . . . . . . . . . . . otarer Z f72r2.S.20;K * otarer qotoees 1 1 . . . . . . . . . . . . . . . . . . . otcheodar S f68r2.S.17;R * otcheodar okeesodar 1 . . . . . . . . . . . . . . . 1 . . . . oteeary Z f72r2.S.16;K * oteeary oteeosy 1 . . . . . . . . . . . . . . . . . . 1 . oteeary qokchory 1 . 1 . . . . . . . . . . . . . . . . . . oteeary yteeoey 1 . . . . . . 1 . . . . . . . . . . . . . oteoys A f68v2.R.5;C * oteoys qoteoar 1 . . . . . . . . . . . . . . . . . . 1 . otoeeo S f68r2.S.15;R * otoeeo okoeey 1 . . . . . . . . . . . . . . . . . . 1 . otoeeo otyshy 1 . 1 . . . . . . . . . . . . . . . . . . otoeeo qokochy 1 . . 1 . . . . . . . . . . . . . . . . . otoeeo qokoeey 1 . . . . . . . . . . . . . . 1 . . . . . otoeeo qotoeey 2 1 1 . . . . . . . . . . . . . . . . . . otoeeo ykychy 1 . 1 . . . . . . . . . . . . . . . . . . otoeeo ytashy 1 . . . . . . 1 . . . . . . . . . . . . . otoeeo ytychy 1 . 1 . . . . . . . . . . . . . . . . . . otolchdy Z f71r.S.1;K * otolchdy okalchdy 1 . . . . . . . . . . . . . . . . . . 1 . otolchdy okalshdy 1 . . . . 1 . . . . . . . . . . . . . . . otolchdy otalshdy 2 . . . . . . . . 1 . . . . . . . . . . 1 otolchdy qokalchdy 1 . . . 1 . . . . . . . . . . . . . . . . otoshos S f68r2.S.21;R * otoshos otochor 1 1 . . . . . . . . . . . . . . . . . . . otshshdy Z f70v1.S.14;K * otshshdy okechedy 3 . . . . . . . . . . . . . . . . . 3 . . otshshdy okeshedy 1 . . . . . . . . . . . . . 1 . . . . . . otshshdy otcheedy 1 . . . . . . . . . . . . . . . 1 . . . . otshshdy qokcheedy 2 . . . . . . . . . . . . . . 1 . . . . 1 otshshdy qokechedy 4 . . . . . . . . . 2 . . . . . . 1 1 . . otshshdy qokeshedy 1 . . . . . . . . . . . . . . 1 . . . . . otshshdy qotcheedy 4 . . . . 1 . . . . . . . . . 1 . . . 1 1 saloiin P f89r2.t.4;L * saloiin solaiin 2 . . . 1 . . . . . . . . . . . . 1 . . . saloiin soloiin 1 . 1 . . . . . . . . . . . . . . . . . . seeyar A f67r1.S.3;C * seeyar cheoar 5 . . . . . 1 . . . . . . . . . 1 . . 3 . seeyar cheoor 1 . . . . . . . . . . . . . . . . . . . 1 seeyar cheyor 1 . . . . . . . . . . . . . . . . . . 1 . seeyar sheoar 1 . . . . . . . . . . . . . 1 . . . . . . seeyar sheoas 1 . . . . . . . . . . . . 1 . . . . . . . sheoraj A f67r1.S.8;C * sheoraj chearam 1 . . . . . . . . . . . . . . . . . . 1 . sosam P f100r.m.4;B * sosam raram 3 . . . . . . . . . . . . . . . . 1 . 1 1 sosam saram 1 . . . . . . . . . . . . . . . 1 . . . . sosam soeom 1 1 . . . . . . . . . . . . . . . . . . . taol A f67v2.C2.1;C * taol kaol 1 . . . . . . . . . . . . . . . . 1 . . . yfain N f67r2.L;Z * yfain ofaiin 4 . . . . 1 . . . . . . 1 1 . . . . . . 1 yfain opaiin 8 . . . . . . 1 . . . . . . 1 1 . 1 . 3 1 yfain qofaiin 2 . . . . . . . . . . . . 2 . . . . . . . yfain qofoiin 1 . . . . . . . . . . . 1 . . . . . . . . yfain qopaiin 4 . . . . . . 1 . . . 1 . . . . 1 . . . 1 yfain qopoiin 1 . . . . . . . . . . . . . . . . . . 1 . yfain ypaiin 1 . . . . . . . . . . . . . . . 1 . . . . ykocfhy P f89r1.b.1;K * ykocfhy okacfhy 1 . . . . . . . . . . . . . . . 1 . . . . ykocfhy otacphy 1 . . 1 . . . . . . . . . . . . . . . . . yshesas A f68v2.R.3;C * yshesas orcheos 1 . . . . . . . . . . . . . . . 1 . . . . yshesas oscheor 1 . 1 . . . . . . . . . . . . . . . . . . yshesas ycheeas 1 . . . . . . . . 1 . . . . . . . . . . . yshesas ycheeor 1 . . . . . . . . . . . . . . . . 1 . . . yypchy A f68v2.R.11;C * yypchy yqopchy 1 . . . . . . . . . . . . . . 1 . . . . . If some reasonable fraction of these identifications are correct, it seems that there are definite patterns of reference: * A labels from page f68v2 and threabouts are usually mentioned in the "stars" section; * labels The lines in the k/t and o/qo comparison tables above were added with gawk \ ' BEGIN { \ split("",b); t=0 \ } \ /./ { \ t+=$2; for(i=3;i<=22;i++) b[i] +=$(i); \ print; \ } \ STOPPED HERE Let's make a table that gives the range of POS for each panel. Each line has the the physical panel location (e.g. f77v2) the first offset, and the last offset plus one. foreach w ( 0 1 ) foreach f ( 0 1 ) cat .vtx-${w}-${f}.occ \ | sort +0 -1n \ | gawk 'BEGIN {a=0} /./ {b = a+$3; print $1, $2, a, b; a=b}' \ > .panels.chrange Let's make tables that map block index to panel number and vice-versa: echo 'block size = '$BLOCKSZ cat .panels.chrange \ | tr -d '<>' \ | gawk '/./ {printf "s/<%s>/<%03d>/g\n", $2, 1+int($3/'"$BLOCKSZ"')}' \ > panel-to-block chmod a+x panel-to-block echo 'block size = '$BLOCKSZ cat .panels.chrange \ | grep -v '' \ | tr -d '<>' \ | gawk '/./ {printf "%03d %s\n", 1+int($3/'"$BLOCKSZ"'), $2}' \ | gawk 'BEGIN {n=0} /./ {while($1>n){n++;printf "s/<%03d>/<%s>/g\n", n,$2}}' \ > block-to-panel chmod a+x block-to-panel Formatting block-to-panel as a header: cat block-to-panel \ | tr '<>/' ' ' \ | gawk '/./ {print $2, $3}' \ | sed -e 's/ f\([0-9][0-9]*\)/ f\1 /g' \ | format-block-map-header \ > .block-map-header I then combined that maps each label word to a list of locations where the label is defined: cat Note-010/${kind}.tmp \ | sort \ | gawk \ ' BEGIN {w = ""; } \ function dmp() \ { \ if (w != "") { print w, loc; } \ } \ ($1 != "") { \ if ($1 != w) \ { dmp(); w = $1; loc = ("<" $2 ">"); } \ else \ { loc = ( loc ",<" $2 ">"); } \ next; \ } \ end { dmp(); } \ ' \ > Note-010/${kind}.def cat L16-eva/INDEX \ | egrep -v '^[^:]*:[^:]*:[^:]*:[^:]*:(labels|letters|words|titles|-|\?):' \ | sed -e 's/:.*$//g' \ > Note-010/vtx.units