Last edited on 1999-01-16 00:54:08 by stolfi 98-01-25 stolfi =============== I will start from the raw concordance I created before for the word occurence map (Notes/010). Recall that its format was PNUM FNUM UNIT LINE TRANS START LENGTH POS STRING OBS 1 2 3 4 5 6 7 8 9 10 where PNUM is a sequential page number, "001" to "234". FNUM is the corresponding folio-based page number, "f1r" thru "f86v5" to "f106v" UNIT is the code of a text unit within the page, e.g. "P" or "R1" LINE is the code of a line within that unit, 27 or 10a TRANS is a letter identifying a transcriber, e.g. "F" for Friedman START is the index of the first byte of the occurrence in the text line (counting from 1). LENGTH is the original length of the occurrence in the text, including fillers, comments, spaces, etc.. POS is a number giving the approximate position of the occurrence within the whole text; used for sorting, etc. STRING is the non-empty string in question, without any fillers, comments, non-significant spaces, line breaks, etc.. OBS An arbitrary non-empty string, without embedded blanks. First, let's make a list of the herbal text units and pages: cat L16-eva/INDEX \ | gawk -v FS=':' '($2=="herbal" && $5=="parags"){print}' \ > her.index cat her.index \ | gawk -v FS=':' '/./{print $1;}' \ > her.units cat her.index \ | gawk -v FS=':' '/./{print substr($6,2,3);}' \ | sort | uniq \ > her.pages Next, we select the records from herbal pages, and do some cleanup on the STRINGs: remove every word-initial "q" change word-initial "y" to "o" change word-final "y" to "o" change "eeee" to "chch" change "eee" to "che" change "ee" to "ch" (perhaps) change "t" into "k" (perhaps) change "f" into "p" delete any interword spaces. Here it is: set equatekt = 1 set equatepf = 1 cat ../010/vtx-f-eva-15.roc \ | gawk '/./ { print $1, $2, $9; }' \ | select-herbal-pages \ | fix-words -f word-equiv.gawk \ -v field=3 \ -v stripq=1 \ -v equatekt=${equatekt} \ -v equatepf=${equatepf} \ > her-f-eva-15.roc dicio-wc her-f-eva-15.roc lines words bytes file ------ ------- --------- ------------ 29506 88518 547377 her-f-eva-15.roc Next we reduce the data to occurrence counts per page: FREQ PNUM FNUM STRING 1 2 3 4 where FREQ is the count of occurrences of STRING on page PNUM/FNUM. cat her-f-eva-15.roc \ | sort +2 -3 +0 -1 | uniq -c | expand \ > her-f-eva-15.frq dicio-wc her-f-eva-15.frq lines words bytes file ------ ------- --------- ------------ 26397 105588 713624 her-f-eva-15.frq