Let's make a joint file with all three counts: foreach f ( '' -rare -rarest ) cat .label${f}-refs-by-panel.rfrq \ | sort +0 -1 \ > .foo${f} end join \ -a 1 -a 2 -e 00 \ -j1 1 -j2 1 \ -o0,1.2,1.4,2.2,2.4 \ .foo .foo-rare \ > .bar join \ -a 1 -a 2 -e 00 \ -j1 1 -j2 1 \ -o0,1.2,1.3,1.4,1.5,2.2,2.4 \ .bar .foo-rarest \ > .baz cat .baz \ | sed -e 's/<\(.*\)>/<\1> {\1}/g' \ | panel-to-page \ | tr -d '{}<>' \ | sort +0 -1n \ | gawk '/./ {printf "%03d %-6s %5d %5d %5d %5d %5d %5d\n", $1,$2,$3,$4,$5,$6,$7,$8}' \ > .label-refs-by-panel.jfrq --- .label-refs-by-panel.jfrq ------------------------ 001 f1r 116 122 12 13 2 2 002 f1v 78 197 14 35 5 13 003 f2r 67 157 12 28 3 7 004 f2v 28 103 5 18 1 4 005 f3r 90 159 12 21 5 9 006 f3v 62 160 6 15 2 5 007 f4r 37 123 4 13 2 7 008 f4v 35 87 1 2 0 0 009 f5r 22 85 5 19 0 0 010 f5v 38 176 5 23 2 9 ... ...... ... ... ... ... ... ... 262 f114v 289 147 24 12 5 3 263 f115r 445 188 60 25 15 6 264 f115v 342 157 37 17 1 0 265 f116r 465 193 51 21 10 4 ------------------------------------------------------ Let's make histograms of those ratios, sorted by page position. cat .label-refs-by-panel.jfrq \ | make-label-ref-graphs \ -v MAX1=348 -v MAX2=58 -v MAX3=20 \ > .label-refs-by-panel.jhis --- .label-refs-by-panel.jhis ------------------------ 001 f1r 116 122 ooo...... 12 13 oo....... 2 2 ......... 002 f1v 78 197 ooooo.... 14 35 ooooo.... 5 13 ooooo.... 003 f2r 67 157 oooo..... 12 28 oooo..... 3 7 ooo...... 004 f2v 28 103 oo....... 5 18 oo....... 1 4 o........ 005 f3r 90 159 oooo..... 12 21 ooo...... 5 9 oooo..... ... ... ... ... ... ... ... ... ... ... ... 261 f114r 382 170 oooo..... 35 16 oo....... 6 3 o........ 262 f114v 289 147 ooo...... 24 12 o........ 5 3 o........ 263 f115r 445 188 oooo..... 60 25 ooo...... 15 6 oo....... 264 f115v 342 157 oooo..... 37 17 oo....... 1 0 ......... 265 f116r 465 193 oooo..... 51 21 ooo...... 10 4 o........ ------------------------------------------------------ The most label-rich and label-poor pages are ALL LABELS 146 0.005 348 546 0.020 294 178 0.006 293 30 0.001 54 11 0.000 49 12 0.000 39 UNDER 100 35 0.011 58 24 0.008 57 93 0.030 50 10 0.003 50 20 0.006 47 12 0.004 47 14 0.004 44 16 0.005 43 14 0.004 42 20 0.006 42 11 0.003 41 1 0.000 2 1 0.000 2 1 0.000 2 UNDER 25 5 0.007 20 12 0.017 20 7 0.010 19 3 0.004 18 5 0.007 16 8 0.012 15 5 0.007 15 28 0.041 15 3 0.004 15 3 0.004 15 1 0.001 1 1 0.001 1 2 0.003 1 1 0.001 1 1 0.001 1 1 0.001 1 2 0.003 1 1 0.001 0 1 0.001 0 The UNDER 100 class is distributed fairly uniformly among the most labelliferous pages. The UNDER 25 class has a steeper ditribution. The "starred paragraph" pages (incuding f58r) are not exceptionally labelliferous in relative terms; their absolute counts are high only because they contain a lot of paragraphical text. Page f40r has many labels of the ALL and UNDER 100 classes, but only 2 of the UNDER 25 class. On the other hand, f99v is label-rich in all three classes. The same can be said of page f58r, except for a modest drop in the UNDER-25 class. Let's find WHICH labels were mentioned on pages f58r and f99v. I had previously written a gawk script "show-occurrences" to show the occurrences of a bunch of words in a text. let's run it, just as a check: foreach f ( f58r f99v ) cat .units-parags.dir \ | egrep "^${f}[:.]" \ | sed \ -e 's/:.*$//g' \ -e 's:^:L16-ecc-x/:g' \ > .tmp cat `cat .tmp` \ | find-and-show-occurrences .labels-rarest.dic \ > .label-rarest-occs2-$f end Let's now generate the same files from the occurrence lists. We will add to the latter the label's definition code. foreach f ( '' '-rare' '-rarest' ) cat .label${f}-occurrences.idx \ | sort -b +3 -4 \ > .occ cat .labels-first.def \ | sort -b +1 -2 \ | tr '<>' '{}' \ >.def join \ -a1 -e '{???}' \ -j1 4 -j2 2 \ -o1.1,1.2,1.3,0,2.3 \ .occ .def \ | sort -b +2 -3n +3 -4 \ > .label${f}-occ-def.idx end foreach f ( f58r f99v ) cat .units-parags.dir \ | egrep "^${f}[:.]" \ | sed \ -e 's/:.*$//g' \ -e 's:^:L16-ecc-x/:g' \ > .tmp foreach g ( '' '-rare' '-rarest' ) cat .label${g}-occ-def.idx \ | egrep "<${f}[.]" \ > .occ cat `cat .tmp` \ | show-occurrences .occ \ > .label${g}-occs-$f end end Now let's prepare a similar file in FSG format. We must use the FSG text, and we must replace each ECC label in the occurrence file by its correspondent FSG label. For the second part, let's first extract the labels in FSG notation, just as we did for ECC. Note that we must remove the "."s from the text, but keep the "-"s and "="s: cat .units-labels.dir \ | sed \ -e 's/:.*$//g' \ -e 's:^:L16/:g' \ > .tmp cat `cat .tmp` \ > .labels-m-fsg.evt cat .units-parags.dir \ | sed \ -e 's/:.*$//g' \ -e 's:^:L16/:g' \ > .tmp cat `cat .tmp` \ > .parags-m-fsg.evt /bin/rm -f .labels-fsg.def cat .labels-m-fsg.evt \ | remove-comments-from-evt \ | sed \ -e 's/ *//g' \ -e 's/;[A-Z]>/>/g' \ -e 's/[-=]//g' \ -e 's/>\(.*\)[.]/>\1/g' \ -e 's/>\(.*\)[.]/>\1/g' \ -e 's/>\(.*\)[.]/>\1/g' \ -e 's/>\(.*\)[.]/>\1/g' \ -e 's/>\(.*\)[.]/>\1/g' \ -e 's/>/> /g' \ | egrep ' .' \ > .labels1.def cat .labels-m-fsg.evt \ | remove-comments-from-evt \ | /n/gnu/bin/sed \ -e 's/ *//g' \ -e 's/;[A-Z]>/>/g' \ -e 's/[=-]$//g' \ -e 's/^/@/g' \ -e 's/>\(.*\)[.]/>\1/g' \ -e 's/>\(.*\)[.]/>\1/g' \ -e 's/>\(.*\)[.]/>\1/g' \ -e 's/>\(.*\)[.]/>\1/g' \ -e 's/>\(.*\)[.]/>\1/g' \ -e 's/@\(<[^>]*>\)\([^ @-][^ @-]*\)[ -][ -]*/@\1\2@\1/g' \ -e 's/@\(<[^>]*>\)\([^ @-][^ @-]*\)[ -][ -]*/@\1\2@\1/g' \ -e 's/@\(<[^>]*>\)\([^ @-][^ @-]*\)[ -][ -]*/@\1\2@\1/g' \ -e 's/@\(<[^>]*>\)\([^ @-][^ @-]*\)[ -][ -]*/@\1\2@\1/g' \ -e 's/@\(<[^>]*>\)\([^ @-][^ @-]*\)[ -][ -]*/@\1\2@\1/g' \ -e 's/>/> /g' \ | tr '@' '\012' \ | egrep ' .' \ > .labels2.def cat .labels1.def .labels2.def \ | sort | uniq \ | sed \ -e 's/<\(.*\)> \(.*\)/<\1> \2 {\1}/g' \ -e 's/\.[^>]*> */> /g' \ | panel-to-page \ | tr '{}' '<>' \ > .labels-fsg.def Now make file that lists each FSG label with the equivalent ECC label: cat .labels-fsg.def \ | tr '<>' '{}' \ | gawk 'BEGIN{n=0} /./ {n++; printf " %s %s\n",n,$2,$3}' \ > .foo-fsg cat .foo-fsg \ | fsg2ecc \ > .foo-ecc join \ -j1 1 -j2 1 \ -o1.2,2.2,2.3 \ .foo-ecc .foo-fsg \ | sort | uniq \ > .label-ecc-fsg.map Now let's use that table to "translate" the label occurrence index from ECC to FSG. foreach g ( '' '-rare' '-rarest' ) cat .label${g}-occ-def.idx \ | translate-occurrences-ecc-to-fsg \ .label-ecc-fsg.map \ > .label-fsg${g}-occ-def.idx end Finally, let's list the occurrences on pages f58r and f99v. We must "fake" entries with transcription code ";S" in order for show-occurrences to work. foreach f ( f58r f99v f95v2 f69r f9v ) cat .units-parags.dir \ | egrep "^${f}[:.]" \ | sed \ -e 's/:.*$//g' \ -e 's:^:L16/:g' \ > .tmp-$f end foreach f ( f58r f99v f95v2 f69r f9v ) foreach g ( '' '-rare' '-rarest' ) cat .label-fsg${g}-occ-def.idx \ | egrep "<${f}[.]" \ > .occ cat `cat .tmp-$f` \ | fake-S-transcription-codes \ | show-occurrences .occ \ | egrep -v '^<[^>]*;S> *$' \ > .label-fsg${g}-occs-$f end end Curiously, label f89v1.t.4 is identical to f89v1.b.4. There is a comment on the latter: #looks like 3 plants and 4 names. none of the other "vases" seem to be named.