Hacking at the Voynich manuscript - Side notes 101 Preparing clean samples of various other languages Last edited on 2023-05-10 23:44:30 by stolfi SUMMARY Here we prepare text samples in English, Latin, and other languages, comparable in size to the Voynichese reference sample, for the statistical analyses that will go into the "word structure" technical report. SETTING UP THE ENVIRONMENT Links: ln -s ../tr-stats/dat ln -s ../tr-stats/exp ln -s ../../../work ln -s ../../../langbank ln -s work/wds-to-tlw wds-to-tlw.sh ln -s work/update-paper-include update-paper-include.sh ln -s work/format-words-filled format-words-filled.sh ln -s work/compute-freqs compute-freqs.sh NAMING THE SAMPLE TEXTS A sample is identified by a pair {smp = "{LANG}/{BOOK}"} where {LANG} is the general language and writing system ("engl" for English in standard spelling, "chip" for Mandarin Chinese in pinyin, etc.) and {BOOK} is the source document ("wow" for War of the Worlds, "ptt" for the Pentateuch, "voa" for Voice of America broadcasts, etc.). A sample may be divided into sections or sub-samples for the purpose of studying statistical variations within the same document. A section is identified by a name {sec = "{TAG}.{N}"} where {TAG} is a descriptive string ("gen" for Genesis, "exo" for Exodus, "hea" for Herbal-A) and {N} is a serial number in case the section is split into separate segments (as the VMS Herbal-A is split into "hea.1", "hea.2", etc.). Every sample must have a "tot.1" section (the whole sample), which must be processed only after any other sections. The samples, sections, and their attributes are specified in the table "sample-sections.tbl". FORMAT OF THE SAMPLE TEXTS For each sample {smp="{LANG}/{BOOK}"} and each section {sec="{TAG}.{N}"} (including "tot.1"), we produce a file called "dat/{smp}/{sec}/whole.tlw" that contains the corresponding "raw" tokens, suitably encoded, filtered, and tagged as "good" or "bad" for linguistic analysis. The file "whole.tlw" is derived from a reference file "LANG/BUK/main.wds" from my Linguistic Sample Bank ("/home/staff/stolfi/projects/langbank/"). This file is also linked as "dat/{smp}/org/main.wds". The "LANG/BUK" of the source may be different from the "LANG/BUK" of the sample {smp}; for example, the Langbank source "engl/wow" is used to produce the samples "engl/wow" (full text in lowercase), "engl/wnm" (proper names only), and "envg/wow" (Vigenère-coded text). A truncated version "dat/{smp}/{sec}/trunc.tlw" of the "whole.tlw" file is also created, containing a specified maximum number of "good" words. The roughly uniform sizes of these files makes them more suitable for certain comparative analyises, such as Zipf law plots. For compatibility with the VMS samples, each ".tlw" file must EXCLUDE any blanks, embedded comments, alignment fillers, or punctuation (including line and paragraph breaks); or any other tokens that are semantically equivalent to them. On the other hand, the ".tlw" files must INCLUDE any symbols that stand for words in the text, such as numbers, abbreviations, and "*"-surrogates for illegible or omitted words. In particular, each ".tlw" file must EXCLUDE any "null-like intrusions", i.e. undesirable sub-sections of the selected section that are syntactically equivalent to a null or blank space -- such as margin notes, footnotes and footnote marks, etc.. On the other hand, it must INCLUDE any "symbol-like intrusions", i.e. undesirable sub-sections that play a syntactic role in the text -- such as formulas, tables, poems, foreign phrases, etc.. However, the tokens of a symbol-like intrusion should be replaced by "*" symbols to clearly distinguish them from valid words; and, if the intrusion has two or more consecutive tokens, only the first and last one should be kept. The entries of a ".tlw" file have the format "{TYPE} {LOC} {WORD}", where: {TYPE} is either "a" or "s"; {LOC} is the full location of the token (section ID plus line number) with components delimited by braces (e.g. "{b1}{c3}{sA}{tx}{73}"); and {WORD} is the word or symbol in question. The type "a" denotes plain ("gud") words of the language that are suitable for letter-level linguistic analysis, such as letter frequencies and correlations, word length distribution, etc. The type "s" denotes anomalous ("bad") words that ought to be excluded from such analyses, such as numerals, abbreviations, symbols, foreign-language intrusions, unreadable words, etc.. Note that the "bad" words cannot be discarded, because they are relevant for some investigations, such as word-pair correlations, concordances, etc.. The conversion of the source "main.wds" to the sample file "whole.tlw" and "trunc.tlw" is done by {wds-to-tlw} and consists of the following steps: (1) Read the tokens from "main.wds", which must have been roughly classified as comments (type="#"), alpha words ("a"), symbol words ("s"), punctuation chars ("p"), section starts ("$") and line starts ("@"). (2) Test each input token with a sample-specific procedure {smp_reclassify_word} from the library {smp}/sample-fns.gawk. This procedure should re-assign type "x" to any tokens that lie outside section ${SEC} or are symbol-like intrusions; and as type="n" any null-like intrusions. (4) Based on the reclassified type, discard all entries except "a", "s" and "x". (5) To the remaining words, apply a sample-specific global word transformation, e.g. map upper to lower case, map Chinese ideograms to pinyin, change the alphabet, delete vowels or diacritics, split hyphenated words, etc. This is also a chance to delete any undesired words that could not be discarded by {smp_reclassify_word}. This step uses a sample-specific word substition table {smp}/word-map.tbl followed by a sample-specific function {smp_fix_word} from the library {smp}/sample-fns.gawk. (6) Insert the {LOC} field, prepend "*" to any "x" words (to avoid confusion), and squeeze any long runs of the latter. (7) Re-classify each "a" and "s" word as "gud" or "bad", using the predicate {smp_is_good_word} from the same library; "x" words are automatically "bad". Replace the type tag by "a" for the "gud" words, "s" for the bad words.. Write both types the "all.tlw" file. The "whole.tlw" file is then truncate after a specified number of "gud" records, producing the "trunc.tlw" file. The file "trunc.tlw" for each sample and section is copied to "raw.tlw" for compatibility with other Notes workbooks, and then split into dat/{smp}/{sec}/gud.tlw - the "gud" subset. dat/{smp}/{sec}/bad.tlw - the "bad" subset. From these files are also created the derived files dat/{smp}/{sec}/raw.wfr - word counts in "raw.tlw". dat/{smp}/{sec}/gud.wfr - word counts in "gud.tlw". dat/{smp}/{sec}/bad.wfr - word counts in "bad.tlw". dat/{smp}/{sec}/raw.wdf - tokens from "raw.tlw", w/o locations, line-filled. dat/{smp}/{sec}/gud.wdf - tokens from "gud.tlw", w/o locations, line-filled. dat/{smp}/{sec}/bad.wdf - tokens from "bad.tlw", w/o locations, line-filled. dat/{smp}/{sec}/raw-wds-summary.tex - TeX include file for tech report dat/{smp}/{sec}/gud-wds-summary.tex - TeX include file for tech report dat/{smp}/{sec}/bad-wds-summary.tex - TeX include file for tech report The files {raw,gud,bad}.{tlw,wdf,wfr} are temporarily created also for the full sample "{smp}/{sec}/whole.tlw", but are then overwritten for the "trunc" version. GETTING THE SAMPLE SIZES FOR VOYNICHESE Get number of good tokens in Voynichese reference sample (plain prose and labels): vvers=( prs lab maj ) for book in "${vvers[@]}" ; do cat dat/voyn/${book}/tot.1/gud.wfr \ | gawk '/./{s+=$1} END{print s}' \ > .tmp printf "${book} = %8d\n" `cat .tmp` 1>&2 done prs = 35027 lab = 1003 maj = 36030 CHOOSING THE WORD SAMPLES FROM LANGUAGES OTHER THAN VOYNICHESE Gather the list ${smpsecs} of samples and sections, and the list ${smps} of samples without sections: cat sample-sections.tbl \ | gawk '/^ *([#]|$)/ { next; } // { print $1; }' \ > .tmp smpsecs=( `cat .tmp` ) echo "=== samples and sections ==" echo ${smpsecs[@]} | tr ' ' '\012' echo ${smpsecs[@]} \ | tr ' ' '\012' \ | sed -e 's:[/][^/]*$::' \ | uniq \ > .tmp smps=( `cat .tmp` ) echo "=== samples ==" echo ${smps[@]} | tr ' ' '\012' Create files ${smp}/sections.tags and ${smp}/sections-ok.tags containing the list of sections (other than "tot.1") for each ${smp}, create links to the original Langbank file "main.wds", and remove derived files: for smp in ${smps[@]} ; do echo " " if [[ ! ( -d dat/${smp} ) ]]; then mkdir -p dat/${smp}; fi if [[ ! ( -d exp/${smp} ) ]]; then mkdir -p exp/${smp}; fi sfile="dat/${smp}/sections.tags" sokfile="dat/${smp}/sections-ok.tags" cat sample-sections.tbl \ | egrep -e '^ *'"${smp}/" \ | gawk '// { s = $1; sub(/^.*[\/]/, "", s); print s; }' \ | egrep -v -e '^tot[.]1$' \ > ${sfile} echo "${smp} = " `cat ${sfile}` cp -p ${sfile} ${sokfile} echo "${smp} ok = " `cat ${sokfile}` for sec in `cat ${sokfile}` tot.1 ; do smpsec="${smp}/${sec}" if [[ ! ( -d dat/${smpsec} ) ]]; then mkdir -p dat/${smpsec}; fi if [[ ! ( -d exp/${smpsec} ) ]]; then mkdir -p exp/${smpsec}; fi rm -v dat/${smpsec}/*{-wds-summary.tex,-words.tex,.tlw,.wdf,.wfr} rm -v exp/${smpsec}/*{-wds-summary.tex,-words.tex} done done CREATING THE SAMPLE FILES FOR OTHER LANGUAGES We do two passes on the original text file, with {sizeopt} equal to "whole" and "trunc", respectively. In each pass, for for each sample and each section {smpsec = smp/sec} (including the pseudo-section "tot.1"), we create a word list "dat/{smpsec}/{sizeopt}.tlw" with the words from the text, one per line. The "whole" version uses the full source text, while the "trunc" version truncates the text to a prescribed number of "good" words (see below). for sizeopt in whole trunc ; do echo "### getting main word files dat/*/*/*/${sizeopt}.tlw ###" 1>&2 for smp in ${smps[@]} ; do echo " " get-sample-tlw-file.sh ${smp} ${sizeopt} done done Next, for each {sizeopt} ("whole" then "trunc"), the file "dat/{smpsec}/{sizeopt}.tlw" is copied to "dat/{smpsec}/raw.tlw". Then we split the "raw.tlw" word list into "gud.tlw" and "bad.tlw" files as explained above. For each {kind} in "raw", "gud", "bad", we also generate the files "dat/{smpsec}/{kind}.wdf" - words formated as running text. "dat/{smpsec}/{kind}.wfr" - word counts and frequencies. Note that these files are overwritten in the second pass, so in the end they are only to the truncated versions. In each pass we also create the TeX parameter file "dat/{smpsec}/{sizeopt}-{kind}-wds-summary.tex" and copy it to "exp/{smpsec}/{sizeopt}-{kind}-wds-summary.tex". Finally, for each {sizeopt} and {kind} we create a global table "summary-{sizeopt}-{kind}.txt" with basic data of all samples and sections. Note that the table must be created inside the loop of {sizeopt} since the files {raw,gud,bad}.tlw are overwritten. for sizeopt in whole trunc ; do echo "### creating {raw,gud,bad}.tlw files from ${sizeopt}.tlw ###" for smp in ${smps[@]} ; do echo " " get-sample-raw-gud-bad-files.sh ${smp} ${sizeopt} echo " " done summarize-counts.sh ${sizeopt} ${smps[@]} done Print summaries: for kind in raw gud bad ; do printf "\n" paste summary-{whole,trunc}-${kind}.txt | expand | sed -e 's:^: :g' done # Counts for raw text (whole) # Counts for raw text (trunc) # sample/sec tokens words unique # sample/sec tokens words unique # -------------- ------- ------- ------- # -------------- ------- ------- ------- hebr/tav/gen.1 18744 7213 5100 hebr/tav/gen.1 18744 7213 5100 hebr/tav/exo.1 15079 5712 3882 hebr/tav/exo.1 15079 5712 3882 hebr/tav/num.1 14862 5307 3697 hebr/tav/num.1 14862 5307 3697 hebr/tav/lev.1 10509 3861 2609 hebr/tav/lev.1 10509 3861 2609 hebr/tav/deu.1 12962 5456 3972 hebr/tav/deu.1 12962 5456 3972 hebr/tav/tot.1 72156 20977 14023 hebr/tav/tot.1 38077 12488 8498 hebr/tad/tot.1 72156 19557 12807 hebr/tad/tot.1 38112 11857 7842 geez/gok/tot.1 34788 12356 8385 geez/gok/tot.1 34788 12356 8385 geez/eno/tot.1 18215 6356 4228 geez/eno/tot.1 18215 6356 4228 engl/wow/tot.1 61191 6799 3252 engl/wow/tot.1 35606 4878 2472 engl/wnm/tot.1 831 194 100 engl/wnm/tot.1 831 194 100 engl/cul/pre.1 2824 799 495 engl/cul/pre.1 2824 799 495 engl/cul/her.1 116329 5855 2495 engl/cul/her.1 36193 3489 1613 engl/cul/rec.1 7084 1260 642 engl/cul/rec.1 7084 1260 642 engl/cul/tot.1 126237 6379 2721 engl/cul/tot.1 36201 3637 1728 engl/cpn/tot.1 544 402 323 engl/cpn/tot.1 544 402 323 engl/twp/tot.1 95816 6848 3500 engl/twp/tot.1 41419 4222 2242 latn/ptt/gen.1 26748 5714 3485 latn/ptt/gen.1 26748 5714 3485 latn/ptt/exo.1 21271 4702 2790 latn/ptt/exo.1 21271 4702 2790 latn/ptt/num.1 20604 4341 2595 latn/ptt/num.1 20604 4341 2595 latn/ptt/lev.1 14633 3234 1909 latn/ptt/lev.1 14633 3234 1909 latn/ptt/deu.1 19461 4467 2815 latn/ptt/deu.1 19461 4467 2815 latn/ptt/tot.1 102717 13947 7568 latn/ptt/tot.1 37104 6634 3875 latn/nwt/mat.1 17502 3914 2280 latn/nwt/mat.1 17502 3914 2280 latn/nwt/mrk.1 10959 2916 1812 latn/nwt/mrk.1 10959 2916 1812 latn/nwt/luk.1 19155 4407 2743 latn/nwt/luk.1 19155 4407 2743 latn/nwt/joh.1 14905 2524 1377 latn/nwt/joh.1 14905 2524 1377 latn/nwt/tot.1 62521 7994 3948 latn/nwt/tot.1 37253 5741 2948 latn/ock/tot.1 37637 5828 3017 latn/ock/tot.1 35389 5643 2947 grek/nwt/mat.1 19816 3959 2350 grek/nwt/mat.1 19816 3959 2350 grek/nwt/mrk.1 12310 2899 1842 grek/nwt/mrk.1 12310 2899 1842 grek/nwt/luk.1 21037 4610 3015 grek/nwt/luk.1 21037 4610 3015 grek/nwt/joh.1 16798 2587 1422 grek/nwt/joh.1 16798 2587 1422 grek/nwt/tot.1 69961 8302 4163 grek/nwt/tot.1 37003 5437 2824 span/qvi/one.1 179274 14289 7493 span/qvi/one.1 35549 5467 3248 span/qvi/two.1 190831 16084 8585 span/qvi/two.1 35625 5715 3568 span/qvi/tot.1 370105 22563 11235 span/qvi/tot.1 35605 5600 3409 ital/psp/tot.1 219894 19053 9728 ital/psp/tot.1 35621 6655 4106 fran/tal/tot.1 55551 8242 4648 fran/tal/tot.1 36012 6344 3785 port/csm/tot.1 64691 9079 5116 port/csm/tot.1 35056 6278 3778 germ/sim/tot.1 185396 18657 10099 germ/sim/tot.1 35274 6879 4265 russ/pic/tot.1 47369 11837 7940 russ/pic/tot.1 36263 9767 6663 russ/ptt/gen.1 28445 4899 2704 russ/ptt/gen.1 28445 4899 2704 russ/ptt/exo.1 22960 4084 2112 russ/ptt/exo.1 22960 4084 2112 russ/ptt/num.1 22530 3952 2142 russ/ptt/num.1 22530 3952 2142 russ/ptt/lev.1 16901 2659 1305 russ/ptt/lev.1 16901 2659 1305 russ/ptt/deu.1 20988 3913 2238 russ/ptt/deu.1 20988 3913 2238 russ/ptt/tot.1 111824 12034 5926 russ/ptt/tot.1 35027 5521 2911 arab/quf/tot.1 83724 19921 12968 arab/quf/tot.1 37054 10983 7392 arab/quv/tot.1 83724 19586 12642 arab/quv/tot.1 37040 10800 7219 arab/qud/tot.1 83717 15325 9115 arab/qud/tot.1 37001 8536 5247 arab/qph/tot.1 84081 17381 10742 arab/qph/tot.1 36980 9435 6044 arab/qcs/tot.1 80448 15874 9603 arab/qcs/tot.1 37102 9026 5649 viet/ptt/gen.1 43448 1796 432 viet/ptt/gen.1 36162 1693 423 viet/ptt/exo.1 34775 1652 370 viet/ptt/exo.1 34775 1652 370 viet/ptt/num.1 38067 1488 365 viet/ptt/num.1 35949 1462 369 viet/ptt/lev.1 25831 1210 341 viet/ptt/lev.1 25831 1210 341 viet/ptt/deu.1 32092 1617 441 viet/ptt/deu.1 32092 1617 441 viet/ptt/tot.1 174213 2687 489 viet/ptt/tot.1 36022 1634 397 viet/nwt/mat.1 26411 1821 566 viet/nwt/mat.1 26411 1821 566 viet/nwt/mrk.1 16326 1575 558 viet/nwt/mrk.1 16326 1575 558 viet/nwt/luk.1 28276 2118 750 viet/nwt/luk.1 28276 2118 750 viet/nwt/jhn.1 22428 1290 428 viet/nwt/jhn.1 22428 1290 428 viet/nwt/tot.1 93441 2739 686 viet/nwt/tot.1 36005 2012 570 chin/ptt/gen.1 46397 1504 301 chin/ptt/gen.1 36068 1377 276 chin/ptt/exo.1 36263 1425 275 chin/ptt/exo.1 36028 1425 277 chin/ptt/num.1 37906 1304 312 chin/ptt/num.1 36034 1292 310 chin/ptt/lev.1 26404 1096 261 chin/ptt/lev.1 26404 1096 261 chin/ptt/deu.1 32282 1434 336 chin/ptt/deu.1 32282 1434 336 chin/ptt/tot.1 179252 2178 278 chin/ptt/tot.1 36056 1393 280 chin/ptn/gen.1 50279 1556 317 chin/ptn/gen.1 35736 1381 312 chin/ptn/exo.1 41000 1451 305 chin/ptn/exo.1 35725 1440 321 chin/ptn/num.1 40542 1309 294 chin/ptn/num.1 35657 1255 288 chin/ptn/lev.1 29292 1170 274 chin/ptn/lev.1 29292 1170 274 chin/ptn/deu.1 35979 1464 368 chin/ptn/deu.1 35627 1458 367 chin/ptn/tot.1 197092 2267 318 chin/ptn/tot.1 35720 1406 291 chin/red/tot.1 710905 4273 585 chin/red/tot.1 35263 2421 663 chin/voa/tot.1 59835 1954 412 chin/voa/tot.1 35691 1674 381 chip/voa/tot.1 60002 933 114 chip/voa/tot.1 35342 832 98 tibe/vim/tot.1 53356 1473 391 tibe/vim/tot.1 35077 1304 372 tibe/ccv/tot.1 88669 1166 300 tibe/ccv/tot.1 35049 855 203 tibe/pmi/tot.1 143331 2946 674 tibe/pmi/tot.1 35034 1968 518 chrc/red/tot.1 710905 4273 585 chrc/red/tot.1 35263 2421 663 enrc/wow/tot.1 61191 6799 3252 enrc/wow/tot.1 35606 4878 2472 envt/wow/tot.1 70119 2692 458 envt/wow/tot.1 58343 2591 467 envg/wow/tot.1 61191 19130 13043 envg/wow/tot.1 35606 12920 9134 voyp/grs/tot.1 1950 635 365 voyp/grs/tot.1 1950 635 365 voyp/grm/tot.1 726 313 208 voyp/grm/tot.1 726 313 208 viep/grs/tot.1 31200 7760 3216 viep/grs/tot.1 31200 7760 3216 viep/mky/tot.1 40398 3472 1161 viep/mky/tot.1 36013 3342 1174 # Counts for gud text (whole) # Counts for gud text (trunc) # sample/sec tokens words unique # sample/sec tokens words unique # -------------- ------- ------- ------- # -------------- ------- ------- ------- hebr/tav/gen.1 17211 7212 5100 hebr/tav/gen.1 17211 7212 5100 hebr/tav/exo.1 13870 5711 3882 hebr/tav/exo.1 13870 5711 3882 hebr/tav/num.1 13573 5306 3697 hebr/tav/num.1 13573 5306 3697 hebr/tav/lev.1 9650 3860 2609 hebr/tav/lev.1 9650 3860 2609 hebr/tav/deu.1 12007 5455 3972 hebr/tav/deu.1 12007 5455 3972 hebr/tav/tot.1 66311 20976 14023 hebr/tav/tot.1 35027 12487 8498 hebr/tad/tot.1 66311 19556 12807 hebr/tad/tot.1 35027 11856 7842 geez/gok/tot.1 34291 12272 8344 geez/gok/tot.1 34291 12272 8344 geez/eno/tot.1 17736 6274 4193 geez/eno/tot.1 17736 6274 4193 engl/wow/tot.1 60293 6789 3244 engl/wow/tot.1 35027 4869 2465 engl/wnm/tot.1 831 194 100 engl/wnm/tot.1 831 194 100 engl/cul/pre.1 2763 778 480 engl/cul/pre.1 2763 778 480 engl/cul/her.1 112695 5685 2402 engl/cul/her.1 35027 3399 1551 engl/cul/rec.1 6771 1240 635 engl/cul/rec.1 6771 1240 635 engl/cul/tot.1 122229 6193 2620 engl/cul/tot.1 35027 3544 1667 engl/cpn/tot.1 541 400 322 engl/cpn/tot.1 541 400 322 engl/twp/tot.1 81498 6799 3465 engl/twp/tot.1 35027 4202 2225 latn/ptt/gen.1 25217 5713 3485 latn/ptt/gen.1 25217 5713 3485 latn/ptt/exo.1 20060 4701 2790 latn/ptt/exo.1 20060 4701 2790 latn/ptt/num.1 19316 4340 2595 latn/ptt/num.1 19316 4340 2595 latn/ptt/lev.1 13775 3233 1909 latn/ptt/lev.1 13775 3233 1909 latn/ptt/deu.1 18502 4466 2815 latn/ptt/deu.1 18502 4466 2815 latn/ptt/tot.1 96870 13946 7568 latn/ptt/tot.1 35027 6633 3875 latn/nwt/mat.1 16431 3911 2278 latn/nwt/mat.1 16431 3911 2278 latn/nwt/mrk.1 10280 2913 1810 latn/nwt/mrk.1 10280 2913 1810 latn/nwt/luk.1 18004 4406 2743 latn/nwt/luk.1 18004 4406 2743 latn/nwt/joh.1 14026 2523 1377 latn/nwt/joh.1 14026 2523 1377 latn/nwt/tot.1 58741 7990 3946 latn/nwt/tot.1 35027 5740 2948 latn/ock/tot.1 37263 5774 2996 latn/ock/tot.1 35027 5589 2926 grek/nwt/mat.1 18745 3958 2350 grek/nwt/mat.1 18745 3958 2350 grek/nwt/mrk.1 11632 2898 1842 grek/nwt/mrk.1 11632 2898 1842 grek/nwt/luk.1 19887 4609 3015 grek/nwt/luk.1 19887 4609 3015 grek/nwt/joh.1 15919 2586 1422 grek/nwt/joh.1 15919 2586 1422 grek/nwt/tot.1 66183 8301 4163 grek/nwt/tot.1 35027 5436 2824 span/qvi/one.1 177061 14247 7466 span/qvi/one.1 35027 5452 3237 span/qvi/two.1 187776 16023 8543 span/qvi/two.1 35027 5698 3558 span/qvi/tot.1 364837 22475 11175 span/qvi/tot.1 35027 5582 3395 ital/psp/tot.1 216969 18965 9671 ital/psp/tot.1 35027 6623 4085 fran/tal/tot.1 54061 8102 4555 fran/tal/tot.1 35027 6223 3698 port/csm/tot.1 64602 9032 5081 port/csm/tot.1 35027 6267 3772 germ/sim/tot.1 184498 18556 10020 germ/sim/tot.1 35027 6826 4223 russ/pic/tot.1 45915 11831 7936 russ/pic/tot.1 35027 9761 6659 russ/ptt/gen.1 28445 4899 2704 russ/ptt/gen.1 28445 4899 2704 russ/ptt/exo.1 22960 4084 2112 russ/ptt/exo.1 22960 4084 2112 russ/ptt/num.1 22530 3952 2142 russ/ptt/num.1 22530 3952 2142 russ/ptt/lev.1 16901 2659 1305 russ/ptt/lev.1 16901 2659 1305 russ/ptt/deu.1 20988 3913 2238 russ/ptt/deu.1 20988 3913 2238 russ/ptt/tot.1 111824 12034 5926 russ/ptt/tot.1 35027 5521 2911 arab/quf/tot.1 77394 19852 12911 arab/quf/tot.1 35027 10935 7353 arab/quv/tot.1 77411 19530 12595 arab/quv/tot.1 35027 10762 7187 arab/qud/tot.1 77455 15314 9109 arab/qud/tot.1 35027 8531 5245 arab/qph/tot.1 77845 17380 10742 arab/qph/tot.1 35027 9434 6044 arab/qcs/tot.1 74212 15873 9603 arab/qcs/tot.1 35027 9025 5649 viet/ptt/gen.1 42099 1793 430 viet/ptt/gen.1 35027 1690 421 viet/ptt/exo.1 33760 1649 368 viet/ptt/exo.1 33760 1649 368 viet/ptt/num.1 37097 1485 363 viet/ptt/num.1 35027 1459 367 viet/ptt/lev.1 25163 1207 339 viet/ptt/lev.1 25163 1207 339 viet/ptt/deu.1 31361 1614 439 viet/ptt/deu.1 31361 1614 439 viet/ptt/tot.1 169480 2684 489 viet/ptt/tot.1 35027 1631 397 viet/nwt/mat.1 25615 1818 564 viet/nwt/mat.1 25615 1818 564 viet/nwt/mrk.1 15895 1572 556 viet/nwt/mrk.1 15895 1572 556 viet/nwt/luk.1 27637 2117 750 viet/nwt/luk.1 27637 2117 750 viet/nwt/jhn.1 21872 1289 428 viet/nwt/jhn.1 21872 1289 428 viet/nwt/tot.1 91019 2735 684 viet/nwt/tot.1 35027 2011 570 chin/ptt/gen.1 45081 1503 301 chin/ptt/gen.1 35027 1376 276 chin/ptt/exo.1 35252 1424 275 chin/ptt/exo.1 35027 1424 277 chin/ptt/num.1 36843 1303 312 chin/ptt/num.1 35027 1291 310 chin/ptt/lev.1 25694 1095 261 chin/ptt/lev.1 25694 1095 261 chin/ptt/deu.1 31494 1433 336 chin/ptt/deu.1 31494 1433 336 chin/ptt/tot.1 174364 2177 278 chin/ptt/tot.1 35027 1392 280 chin/ptn/gen.1 49305 1555 317 chin/ptn/gen.1 35027 1380 312 chin/ptn/exo.1 40159 1450 305 chin/ptn/exo.1 35027 1439 321 chin/ptn/num.1 39792 1308 294 chin/ptn/num.1 35027 1254 288 chin/ptn/lev.1 28693 1169 274 chin/ptn/lev.1 28693 1169 274 chin/ptn/deu.1 35370 1463 368 chin/ptn/deu.1 35027 1457 367 chin/ptn/tot.1 193319 2266 318 chin/ptn/tot.1 35027 1405 291 chin/red/tot.1 706889 4271 585 chin/red/tot.1 35027 2420 663 chin/voa/tot.1 58813 1886 376 chin/voa/tot.1 35027 1616 348 chip/voa/tot.1 59476 930 114 chip/voa/tot.1 35027 830 98 tibe/vim/tot.1 53287 1469 389 tibe/vim/tot.1 35027 1300 370 tibe/ccv/tot.1 88620 1155 292 tibe/ccv/tot.1 35027 846 196 tibe/pmi/tot.1 143289 2932 666 tibe/pmi/tot.1 35027 1963 515 chrc/red/tot.1 706889 4271 585 chrc/red/tot.1 35027 2420 663 enrc/wow/tot.1 60293 6789 3244 enrc/wow/tot.1 35027 4869 2465 envt/wow/tot.1 42098 1709 286 envt/wow/tot.1 35027 1650 291 envg/wow/tot.1 60293 19120 13035 envg/wow/tot.1 35027 12911 9127 voyp/grs/tot.1 1950 635 365 voyp/grs/tot.1 1950 635 365 voyp/grm/tot.1 708 307 204 voyp/grm/tot.1 708 307 204 viep/grs/tot.1 31200 7760 3216 viep/grs/tot.1 31200 7760 3216 viep/mky/tot.1 39293 3471 1161 viep/mky/tot.1 35027 3341 1174 # Counts for bad text (whole) # Counts for bad text (trunc) # sample/sec tokens words unique # sample/sec tokens words unique # -------------- ------- ------- ------- # -------------- ------- ------- ------- hebr/tav/gen.1 1533 1 0 hebr/tav/gen.1 1533 1 0 hebr/tav/exo.1 1209 1 0 hebr/tav/exo.1 1209 1 0 hebr/tav/num.1 1289 1 0 hebr/tav/num.1 1289 1 0 hebr/tav/lev.1 859 1 0 hebr/tav/lev.1 859 1 0 hebr/tav/deu.1 955 1 0 hebr/tav/deu.1 955 1 0 hebr/tav/tot.1 5845 1 0 hebr/tav/tot.1 3050 1 0 hebr/tad/tot.1 5845 1 0 hebr/tad/tot.1 3085 1 0 geez/gok/tot.1 497 84 41 geez/gok/tot.1 497 84 41 geez/eno/tot.1 479 82 35 geez/eno/tot.1 479 82 35 engl/wow/tot.1 898 10 8 engl/wow/tot.1 579 9 7 engl/wnm/tot.1 0 0 0 engl/wnm/tot.1 0 0 0 engl/cul/pre.1 61 21 15 engl/cul/pre.1 61 21 15 engl/cul/her.1 3634 170 93 engl/cul/her.1 1166 90 62 engl/cul/rec.1 313 20 7 engl/cul/rec.1 313 20 7 engl/cul/tot.1 4008 186 101 engl/cul/tot.1 1174 93 61 engl/cpn/tot.1 3 2 1 engl/cpn/tot.1 3 2 1 engl/twp/tot.1 14318 49 35 engl/twp/tot.1 6392 20 17 latn/ptt/gen.1 1531 1 0 latn/ptt/gen.1 1531 1 0 latn/ptt/exo.1 1211 1 0 latn/ptt/exo.1 1211 1 0 latn/ptt/num.1 1288 1 0 latn/ptt/num.1 1288 1 0 latn/ptt/lev.1 858 1 0 latn/ptt/lev.1 858 1 0 latn/ptt/deu.1 959 1 0 latn/ptt/deu.1 959 1 0 latn/ptt/tot.1 5847 1 0 latn/ptt/tot.1 2077 1 0 latn/nwt/mat.1 1071 3 2 latn/nwt/mat.1 1071 3 2 latn/nwt/mrk.1 679 3 2 latn/nwt/mrk.1 679 3 2 latn/nwt/luk.1 1151 1 0 latn/nwt/luk.1 1151 1 0 latn/nwt/joh.1 879 1 0 latn/nwt/joh.1 879 1 0 latn/nwt/tot.1 3780 4 2 latn/nwt/tot.1 2226 1 0 latn/ock/tot.1 374 54 21 latn/ock/tot.1 362 54 21 grek/nwt/mat.1 1071 1 0 grek/nwt/mat.1 1071 1 0 grek/nwt/mrk.1 678 1 0 grek/nwt/mrk.1 678 1 0 grek/nwt/luk.1 1150 1 0 grek/nwt/luk.1 1150 1 0 grek/nwt/joh.1 879 1 0 grek/nwt/joh.1 879 1 0 grek/nwt/tot.1 3778 1 0 grek/nwt/tot.1 1976 1 0 span/qvi/one.1 2213 42 27 span/qvi/one.1 522 15 11 span/qvi/two.1 3055 61 42 span/qvi/two.1 598 17 10 span/qvi/tot.1 5268 88 60 span/qvi/tot.1 578 18 14 ital/psp/tot.1 2925 88 57 ital/psp/tot.1 594 32 21 fran/tal/tot.1 1490 140 93 fran/tal/tot.1 985 121 87 port/csm/tot.1 89 47 35 port/csm/tot.1 29 11 6 germ/sim/tot.1 898 101 79 germ/sim/tot.1 247 53 42 russ/pic/tot.1 1454 8 5 russ/pic/tot.1 1236 8 5 russ/ptt/gen.1 0 0 0 russ/ptt/gen.1 0 0 0 russ/ptt/exo.1 0 0 0 russ/ptt/exo.1 0 0 0 russ/ptt/num.1 0 0 0 russ/ptt/num.1 0 0 0 russ/ptt/lev.1 0 0 0 russ/ptt/lev.1 0 0 0 russ/ptt/deu.1 0 0 0 russ/ptt/deu.1 0 0 0 russ/ptt/tot.1 0 0 0 russ/ptt/tot.1 0 0 0 arab/quf/tot.1 6330 69 57 arab/quf/tot.1 2027 48 39 arab/quv/tot.1 6313 56 47 arab/quv/tot.1 2013 38 32 arab/qud/tot.1 6262 11 6 arab/qud/tot.1 1974 5 2 arab/qph/tot.1 6236 1 0 arab/qph/tot.1 1953 1 0 arab/qcs/tot.1 6236 1 0 arab/qcs/tot.1 2075 1 0 viet/ptt/gen.1 1349 3 2 viet/ptt/gen.1 1135 3 2 viet/ptt/exo.1 1015 3 2 viet/ptt/exo.1 1015 3 2 viet/ptt/num.1 970 3 2 viet/ptt/num.1 922 3 2 viet/ptt/lev.1 668 3 2 viet/ptt/lev.1 668 3 2 viet/ptt/deu.1 731 3 2 viet/ptt/deu.1 731 3 2 viet/ptt/tot.1 4733 3 0 viet/ptt/tot.1 995 3 0 viet/nwt/mat.1 796 3 2 viet/nwt/mat.1 796 3 2 viet/nwt/mrk.1 431 3 2 viet/nwt/mrk.1 431 3 2 viet/nwt/luk.1 639 1 0 viet/nwt/luk.1 639 1 0 viet/nwt/jhn.1 556 1 0 viet/nwt/jhn.1 556 1 0 viet/nwt/tot.1 2422 4 2 viet/nwt/tot.1 978 1 0 chin/ptt/gen.1 1316 1 0 chin/ptt/gen.1 1041 1 0 chin/ptt/exo.1 1011 1 0 chin/ptt/exo.1 1001 1 0 chin/ptt/num.1 1063 1 0 chin/ptt/num.1 1007 1 0 chin/ptt/lev.1 710 1 0 chin/ptt/lev.1 710 1 0 chin/ptt/deu.1 788 1 0 chin/ptt/deu.1 788 1 0 chin/ptt/tot.1 4888 1 0 chin/ptt/tot.1 1029 1 0 chin/ptn/gen.1 974 1 0 chin/ptn/gen.1 709 1 0 chin/ptn/exo.1 841 1 0 chin/ptn/exo.1 698 1 0 chin/ptn/num.1 750 1 0 chin/ptn/num.1 630 1 0 chin/ptn/lev.1 599 1 0 chin/ptn/lev.1 599 1 0 chin/ptn/deu.1 609 1 0 chin/ptn/deu.1 600 1 0 chin/ptn/tot.1 3773 1 0 chin/ptn/tot.1 693 1 0 chin/red/tot.1 4016 2 0 chin/red/tot.1 236 1 0 chin/voa/tot.1 1022 68 36 chin/voa/tot.1 664 58 33 chip/voa/tot.1 526 3 0 chip/voa/tot.1 315 2 0 tibe/vim/tot.1 69 4 2 tibe/vim/tot.1 50 4 2 tibe/ccv/tot.1 49 11 8 tibe/ccv/tot.1 22 9 7 tibe/pmi/tot.1 42 14 8 tibe/pmi/tot.1 7 5 3 chrc/red/tot.1 4016 2 0 chrc/red/tot.1 236 1 0 enrc/wow/tot.1 898 10 8 enrc/wow/tot.1 579 9 7 envt/wow/tot.1 28021 983 172 envt/wow/tot.1 23316 941 176 envg/wow/tot.1 898 10 8 envg/wow/tot.1 579 9 7 voyp/grs/tot.1 0 0 0 voyp/grs/tot.1 0 0 0 voyp/grm/tot.1 18 6 4 voyp/grm/tot.1 18 6 4 viep/grs/tot.1 0 0 0 viep/grs/tot.1 0 0 0 viep/mky/tot.1 1105 1 0 viep/mky/tot.1 986 1 0 # END