Hacking at the Voynich manuscript - Side notes
101 Preparing clean samples of various other languages 

Last edited on 2023-05-10 23:44:30 by stolfi

SUMMARY

  Here we prepare text samples in English, Latin, and other languages,
  comparable in size to the Voynichese reference sample, for the
  statistical analyses that will go into the "word structure"
  technical report.

SETTING UP THE ENVIRONMENT

  Links:

    ln -s ../tr-stats/dat
    ln -s ../tr-stats/exp

    ln -s ../../../work 
    ln -s ../../../langbank

    ln -s work/wds-to-tlw wds-to-tlw.sh
    ln -s work/update-paper-include update-paper-include.sh
    ln -s work/format-words-filled format-words-filled.sh
    ln -s work/compute-freqs compute-freqs.sh

NAMING THE SAMPLE TEXTS
  
  A sample is identified by a pair {smp = "{LANG}/{BOOK}"} where
  {LANG} is the general language and writing system ("engl" for
  English in standard spelling, "chip" for Mandarin Chinese in pinyin,
  etc.) and {BOOK} is the source document ("wow" for War of the
  Worlds, "ptt" for the Pentateuch, "voa" for Voice of America
  broadcasts, etc.).
  
  A sample may be divided into sections or sub-samples for the purpose
  of studying statistical variations within the same document. A
  section is identified by a name {sec = "{TAG}.{N}"} where {TAG} is a
  descriptive string ("gen" for Genesis, "exo" for Exodus, "hea" for
  Herbal-A) and {N} is a serial number in case the section is split
  into separate segments (as the VMS Herbal-A is split into "hea.1",
  "hea.2", etc.). Every sample must have a "tot.1" section (the whole
  sample), which must be processed only after any other sections.

  The samples, sections, and their attributes are specified in the
  table "sample-sections.tbl". 

FORMAT OF THE SAMPLE TEXTS

  For each sample {smp="{LANG}/{BOOK}"} and each section
  {sec="{TAG}.{N}"} (including "tot.1"), we produce a file called
  "dat/{smp}/{sec}/whole.tlw" that contains the corresponding "raw" tokens,
  suitably encoded, filtered, and tagged as "good" or "bad" for linguistic
  analysis.
  
  The file "whole.tlw" is derived from a reference file
  "LANG/BUK/main.wds" from my Linguistic Sample Bank
  ("/home/staff/stolfi/projects/langbank/"). This file
  is also linked as "dat/{smp}/org/main.wds".
  
  The "LANG/BUK" of the source may be different from the "LANG/BUK" of
  the sample {smp}; for example, the Langbank source "engl/wow" is used
  to produce the samples "engl/wow" (full text in lowercase), "engl/wnm"
  (proper names only), and "envg/wow" (Vigenère-coded text).
  
  A truncated version "dat/{smp}/{sec}/trunc.tlw" of the "whole.tlw" file is
  also created, containing a specified maximum number of "good" words.
  The roughly uniform sizes of these files makes them more suitable for
  certain comparative analyises, such as Zipf law plots.
  
  For compatibility with the VMS samples, each ".tlw" file must 
  EXCLUDE any blanks, embedded comments, alignment fillers, or
  punctuation (including line and paragraph breaks); or any other
  tokens that are semantically equivalent to them. On the other hand,
  the ".tlw" files must INCLUDE any symbols that stand for words in
  the text, such as numbers, abbreviations, and "*"-surrogates for
  illegible or omitted words.
  
  In particular, each ".tlw" file must EXCLUDE any "null-like
  intrusions", i.e. undesirable sub-sections of the selected section that
  are syntactically equivalent to a null or blank space -- such as margin
  notes, footnotes and footnote marks, etc.. On the other hand, it must
  INCLUDE any "symbol-like intrusions", i.e. undesirable sub-sections that
  play a syntactic role in the text -- such as formulas, tables,
  poems, foreign phrases, etc.. However, the tokens of a symbol-like
  intrusion should be replaced by "*" symbols to clearly distinguish them
  from valid words; and, if the intrusion has two or more consecutive
  tokens, only the first and last one should be kept.
  
  The entries of a ".tlw" file have the format "{TYPE} {LOC} {WORD}",
  where: {TYPE} is either "a" or "s"; {LOC} is the full location of
  the token (section ID plus line number) with components delimited by
  braces (e.g. "{b1}{c3}{sA}{tx}{73}"); and {WORD} is the word or
  symbol in question.
  
  The type "a" denotes plain ("gud") words of the language that are
  suitable for letter-level linguistic analysis, such as letter
  frequencies and correlations, word length distribution, etc. The
  type "s" denotes anomalous ("bad") words that ought to be excluded
  from such analyses, such as numerals, abbreviations, symbols,
  foreign-language intrusions, unreadable words, etc.. Note that the
  "bad" words cannot be discarded, because they are relevant for some
  investigations, such as word-pair correlations, concordances, etc..
  
  The conversion of the source "main.wds" to the sample file "whole.tlw"
  and "trunc.tlw" is done by {wds-to-tlw} and consists of the following steps:
    
    (1) Read the tokens from "main.wds", which must have been roughly
    classified as comments (type="#"), alpha words ("a"), symbol words
    ("s"), punctuation chars ("p"), section starts ("$") and line
    starts ("@").
    
    (2) Test each input token with a sample-specific procedure
    {smp_reclassify_word} from the library {smp}/sample-fns.gawk. This
    procedure should re-assign type "x" to any tokens that lie
    outside section ${SEC} or are symbol-like intrusions; and as
    type="n" any null-like intrusions. 
    
    (4) Based on the reclassified type, discard all entries except
    "a", "s" and "x".
    
    (5) To the remaining words, apply a sample-specific global word
    transformation, e.g. map upper to lower case, map Chinese
    ideograms to pinyin, change the alphabet, delete vowels or
    diacritics, split hyphenated words, etc. This is also a chance to
    delete any undesired words that could not be discarded by
    {smp_reclassify_word}. This step uses a sample-specific word
    substition table {smp}/word-map.tbl followed by a sample-specific
    function {smp_fix_word} from the library {smp}/sample-fns.gawk.
    
    (6) Insert the {LOC} field, prepend "*" to any "x" words (to 
    avoid confusion), and squeeze any long runs of the latter.
    
    (7) Re-classify each "a" and "s" word as "gud" or "bad", using the
    predicate {smp_is_good_word} from the same library; "x" words are
    automatically "bad". Replace the type tag by "a" for the "gud"
    words, "s" for the bad words.. Write both types the "all.tlw" file.
    
  The "whole.tlw" file is then truncate after a specified number of "gud" 
  records, producing the "trunc.tlw" file.
  
  The file "trunc.tlw" for each sample and section is copied to "raw.tlw"
  for compatibility with other Notes workbooks, and then split into
  
    dat/{smp}/{sec}/gud.tlw - the "gud" subset.
    dat/{smp}/{sec}/bad.tlw - the "bad" subset.
    
  From these files are also created the derived files
    
    dat/{smp}/{sec}/raw.wfr - word counts in "raw.tlw".
    dat/{smp}/{sec}/gud.wfr - word counts in "gud.tlw".
    dat/{smp}/{sec}/bad.wfr - word counts in "bad.tlw".
    
    dat/{smp}/{sec}/raw.wdf - tokens from "raw.tlw", w/o locations, line-filled.
    dat/{smp}/{sec}/gud.wdf - tokens from "gud.tlw", w/o locations, line-filled.
    dat/{smp}/{sec}/bad.wdf - tokens from "bad.tlw", w/o locations, line-filled.
    
    dat/{smp}/{sec}/raw-wds-summary.tex - TeX include file for tech report
    dat/{smp}/{sec}/gud-wds-summary.tex - TeX include file for tech report
    dat/{smp}/{sec}/bad-wds-summary.tex - TeX include file for tech report
    
  The files {raw,gud,bad}.{tlw,wdf,wfr} are temporarily created also for
  the full sample "{smp}/{sec}/whole.tlw", but are then overwritten for
  the "trunc" version.

GETTING THE SAMPLE SIZES FOR VOYNICHESE

  Get number of good tokens in Voynichese reference sample 
  (plain prose and labels):

    vvers=( prs lab maj )
    for book in "${vvers[@]}" ; do
      cat dat/voyn/${book}/tot.1/gud.wfr \
        | gawk '/./{s+=$1} END{print s}' \
        > .tmp
      printf "${book} = %8d\n" `cat .tmp` 1>&2
    done
    
      prs =    35027
      lab =     1003
      maj =    36030
  
CHOOSING THE WORD SAMPLES FROM LANGUAGES OTHER THAN VOYNICHESE

  Gather the list ${smpsecs} of samples and sections,
  and the list ${smps} of samples without sections: 

    cat sample-sections.tbl \
      | gawk '/^ *([#]|$)/ { next; } // { print $1; }' \
      > .tmp
    smpsecs=( `cat .tmp` )
    echo "=== samples and sections =="
    echo ${smpsecs[@]} | tr ' ' '\012'
    echo ${smpsecs[@]} \
      | tr ' ' '\012' \
      | sed -e 's:[/][^/]*$::' \
      | uniq \
      > .tmp
    smps=( `cat .tmp` )
    echo "=== samples =="
    echo ${smps[@]} | tr ' ' '\012'

  Create files ${smp}/sections.tags and ${smp}/sections-ok.tags
  containing the list of sections (other than "tot.1") for each ${smp},
  create links to the original Langbank file "main.wds", and remove
  derived files:

    for smp in ${smps[@]} ; do
      echo " "
      if [[ ! ( -d dat/${smp} ) ]]; then mkdir -p dat/${smp}; fi
      if [[ ! ( -d exp/${smp} ) ]]; then mkdir -p exp/${smp}; fi
      sfile="dat/${smp}/sections.tags"
      sokfile="dat/${smp}/sections-ok.tags"
      cat sample-sections.tbl \
        | egrep -e '^ *'"${smp}/" \
        | gawk '// { s = $1; sub(/^.*[\/]/, "", s); print s; }' \
        | egrep -v -e '^tot[.]1$' \
        > ${sfile}
      echo "${smp} =    " `cat ${sfile}`
      cp -p ${sfile} ${sokfile}
      echo "${smp} ok = " `cat ${sokfile}`
      for sec in  `cat ${sokfile}` tot.1 ; do
        smpsec="${smp}/${sec}"
        if [[ ! ( -d dat/${smpsec} ) ]]; then mkdir -p dat/${smpsec}; fi
        if [[ ! ( -d exp/${smpsec} ) ]]; then mkdir -p exp/${smpsec}; fi
        rm -v dat/${smpsec}/*{-wds-summary.tex,-words.tex,.tlw,.wdf,.wfr}
        rm -v exp/${smpsec}/*{-wds-summary.tex,-words.tex}
      done
    done

CREATING THE SAMPLE FILES FOR OTHER LANGUAGES

  We do two passes on the original text file, with {sizeopt} equal to
  "whole" and "trunc", respectively.
  
  In each pass, for for each sample and each section {smpsec = smp/sec}
  (including the pseudo-section "tot.1"), we create a word list
  "dat/{smpsec}/{sizeopt}.tlw" with the words from the text, one per
  line. The "whole" version uses the full source text, while the "trunc"
  version truncates the text to a prescribed number of "good" words (see
  below).

    for sizeopt in  whole trunc ; do
      echo "### getting main word files dat/*/*/*/${sizeopt}.tlw ###" 1>&2
      for smp in ${smps[@]} ; do
        echo " "
        get-sample-tlw-file.sh ${smp} ${sizeopt}
      done
    done
  
  Next, for each {sizeopt} ("whole" then "trunc"), the file
  "dat/{smpsec}/{sizeopt}.tlw" is copied to "dat/{smpsec}/raw.tlw". Then
  we split the "raw.tlw" word list into "gud.tlw" and "bad.tlw" files
  as explained above.
  
  For each {kind} in "raw", "gud", "bad", we also generate the files
    "dat/{smpsec}/{kind}.wdf" - words formated as running text.
    "dat/{smpsec}/{kind}.wfr" - word counts and frequencies.
  Note that these files are overwritten in the second pass,
  so in the end they are only to the truncated versions.
  
  In each pass we also create the TeX parameter file 
    "dat/{smpsec}/{sizeopt}-{kind}-wds-summary.tex"
  and copy it to "exp/{smpsec}/{sizeopt}-{kind}-wds-summary.tex".
  
  Finally, for each {sizeopt} and {kind} we create a global table 
  "summary-{sizeopt}-{kind}.txt" with basic data of all samples
  and sections. Note that the table must be created inside the loop
  of {sizeopt} since the files {raw,gud,bad}.tlw are overwritten.

    for sizeopt in  whole trunc ; do
      echo "### creating {raw,gud,bad}.tlw files from ${sizeopt}.tlw ###"
      for smp in ${smps[@]} ; do
        echo " "
        get-sample-raw-gud-bad-files.sh ${smp} ${sizeopt}
        echo " "
      done
      summarize-counts.sh ${sizeopt} ${smps[@]}
    done

  Print summaries:

    for kind in raw gud bad ; do 
      printf "\n"
      paste summary-{whole,trunc}-${kind}.txt | expand | sed -e 's:^:    :g' 
    done


      # Counts for raw text (whole)                   # Counts for raw text (trunc)           
      # sample/sec      tokens   words  unique        # sample/sec      tokens   words  unique
      # -------------- ------- ------- -------        # -------------- ------- ------- -------
        hebr/tav/gen.1   18744    7213    5100          hebr/tav/gen.1   18744    7213    5100
        hebr/tav/exo.1   15079    5712    3882          hebr/tav/exo.1   15079    5712    3882
        hebr/tav/num.1   14862    5307    3697          hebr/tav/num.1   14862    5307    3697
        hebr/tav/lev.1   10509    3861    2609          hebr/tav/lev.1   10509    3861    2609
        hebr/tav/deu.1   12962    5456    3972          hebr/tav/deu.1   12962    5456    3972
        hebr/tav/tot.1   72156   20977   14023          hebr/tav/tot.1   38077   12488    8498
        hebr/tad/tot.1   72156   19557   12807          hebr/tad/tot.1   38112   11857    7842
        geez/gok/tot.1   34788   12356    8385          geez/gok/tot.1   34788   12356    8385
        geez/eno/tot.1   18215    6356    4228          geez/eno/tot.1   18215    6356    4228
        engl/wow/tot.1   61191    6799    3252          engl/wow/tot.1   35606    4878    2472
        engl/wnm/tot.1     831     194     100          engl/wnm/tot.1     831     194     100
        engl/cul/pre.1    2824     799     495          engl/cul/pre.1    2824     799     495
        engl/cul/her.1  116329    5855    2495          engl/cul/her.1   36193    3489    1613
        engl/cul/rec.1    7084    1260     642          engl/cul/rec.1    7084    1260     642
        engl/cul/tot.1  126237    6379    2721          engl/cul/tot.1   36201    3637    1728
        engl/cpn/tot.1     544     402     323          engl/cpn/tot.1     544     402     323
        engl/twp/tot.1   95816    6848    3500          engl/twp/tot.1   41419    4222    2242
        latn/ptt/gen.1   26748    5714    3485          latn/ptt/gen.1   26748    5714    3485
        latn/ptt/exo.1   21271    4702    2790          latn/ptt/exo.1   21271    4702    2790
        latn/ptt/num.1   20604    4341    2595          latn/ptt/num.1   20604    4341    2595
        latn/ptt/lev.1   14633    3234    1909          latn/ptt/lev.1   14633    3234    1909
        latn/ptt/deu.1   19461    4467    2815          latn/ptt/deu.1   19461    4467    2815
        latn/ptt/tot.1  102717   13947    7568          latn/ptt/tot.1   37104    6634    3875
        latn/nwt/mat.1   17502    3914    2280          latn/nwt/mat.1   17502    3914    2280
        latn/nwt/mrk.1   10959    2916    1812          latn/nwt/mrk.1   10959    2916    1812
        latn/nwt/luk.1   19155    4407    2743          latn/nwt/luk.1   19155    4407    2743
        latn/nwt/joh.1   14905    2524    1377          latn/nwt/joh.1   14905    2524    1377
        latn/nwt/tot.1   62521    7994    3948          latn/nwt/tot.1   37253    5741    2948
        latn/ock/tot.1   37637    5828    3017          latn/ock/tot.1   35389    5643    2947
        grek/nwt/mat.1   19816    3959    2350          grek/nwt/mat.1   19816    3959    2350
        grek/nwt/mrk.1   12310    2899    1842          grek/nwt/mrk.1   12310    2899    1842
        grek/nwt/luk.1   21037    4610    3015          grek/nwt/luk.1   21037    4610    3015
        grek/nwt/joh.1   16798    2587    1422          grek/nwt/joh.1   16798    2587    1422
        grek/nwt/tot.1   69961    8302    4163          grek/nwt/tot.1   37003    5437    2824
        span/qvi/one.1  179274   14289    7493          span/qvi/one.1   35549    5467    3248
        span/qvi/two.1  190831   16084    8585          span/qvi/two.1   35625    5715    3568
        span/qvi/tot.1  370105   22563   11235          span/qvi/tot.1   35605    5600    3409
        ital/psp/tot.1  219894   19053    9728          ital/psp/tot.1   35621    6655    4106
        fran/tal/tot.1   55551    8242    4648          fran/tal/tot.1   36012    6344    3785
        port/csm/tot.1   64691    9079    5116          port/csm/tot.1   35056    6278    3778
        germ/sim/tot.1  185396   18657   10099          germ/sim/tot.1   35274    6879    4265
        russ/pic/tot.1   47369   11837    7940          russ/pic/tot.1   36263    9767    6663
        russ/ptt/gen.1   28445    4899    2704          russ/ptt/gen.1   28445    4899    2704
        russ/ptt/exo.1   22960    4084    2112          russ/ptt/exo.1   22960    4084    2112
        russ/ptt/num.1   22530    3952    2142          russ/ptt/num.1   22530    3952    2142
        russ/ptt/lev.1   16901    2659    1305          russ/ptt/lev.1   16901    2659    1305
        russ/ptt/deu.1   20988    3913    2238          russ/ptt/deu.1   20988    3913    2238
        russ/ptt/tot.1  111824   12034    5926          russ/ptt/tot.1   35027    5521    2911
        arab/quf/tot.1   83724   19921   12968          arab/quf/tot.1   37054   10983    7392
        arab/quv/tot.1   83724   19586   12642          arab/quv/tot.1   37040   10800    7219
        arab/qud/tot.1   83717   15325    9115          arab/qud/tot.1   37001    8536    5247
        arab/qph/tot.1   84081   17381   10742          arab/qph/tot.1   36980    9435    6044
        arab/qcs/tot.1   80448   15874    9603          arab/qcs/tot.1   37102    9026    5649
        viet/ptt/gen.1   43448    1796     432          viet/ptt/gen.1   36162    1693     423
        viet/ptt/exo.1   34775    1652     370          viet/ptt/exo.1   34775    1652     370
        viet/ptt/num.1   38067    1488     365          viet/ptt/num.1   35949    1462     369
        viet/ptt/lev.1   25831    1210     341          viet/ptt/lev.1   25831    1210     341
        viet/ptt/deu.1   32092    1617     441          viet/ptt/deu.1   32092    1617     441
        viet/ptt/tot.1  174213    2687     489          viet/ptt/tot.1   36022    1634     397
        viet/nwt/mat.1   26411    1821     566          viet/nwt/mat.1   26411    1821     566
        viet/nwt/mrk.1   16326    1575     558          viet/nwt/mrk.1   16326    1575     558
        viet/nwt/luk.1   28276    2118     750          viet/nwt/luk.1   28276    2118     750
        viet/nwt/jhn.1   22428    1290     428          viet/nwt/jhn.1   22428    1290     428
        viet/nwt/tot.1   93441    2739     686          viet/nwt/tot.1   36005    2012     570
        chin/ptt/gen.1   46397    1504     301          chin/ptt/gen.1   36068    1377     276
        chin/ptt/exo.1   36263    1425     275          chin/ptt/exo.1   36028    1425     277
        chin/ptt/num.1   37906    1304     312          chin/ptt/num.1   36034    1292     310
        chin/ptt/lev.1   26404    1096     261          chin/ptt/lev.1   26404    1096     261
        chin/ptt/deu.1   32282    1434     336          chin/ptt/deu.1   32282    1434     336
        chin/ptt/tot.1  179252    2178     278          chin/ptt/tot.1   36056    1393     280
        chin/ptn/gen.1   50279    1556     317          chin/ptn/gen.1   35736    1381     312
        chin/ptn/exo.1   41000    1451     305          chin/ptn/exo.1   35725    1440     321
        chin/ptn/num.1   40542    1309     294          chin/ptn/num.1   35657    1255     288
        chin/ptn/lev.1   29292    1170     274          chin/ptn/lev.1   29292    1170     274
        chin/ptn/deu.1   35979    1464     368          chin/ptn/deu.1   35627    1458     367
        chin/ptn/tot.1  197092    2267     318          chin/ptn/tot.1   35720    1406     291
        chin/red/tot.1  710905    4273     585          chin/red/tot.1   35263    2421     663
        chin/voa/tot.1   59835    1954     412          chin/voa/tot.1   35691    1674     381
        chip/voa/tot.1   60002     933     114          chip/voa/tot.1   35342     832      98
        tibe/vim/tot.1   53356    1473     391          tibe/vim/tot.1   35077    1304     372
        tibe/ccv/tot.1   88669    1166     300          tibe/ccv/tot.1   35049     855     203
        tibe/pmi/tot.1  143331    2946     674          tibe/pmi/tot.1   35034    1968     518
        chrc/red/tot.1  710905    4273     585          chrc/red/tot.1   35263    2421     663
        enrc/wow/tot.1   61191    6799    3252          enrc/wow/tot.1   35606    4878    2472
        envt/wow/tot.1   70119    2692     458          envt/wow/tot.1   58343    2591     467
        envg/wow/tot.1   61191   19130   13043          envg/wow/tot.1   35606   12920    9134
        voyp/grs/tot.1    1950     635     365          voyp/grs/tot.1    1950     635     365
        voyp/grm/tot.1     726     313     208          voyp/grm/tot.1     726     313     208
        viep/grs/tot.1   31200    7760    3216          viep/grs/tot.1   31200    7760    3216
        viep/mky/tot.1   40398    3472    1161          viep/mky/tot.1   36013    3342    1174

      # Counts for gud text (whole)                   # Counts for gud text (trunc)           
      # sample/sec      tokens   words  unique        # sample/sec      tokens   words  unique
      # -------------- ------- ------- -------        # -------------- ------- ------- -------
        hebr/tav/gen.1   17211    7212    5100          hebr/tav/gen.1   17211    7212    5100
        hebr/tav/exo.1   13870    5711    3882          hebr/tav/exo.1   13870    5711    3882
        hebr/tav/num.1   13573    5306    3697          hebr/tav/num.1   13573    5306    3697
        hebr/tav/lev.1    9650    3860    2609          hebr/tav/lev.1    9650    3860    2609
        hebr/tav/deu.1   12007    5455    3972          hebr/tav/deu.1   12007    5455    3972
        hebr/tav/tot.1   66311   20976   14023          hebr/tav/tot.1   35027   12487    8498
        hebr/tad/tot.1   66311   19556   12807          hebr/tad/tot.1   35027   11856    7842
        geez/gok/tot.1   34291   12272    8344          geez/gok/tot.1   34291   12272    8344
        geez/eno/tot.1   17736    6274    4193          geez/eno/tot.1   17736    6274    4193
        engl/wow/tot.1   60293    6789    3244          engl/wow/tot.1   35027    4869    2465
        engl/wnm/tot.1     831     194     100          engl/wnm/tot.1     831     194     100
        engl/cul/pre.1    2763     778     480          engl/cul/pre.1    2763     778     480
        engl/cul/her.1  112695    5685    2402          engl/cul/her.1   35027    3399    1551
        engl/cul/rec.1    6771    1240     635          engl/cul/rec.1    6771    1240     635
        engl/cul/tot.1  122229    6193    2620          engl/cul/tot.1   35027    3544    1667
        engl/cpn/tot.1     541     400     322          engl/cpn/tot.1     541     400     322
        engl/twp/tot.1   81498    6799    3465          engl/twp/tot.1   35027    4202    2225
        latn/ptt/gen.1   25217    5713    3485          latn/ptt/gen.1   25217    5713    3485
        latn/ptt/exo.1   20060    4701    2790          latn/ptt/exo.1   20060    4701    2790
        latn/ptt/num.1   19316    4340    2595          latn/ptt/num.1   19316    4340    2595
        latn/ptt/lev.1   13775    3233    1909          latn/ptt/lev.1   13775    3233    1909
        latn/ptt/deu.1   18502    4466    2815          latn/ptt/deu.1   18502    4466    2815
        latn/ptt/tot.1   96870   13946    7568          latn/ptt/tot.1   35027    6633    3875
        latn/nwt/mat.1   16431    3911    2278          latn/nwt/mat.1   16431    3911    2278
        latn/nwt/mrk.1   10280    2913    1810          latn/nwt/mrk.1   10280    2913    1810
        latn/nwt/luk.1   18004    4406    2743          latn/nwt/luk.1   18004    4406    2743
        latn/nwt/joh.1   14026    2523    1377          latn/nwt/joh.1   14026    2523    1377
        latn/nwt/tot.1   58741    7990    3946          latn/nwt/tot.1   35027    5740    2948
        latn/ock/tot.1   37263    5774    2996          latn/ock/tot.1   35027    5589    2926
        grek/nwt/mat.1   18745    3958    2350          grek/nwt/mat.1   18745    3958    2350
        grek/nwt/mrk.1   11632    2898    1842          grek/nwt/mrk.1   11632    2898    1842
        grek/nwt/luk.1   19887    4609    3015          grek/nwt/luk.1   19887    4609    3015
        grek/nwt/joh.1   15919    2586    1422          grek/nwt/joh.1   15919    2586    1422
        grek/nwt/tot.1   66183    8301    4163          grek/nwt/tot.1   35027    5436    2824
        span/qvi/one.1  177061   14247    7466          span/qvi/one.1   35027    5452    3237
        span/qvi/two.1  187776   16023    8543          span/qvi/two.1   35027    5698    3558
        span/qvi/tot.1  364837   22475   11175          span/qvi/tot.1   35027    5582    3395
        ital/psp/tot.1  216969   18965    9671          ital/psp/tot.1   35027    6623    4085
        fran/tal/tot.1   54061    8102    4555          fran/tal/tot.1   35027    6223    3698
        port/csm/tot.1   64602    9032    5081          port/csm/tot.1   35027    6267    3772
        germ/sim/tot.1  184498   18556   10020          germ/sim/tot.1   35027    6826    4223
        russ/pic/tot.1   45915   11831    7936          russ/pic/tot.1   35027    9761    6659
        russ/ptt/gen.1   28445    4899    2704          russ/ptt/gen.1   28445    4899    2704
        russ/ptt/exo.1   22960    4084    2112          russ/ptt/exo.1   22960    4084    2112
        russ/ptt/num.1   22530    3952    2142          russ/ptt/num.1   22530    3952    2142
        russ/ptt/lev.1   16901    2659    1305          russ/ptt/lev.1   16901    2659    1305
        russ/ptt/deu.1   20988    3913    2238          russ/ptt/deu.1   20988    3913    2238
        russ/ptt/tot.1  111824   12034    5926          russ/ptt/tot.1   35027    5521    2911
        arab/quf/tot.1   77394   19852   12911          arab/quf/tot.1   35027   10935    7353
        arab/quv/tot.1   77411   19530   12595          arab/quv/tot.1   35027   10762    7187
        arab/qud/tot.1   77455   15314    9109          arab/qud/tot.1   35027    8531    5245
        arab/qph/tot.1   77845   17380   10742          arab/qph/tot.1   35027    9434    6044
        arab/qcs/tot.1   74212   15873    9603          arab/qcs/tot.1   35027    9025    5649
        viet/ptt/gen.1   42099    1793     430          viet/ptt/gen.1   35027    1690     421
        viet/ptt/exo.1   33760    1649     368          viet/ptt/exo.1   33760    1649     368
        viet/ptt/num.1   37097    1485     363          viet/ptt/num.1   35027    1459     367
        viet/ptt/lev.1   25163    1207     339          viet/ptt/lev.1   25163    1207     339
        viet/ptt/deu.1   31361    1614     439          viet/ptt/deu.1   31361    1614     439
        viet/ptt/tot.1  169480    2684     489          viet/ptt/tot.1   35027    1631     397
        viet/nwt/mat.1   25615    1818     564          viet/nwt/mat.1   25615    1818     564
        viet/nwt/mrk.1   15895    1572     556          viet/nwt/mrk.1   15895    1572     556
        viet/nwt/luk.1   27637    2117     750          viet/nwt/luk.1   27637    2117     750
        viet/nwt/jhn.1   21872    1289     428          viet/nwt/jhn.1   21872    1289     428
        viet/nwt/tot.1   91019    2735     684          viet/nwt/tot.1   35027    2011     570
        chin/ptt/gen.1   45081    1503     301          chin/ptt/gen.1   35027    1376     276
        chin/ptt/exo.1   35252    1424     275          chin/ptt/exo.1   35027    1424     277
        chin/ptt/num.1   36843    1303     312          chin/ptt/num.1   35027    1291     310
        chin/ptt/lev.1   25694    1095     261          chin/ptt/lev.1   25694    1095     261
        chin/ptt/deu.1   31494    1433     336          chin/ptt/deu.1   31494    1433     336
        chin/ptt/tot.1  174364    2177     278          chin/ptt/tot.1   35027    1392     280
        chin/ptn/gen.1   49305    1555     317          chin/ptn/gen.1   35027    1380     312
        chin/ptn/exo.1   40159    1450     305          chin/ptn/exo.1   35027    1439     321
        chin/ptn/num.1   39792    1308     294          chin/ptn/num.1   35027    1254     288
        chin/ptn/lev.1   28693    1169     274          chin/ptn/lev.1   28693    1169     274
        chin/ptn/deu.1   35370    1463     368          chin/ptn/deu.1   35027    1457     367
        chin/ptn/tot.1  193319    2266     318          chin/ptn/tot.1   35027    1405     291
        chin/red/tot.1  706889    4271     585          chin/red/tot.1   35027    2420     663
        chin/voa/tot.1   58813    1886     376          chin/voa/tot.1   35027    1616     348
        chip/voa/tot.1   59476     930     114          chip/voa/tot.1   35027     830      98
        tibe/vim/tot.1   53287    1469     389          tibe/vim/tot.1   35027    1300     370
        tibe/ccv/tot.1   88620    1155     292          tibe/ccv/tot.1   35027     846     196
        tibe/pmi/tot.1  143289    2932     666          tibe/pmi/tot.1   35027    1963     515
        chrc/red/tot.1  706889    4271     585          chrc/red/tot.1   35027    2420     663
        enrc/wow/tot.1   60293    6789    3244          enrc/wow/tot.1   35027    4869    2465
        envt/wow/tot.1   42098    1709     286          envt/wow/tot.1   35027    1650     291
        envg/wow/tot.1   60293   19120   13035          envg/wow/tot.1   35027   12911    9127
        voyp/grs/tot.1    1950     635     365          voyp/grs/tot.1    1950     635     365
        voyp/grm/tot.1     708     307     204          voyp/grm/tot.1     708     307     204
        viep/grs/tot.1   31200    7760    3216          viep/grs/tot.1   31200    7760    3216
        viep/mky/tot.1   39293    3471    1161          viep/mky/tot.1   35027    3341    1174

      # Counts for bad text (whole)                   # Counts for bad text (trunc)           
      # sample/sec      tokens   words  unique        # sample/sec      tokens   words  unique
      # -------------- ------- ------- -------        # -------------- ------- ------- -------
        hebr/tav/gen.1    1533       1       0          hebr/tav/gen.1    1533       1       0
        hebr/tav/exo.1    1209       1       0          hebr/tav/exo.1    1209       1       0
        hebr/tav/num.1    1289       1       0          hebr/tav/num.1    1289       1       0
        hebr/tav/lev.1     859       1       0          hebr/tav/lev.1     859       1       0
        hebr/tav/deu.1     955       1       0          hebr/tav/deu.1     955       1       0
        hebr/tav/tot.1    5845       1       0          hebr/tav/tot.1    3050       1       0
        hebr/tad/tot.1    5845       1       0          hebr/tad/tot.1    3085       1       0
        geez/gok/tot.1     497      84      41          geez/gok/tot.1     497      84      41
        geez/eno/tot.1     479      82      35          geez/eno/tot.1     479      82      35
        engl/wow/tot.1     898      10       8          engl/wow/tot.1     579       9       7
        engl/wnm/tot.1       0       0       0          engl/wnm/tot.1       0       0       0
        engl/cul/pre.1      61      21      15          engl/cul/pre.1      61      21      15
        engl/cul/her.1    3634     170      93          engl/cul/her.1    1166      90      62
        engl/cul/rec.1     313      20       7          engl/cul/rec.1     313      20       7
        engl/cul/tot.1    4008     186     101          engl/cul/tot.1    1174      93      61
        engl/cpn/tot.1       3       2       1          engl/cpn/tot.1       3       2       1
        engl/twp/tot.1   14318      49      35          engl/twp/tot.1    6392      20      17
        latn/ptt/gen.1    1531       1       0          latn/ptt/gen.1    1531       1       0
        latn/ptt/exo.1    1211       1       0          latn/ptt/exo.1    1211       1       0
        latn/ptt/num.1    1288       1       0          latn/ptt/num.1    1288       1       0
        latn/ptt/lev.1     858       1       0          latn/ptt/lev.1     858       1       0
        latn/ptt/deu.1     959       1       0          latn/ptt/deu.1     959       1       0
        latn/ptt/tot.1    5847       1       0          latn/ptt/tot.1    2077       1       0
        latn/nwt/mat.1    1071       3       2          latn/nwt/mat.1    1071       3       2
        latn/nwt/mrk.1     679       3       2          latn/nwt/mrk.1     679       3       2
        latn/nwt/luk.1    1151       1       0          latn/nwt/luk.1    1151       1       0
        latn/nwt/joh.1     879       1       0          latn/nwt/joh.1     879       1       0
        latn/nwt/tot.1    3780       4       2          latn/nwt/tot.1    2226       1       0
        latn/ock/tot.1     374      54      21          latn/ock/tot.1     362      54      21
        grek/nwt/mat.1    1071       1       0          grek/nwt/mat.1    1071       1       0
        grek/nwt/mrk.1     678       1       0          grek/nwt/mrk.1     678       1       0
        grek/nwt/luk.1    1150       1       0          grek/nwt/luk.1    1150       1       0
        grek/nwt/joh.1     879       1       0          grek/nwt/joh.1     879       1       0
        grek/nwt/tot.1    3778       1       0          grek/nwt/tot.1    1976       1       0
        span/qvi/one.1    2213      42      27          span/qvi/one.1     522      15      11
        span/qvi/two.1    3055      61      42          span/qvi/two.1     598      17      10
        span/qvi/tot.1    5268      88      60          span/qvi/tot.1     578      18      14
        ital/psp/tot.1    2925      88      57          ital/psp/tot.1     594      32      21
        fran/tal/tot.1    1490     140      93          fran/tal/tot.1     985     121      87
        port/csm/tot.1      89      47      35          port/csm/tot.1      29      11       6
        germ/sim/tot.1     898     101      79          germ/sim/tot.1     247      53      42
        russ/pic/tot.1    1454       8       5          russ/pic/tot.1    1236       8       5
        russ/ptt/gen.1       0       0       0          russ/ptt/gen.1       0       0       0
        russ/ptt/exo.1       0       0       0          russ/ptt/exo.1       0       0       0
        russ/ptt/num.1       0       0       0          russ/ptt/num.1       0       0       0
        russ/ptt/lev.1       0       0       0          russ/ptt/lev.1       0       0       0
        russ/ptt/deu.1       0       0       0          russ/ptt/deu.1       0       0       0
        russ/ptt/tot.1       0       0       0          russ/ptt/tot.1       0       0       0
        arab/quf/tot.1    6330      69      57          arab/quf/tot.1    2027      48      39
        arab/quv/tot.1    6313      56      47          arab/quv/tot.1    2013      38      32
        arab/qud/tot.1    6262      11       6          arab/qud/tot.1    1974       5       2
        arab/qph/tot.1    6236       1       0          arab/qph/tot.1    1953       1       0
        arab/qcs/tot.1    6236       1       0          arab/qcs/tot.1    2075       1       0
        viet/ptt/gen.1    1349       3       2          viet/ptt/gen.1    1135       3       2
        viet/ptt/exo.1    1015       3       2          viet/ptt/exo.1    1015       3       2
        viet/ptt/num.1     970       3       2          viet/ptt/num.1     922       3       2
        viet/ptt/lev.1     668       3       2          viet/ptt/lev.1     668       3       2
        viet/ptt/deu.1     731       3       2          viet/ptt/deu.1     731       3       2
        viet/ptt/tot.1    4733       3       0          viet/ptt/tot.1     995       3       0
        viet/nwt/mat.1     796       3       2          viet/nwt/mat.1     796       3       2
        viet/nwt/mrk.1     431       3       2          viet/nwt/mrk.1     431       3       2
        viet/nwt/luk.1     639       1       0          viet/nwt/luk.1     639       1       0
        viet/nwt/jhn.1     556       1       0          viet/nwt/jhn.1     556       1       0
        viet/nwt/tot.1    2422       4       2          viet/nwt/tot.1     978       1       0
        chin/ptt/gen.1    1316       1       0          chin/ptt/gen.1    1041       1       0
        chin/ptt/exo.1    1011       1       0          chin/ptt/exo.1    1001       1       0
        chin/ptt/num.1    1063       1       0          chin/ptt/num.1    1007       1       0
        chin/ptt/lev.1     710       1       0          chin/ptt/lev.1     710       1       0
        chin/ptt/deu.1     788       1       0          chin/ptt/deu.1     788       1       0
        chin/ptt/tot.1    4888       1       0          chin/ptt/tot.1    1029       1       0
        chin/ptn/gen.1     974       1       0          chin/ptn/gen.1     709       1       0
        chin/ptn/exo.1     841       1       0          chin/ptn/exo.1     698       1       0
        chin/ptn/num.1     750       1       0          chin/ptn/num.1     630       1       0
        chin/ptn/lev.1     599       1       0          chin/ptn/lev.1     599       1       0
        chin/ptn/deu.1     609       1       0          chin/ptn/deu.1     600       1       0
        chin/ptn/tot.1    3773       1       0          chin/ptn/tot.1     693       1       0
        chin/red/tot.1    4016       2       0          chin/red/tot.1     236       1       0
        chin/voa/tot.1    1022      68      36          chin/voa/tot.1     664      58      33
        chip/voa/tot.1     526       3       0          chip/voa/tot.1     315       2       0
        tibe/vim/tot.1      69       4       2          tibe/vim/tot.1      50       4       2
        tibe/ccv/tot.1      49      11       8          tibe/ccv/tot.1      22       9       7
        tibe/pmi/tot.1      42      14       8          tibe/pmi/tot.1       7       5       3
        chrc/red/tot.1    4016       2       0          chrc/red/tot.1     236       1       0
        enrc/wow/tot.1     898      10       8          enrc/wow/tot.1     579       9       7
        envt/wow/tot.1   28021     983     172          envt/wow/tot.1   23316     941     176
        envg/wow/tot.1     898      10       8          envg/wow/tot.1     579       9       7
        voyp/grs/tot.1       0       0       0          voyp/grs/tot.1       0       0       0
        voyp/grm/tot.1      18       6       4          voyp/grm/tot.1      18       6       4
        viep/grs/tot.1       0       0       0          viep/grs/tot.1       0       0       0
        viep/mky/tot.1    1105       1       0          viep/mky/tot.1     986       1       0
      

# END