# Last edited on 2004-02-07 10:25:49 by stolfi # A collection of sample texts in several languages, # formatted for statistical analysis. CURRENT DIRECTORY LAYOUT Directories are named LLLL/DDD where LLLL is the language (engl, latn, span, ital) and DDD is the document. Within each directory there MAY be files with the following names and meanings: remove-html-junk-xx = filter that extracts usable text from original HTML files. main.htm = the original source for edit.txt, as obtained from the net, with minimal clean-up, in the original encoding. main.raw = intermediate file with some mechanical code conversion and editing. main.org = final source file, with all manual edits and transformations, with standardized @-directives as expected by make-evt-from-org main.evt = the main text in EVT format, generated mechanically from main.org. main.wds = list of words from main.evt. main.cts = frequency counts for main.wds. raw-to-org = converts main.raw into a first draft of main.org. preprocess-org = optional mechanical transformations of main.org before feeding it to org-to-evt. raw-to-wds = converts main.raw into a word list, for consistency checking. cook-words-for-diff = preprocess words extracted from main.evt for consistency chack against words produced by raw-to-wds. Makefile = generates main.evt and the other files from main.org. If the Makefile is not present, the main.evt file was generated by hand-editing. The master texts are stored in EVT-like format, extended with language-specific charset declarations (see wds-from-evt). Capitalization should be retained when it is significant in the original text. E.g., yes for English, Vietnamese, and modern Latin; no for Greek, Arabic, Tibetan, and some ancient Latin texts. The line locators should have the form where CCC is a chapter/section identifier, PP is a text unit ID and NNN is a line number (increasing within each CCC.PP combination). Each text unit must have homogeneous nature (normal text, title, label, quotation, etc.) Comments start with "#" on column 1. There should be a para-comment "## " at the beginning of each new chapter. TO DO This structure is inconsistent and too baroque. It is being converted to s aimpler scheme in /home/staff/stolfi/projects/langbank/ Old sub-directories are in ~/voynich/work/Texts-Old/