# Last edited on 2004-02-17 01:35:05 by stolfi # A collection of sample texts in several languages, # formatted for statistical analysis. CURRENT DIRECTORY LAYOUT Directories are named {LLLL}/{DDD} where {LLLL} is the language (engl, latn, span, ital) and DDD is the document. "Language" is defined here from the broad computational/statistical viewpoint. Minor differences in puctuation, null charaters, and the like (see "file encoding" below) do not make a different language. Also acceptable are moderate differences in transliteration and spelling, such as different Romanization systems (e.g. Japanese "mitubisi" vs. "mitsubishi"), or classical vs. modern Latin. On the other hand, Chinese ideograms and Chinese pinyin are considered to be two different languages. Ditto for plain English and English in Vigenère or codebook cipher. Documents that differ by the systematic omission of certain symbols, such as Arabic/Hebrew with and without vowel marks, should be classed as different languages. However, if the remaining letters are encoded in the same way, we make an exception and use the same diretory {LLLL}, for editing convenience. DIRECTORY CONTENTS Within each directory, there should be files with the following names and meanings: * "main.src" (mandatory): the main source file, with all the manual cleanup, reformatting, error fixes, etc.. See the detailed description below. * "main.raw" (optional): original files from external source, for documentation purposes only, with *purely mechanical* cleanup such as: * delete irrelevant HTML headers, markup and line breaks. * replace significant HTML tags (e.g. start of chapter or verse) by ad-hoc "@set{{VAR}}{[=|+]{VALUE}}" directives. * mark mechanically recognizable special text (e.g. English chapter titles, verse numbers, editorial notes) with textual type @-directives (see below). This file must be in the original encoding, with no hand-edits and no hand comments. The commands used to produce this file and other notes should be in Notebook.txt. * "Notebook.txt": technical description of the document's processing. Note that information that matters to users of the document should be placed as comments in the "main.src" file itself. FORMAT OF THE MAIN SOURCE FILE The file "main.src" must have the following features: * identification of any significant structural divisions of the text (volumes, books, parts, chapters, sections and subsections, verses) with @-directives. See tools/src-to-wds for details. * segregation of linguistically homogeneous textual types (prose, poetry, foreign-language inserts, verse and chapter numbers and titles, lists and tables, figure labels and captions etc.) into separate textual units, marked with textual type @-directives. * markup of significant paragraph and text-line breaks by distinctive characters. Note that a paragraph or a line may span several textual units, e.g. foreign-language phrases or tables. * regularization of spelling (such as removal of non-significant capitalization, capitalization of proper names, adding or deleting hyphens, fixes to punctuation, numbers and symbols, etc.) * replacement of linguistically significant font markup by punctuation marks or symbols, e.g. "/.../" for italic. * preamble comments with document description, external sources and credits. * comments describing all edits and transformations made to the original document, after obtained from the external source. * a short table-of-contents summarizing the textual divisions and their nature. * directives "@include{FILENAME}" to include auxiliary (encoding tables, parts of a multi-file document, etc.) This file may use a language-specific lossless encoding into printable ISO-8859-1 characters that is suitable for hand-editing. The main features of the encoding should be specified by the directives "@null{...}", "@alpha{...}" "@symbol{...}", "@punct{...}", "@bank{...}", "@parag{...}", and "@break{...}". These character classes must be disjoint and must not include the characters "@#{}*". In this file, words should be separated by blank spaces and/or newlines (implicitly included in the "@blank{}" directive). Note therefore that there is no relatonship between file-lines (ASCII NL) and text-lines (marked by encoding-specific characters). Except for ASCII SP and NL, non-printable characters -- including ASCII HT, VT, CR, or ISO-8859-1 non-breaking space and soft hyphen -- are strictly forbidden.