# Last edited on 2025-11-03 23:30:48 by stolfi # Note 085 Generating two texts with same language diff spelling The goal of this note is to generate texts that could serve as controls for investigations into the cause of the "language A / language B" split. The idea is to use Machado de Assis's Dom Casmurro in Portuguese with three different spellings, and its Spanish translation. SETUP ln -s ../.. work ln -s work/projects/langbank bank ln -s work/projects/dom-casmurro casm ln -s work/error_funcs.py EXTRACTING THE FULL TEXTS Create files "out/{lang}/{buk}/full.txt" where {lang} is "port" or "span" and {buk} is "cso" (1899 spelling), "csc" (1999 spelling), "csp" (2099 phonetic spelling), "cas" (modern Spanish spelling). create_full_files.sh CREATING THE PARALLEL SAMPLES create_parallel_samples.sh >>>===REDO===<<< EXTRACTING THE DISJOINT SAMPLES Files file ln spelling chapters source ------- -- -------- ----------- -------------------------------------------------- DC1.txt PT 1899 3,13,23,..73 ~/projects/dom-casmurro/1899/orig.txt DC2.txt PT 1999 4,14,24,..74 ~/projects/dom-casmurro/1999/orig.txt DC3.txt PT phonetic 5,15,25,..75 ~/projects/dom-casmurro/2099/orig.txt DC4.txt ES modern 6,16,26,..76 https://livros01.livrosgratis.com.br/al000069.pdf Lines starting with "#" are comments and should be ignored. Chapter numbers and titles are lines marked "# \chapt". Text in italics (sometimes in a foreign language) are marked "\emph{...}". Ellipses are "...", and the period of abbreviation is ".~" without a following space. Otherwise "." is always a sentence period. Lines that start with "---" are utterances in dialogs.