Hacking at the Voynich manuscript - Side notes 027 Applying Jim Reeds's iterative digraph compression Last edited on 1998-07-15 03:00:25 by stolfi Jim Reeds suggested on 1998-07-14: > Here is a primitive cut at an "anti dain daiin" filter: > Read a text and tabulate all digraphs. Create a new symbol > and replace all instances of the most frequent digraph with that > new symbol. The new text will be somewhat shorter, will have > a character set with one extra symbol, and will have somewhat > higher entropy. > > The resulting text can be run through the filter again. > And again. And again... Let's try it. I can reuse the "extract-signif-chars" and "gather-tuples" scripts that I wrote for the entropy-colored pages (Note 026). cat voyn-bio/full.txt \ | extract-signif-chars \ -v normal="-'" \ -v errors='_' \ > bio-00.sig I wrote another simple gawk script "replace-signif-digraph" that does the replacement (skipping over decoration and breaks), and a shell script "reeds-compress" that does the outer loop and manages the file names. reeds-compress bio 00 10 A B C D E F G H I J reeds-compress bio 10 20 K L M N O P Q R S T # A = dy ( 1908) # B = he ( 1846) # C = qo ( 1533) # D = ol ( 1087) # E = Ck ( 1025) # F = cB ( 955) # G = ai ( 860) # H = eA ( 852) # I = sB ( 808) # J = al ( 524) # K = Gn ( 429) # L = in ( 425) # M = FA ( 415) # N = ey ( 394) # O = ar ( 378) # P = GL ( 374) # Q = ch ( 364) # R = eH ( 352) # S = IA ( 334) # T = ok ( 307) Trying again, with word breaks treated as letters (but not line breaks - handling the decoration would be too messy. First, we create a new version of the ".sig" file where * word breaks (class 1) entries whose external rep is " " are turned into significant chars (class 3) with rep = "-" * word breaks with other external reps are turned into significant chars with rep "-", followed by a deco (class 0) with the same rep (minus a leading blank, if any). * a small set of "-"s is inserted before and after each parag break. cat bio-00.sig \ | sed \ -e 's/^1[ ]$/3-/' \ -e 's/^1[ ]/3- 0/' \ -e 's/^1/3- 0/' \ -e '2,$s/^\(2.*\)/3- 3- 3- \1/' \ | tr '\013' '\012' \ > bio-sp-00.sig cat bio-sp-00.sig \ | sed -e 's/^.//' \ | tr -d '\012' \ | tr '\015' '\012' \ > bio-sp-00.txt Now let's compress again: reeds-compress bio-sp 00 10 A B C D E F G H I J reeds-compress bio-sp 10 20 K L M N O P Q R S T # A = y. ( 3358) # B = dA ( 1886) # C = he ( 1846) # D = qo ( 1533) # E = l. ( 1270) # F = Dk ( 1025) # G = cC ( 955) # H = ai ( 860) # I = n. ( 860) # J = eB ( 851) # K = sC ( 808) # L = r. ( 661) # M = oE ( 652) # N = ol ( 435) # O = HI ( 426) # P = iI ( 423) # Q = aE ( 416) # R = GB ( 412) # S = eA ( 391) # T = HP ( 373) Let's try the same with Herbal-A and Herbal-B cat ../026/voyn-hea/full.txt > hea-00.txt cat ../026/voyn-heb/full.txt > heb-00.txt foreach sec ( hea heb ) cat ${sec}-00.txt \ | extract-signif-chars \ -v normal="-'" \ -v errors='_' \ > ${sec}-00.sig reeds-compress ${sec} 00 10 A B C D E F G H I J reeds-compress ${sec} 10 20 K L M N O P Q R S T end cat hea-*.dic # A = ch ( 2982) # B = Ao ( 1433) # C = ai ( 1265) # D = in ( 1076) # E = sh ( 981) # F = CD ( 973) # G = ol ( 886) # H = qo ( 697) # I = or ( 604) # J = Ae ( 558) # K = dF ( 544) # L = Ay ( 519) # M = ct ( 513) # N = Mh ( 506) # O = dy ( 504) # P = Bl ( 429) # Q = ar ( 385) # R = ot ( 383) # S = ey ( 376) # T = Br ( 372) cat hea-*.dic # A = ch ( 2982) # B = Ao ( 1433) # C = ai ( 1265) # D = in ( 1076) # E = sh ( 981) # F = CD ( 973) # G = ol ( 886) # H = qo ( 697) # I = or ( 604) # J = Ae ( 558) # K = dF ( 544) # L = Ay ( 519) # M = ct ( 513) # N = Mh ( 506) # O = dy ( 504) # P = Bl ( 429) # Q = ar ( 385) # R = ot ( 383) # S = ey ( 376) # T = Br ( 372) Now with English: cat ../026/engl-wow/full.txt > wow-00.txt foreach sec ( wow ) cat ${sec}-00.txt \ | extract-signif-chars \ -v normal="-'" \ -v errors='_' \ > ${sec}-00.sig reeds-compress ${sec} 00 10 A B C D E F G H I J reeds-compress ${sec} 10 20 K L M N O P Q R S T end