Last edited on 1998-07-16 00:40:19 by stolfi

East Asian languages

Vietnamese
Text fetched from WWW [ full ] [ page ]

    Vietnamese has a large alphabet, Latin-based but with tons of diacritics. This text uses some standard computer encoding, which seems to be almost a letter-for-letter transcription of the official Vietnamese script (except that some characters will look very wrong on your ISO Latin-1 browser.) The entropy is lower than that of English, probably because of the monosyllabic word structure. Note, in particular, that the word breaks are fairly predictable. Another source of low entropy is the use of digraphs (nh, ng, ph, th, kh, eå, uù etc.) to represent some relatively frequent phonemes.

l = 1    r = 0 [ colorized page ] [ bits per tuple ]
l = 2    r = 0 [ colorized page ] [ bits per tuple ]
l = 3    r = 0 [ colorized page ] [ bits per tuple ]
l = 1    r = 1 [ colorized page ] [ bits per tuple ]
Phonetic Mandarin (numeric tones)
Lao-Tsu's Tao Te King [ full ] [ page ]

    The low entropy of this text is obvious. The Chinese scholars who designed the modern phonetic spelling (pinyin) had the double goal of accurately recording the Chinese phonemes while preserving the occidental letter values as much as possible. Therefore many common consonant sounds came to be represented by digraphs, e.g. zh, ch, sh, t'.
    Moreover, each Chinese non-compound word is a single syllable, which must begin with a single consonant sound or semivowel, continue with a very limited choice of vowels and diphtongs, and end with r, n, ng, or nothing. This rigid (hence predictable) word structure contributes to lower the entropy. Finally, this particular variant of pinyin uses postfix digits to denote the four tones; which means each digit carries less than 2 bits of information.
    Note that most of the information is carried by the initial consonant of each syllable; on the other hand, the spaces in this encoding have almost zero information contents.

l = 1    r = 0 [ colorized page ] [ bits per tuple ]
l = 2    r = 0 [ colorized page ] [ bits per tuple ]
l = 3    r = 0 [ colorized page ] [ bits per tuple ]
l = 1    r = 1 [ colorized page ] [ bits per tuple ]
Phonetic Mandarin (diacritic tones)
Lao-Tsu's Tao Te King [ full ] [ page ]

    This is the same text as in the previous example, but with a more compact encoding. The numeric tones have been replaced by diacritics over the main vowel, as in official pinyin---except that the umlaut `¨' is used for macron `¯', and circumflex for hacheck, due to font limitations. Also the common digraphs ng, zh, ch, sh have been arbitrarily replaced by single characters ñ, ð, þ, ç.
    These changes raise the mean h₃ entropy of this Chinese text, to levels comparable to the Western texts above (2.7 bits/letter by the Bayesian formula, 2.0 by the frequentistic one). The main source of inefficiency still left is the word break, which is almost entirely predictable due to the rigid word structure. Indeed, in the official pinyin script, compound words (which are the majority of the entries in a dictionary) are written without intervening spaces, as in zhöngguó (= China, lit. middle-country) and huôxïng (= Mars, lit. fire-planet).

l = 1    r = 0 [ colorized page ] [ bits per tuple ]
l = 2    r = 0 [ colorized page ] [ bits per tuple ]
l = 3    r = 0 [ colorized page ] [ bits per tuple ]
l = 1    r = 1 [ colorized page ] [ bits per tuple ]