InterBase Collation Kit (DRAFT)
November 1999
David Brookstone Schnepper
dave@BrookstoneSystems.com
1
Introduction
This paper describes how to create a custom collation or
character set driver for InterBase. An
example using InterBase 6.0 is included.
This document uses international characters for
illustration. The encoding for this
document is Character set Windows 1252 (a.k.a Latin 1, or Western European).
2
Contents
Note: The examples are not available for this draft. They will be available with
the final release of the Collation Kit.
|
CollationKit2.doc
|
This document
|
|
CollationKit2.html
|
This document in HTML
|
|
CollationKit2.txt
|
This document as a text file
|
|
makefile.bc
|
Makefile for Borland C
|
|
makefile.so
|
Makefile for Solaris
|
|
makefile.hp
|
Makefile for HP (not completed)
|
|
ld2.c
|
Example linkage function between engine and international
module
|
|
cs_example.c
|
Character Set example
|
|
cv_example.c
|
Text conversion example
|
|
lc_example.c
|
Collation example
|
|
cs_latin1.h
|
Mapping tables for Character Set example
|
|
country_codes.h
|
Defines numeric codes for countries
|
|
charsets.h
|
InterBase numeric codes for character sets
|
|
cs_example.h
|
Generic include file for the example
|
|
tx865_lat1.h
|
Translation tables between DOS865 and ISO Latin-1
character sets
|
|
intlobj.h
|
Interface definition between InterBase Engine and
International modules.
|
|
cs_example.sql
|
SQL script defining & testing example character set
& collation.
|
|
cs_util.sql
|
Utility procedures for creating character sets and
collations.
|
3
History
Prior to InterBase v3.3, InterBase defined CHARACTER(n) data
as simple 8-bit bytes. It was up to
the application programmer to know what character set was used to store the
data, and how to convert between different character sets. Only binary collation order was supported
for CHARACTER(n) data.
In June 1992, InterBase v3.3 was released, which included
the first support for multiple character sets in a database and alternate
collations for text. Only single-byte
character sets and European-style collations were supported at that time. Version 3.3 also supported the ability to
extend the International functionality by adding additional character set or
collation modules.
Separately, a version 3.2J (Japanese) was released in
1992. The 3.2J version supported
Japanese data stored in EUC and SJIS formats.
It only supported codepoint order collation. There was no ability to extend the functionality of the
international parts of the product.
In 1993, InterBase 4.0 was released. InterBase 4.0 built on the foundation
established by v3.3 and v3.2J and combined both single-byte and multi-byte
character set support into a single product.
Unlike the 3.3 architecture, InterBase 4.0 used the
then-standard SQL-92 Internationalization architecture.
This document describes how to create drivers in the
InterBase Internationalization architecture.
Examples are provided for InterBase 6.0. These examples should also work for any InterBase 4.x or 5.x
version - as the architecture has not changed since originally implemented for
InterBase 4.0.
Many of the International collations supported in InterBase
were originally designed for Borland's Paradox and/or dBASE product lines. The collations were re-implemented in
InterBase and validated against the original product. An important design decision was that InterBase collations must
match the corresponding dBASE/Paradox collation PRECISELY, to make them
compatible. InterBase duplicated known
bugs in dBASE/Paradox collations to preserve compatibility. InterBase defines several collations that
are "the same, but different". For
instance, there are German collations for dBASE, for Paradox, and an InterBase
specific collation "DE_de". The general
rule is dBASE & Paradox collations perform identically to dBASE &
Paradox while InterBase specific collations follow the national standard for
the language, if one exists.
An InterBase v3.3 Collation Kit document was written in
1993, which documents creating an InterBase 3.3 collation driver. This document covers InterBase 4.x through
6.x - using the architecture developed for V4.x.
4
SQL Internationalization Definitions
Internationalization is full of ambiguous terminology: the
statements are the same but all the definitions are different. Here are the definitions from the SQL
approach to Internationalization. These
definitions are painfully precise in the SQL 92 definition - this documents
takes some liberties with that precision in the interest of clarity. (See the SQL 92 standard for the original
definitions). The SQL definitions are
different from the "intuitive" definitions many computer programmers use.
4.1
Character Repertoire
A character repertoire is a set of characters used for a
particular culture. For instance,
English uses A, B, C ... Z. French uses
A, Á, À, B, C, Ç, D ... Z.
4.2
Character Set
A character set is a collection of characters that includes
at least one character repertoire. For
instance, ISO Latin 1 is a character set that includes the English repertoire
and the French repertoire.
4.3
Form of Use
A form of use is the mapping of a character set to a
particular computer based representation.
It includes the numerical value assigned to a particular character, and
the way to encode the value into bits for computer use. Examples of encoding include single byte,
multi-byte sequence, 16-bit value, etc.
4.4
Collation
A collation is a method of ordering sequences of characters
in a culturally acceptable fashion.
4.5
Character
A character is a particular member of a character
repertoire. Characters can change "form
of use" (e.g.: encoding) but never meaning.
5
InterBase Internationalization Definitions
This section defines the terminology used in InterBase's
internationalization implementation.
5.1
Narrow Character - NCHAR
A narrow character set represents text as a sequence of
bytes, each byte representing a unique character.
typedef unsigned char NCHAR;
ISO Latin1, DOS437, and similar character sets are
represented as NCHAR.
5.2
Multi-byte Character - MBCHAR
A multi-byte character set represents text as a sequence of
bytes, with some byte-values indicating that one or more of the following bytes
determine the next character.
typedef unsigned char MBCHAR;
SJIS, EUC-J, BIG-5, and similar character sets are
represented as MBCHAR.
5.3
Wide Character - WCHAR
A method of representing a character set as a sequence of
integers, in which the integer is larger than an 8-bit byte. Typically, a platform specific WCHAR
implementation matches the integer size for the platform.
InterBase defines WCHAR as an unsigned short - a 16-bit
value.
typedef unsigned short WCHAR;
InterBase does not fully support character sets in WCHAR format.
It does not swap bytes between platforms of different integer architectures
(e.g.: little-end-ian and big-end-ian platforms). This limitation can create problems when the client and server
have different integer architectures.
InterBase uses WCHAR as an internal processing format -
InterBase will typically convert a MBCHAR string to WCHAR for text matching
functions.
UNICODE, in WCHAR format, is used internally for charset
conversion functions.
5.4
Character Set
An InterBase character set consists of a SQL character set
and a SQL Form of Use. For instance,
the InterBase character Set ISO-8859-1 is precisely called: "The set of characters defined by ISO
Standard 8859-1, encoded by the values defined in ISO standard 8859-1, having
each value represented by a single 8-bit byte".
InterBase character sets are generally defined by national
standards or industry standards.
Examples: the International Standards Organization defines ISO-8859-1,
and Microsoft defines Windows1252.
InterBase is precise in following defined standards. Windows 1252 is a superset of ISO-8859-1
(commonly known as Latin-1). In
InterBase, WIN1252 and LATIN_1 are two separate character sets because Windows
1252 defines characters in positions that Latin-1 specifically states are "not a
character".
5.5
Text Type
An InterBase text type could also be termed a "locale". It consists of all culturally significant
information about processing text.
This includes: character set, representation (form of use), collation
method, comparison method, and uppercasing method.
Typical "locale" information also includes such methods as
conversion of numeric values to string values, time representation, and date
representation. These methods and
formats are NOT included in the InterBase Text Type.
5.6
Transliteration
A transliteration occurs when characters in one InterBase
character set are converted to another character set. Note that this process occurs when changing from one SQL "form of
use" to another. InterBase terms the
conversion from Unicode in the multi-byte FSS format to Unicode in WCHAR format
a transliteration. The conversion from DOS437 to ISO-8859-1 is also a
transliteration.
InterBase transliterations are "precise." InterBase will
report a transliteration error if the character in the input set does not have
an exact representation in the output set.
None of the existing set of InterBase transliterations will
transliterate a "missing character" to another character.
This restriction provides the ability to perform "round-trip
conversions". A conversion is
"round-trip" if it converts from character set A to character set B, then back
to A, producing the original sequence of characters in the original
representation.
InterBase transliterations preserve "character fidelity".
Custom transliterations can perform alternate conversions
for "missing characters." For instance,
a custom transliteration could convert the character <ae> to the string
"ae" when the character <ae> does not exist in the destination character
set.
6
Complexity of Collation
InterBase provides collations of varying complexity. The most sophisticated collations provide
full dictionary sorting appropriate to the language being compared. Many
InterBase collations provide a 4-weighted collation, where characters are
compared by:
-
Primary shape (A vs. B vs. C)
-
Accents (A vs. Á vs. Ä)
-
Case (A vs. a)
-
Punctuation weight (punctuation is ignored in the first three
comparison).
In a 4-weight collation Primary differences take precedence
over Accent differences, which take precedence over Case differences, which
take precedence over punctuation differences.
Most languages use left-to-right comparisons for all 4
weightings. French as used in France
requires right-to-left for Accent distinctions and left-to-right for all other
distinctions.
The ordering of Primary Shapes, Accents, Case, and
punctuation is culturally specific:
|
Collation
|
Primary Shapes
|
Accents
|
Case
|
Punctuation
|
|
English
|
Digits
0 1 2 ... 9 ...
before letters
A B C ... Z
|
|
Uppercase before lowercase
|
comma
dash
period
|
|
Danish
|
Letters
A B C ... Z Å ...
before digits
0 1 2 ... 9
|
grave
acute
circumflex
|
lowercase before uppercase
|
period
dash
comma
|
|
French
|
0 1 2 ... 9 ...
A B C Ç ... Z
|
circumflex
acute
grave
|
|
|
Note: The
examples used in this section are for illustration purposes only, and not meant
to be definitive of any collation sequence.
The "Primary Shape" of a character is culturally specific:
|
Collation
|
Character
|
Treatment
|
|
English
|
å
|
Same primary shape as a
|
|
Danish
|
å
|
Unique primary shape, sort after Z
|
|
English
|
ç
|
Same primary shape as c
|
|
French
|
ç
|
Unique primary shape, sort between c and d
|
|
Danish
|
Æ
|
Unique primary shape, sort after Z
|
|
German
|
Æ
|
Sort as if written "AE"
|
Each InterBase collation is specific to a character
set. All characters in the character
set must be sorted. If a character is
not used in a particular culture it can be collated to any location, but it
must still be collated, as it is valid input.
Icelandic Thorn (Þ) and eth (Ð) are not used in English - and the EN_UK
driver sorts them near T and D.
Some languages expand characters to an equivalent format:
|
Collation
|
Character
|
Expansion
|
Special rule
|
|
German
|
ß
|
ss
|
Sort after "ss" as an "accent difference"
|
|
Dutch
|
ÿ
|
ij
|
Sort before "ij" as an "accent difference"
|
Some languages contract sequences of characters:
|
Collation
|
Sequence
|
Sorting Rule
|
|
Spanish
|
CH
|
Primary shape between C and D
|
|
Spanish
|
LL
|
Primary shape between L and M
|
|
Danish
|
AA
|
Primary shape as Å, but treat as an accented difference
after Å
|
The complexities of culturally correct sorting are beyond
the scope of this document. Designing
a collation order requires skill and insight; find a good text book on the
subject and be sure you understand the issues before you begin.
7
InterBase Internationalization Classes
In JAVA object-oriented terminology; every character set
supported by InterBase is represented by a class that implements the InterBase
character set Interface.
Every InterBase text type is represented by a class that
implements the texttype interface.
Every transliteration is represented by a class that implements
the conversion interface.
InterBase does not implement international support in an
object-oriented language. It uses an
object-oriented design implemented in standard C.
The methods in each internationalization interface are not
very well divided. In particular, many
methods defined in texttype belong in charset (mb_to_wc in particular).
7.1
Charset
A class implementing the InterBase charset interface must
provide an internal name for the character set and:
-
The value for the SPACE character
-
The length of the SPACE character, in bytes
-
Max & min bytes per character
-
A method to convert from the character set to UNICODE in WCHAR
-
A method to convert from UNICODE in WCHAR to the character set
Charsets are defined by the structure charset in intlobj.h. The structures defined in intlobj.h are
explained starting in section 12.
7.2
Texttype
A class implementing the InterBase texttype interface must
provide an internal name for the texttype and functions that:
-
Calculate the key length for a given string length
-
Calculate the collation key for a given string
-
Compare two strings
-
Convert a string to uppercase
-
Convert a string to lowercase
-
Convert a string to WCHAR format
-
Fetch the next character in a string
Texttypes are defined by the structure texttype in
intlobj.h.
7.3
Conversion
-
A class implementing the InterBase conversion interface must
provide an internal name for the conversion and a function to convert FROM a
charset TO another charset.
Conversions are defined by the structure csconvert in
intlobj.h
8
Data Types for International Object Identifiers
Character sets and text types are identified by numeric
values. Once a numeric value is
established for a character set or text type, the value will never be
re-designated by InterBase. Preserving
those numbers protects backup/restore and compatibility between servers.
Some character set and collation identifiers are reserved
for user implementations. If you are
creating a text type for your own application, you should use values in the
range of 250 to 255 for character set identifiers and 250 to 249 for collation
identifiers. If you are creating a text
type that you expect to share or sell, you should use a value below 250 for
each identifier, and register your identifiers with InterBase support.
8.1
Character Set
Identifiers
Character set identifiers range from 0 to 255. The range from 250 to 255 is reserved for
user implementations. All numbers
below 250 must be registered with InterBase support to prevent conflicts with
other vendors of InterBase international objects.
typedef unsigned char CHARSETID;
CHARSET ID 127 is reserved for internal InterBase use.
8.2
Texttype identifiers.
A texttype is identified by a 16-bit value. The most-significant byte is the collation
identifier (COLLATIONID). The least
significant byte is the character set identifier (CHARSETID).
Collation IDs range from 0 to 254. Collation identifiers from 250 to 254 are reserved for
user-defined collations. Value 255 is not supported. If you use a
collation identifier below 250, you must register it with InterBase support to
avoid conflicts with other vendors of InterBase international objects. By convention Collation ID 0 is the default
collation for a given character set.
The default collation for a character set is binary collation, where
characters are ordered by the binary value specified by their form of use.
typedef unsigned char COLLATIONID;
typedef unsigned short TEXTTYPEID;
#define GET_CHARSETID (t) ((t) & 0x255)
#define GET_COLLATIONID(t) (((t)
>> 8) & 0x255)
Collation identifiers are unique within a character
set. Texttype ID's are unique across
the database.
9
Linking Numeric Identifiers and Symbolic Names
InterBase searches internal system metadata tables for the
symbolic names to attach to charset and texttype objects.
9.1
Symbolic Names for Character Set ID's
The following query gives the correspondence between charset
ID's and charset names.
SELECT RDB$CHARACTER_SET_ID, RDB$CHARACTER_SET_NAME
FROM RDB$CHARACTER_SETS;
This name is the "official" name for a character set. However, naming conventions for character
sets vary. InterBase character sets
have aliases to conform to the naming conventions on different platforms.
SELECT RDB$TYPE, RDB$TYPE_NAME
FROM RDB$TYPES
WHERE RDB$FIELD_NAME = "RDB$CHARACTER_SET_NAME";
RDB$TYPES in the internal table used to define symbolic
constants for various internal values.
9.2
Symbolic Names for Collation ID's
Collation names are also defined in a system table. Retrieving them requires a more complicated
query
SELECT
RDB$COLLATION_NAME, RDB$COLLATION_ID,
RDB$CHARACTER_SET_ID,
RDB$COLLATION_ID * 256
+ RDB$CHARACTER_SET_ID
AS TEXTTYPEID
FROM RDB$COLLATIONS;
RDB$COLLATIONS stores the collation name, the collation
identifier, and the character set identifier for each supported collation. The TEXTTYPEID is calculated from the
COLLATION_ID and the CHARACTER_SET_ID.
COLLATION_NAME must be unique in a database. Each texttype has a unique collation
name. Aliases are not supported for
collation names.
9.3
Naming Conventions
Collation names often follow the convention XX_yy.
- XX is
the two-letter Language code, per ISO Standard 639
(ftp://ftp.std.com/obi/Standards/ISO/ISO_639).
- yy is
the two-letter Country code, per ISO Standard 3166
(http://www.niso.org/3166.html).
Thus, DE_DE is the collation name for German as used in
Germany. FR_FR is French as used in
France, and FR_CA is French as used in Canada.
InterBase collations that emulate Paradox or dBASE
collations have names like "PDOXxxxx", "PXWxxx" or "DBxxxx". The names should match the names used in
Paradox and dBASE.
A special collation exists for each character set. This collation has the same name as the
character set and implements binary collation for the character set. Binary collation sorts the character set by
the numeric codes used to represent the characters.
Character set names follow the standard that defines them
(for example: WIN1252, DOS437).
Character set alias names support the differences in these names between
platforms (for example: WIN1252, WIN_1252).
10
Searching for an Object
InterBase searches for an international implementation
object when it loads its metadata. It
searches these three locations in this order:
-
The engine itself
-
The international library - gdsintl
-
The supplemental international library - gdsintl2
Objects for character sets ASCII, NONE, OCTETS, and
UNICODE_FSS are built into the engine.
Without these character set definitions the engine cannot load a
database. The default collations for
these character sets are also implemented within the engine.
All other InterBase international object implementations are
in the gdsintl library.
The gdsintl2 library is available for user-defined
international objects.
10.1 GDSINTL Libraries
Gdsintl and gdsintl2 are platform-dependant shared-object
libraries. The suffix for the filename
is platform dependant.
|
Platform
|
Suffix
|
|
Windows
|
DLL
|
|
Solaris
|
SO
|
|
HP
|
SL
|
The implementation of the library is
platform-dependant. Libraries must be
recompiled when moved to a new platform.
Some platforms do not support shared-object libraries. Novell NLM is an example. For those platforms, the entire gdsintl
library is linked with the engine.
User-defined international objects cannot be added to InterBase on
platforms that do not support shared-object libraries.
The InterBase installation procedure places gdsintl in the
intl, bin or lib directory in the InterBase root directory. gdsintl2 must be
stored in the same directory as gdsintl.
The location of the gdsintl library is dependant on the platform and
InterBase version.
|
Platform
|
Version
|
Location
|
|
Windows
|
4.x & 5.x
|
bin
|
|
All others
|
4.x & 5.x
|
lib
|
|
All
|
6.x
|
intl
|
10.2 Single Entry Point
Each international library has a single entry point, which
is LD_lookup for gdsintl, and LD_lookup2 for gdsintl. Aside from the difference in entry point names, the libraries are
functionally identical.
Historical note: Early
versions of IBM's AIX platform supported only one entry point in shared-object
libraries. That limitation is the reason for limiting the lookup function to
the single entry point for InterBase international libraries.
Historical
note: The Apollo platform required
that identifiers be unique across all shared-object libraries loaded by a
program. That limitation is the reason
for requiring the different entry point names in the two libraries.
10.3 The LD2_lookup Function
The lookup function for the gdsintl2 library is passed the
following:
-
Type of object to lookup
-
Numerical identifier to lookup
It returns
-
1 if object was not found
-
0 if object was found and a pointer to a function used to
initialize the object
USHORT LD2_lookup (
USHORT objtype,
FPTR_SHORT *fun,
SSHORT parm1,
SSHORT parm2)
Objtype is
one of type_texttype, type_charset, or type_csconvert. Any other input is an error.
|
Objtype
|
Parm1
|
Parm2
|
|
Type_texttype
|
TEXTTYPEID
|
0
|
|
Type_charset
|
CHARSETID
|
0
|
|
Type_csconvert
|
CHARSETID
(from)
|
CHARSETID
(to)
|
|
Any
other value
|
-- ERROR --
|
An example LD2_lookup function is provided in ld2.c
11
Initialization Functions
When an international implementation is found, the call to
LD2_lookup returns a pointer to the class initialization function. The engine calls this initialization
function with a pointer to a 0-initialized data structure, which the function
will fill. This data structure defines
the object to the engine.
The initialization is part of database startup, not server
startup! If your server supports more
than one database, the initialization function will be called for every
database that shares a set of international objects. Even in a single-database server, the initialization function
will be called each time the engine shuts down and reloads a database that was
temporarily inactive.
The initialized object is not released until the database is
unloaded. The engine releases the
object as a normal part of database shutdown.
Here is a sample initialization function for a texttype:
unsigned short
LC_win1252_example(
TEXTTYPE cache,
SSHORT parm1,
SSHORT dummy)
{
cache->texttype_version =
IB_LANGDRV_VERSION;
/* " Other initialization code "
*/
}
Each charset, each texttype, and each conversion supported
in a gdsintl library must have an initialization function.
Texttype objects implement an InterBase texttype. They
provide collation and other text manipulation functions. This is the definition of texttype in
intlobj.h:
typedef
struct texttype {
struct blk texttype_blk;
USHORT texttype_version; /* version ID of object */
USHORT texttype_flags; /* miscellaneous flags */
TTYPE_ID texttype_type; /* Interpretation ID */
ASCII *texttype_name;
CHARSET_ID texttype_character_set; /* ID of base character set */
SSHORT texttype_country; /* ID of base country values */
BYTE texttype_bytes_per_char; /* max bytes per character */
ULONG texttype_license_mask; /* required bits for license */
/* MUST BE ALIGNED */
FPTR_SHORT texttype_fn_init;
FPTR_SHORT texttype_fn_key_length;
FPTR_SHORT texttype_fn_string_to_key;
FPTR_short texttype_fn_compare;
FPTR_SHORT texttype_fn_to_upper; /* convert one ch to uppercase */
FPTR_SHORT texttype_fn_to_lower; /* One ch to lowercase */
FPTR_short texttype_fn_str_to_upper; /* Convert string to uppercase */
FPTR_SHORT texttype_fn_to_wc; /* convert string to wc */
FPTR_SHORT texttype_fn_contains; /* s1 contains s2? */
FPTR_SHORT texttype_fn_like; /* s1 like s2? */
FPTR_SHORT texttype_fn_matches; /* s1 matches s2 */
FPTR_SHORT texttype_fn_sleuth_check; /* s1 sleuth s2 */
FPTR_SHORT texttype_fn_sleuth_merge; /* aux function for sleuth */
FPTR_short texttype_fn_mbtowc; /* get next character */
BYTE *texttype_collation_table;
BYTE *texttype_toupper_table;
BYTE *texttype_tolower_table;
BYTE *texttype_expand_table;
BYTE *texttype_compress_table;
BYTE *texttype_misc; /* Used by some drivers */
ULONG *texttype_unused
[4]; /* spare space */
}
*TEXTTYPE;
Many of the fields in this structure are for use only by the
engine, some are for use by the object itself, and some form the interface
between the InterBase engine and the texttype implementation.
|
Field Name
|
Description
|
|
texttype_blk
|
Used for the engine's memory management - do NOT touch!
|
|
texttype_version
|
Set to IB_LANGDRV_VERSION
|
|
texttype_flags
|
For engine use
|
|
texttype_name
|
Pointer to ASCII internal name for texttype
|
|
texttype_character_set
|
CHARSET_ID of the character set this object is implemented
for.
|
|
texttype_country
|
Country ID of locale (not required)
|
|
texttype_bytes_per_char
|
Max bytes per char (duplicate information from charset
object)
|
|
texttype_license_mask
|
No longer used.
|
|
texttype_fn_init
|
Pointer to function used to initialize the data structure
|
|
texttype_fn_key_length
|
Calculates key length for a input string length.
|
|
texttype_fn_string_to_key
|
Converts a string to a collation key
|
|
texttype_fn_compare
|
Compares two strings of unlimited length.
|
|
texttype_fn_to_lower
|
Converts to lowercase
|
|
texttype_fn_to_upper, texttype_fn_str_to_upper
|
Converts to uppercase
|
|
texttype_fn_to_wc, texttype_fn_mbtowc
|
Converts string to WCHAR (belongs with charset object!)
|
|
texttype_fn_contains, texttype_fn_like,
texttype_fn_matches, texttype_fn_sleuth_check, texttype_fn_sleuth_merge
|
The engine will initialize these pointers with functions
used to implement LIKE, CONTAINS, and the GDML matching language. Do not set them in the initialization
function.
|
|
texttype_collation_table, texttype_toupper_table,
texttype_tolower_table, texttype_expand_table, texttype_compress_table,
texttype_misc, texttype_unused
|
These fields provide pointers to data tables that the
implementation uses to provide its functionality. The engine never refers to them directly; they are for the
object's use.
|
12.1 Functions to Perform Operations on Text
The first parameter of all functions implemented within the
texttype is a pointer to the data structure initialized by the initialization
function. This pointer provides access
to the data tables: texttype_collation_table, texttype_toupper_table, etc. Functions can be shared among different
texttypes implemented in the same gdsintl2 shared library. The result of the function is determined by
the data tables, which are specific to a texttype.
The comments in lc_example.c provide more detail on each
function.
12.2 Collation Keys
A collation key maps a string with a non-default collation
to a set of bytes that can be compared with unsigned-byte comparison preserving
the culturally correct collation order.
InterBase stores collation keys in indexes. Index keys are always compared with unsigned byte comparisons,
starting with the high order byte and continuing until a difference or the end
of the key is found. When creating an
index key, InterBase transforms values that do not produce correct ordering
using unsigned byte comparisons to values that do.
The mapping from a string to a collation key can be
irreversible. The original string is never recreated from the collation key.
When writing a textype_fn_string_to_key function, the
resulting key must consist of unsigned bytes.
In a binary collation, the collation key is identical to the
string.
Here are examples of collation keys created using different
collations:
|
Collation
|
Input
|
Collation Key
|
|
Binary
|
ÁbC
|
ÁbC
|
|
Case Insensitive binary
|
ÁbC
|
ÁBC
|
|
Accent Insensitive binary
|
ÁbC
|
AbC
|
|
Case and Accent Insensitive binary
|
ÁbC
|
ABC
|
|
English Book Title
|
The Practice
|
Practice, The
|
|
Telephone Book (USA)
|
McDonald
|
MacDonald
|
|
Telephone Book (USA)
|
1 Way
|
ONE WAY
|
|
Lexical Ordering (German)
|
ÁbCê
|
ABCE' ^AbCe
|
|
Lexical Ordering (French)
|
ÁbCê
|
ABCE^ 'AbCe
|
|
Lexical Ordering (German)
|
Cheß
|
CHESS 1chess
|
|
Lexical Ordering (Spanish)
|
Cheß
|
XHEß
Xheß
where X represents a special value between C and D.
|
12.3 Collation Keys versus Comparison
A texttype must provide functions for string comparison and
creating collation keys. String comparison is used for equality and inequality
Boolean evaluations. Collation keys are used in indexes. Comparing strings and comparing collation
keys must produce identical results.
The two functions have a significant difference. Index keys have a fixed maximum length,
currently 254 bytes. String comparison must be able to handle strings of
unlimited length.
12.4 Ignore Trailing Spaces
SQL requires that trailing spaces be ignored when comparing
two strings of different lengths. It is
important that both the key creation function and the compare function ignore
trailing spaces on strings.
Note: The space
character is determined by the character set.
12.5 Monocase operations
A texttype must contain functions to implement the SQL
UPPER(x) and LOWER(x) functions. The result of UPPER & LOWER is culturally
specific, and the rules are determined by the collation sequence defined with
the column.
The to_upper and to_lower functions can be implemented using
data tables or directly in code.
InterBase does not currently support the SQL LOWER(x)
function, however it is necessary to implement it in each defined texttype.
12.6 Form of Use Changes
A texttype contains functions for changing the "form of use"
of a character string from NCHAR or MBCHAR to WCHAR. These functions, strictly speaking, belong with the charset
object. A design decision in the
Internationalization architecture placed them with texttype.
12.7 Text Matching Functions
Entries exist in texttype for custom implementation of SQL
LIKE, CONTAINS, and GDML MATCHING functions.
The InterBase engine will fill these entries with internal function
pointers after the texttype initialization function is called. Your driver should not initialize these
function pointers.
12.8 Debugging a Collation Driver
Debugging a collation driver should be done before making it
into a GdsIntl2 driver. While the collation driver is a standalone entity, it
can print the collation keys it creates
to check for correctness.
After installing the driver, sort and index several tables,
compare indexed and non-indexed strings and perform UPPER() operations to
ensure the driver is working properly.
The example test file is cs_example.sql.
12.9 Example
The example texttype driver lc_example.c implements a simple version of English Title
Sort Order. When titles are sorted,
articles such as "The", "An", or "A" at the beginning of the title are
ignored. Thus "The Wind and the Lion"
sorts as if it were written "Wind and the Lion".
In this driver, the Collation Key created is simply the
input string, minus any articles at the beginning of the string. This collation does not change the default
ordering of accented characters or upper and lower case characters.
13
Conversion Objects
InterBase uses conversion objects to transform one character set to another. A conversion object is defined by the data
structure csconvert in intlobj.h
typedef
struct csconvert {
struct blk csconvert_blk;
USHORT csconvert_version;
USHORT csconvert_flags;
SSHORT csconvert_id;
ASCII *csconvert_name;
CHARSET_ID csconvert_from;
CHARSET_ID csconvert_to;
FPTR_SHORT csconvert_convert;
BYTE *csconvert_datatable;
BYTE *csconvert_misc;
ULONG *csconvert_unused
[2];
}
*CSCONVERT;
|
Field Name
|
Definition
|
|
csconvert_blk
|
Used by the engine's memory management. Do not modify.
|
|
csconvert_flags
|
Engine use - do not modify
|
|
csconvert_id
|
No longer used
|
|
csconvert_name
|
Internal name - can be ignored.
|
|
csconvert_from
|
CharsetID being converted FROM
|
|
csconvert_to
|
CharsetID being converted TO
|
|
csconvert_convert
|
Function to perform the conversion
|
|
csconvert_datatable, csconvert_misc, csconvert_unused
|
Available for object implementation to use for pointer
storage, etc. Not directly accessed
by the engine.
|
13.1 Conversion API
The conversion object has a single method, the function that
converts from one character set to another.
static USHORT conversion_function (
CSCONVERT obj,
BYTE *dest_ptr,
USHORT dest_len,
BYTE *src_ptr,
USHORT src_len,
SSHORT *err_code,
USHORT *err_position);
Input:
obj The conversion
object
dest_ptr Location for the converted string, or NULL
when an estimate of output length is requested.
dest_len Length
of the dest_ptr buffer, in bytes
str_ptr The
input string
src_len Length
of input string, in bytes.
*err_code See below
*err_position See below
Returns:
The length
of the converted string, in bytes.
If dest_ptr is NULL, then the routine returns a maximum
estimate of the space required for the output string if an input string of
src_len bytes is converted.
13.2 Error Returns from Conversion Function
The conversion function can return errors to the
engine. Generally the engine will
report these to the user as a :transliteration error". The error code is returned via *err_code. The position in the input string where the error
was detected is returned via *err_position.
|
Result in *err_code
|
Meaning
|
|
0
|
No Error
|
|
CS_CONVERT_ERROR
|
Input character does not exist in output character set.
|
|
CS_BAD_INPUT
|
Input string was not properly formed per rules of the
input character set.
|
|
CS_TRUNCATION_ERROR
|
Not enough space available in the output buffer.
|
13.3 Addressed by FROM charsetid & TO
charsetid
Unlike character sets and collations, conversions have no
symbolic names. A conversion is
identified by the charsetID is it converting from and the charsetID it is
converting to. An InterBase aficionado
will recognize the similarity to blob filters.
If no direct conversion object has been created to convert
from charset A to charset B, the engine converts charset A to UNICODE and then
UNICODE to charset B.
13.4 Example
In the sample code
provided, cv_example.c is an example of a conversion object. Conversion functions are generally table
driven. . The example contains several
different functions. The functions are code, and produce a specific conversion
when the the data table for that conversion is given.
For conversion tables, check files at www.unicode.org.
14
Charset object
InterBase uses a charset object to represent information for
a character set. Much of the
information inside a charset object is for engine use only.
charsets are defined by the structure charset in intlobj.h
typedef struct charset {
struct blk charset_blk;
USHORT charset_version;
USHORT charset_flags;
CHARSET_ID charset_id;
ASCII *charset_name;
BYTE charset_min_bytes_per_char;
BYTE charset_max_bytes_per_char;
BYTE charset_space_length;
BYTE *charset_space_character;
/* Must be
aligned */
FPTR_SHORT charset_well_formed;
struct
csconvert charset_to_unicode;
struct
csconvert charset_from_unicode;
VEC charset_converters;
VEC charset_collations;
ULONG *charset_unused [2];
} *CHARSET;
|
Field
|
Definition
|
|
charset_blk
|
Used by engine's memory manager. Do not modify.
|
|
charset_version
|
Version ID of InterBase international interface.
|
|
charset_id
|
The CHARSETID for this character set
|
|
charset_name
|
Internal name for charset
|
|
charset_min_bytes_per_char
|
Minimal number of bytes per character.
|
|
charset_max_bytes_per_char
|
Maximum bytes per character. InterBase will allocate this number of bytes for each character
in a SQL CHAR(n) definition.
|
|
charset_space_length
|
Length, in bytes, of the space character
|
|
charset_space_character
|
Pointer to the definition of the space character
|
|
charset_well_formed
|
Not used. This was
intended to be a pointer to a function that would validate that a string was
well formed by the rules of a character set.
|
|
charset_to_Unicode
|
A conversion object that transforms the character set TO
Unicode (WCHAR).
|
|
charset_from_Unicode
|
A conversion object that transforms UNICODE into the
character set.
|
|
charset_converters
|
For engine use only.
(The Engine keeps a list of converters to/from this character set
here).
|
|
charset_collations
|
For engine use only.
(The Engine keeps a list of collations defined for this character set
here).
|
|
charset_unused
|
May be used by object implementation.
|
14.1 Two Conversion Objects
Every charset must contain two conversion definitions: a conversion from Unicode to the charset and
a conversion to Unicode from the charset.
All InterBase character sets must be able to convert themselves to and
from Unicode.
When the engine cannot find a direct conversion between two
character sets, it uses the converters here. Unicode is the intermediate
conversion step.
14.2 One Texttype Object
The SQL standard requires that all character sets have a
default collation. By convention in
InterBase, the default collation for a character set has the same name as the
character set. It performs a collation
in binary, or codepoint order.
The character set default texttype object is loaded when the
character set is loaded. It must be defined separately in the gdsintl2 library.
14.3 The SPACE character
The SQL standard requires that all character sets define a
"space character". This character is
used to pad strings assigned to CHAR(n) columns to the length of the
column. Trailing space characters are
ignored in all string comparison operations.
The overwhelming majority of modern character sets use 0x20
or 0x0020 for the space character.
InterBase character sets follow that rule, except for character set
OCTETS. OCTETS instantiates binary byte
data, rather than character data, and uses 0x00 for the space character.
14.4 Example
The example character set in cs_example.c is identical to
the InterBase provided ISO_LATIN_1 character set.
The critical structures in a character set definition are
the tables that map the character set to and from Unicode. InterBase's mapping tables were derived from
the tables provided by Unicode, Inc. at their website - www.unicode.org.
15
Building gdsintl2
The build process for gdsintl2 is platform specific. In general terms, gdsintl2 must be a shared
library with minimal requirements for other libraries and a single
entrypoints. Building a gdsintl2
library is very similar to building an InterBase UDF library.
The platform specific makefiles provided in the collation
kit files are examples of building gdsintl2.
16
Special Notes
16.1 Backup / Restore
InterBase will backup and restore user defined character
sets and collations, and data in columns defined on them without problems.
If you restore a backup file with that includes user defined
international objects on a server other than the one on which it was backed up,
you must ensure that the character sets and collations referenced in the backup
file exist on the destination server. The CHARSETIDs and COLLATIONIDs must be
identical on the two servers.
16.2 Testing a GDSINTL2
When you test a new GDSINTL2, you must be aware that the
engine does not release a shared library until the engine itself is
unloaded. Each time you update your
gdsintl2, you must shut down the server, install the new gdsintl2, and reload
the server.
16.3Additional Shared Libraries.
The gdsintl2 shared-image can not require any shared
libraries that are not already loaded by InterBase. In particular, a gdsintl2 shared-library built with Borland
C++Builder will require VCL libraries that InterBase does not load. You can build the library using C++Builder
command line tools; just don't build the DLL within the visual environment.
If the gdsintl2 shared library references shared libraries
that were not loaded during server startup, the engine will fail to load the
gdsintl2 shared-library. The error
message for this is cryptic: "Implementation of text subtype %d not
located." Because the gdsintl2 library
is optional, the engine does not consider its absence an error. The error occurs when the engine attempts to
interpret data stored in a format defined in gdsintl2.
16.4 Platform Provided Internationalization
functions
Many platforms provide internationalization libraries that
perform the functions needed in gdsintl2.
For instance, JAVA provides the Collator class; Windows provides CompareString
and similar functions.
InterBase recommends against using a platform
provided Internationalization library as it may:
- be
platform specific.
Moving the database to another platform will require re-implementation of
the gdsintl2 library and may result in a different behavior if the two
platform implementations are not precisely identical.
- be OS
version specific.
Updating your OS may install a new version of the library - which may
perform differently. As the key
values are stored in InterBase indices your database will not operate
correctly until all indices are rebuilt.
- limit
input string size.
InterBase requires that comparison of Collation Keys and String Comparison
return identical results - String Comparison would not be able to use the
Collation Key function if it has a limit on input string size.
- not
be thread safe.
Many platforms use the process-wide locale setting for the international
library. The texttype
implementation would then have to set the process locale before each text
operation. This may result in
other, unintended, differences in InterBase behavior.
- have
poor performance.
An OS collation key function that loads data tables from disk each time
will greatly reduce database performance.
- require
a library not loaded by the InterBase server
If the library isn't already loaded by the InterBase server, your gdsintl2
library will not be able to access the OS provided functions.
16.5 Signed and Unsigned values
A CHARSETID and TEXTTYPEID are unsigned values. However, the LD_lookup interface between the
engine and the gdsintl library is defined as a signed short. You must use a C cast to refer to the value
as an unsigned short inside of your gdsintl2 module.
The error given when the engine is unable to locate a
texttype is "Implementation of text subtype %d not located." - the TEXTTYPEID /
CHARSETID value is printed as a signed value instead of unsigned.
16.6 InterBase Source Files
The files provided in the InterBase collation kit were
derived from InterBase source files.
InterBase source files are best viewed with tab-stops set to 4.
InterBase was written in Kernighan &Richie C, and
gradually moved toward ANSI C. Some inconsistencies remain; for example, some
function definitions have prototypes, some do not.
Building
an International Object, Step-by-Step
This section is a cookbook for creating a new InterBase
texttype.
First, build and install a gdsintl2 shared library from the
examples that accompany this document.
- Install
the example files.
- Copy
the platform specific makefile.platform to makefile.
- Make
the example gdsintl2 module ("make gdsintl2")
- Install
the example gdsintl2 module with "make install"
- Test
the module using cs_example.sql
Now that you are comfortable with the mechanics of building and using a new
stored library, begin working on your character set.
- Define
a collation object that implements binary collation.
- Create
your character set mapping tables.
Obtain a mapping table to and from Unicode from Unicode.com or
another reliable source.
- Define
a symbol for your charsetid in cs_example.h
- Modify
cs_example in cs_example.c to use your mapping tables instead of the example
ones.
- Modify
the initialization function cs_example in cs_example.c setting the
parameters to reflect your character set.
- Pick
character set and collation identifiers. If you intend to share or sell
your texttypes, you should pick a value below 250 for the character set
and collation identifiers and register them with InterBase support. If your new texttypes are for use only
in your controlled environment, pick a value between 250 and 255 for the
character set and between 250 and 254 for the collation. Those values do not require
registration.
- Modify
the linkage function in ld2.c to reflect your picked parameters.
- Test
the character set using a standalone test method.
- Make
and install the character set as a gdsintl2 driver.
- Add
the character set definition procedures defined in cs_util.sql to a newly
created test database for the character set.
- Declare
the character set to the database.
execute
procedure create_character_set (
"<NEW_CHARSET>",
<your id>);
- Declare
alias names for the character set.
execute procedure
create_character_set_alias (
"new name", /* for */
"existing name");
- Test
the newly created character set.
Create a column with all characters from the character set. Map the column to a Unicode column and
back again. Sort the column, index
it.
- Modify
the collation function in lc_example.c to support your collation.
- Rebuild
and re-install gdsintl2. You will
need to restart the InterBase server for it to load the new gdsintl2
module (as the prior version is already loaded).
- Declare
the collation to the database.
execute procedure
create_collation ("collation name",
<collation
id>, "charset name");
- Test
the collation.
Appendix
Existing Character Sets
SELECT RDB$CHARACTER_SET_NAME, RDB$CHARACTER_SET_ID
FROM
RDB$CHARACTER_SETS
ORDER BY
RDB$CHARACTER_SET_ID;
|
RDB$CHARACTER_SET_NAME
|
RDB$CHARACTER_SET_ID
|
|
NONE
|
0
|
|
OCTETS
|
1
|
|
ASCII
|
2
|
|
UNICODE_FSS
|
3
|
|
SJIS_0208
|
5
|
|
EUCJ_0208
|
6
|
|
DOS437
|
10
|
|
DOS850
|
11
|
|
DOS865
|
12
|
|
DOS860
|
13
|
|
DOS863
|
14
|
|
NEXT
|
19
|
|
ISO8859_1
|
21
|
|
KSC_5601
|
44
|
|
DOS852
|
45
|
|
DOS857
|
46
|
|
DOS861
|
47
|
|
CYRL
|
50
|
|
WIN1250
|
51
|
|
WIN1251
|
52
|
|
WIN1252
|
53
|
|
WIN1253
|
54
|
|
WIN1254
|
55
|
|
BIG_5
|
56
|
|
GB_2312
|
57
|
|
|
|
Alias
Names for InterBase character sets
SELECT RDB$TYPE, RDB$TYPE_NAME FROM RDB$TYPES
WHERE RDB$FIELD_NAME = "RDB$CHARACTER_SET_NAME"
ORDER BY RDB$TYPE, RDB$TYPE_NAME;