InterBase Collation Kit (DRAFT)
November 1999
David Brookstone Schnepper
dave@BrookstoneSystems.com

1 Introduction

This paper describes how to create a custom collation or character set driver for InterBase. An example using InterBase 6.0 is included.

This document uses international characters for illustration. The encoding for this document is Character set Windows 1252 (a.k.a Latin 1, or Western European).

2 Contents

Note: The examples are not available for this draft. They will be available with the final release of the Collation Kit.

CollationKit2.doc	This document
CollationKit2.html	This document in HTML
CollationKit2.txt	This document as a text file
makefile.bc	Makefile for Borland C
makefile.so	Makefile for Solaris
makefile.hp	Makefile for HP (not completed)
ld2.c	Example linkage function between engine and international module
cs_example.c	Character Set example
cv_example.c	Text conversion example
lc_example.c	Collation example
cs_latin1.h	Mapping tables for Character Set example
country_codes.h	Defines numeric codes for countries
charsets.h	InterBase numeric codes for character sets
cs_example.h	Generic include file for the example
tx865_lat1.h	Translation tables between DOS865 and ISO Latin-1 character sets
intlobj.h	Interface definition between InterBase Engine and International modules.
cs_example.sql	SQL script defining & testing example character set & collation.
cs_util.sql	Utility procedures for creating character sets and collations.

3 History

Prior to InterBase v3.3, InterBase defined CHARACTER(n) data as simple 8-bit bytes. It was up to the application programmer to know what character set was used to store the data, and how to convert between different character sets. Only binary collation order was supported for CHARACTER(n) data.

In June 1992, InterBase v3.3 was released, which included the first support for multiple character sets in a database and alternate collations for text. Only single-byte character sets and European-style collations were supported at that time. Version 3.3 also supported the ability to extend the International functionality by adding additional character set or collation modules.

Separately, a version 3.2J (Japanese) was released in 1992. The 3.2J version supported Japanese data stored in EUC and SJIS formats. It only supported codepoint order collation. There was no ability to extend the functionality of the international parts of the product.

In 1993, InterBase 4.0 was released. InterBase 4.0 built on the foundation established by v3.3 and v3.2J and combined both single-byte and multi-byte character set support into a single product.

Unlike the 3.3 architecture, InterBase 4.0 used the then-standard SQL-92 Internationalization architecture.

This document describes how to create drivers in the InterBase Internationalization architecture. Examples are provided for InterBase 6.0. These examples should also work for any InterBase 4.x or 5.x version - as the architecture has not changed since originally implemented for InterBase 4.0.

Many of the International collations supported in InterBase were originally designed for Borland's Paradox and/or dBASE product lines. The collations were re-implemented in InterBase and validated against the original product. An important design decision was that InterBase collations must match the corresponding dBASE/Paradox collation PRECISELY, to make them compatible. InterBase duplicated known bugs in dBASE/Paradox collations to preserve compatibility. InterBase defines several collations that are "the same, but different". For instance, there are German collations for dBASE, for Paradox, and an InterBase specific collation "DE_de". The general rule is dBASE & Paradox collations perform identically to dBASE & Paradox while InterBase specific collations follow the national standard for the language, if one exists.

An InterBase v3.3 Collation Kit document was written in 1993, which documents creating an InterBase 3.3 collation driver. This document covers InterBase 4.x through 6.x - using the architecture developed for V4.x.

4 SQL Internationalization Definitions

Internationalization is full of ambiguous terminology: the statements are the same but all the definitions are different. Here are the definitions from the SQL approach to Internationalization. These definitions are painfully precise in the SQL 92 definition - this documents takes some liberties with that precision in the interest of clarity. (See the SQL 92 standard for the original definitions). The SQL definitions are different from the "intuitive" definitions many computer programmers use.

4.1 Character Repertoire

A character repertoire is a set of characters used for a particular culture. For instance, English uses A, B, C ... Z. French uses A, Á, À, B, C, Ç, D ... Z.

4.2 Character Set

A character set is a collection of characters that includes at least one character repertoire. For instance, ISO Latin 1 is a character set that includes the English repertoire and the French repertoire.

4.3 Form of Use

A form of use is the mapping of a character set to a particular computer based representation. It includes the numerical value assigned to a particular character, and the way to encode the value into bits for computer use. Examples of encoding include single byte, multi-byte sequence, 16-bit value, etc.

4.4 Collation

A collation is a method of ordering sequences of characters in a culturally acceptable fashion.

4.5 Character

A character is a particular member of a character repertoire. Characters can change "form of use" (e.g.: encoding) but never meaning.

5 InterBase Internationalization Definitions

This section defines the terminology used in InterBase's internationalization implementation.

5.1 Narrow Character - NCHAR

A narrow character set represents text as a sequence of bytes, each byte representing a unique character.

typedef unsigned char NCHAR;

ISO Latin1, DOS437, and similar character sets are represented as NCHAR.

5.2 Multi-byte Character - MBCHAR

A multi-byte character set represents text as a sequence of bytes, with some byte-values indicating that one or more of the following bytes determine the next character.

typedef unsigned char MBCHAR;

SJIS, EUC-J, BIG-5, and similar character sets are represented as MBCHAR.

5.3 Wide Character - WCHAR

A method of representing a character set as a sequence of integers, in which the integer is larger than an 8-bit byte. Typically, a platform specific WCHAR implementation matches the integer size for the platform.

InterBase defines WCHAR as an unsigned short - a 16-bit value.

typedef unsigned short WCHAR;

InterBase does not fully support character sets in WCHAR format. It does not swap bytes between platforms of different integer architectures (e.g.: little-end-ian and big-end-ian platforms). This limitation can create problems when the client and server have different integer architectures.

InterBase uses WCHAR as an internal processing format - InterBase will typically convert a MBCHAR string to WCHAR for text matching functions.

UNICODE, in WCHAR format, is used internally for charset conversion functions.

5.4 Character Set

An InterBase character set consists of a SQL character set and a SQL Form of Use. For instance, the InterBase character Set ISO-8859-1 is precisely called: "The set of characters defined by ISO Standard 8859-1, encoded by the values defined in ISO standard 8859-1, having each value represented by a single 8-bit byte".

InterBase character sets are generally defined by national standards or industry standards. Examples: the International Standards Organization defines ISO-8859-1, and Microsoft defines Windows1252.

InterBase is precise in following defined standards. Windows 1252 is a superset of ISO-8859-1 (commonly known as Latin-1). In InterBase, WIN1252 and LATIN_1 are two separate character sets because Windows 1252 defines characters in positions that Latin-1 specifically states are "not a character".

5.5 Text Type

An InterBase text type could also be termed a "locale". It consists of all culturally significant information about processing text. This includes: character set, representation (form of use), collation method, comparison method, and uppercasing method.

Typical "locale" information also includes such methods as conversion of numeric values to string values, time representation, and date representation. These methods and formats are NOT included in the InterBase Text Type.

5.6 Transliteration

A transliteration occurs when characters in one InterBase character set are converted to another character set. Note that this process occurs when changing from one SQL "form of use" to another. InterBase terms the conversion from Unicode in the multi-byte FSS format to Unicode in WCHAR format a transliteration. The conversion from DOS437 to ISO-8859-1 is also a transliteration.

InterBase transliterations are "precise." InterBase will report a transliteration error if the character in the input set does not have an exact representation in the output set. None of the existing set of InterBase transliterations will transliterate a "missing character" to another character.

This restriction provides the ability to perform "round-trip conversions". A conversion is "round-trip" if it converts from character set A to character set B, then back to A, producing the original sequence of characters in the original representation.

InterBase transliterations preserve "character fidelity".

Custom transliterations can perform alternate conversions for "missing characters." For instance, a custom transliteration could convert the character <ae> to the string "ae" when the character <ae> does not exist in the destination character set.

6 Complexity of Collation

InterBase provides collations of varying complexity. The most sophisticated collations provide full dictionary sorting appropriate to the language being compared. Many InterBase collations provide a 4-weighted collation, where characters are compared by:

- Primary shape (A vs. B vs. C)

- Accents (A vs. Á vs. Ä)

- Case (A vs. a)

- Punctuation weight (punctuation is ignored in the first three comparison).

In a 4-weight collation Primary differences take precedence over Accent differences, which take precedence over Case differences, which take precedence over punctuation differences.

Most languages use left-to-right comparisons for all 4 weightings. French as used in France requires right-to-left for Accent distinctions and left-to-right for all other distinctions.

The ordering of Primary Shapes, Accents, Case, and punctuation is culturally specific:

Collation	Primary Shapes	Accents	Case	Punctuation
English	Digits 0 1 2 ... 9 ... before letters A B C ... Z		Uppercase before lowercase	comma dash period
Danish	Letters A B C ... Z Å ... before digits 0 1 2 ... 9	grave acute circumflex	lowercase before uppercase	period dash comma
French	0 1 2 ... 9 ... A B C Ç ... Z	circumflex acute grave

Note: The examples used in this section are for illustration purposes only, and not meant to be definitive of any collation sequence.

The "Primary Shape" of a character is culturally specific:

Collation	Character	Treatment
English	å	Same primary shape as a
Danish	å	Unique primary shape, sort after Z
English	ç	Same primary shape as c
French	ç	Unique primary shape, sort between c and d
Danish	Æ	Unique primary shape, sort after Z
German	Æ	Sort as if written "AE"

Each InterBase collation is specific to a character set. All characters in the character set must be sorted. If a character is not used in a particular culture it can be collated to any location, but it must still be collated, as it is valid input. Icelandic Thorn (Þ) and eth (Ð) are not used in English - and the EN_UK driver sorts them near T and D.

Some languages expand characters to an equivalent format:

Collation	Character	Expansion	Special rule
German	ß	ss	Sort after "ss" as an "accent difference"
Dutch	ÿ	ij	Sort before "ij" as an "accent difference"

Some languages contract sequences of characters:

Collation	Sequence	Sorting Rule
Spanish	CH	Primary shape between C and D
Spanish	LL	Primary shape between L and M
Danish	AA	Primary shape as Å, but treat as an accented difference after Å

The complexities of culturally correct sorting are beyond the scope of this document. Designing a collation order requires skill and insight; find a good text book on the subject and be sure you understand the issues before you begin.

7 InterBase Internationalization Classes

In JAVA object-oriented terminology; every character set supported by InterBase is represented by a class that implements the InterBase character set Interface.

Every InterBase text type is represented by a class that implements the texttype interface.

Every transliteration is represented by a class that implements the conversion interface.

InterBase does not implement international support in an object-oriented language. It uses an object-oriented design implemented in standard C.

The methods in each internationalization interface are not very well divided. In particular, many methods defined in texttype belong in charset (mb_to_wc in particular).

7.1 Charset

A class implementing the InterBase charset interface must provide an internal name for the character set and:

- The value for the SPACE character

- The length of the SPACE character, in bytes

- Max & min bytes per character

- A method to convert from the character set to UNICODE in WCHAR

- A method to convert from UNICODE in WCHAR to the character set

Charsets are defined by the structure charset in intlobj.h. The structures defined in intlobj.h are explained starting in section 12.

7.2 Texttype

A class implementing the InterBase texttype interface must provide an internal name for the texttype and functions that:

- Calculate the key length for a given string length

- Calculate the collation key for a given string

- Compare two strings

- Convert a string to uppercase

- Convert a string to lowercase

- Convert a string to WCHAR format

- Fetch the next character in a string

Texttypes are defined by the structure texttype in intlobj.h.

7.3 Conversion

- A class implementing the InterBase conversion interface must provide an internal name for the conversion and a function to convert FROM a charset TO another charset.

Conversions are defined by the structure csconvert in intlobj.h

8 Data Types for International Object Identifiers

Character sets and text types are identified by numeric values. Once a numeric value is established for a character set or text type, the value will never be re-designated by InterBase. Preserving those numbers protects backup/restore and compatibility between servers.

Some character set and collation identifiers are reserved for user implementations. If you are creating a text type for your own application, you should use values in the range of 250 to 255 for character set identifiers and 250 to 249 for collation identifiers. If you are creating a text type that you expect to share or sell, you should use a value below 250 for each identifier, and register your identifiers with InterBase support.

8.1 Character Set Identifiers

Character set identifiers range from 0 to 255. The range from 250 to 255 is reserved for user implementations. All numbers below 250 must be registered with InterBase support to prevent conflicts with other vendors of InterBase international objects.

typedef unsigned char CHARSETID;

CHARSET ID 127 is reserved for internal InterBase use.

8.2 Texttype identifiers.

A texttype is identified by a 16-bit value. The most-significant byte is the collation identifier (COLLATIONID). The least significant byte is the character set identifier (CHARSETID).

Collation IDs range from 0 to 254. Collation identifiers from 250 to 254 are reserved for user-defined collations. Value 255 is not supported. If you use a collation identifier below 250, you must register it with InterBase support to avoid conflicts with other vendors of InterBase international objects. By convention Collation ID 0 is the default collation for a given character set. The default collation for a character set is binary collation, where characters are ordered by the binary value specified by their form of use.

typedef unsigned char COLLATIONID;

typedef unsigned short TEXTTYPEID;

#define GET_CHARSETID (t) ((t) & 0x255)

#define GET_COLLATIONID(t) (((t) >> 8) & 0x255)

Collation identifiers are unique within a character set. Texttype ID's are unique across the database.

9 Linking Numeric Identifiers and Symbolic Names

InterBase searches internal system metadata tables for the symbolic names to attach to charset and texttype objects.

9.1 Symbolic Names for Character Set ID's

The following query gives the correspondence between charset ID's and charset names.

SELECT RDB$CHARACTER_SET_ID, RDB$CHARACTER_SET_NAME

FROM RDB$CHARACTER_SETS;

This name is the "official" name for a character set. However, naming conventions for character sets vary. InterBase character sets have aliases to conform to the naming conventions on different platforms.

SELECT RDB$TYPE, RDB$TYPE_NAME

FROM RDB$TYPES

WHERE RDB$FIELD_NAME = "RDB$CHARACTER_SET_NAME";

RDB$TYPES in the internal table used to define symbolic constants for various internal values.

9.2 Symbolic Names for Collation ID's

Collation names are also defined in a system table. Retrieving them requires a more complicated query

SELECT RDB$COLLATION_NAME, RDB$COLLATION_ID,

RDB$CHARACTER_SET_ID,

RDB$COLLATION_ID * 256 + RDB$CHARACTER_SET_ID

AS TEXTTYPEID

FROM RDB$COLLATIONS;

RDB$COLLATIONS stores the collation name, the collation identifier, and the character set identifier for each supported collation. The TEXTTYPEID is calculated from the COLLATION_ID and the CHARACTER_SET_ID.

COLLATION_NAME must be unique in a database. Each texttype has a unique collation name. Aliases are not supported for collation names.

9.3 Naming Conventions

Collation names often follow the convention XX_yy.

XX is the two-letter Language code, per ISO Standard 639 (ftp://ftp.std.com/obi/Standards/ISO/ISO_639).

yy is the two-letter Country code, per ISO Standard 3166 (http://www.niso.org/3166.html).

Thus, DE_DE is the collation name for German as used in Germany. FR_FR is French as used in France, and FR_CA is French as used in Canada.

InterBase collations that emulate Paradox or dBASE collations have names like "PDOXxxxx", "PXWxxx" or "DBxxxx". The names should match the names used in Paradox and dBASE.

A special collation exists for each character set. This collation has the same name as the character set and implements binary collation for the character set. Binary collation sorts the character set by the numeric codes used to represent the characters.

Character set names follow the standard that defines them (for example: WIN1252, DOS437). Character set alias names support the differences in these names between platforms (for example: WIN1252, WIN_1252).

10 Searching for an Object

InterBase searches for an international implementation object when it loads its metadata. It searches these three locations in this order:

- The engine itself

- The international library - gdsintl

- The supplemental international library - gdsintl2

Objects for character sets ASCII, NONE, OCTETS, and UNICODE_FSS are built into the engine. Without these character set definitions the engine cannot load a database. The default collations for these character sets are also implemented within the engine.

All other InterBase international object implementations are in the gdsintl library.

The gdsintl2 library is available for user-defined international objects.

10.1 GDSINTL Libraries

Gdsintl and gdsintl2 are platform-dependant shared-object libraries. The suffix for the filename is platform dependant.

Platform	Suffix
Windows	DLL
Solaris	SO
HP	SL

The implementation of the library is platform-dependant. Libraries must be recompiled when moved to a new platform.

Some platforms do not support shared-object libraries. Novell NLM is an example. For those platforms, the entire gdsintl library is linked with the engine. User-defined international objects cannot be added to InterBase on platforms that do not support shared-object libraries.

The InterBase installation procedure places gdsintl in the intl, bin or lib directory in the InterBase root directory. gdsintl2 must be stored in the same directory as gdsintl. The location of the gdsintl library is dependant on the platform and InterBase version.

Platform	Version	Location
Windows	4.x & 5.x	bin
All others	4.x & 5.x	lib
All	6.x	intl

10.2 Single Entry Point

Each international library has a single entry point, which is LD_lookup for gdsintl, and LD_lookup2 for gdsintl. Aside from the difference in entry point names, the libraries are functionally identical.

Historical note: Early versions of IBM's AIX platform supported only one entry point in shared-object libraries. That limitation is the reason for limiting the lookup function to the single entry point for InterBase international libraries.

Historical note: The Apollo platform required that identifiers be unique across all shared-object libraries loaded by a program. That limitation is the reason for requiring the different entry point names in the two libraries.

10.3 The LD2_lookup Function

The lookup function for the gdsintl2 library is passed the following:

- Type of object to lookup

- Numerical identifier to lookup

It returns

- 1 if object was not found

- 0 if object was found and a pointer to a function used to initialize the object

USHORT LD2_lookup (

USHORT objtype,

FPTR_SHORT *fun,

SSHORT parm1,

SSHORT parm2)

Objtype is one of type_texttype, type_charset, or type_csconvert. Any other input is an error.

Objtype	Parm1	Parm2
Type_texttype	TEXTTYPEID	0
Type_charset	CHARSETID	0
Type_csconvert	CHARSETID (from)	CHARSETID (to)
Any other value	-- ERROR --

An example LD2_lookup function is provided in ld2.c

11 Initialization Functions

When an international implementation is found, the call to LD2_lookup returns a pointer to the class initialization function. The engine calls this initialization function with a pointer to a 0-initialized data structure, which the function will fill. This data structure defines the object to the engine.

The initialization is part of database startup, not server startup! If your server supports more than one database, the initialization function will be called for every database that shares a set of international objects. Even in a single-database server, the initialization function will be called each time the engine shuts down and reloads a database that was temporarily inactive.

The initialized object is not released until the database is unloaded. The engine releases the object as a normal part of database shutdown.

Here is a sample initialization function for a texttype:

unsigned short LC_win1252_example(

TEXTTYPE cache,

SSHORT parm1,

SSHORT dummy)

{

cache->texttype_version = IB_LANGDRV_VERSION;

/* " Other initialization code " */

}

Each charset, each texttype, and each conversion supported in a gdsintl library must have an initialization function.

12 Texttype Objects

Texttype objects implement an InterBase texttype. They provide collation and other text manipulation functions. This is the definition of texttype in intlobj.h:

typedef struct texttype {

struct blk texttype_blk;

USHORT texttype_version; /* version ID of object */

USHORT texttype_flags; /* miscellaneous flags */

TTYPE_ID texttype_type; /* Interpretation ID */

ASCII *texttype_name;

CHARSET_ID texttype_character_set; /* ID of base character set */

SSHORT texttype_country; /* ID of base country values */

BYTE texttype_bytes_per_char; /* max bytes per character */

ULONG texttype_license_mask; /* required bits for license */

/* MUST BE ALIGNED */

FPTR_SHORT texttype_fn_init;

FPTR_SHORT texttype_fn_key_length;

FPTR_SHORT texttype_fn_string_to_key;

FPTR_short texttype_fn_compare;

FPTR_SHORT texttype_fn_to_upper; /* convert one ch to uppercase */

FPTR_SHORT texttype_fn_to_lower; /* One ch to lowercase */

FPTR_short texttype_fn_str_to_upper; /* Convert string to uppercase */

FPTR_SHORT texttype_fn_to_wc; /* convert string to wc */

FPTR_SHORT texttype_fn_contains; /* s1 contains s2? */

FPTR_SHORT texttype_fn_like; /* s1 like s2? */

FPTR_SHORT texttype_fn_matches; /* s1 matches s2 */

FPTR_SHORT texttype_fn_sleuth_check; /* s1 sleuth s2 */

FPTR_SHORT texttype_fn_sleuth_merge; /* aux function for sleuth */

FPTR_short texttype_fn_mbtowc; /* get next character */

BYTE *texttype_collation_table;

BYTE *texttype_toupper_table;

BYTE *texttype_tolower_table;

BYTE *texttype_expand_table;

BYTE *texttype_compress_table;

BYTE *texttype_misc; /* Used by some drivers */

ULONG *texttype_unused [4]; /* spare space */

} *TEXTTYPE;

Many of the fields in this structure are for use only by the engine, some are for use by the object itself, and some form the interface between the InterBase engine and the texttype implementation.

Field Name	Description
texttype_blk	Used for the engine's memory management - do NOT touch!
texttype_version	Set to IB_LANGDRV_VERSION
texttype_flags	For engine use
texttype_name	Pointer to ASCII internal name for texttype
texttype_character_set	CHARSET_ID of the character set this object is implemented for.
texttype_country	Country ID of locale (not required)
texttype_bytes_per_char	Max bytes per char (duplicate information from charset object)
texttype_license_mask	No longer used.
texttype_fn_init	Pointer to function used to initialize the data structure
texttype_fn_key_length	Calculates key length for a input string length.
texttype_fn_string_to_key	Converts a string to a collation key
texttype_fn_compare	Compares two strings of unlimited length.
texttype_fn_to_lower	Converts to lowercase
texttype_fn_to_upper, texttype_fn_str_to_upper	Converts to uppercase
texttype_fn_to_wc, texttype_fn_mbtowc	Converts string to WCHAR (belongs with charset object!)
texttype_fn_contains, texttype_fn_like, texttype_fn_matches, texttype_fn_sleuth_check, texttype_fn_sleuth_merge	The engine will initialize these pointers with functions used to implement LIKE, CONTAINS, and the GDML matching language. Do not set them in the initialization function.
texttype_collation_table, texttype_toupper_table, texttype_tolower_table, texttype_expand_table, texttype_compress_table, texttype_misc, texttype_unused	These fields provide pointers to data tables that the implementation uses to provide its functionality. The engine never refers to them directly; they are for the object's use.

12.1 Functions to Perform Operations on Text

The first parameter of all functions implemented within the texttype is a pointer to the data structure initialized by the initialization function. This pointer provides access to the data tables: texttype_collation_table, texttype_toupper_table, etc. Functions can be shared among different texttypes implemented in the same gdsintl2 shared library. The result of the function is determined by the data tables, which are specific to a texttype.

The comments in lc_example.c provide more detail on each function.

12.2 Collation Keys

A collation key maps a string with a non-default collation to a set of bytes that can be compared with unsigned-byte comparison preserving the culturally correct collation order. InterBase stores collation keys in indexes. Index keys are always compared with unsigned byte comparisons, starting with the high order byte and continuing until a difference or the end of the key is found. When creating an index key, InterBase transforms values that do not produce correct ordering using unsigned byte comparisons to values that do.

The mapping from a string to a collation key can be irreversible. The original string is never recreated from the collation key.

When writing a textype_fn_string_to_key function, the resulting key must consist of unsigned bytes.

In a binary collation, the collation key is identical to the string.

Here are examples of collation keys created using different collations:

Collation	Input	Collation Key
Binary	ÁbC	ÁbC
Case Insensitive binary	ÁbC	ÁBC
Accent Insensitive binary	ÁbC	AbC
Case and Accent Insensitive binary	ÁbC	ABC
English Book Title	The Practice	Practice, The
Telephone Book (USA)	McDonald	MacDonald
Telephone Book (USA)	1 Way	ONE WAY
Lexical Ordering (German)	ÁbCê	ABCE' ^AbCe
Lexical Ordering (French)	ÁbCê	ABCE^ 'AbCe
Lexical Ordering (German)	Cheß	CHESS 1chess
Lexical Ordering (Spanish)	Cheß	XHEß Xheß where X represents a special value between C and D.

12.3 Collation Keys versus Comparison

A texttype must provide functions for string comparison and creating collation keys. String comparison is used for equality and inequality Boolean evaluations. Collation keys are used in indexes. Comparing strings and comparing collation keys must produce identical results.

The two functions have a significant difference. Index keys have a fixed maximum length, currently 254 bytes. String comparison must be able to handle strings of unlimited length.

12.4 Ignore Trailing Spaces

SQL requires that trailing spaces be ignored when comparing two strings of different lengths. It is important that both the key creation function and the compare function ignore trailing spaces on strings.

Note: The space character is determined by the character set.

12.5 Monocase operations

A texttype must contain functions to implement the SQL UPPER(x) and LOWER(x) functions. The result of UPPER & LOWER is culturally specific, and the rules are determined by the collation sequence defined with the column.

The to_upper and to_lower functions can be implemented using data tables or directly in code.

InterBase does not currently support the SQL LOWER(x) function, however it is necessary to implement it in each defined texttype.

12.6 Form of Use Changes

A texttype contains functions for changing the "form of use" of a character string from NCHAR or MBCHAR to WCHAR. These functions, strictly speaking, belong with the charset object. A design decision in the Internationalization architecture placed them with texttype.

12.7 Text Matching Functions

Entries exist in texttype for custom implementation of SQL LIKE, CONTAINS, and GDML MATCHING functions. The InterBase engine will fill these entries with internal function pointers after the texttype initialization function is called. Your driver should not initialize these function pointers.

12.8 Debugging a Collation Driver

Debugging a collation driver should be done before making it into a GdsIntl2 driver. While the collation driver is a standalone entity, it can print the collation keys it creates to check for correctness.

After installing the driver, sort and index several tables, compare indexed and non-indexed strings and perform UPPER() operations to ensure the driver is working properly.

The example test file is cs_example.sql.

12.9 Example

The example texttype driver lc_example.c implements a simple version of English Title Sort Order. When titles are sorted, articles such as "The", "An", or "A" at the beginning of the title are ignored. Thus "The Wind and the Lion" sorts as if it were written "Wind and the Lion".

In this driver, the Collation Key created is simply the input string, minus any articles at the beginning of the string. This collation does not change the default ordering of accented characters or upper and lower case characters.

13 Conversion Objects

InterBase uses conversion objects to transform one character set to another. A conversion object is defined by the data structure csconvert in intlobj.h

typedef struct csconvert {

struct blk csconvert_blk;

USHORT csconvert_version;

USHORT csconvert_flags;

SSHORT csconvert_id;

ASCII *csconvert_name;

CHARSET_ID csconvert_from;

CHARSET_ID csconvert_to;

FPTR_SHORT csconvert_convert;

BYTE *csconvert_datatable;

BYTE *csconvert_misc;

ULONG *csconvert_unused [2];

} *CSCONVERT;

Field Name	Definition
csconvert_blk	Used by the engine's memory management. Do not modify.
csconvert_flags	Engine use - do not modify
csconvert_id	No longer used
csconvert_name	Internal name - can be ignored.
csconvert_from	CharsetID being converted FROM
csconvert_to	CharsetID being converted TO
csconvert_convert	Function to perform the conversion
csconvert_datatable, csconvert_misc, csconvert_unused	Available for object implementation to use for pointer storage, etc. Not directly accessed by the engine.

13.1 Conversion API

The conversion object has a single method, the function that converts from one character set to another.

static USHORT conversion_function (

CSCONVERT obj,

BYTE *dest_ptr,

USHORT dest_len,

BYTE *src_ptr,

USHORT src_len,

SSHORT *err_code,

USHORT *err_position);

Input:

obj The conversion object

dest_ptr Location for the converted string, or NULL when an estimate of output length is requested.

dest_len Length of the dest_ptr buffer, in bytes

str_ptr The input string

src_len Length of input string, in bytes.

*err_code See below

*err_position See below

Returns:

The length of the converted string, in bytes.

If dest_ptr is NULL, then the routine returns a maximum estimate of the space required for the output string if an input string of src_len bytes is converted.

13.2 Error Returns from Conversion Function

The conversion function can return errors to the engine. Generally the engine will report these to the user as a :transliteration error". The error code is returned via *err_code. The position in the input string where the error was detected is returned via *err_position.

Result in *err_code	Meaning
0	No Error
CS_CONVERT_ERROR	Input character does not exist in output character set.
CS_BAD_INPUT	Input string was not properly formed per rules of the input character set.
CS_TRUNCATION_ERROR	Not enough space available in the output buffer.

13.3 Addressed by FROM charsetid & TO charsetid

Unlike character sets and collations, conversions have no symbolic names. A conversion is identified by the charsetID is it converting from and the charsetID it is converting to. An InterBase aficionado will recognize the similarity to blob filters.

If no direct conversion object has been created to convert from charset A to charset B, the engine converts charset A to UNICODE and then UNICODE to charset B.

13.4 Example

In the sample code provided, cv_example.c is an example of a conversion object. Conversion functions are generally table driven. . The example contains several different functions. The functions are code, and produce a specific conversion when the the data table for that conversion is given.

For conversion tables, check files at www.unicode.org.

14 Charset object

InterBase uses a charset object to represent information for a character set. Much of the information inside a charset object is for engine use only.

charsets are defined by the structure charset in intlobj.h

typedef struct charset {

struct blk charset_blk;

USHORT charset_version;

USHORT charset_flags;

CHARSET_ID charset_id;

ASCII *charset_name;

BYTE charset_min_bytes_per_char;

BYTE charset_max_bytes_per_char;

BYTE charset_space_length;

BYTE *charset_space_character;

/* Must be aligned */

FPTR_SHORT charset_well_formed;

struct csconvert charset_to_unicode;

struct csconvert charset_from_unicode;

VEC charset_converters;

VEC charset_collations;

ULONG *charset_unused [2];

} *CHARSET;

Field	Definition
charset_blk	Used by engine's memory manager. Do not modify.
charset_version	Version ID of InterBase international interface.
charset_id	The CHARSETID for this character set
charset_name	Internal name for charset
charset_min_bytes_per_char	Minimal number of bytes per character.
charset_max_bytes_per_char	Maximum bytes per character. InterBase will allocate this number of bytes for each character in a SQL CHAR(n) definition.
charset_space_length	Length, in bytes, of the space character
charset_space_character	Pointer to the definition of the space character
charset_well_formed	Not used. This was intended to be a pointer to a function that would validate that a string was well formed by the rules of a character set.
charset_to_Unicode	A conversion object that transforms the character set TO Unicode (WCHAR).
charset_from_Unicode	A conversion object that transforms UNICODE into the character set.
charset_converters	For engine use only. (The Engine keeps a list of converters to/from this character set here).
charset_collations	For engine use only. (The Engine keeps a list of collations defined for this character set here).
charset_unused	May be used by object implementation.

14.1 Two Conversion Objects

Every charset must contain two conversion definitions: a conversion from Unicode to the charset and a conversion to Unicode from the charset. All InterBase character sets must be able to convert themselves to and from Unicode.

When the engine cannot find a direct conversion between two character sets, it uses the converters here. Unicode is the intermediate conversion step.

14.2 One Texttype Object

The SQL standard requires that all character sets have a default collation. By convention in InterBase, the default collation for a character set has the same name as the character set. It performs a collation in binary, or codepoint order.

The character set default texttype object is loaded when the character set is loaded. It must be defined separately in the gdsintl2 library.

14.3 The SPACE character

The SQL standard requires that all character sets define a "space character". This character is used to pad strings assigned to CHAR(n) columns to the length of the column. Trailing space characters are ignored in all string comparison operations.

The overwhelming majority of modern character sets use 0x20 or 0x0020 for the space character. InterBase character sets follow that rule, except for character set OCTETS. OCTETS instantiates binary byte data, rather than character data, and uses 0x00 for the space character.

14.4 Example

The example character set in cs_example.c is identical to the InterBase provided ISO_LATIN_1 character set.

The critical structures in a character set definition are the tables that map the character set to and from Unicode. InterBase's mapping tables were derived from the tables provided by Unicode, Inc. at their website - www.unicode.org.

15 Building gdsintl2

The build process for gdsintl2 is platform specific. In general terms, gdsintl2 must be a shared library with minimal requirements for other libraries and a single entrypoints. Building a gdsintl2 library is very similar to building an InterBase UDF library.

The platform specific makefiles provided in the collation kit files are examples of building gdsintl2.

16 Special Notes

16.1 Backup / Restore

InterBase will backup and restore user defined character sets and collations, and data in columns defined on them without problems.

If you restore a backup file with that includes user defined international objects on a server other than the one on which it was backed up, you must ensure that the character sets and collations referenced in the backup file exist on the destination server. The CHARSETIDs and COLLATIONIDs must be identical on the two servers.

16.2 Testing a GDSINTL2

When you test a new GDSINTL2, you must be aware that the engine does not release a shared library until the engine itself is unloaded. Each time you update your gdsintl2, you must shut down the server, install the new gdsintl2, and reload the server.

16.3Additional Shared Libraries.

The gdsintl2 shared-image can not require any shared libraries that are not already loaded by InterBase. In particular, a gdsintl2 shared-library built with Borland C++Builder will require VCL libraries that InterBase does not load. You can build the library using C++Builder command line tools; just don't build the DLL within the visual environment.

If the gdsintl2 shared library references shared libraries that were not loaded during server startup, the engine will fail to load the gdsintl2 shared-library. The error message for this is cryptic: "Implementation of text subtype %d not located." Because the gdsintl2 library is optional, the engine does not consider its absence an error. The error occurs when the engine attempts to interpret data stored in a format defined in gdsintl2.

16.4 Platform Provided Internationalization functions

Many platforms provide internationalization libraries that perform the functions needed in gdsintl2. For instance, JAVA provides the Collator class; Windows provides CompareString and similar functions.

InterBase recommends against using a platform provided Internationalization library as it may:

be platform specific.
Moving the database to another platform will require re-implementation of the gdsintl2 library and may result in a different behavior if the two platform implementations are not precisely identical.
be OS version specific.
Updating your OS may install a new version of the library - which may perform differently. As the key values are stored in InterBase indices your database will not operate correctly until all indices are rebuilt.
limit input string size.
InterBase requires that comparison of Collation Keys and String Comparison return identical results - String Comparison would not be able to use the Collation Key function if it has a limit on input string size.
not be thread safe.
Many platforms use the process-wide locale setting for the international library. The texttype implementation would then have to set the process locale before each text operation. This may result in other, unintended, differences in InterBase behavior.
have poor performance.
An OS collation key function that loads data tables from disk each time will greatly reduce database performance.
require a library not loaded by the InterBase server
If the library isn't already loaded by the InterBase server, your gdsintl2 library will not be able to access the OS provided functions.

16.5 Signed and Unsigned values

A CHARSETID and TEXTTYPEID are unsigned values. However, the LD_lookup interface between the engine and the gdsintl library is defined as a signed short. You must use a C cast to refer to the value as an unsigned short inside of your gdsintl2 module.

The error given when the engine is unable to locate a texttype is "Implementation of text subtype %d not located." - the TEXTTYPEID / CHARSETID value is printed as a signed value instead of unsigned.

16.6 InterBase Source Files

The files provided in the InterBase collation kit were derived from InterBase source files. InterBase source files are best viewed with tab-stops set to 4.

InterBase was written in Kernighan &Richie C, and gradually moved toward ANSI C. Some inconsistencies remain; for example, some function definitions have prototypes, some do not.

Building an International Object, Step-by-Step

This section is a cookbook for creating a new InterBase texttype.

First, build and install a gdsintl2 shared library from the examples that accompany this document.

Install the example files.
Copy the platform specific makefile.platform to makefile.

Make the example gdsintl2 module ("make gdsintl2")

Install the example gdsintl2 module with "make install"

Test the module using cs_example.sql

Now that you are comfortable with the mechanics of building and using a new stored library, begin working on your character set.

Define a collation object that implements binary collation.

Create your character set mapping tables. Obtain a mapping table to and from Unicode from Unicode.com or another reliable source.

Define a symbol for your charsetid in cs_example.h

Modify cs_example in cs_example.c to use your mapping tables instead of the example ones.

Modify the initialization function cs_example in cs_example.c setting the parameters to reflect your character set.

Pick character set and collation identifiers. If you intend to share or sell your texttypes, you should pick a value below 250 for the character set and collation identifiers and register them with InterBase support. If your new texttypes are for use only in your controlled environment, pick a value between 250 and 255 for the character set and between 250 and 254 for the collation. Those values do not require registration.

Modify the linkage function in ld2.c to reflect your picked parameters.

Test the character set using a standalone test method.

Make and install the character set as a gdsintl2 driver.

Add the character set definition procedures defined in cs_util.sql to a newly created test database for the character set.

Declare the character set to the database.

execute procedure create_character_set (
"<NEW_CHARSET>", <your id>);

Declare alias names for the character set.

execute procedure create_character_set_alias (
"new name", /* for */ "existing name");

Test the newly created character set. Create a column with all characters from the character set. Map the column to a Unicode column and back again. Sort the column, index it.

Modify the collation function in lc_example.c to support your collation.

Rebuild and re-install gdsintl2. You will need to restart the InterBase server for it to load the new gdsintl2 module (as the prior version is already loaded).

Declare the collation to the database.
execute procedure create_collation ("collation name",
<collation id>, "charset name");

Test the collation.

Appendix

Existing Character Sets

SELECT RDB$CHARACTER_SET_NAME, RDB$CHARACTER_SET_ID

FROM RDB$CHARACTER_SETS

ORDER BY RDB$CHARACTER_SET_ID;

RDB$CHARACTER_SET_NAME	RDB$CHARACTER_SET_ID
NONE	0
OCTETS	1
ASCII	2
UNICODE_FSS	3
SJIS_0208	5
EUCJ_0208	6
DOS437	10
DOS850	11
DOS865	12
DOS860	13
DOS863	14
NEXT	19
ISO8859_1	21
KSC_5601	44
DOS852	45
DOS857	46
DOS861	47
CYRL	50
WIN1250	51
WIN1251	52
WIN1252	53
WIN1253	54
WIN1254	55
BIG_5	56
GB_2312	57

Alias Names for InterBase character sets

SELECT RDB$TYPE, RDB$TYPE_NAME FROM RDB$TYPES

WHERE RDB$FIELD_NAME = "RDB$CHARACTER_SET_NAME"

ORDER BY RDB$TYPE, RDB$TYPE_NAME;

Existing TextTypes

SELECT RDB$COLLATION_NAME,

RDB$CHARACTER_SET_ID,

RDB$COLLATION_ID,

RDB$COLLATION_ID * 256 + RDB$CHARACTER_SET_ID

AS TEXTTYPEID

FROM RDB$COLLATIONS

ORDER BY RDB$CHARACTER_SET_ID, RDB$COLLATION_ID;

RDB$ COLLATION_NAME	RDB$ CHARACTER_SET_ID	RDB$ COLLATION_ID	TEXTTYPE
NONE	0	0	0
OCTETS	1	0	1
ASCII	2	0	2
UNICODE_FSS	3	0	3
SJIS_0208	5	0	5
EUCJ_0208	6	0	6
DOS437	10	0	10
PDOX_ASCII	10	1	266
PDOX_INTL	10	2	522
PDOX_SWEDFIN	10	3	778
DB_DEU437	10	4	1034
DB_ESP437	10	5	1290
DB_FIN437	10	6	1546
DB_FRA437	10	7	1802
DB_ITA437	10	8	2058
DB_NLD437	10	9	2314
DB_SVE437	10	10	2570
DB_UK437	10	11	2826
DB_US437	10	12	3082
DOS850	11	0	11
DB_FRC850	11	1	267
DB_DEU850	11	2	523
DB_ESP850	11	3	779
DB_FRA850	11	4	1035
DB_ITA850	11	5	1291
DB_NLD850	11	6	1547
DB_PTB850	11	7	1803
DB_SVE850	11	8	2059
DB_UK850	11	9	2315
DB_US850	11	10	2571
DOS865	12	0	12
PDOX_NORDAN4	12	1	268
DB_DAN865	12	2	524
DB_NOR865	12	3	780
DOS860	13	0	13
DB_PTG860	13	1	269
DOS863	14	0	14
DB_FRC863	14	1	270
NEXT	19	0	19
NXT_US	19	1	275
NXT_DEU	19	2	531
NXT_FRA	19	3	787
NXT_ITA	19	4	1043
NXT_ESP	19	5	1299
ISO8859_1	21	0	21
DA_DA	21	1	277
DU_NL	21	2	533
FI_FI	21	3	789
FR_FR	21	4	1045
FR_CA	21	5	1301
DE_DE	21	6	1557
IS_IS	21	7	1813
IT_IT	21	8	2069
NO_NO	21	9	2325
ES_ES	21	10	2581
SV_SV	21	11	2837
EN_UK	21	12	3093
EN_US	21	14	3605
PT_PT	21	15	3861
KSC_5601	44	0	44
KSC_DICTIONARY	44	1	300
DOS852	45	0	45
DB_CSY	45	1	301
DB_PLK	45	2	557
DB_SLO	45	4	1069
PDOX_CSY	45	5	1325
PDOX_PLK	45	6	1581
PDOX_HUN	45	7	1837
PDOX_SLO	45	8	2093
DOS857	46	0	46
DB_TRK	46	1	302
DOS861	47	0	47
PDOX_ISL	47	1	303
CYRL	50	0	50
DB_RUS	50	1	306
PDOX_CYRL	50	2	562
WIN1250	51	0	51
PXW_CSY	51	1	307
PXW_HUNDC	51	2	563
PXW_PLK	51	3	819
PXW_SLOV	51	4	1075
WIN1251	52	0	52
PXW_CYRL	52	1	308
WIN1252	53	0	53
PXW_INTL	53	1	309
PXW_INTL850	53	2	565
PXW_NORDAN4	53	3	821
PXW_SPAN	53	4	1077
PXW_SWEDFIN	53	5	1333
WIN1253	54	0	54
PXW_GREEK	54	1	310
WIN1254	55	0	55
PXW_TURK	55	1	311
BIG_5	56	0	56
GB_2312	57	0	57