Languages Around The World

Normalization

Overview

Normalization is used to convert text to a unique, equivalent form. Systems can normalize Unicode-encoded text to one particular sequence, such as normalizing composite character sequences into pre-composed characters.

Normalizer allows for easier sorting and searching of text. Normalizer supports the standard normalization forms and are described in great detail in Unicode Technical Report #15 (Unicode Normalization Forms) and Section 5.7 of the Unicode Standard.

Usage

Normalizer transforms text into the canonical composed and decomposed forms. In addition, you can have it perform compatibility decompositions so that you can treat compatibility characters the same as their equivalents.

Normalizer adds one optional behavior, IGNORE_HANGUL, that differs from the standard Unicode Normalization Forms in not normalizing Korean syllables. This option can be passed to the Normalizer constructors} and to the static compose and decompose methods. This option will be turned off by default.

There are three common usage models for Normalizer:

  1. You can use normalize() to process an entire input string at once.

    • For example, if you have a string in Unicode that you want to convert to a Latin 1 character set, ISO-8859-1: "a´bc" is normalized to "ábc".

  2. You can create a Normalizer object and use it to iterate through the normalized form of a string by calling first() and next().

    • For example, when you are comparing two strings you want to stop the comparison as soon as a significant difference is found. This way, you do not have the overhead of converting an entire string if only the first characters are important.

  3. You can use setIndex() and getIndex() to perform a random-access iteration.

    • For example, when you want to do a fast language sensitive searching, such as Boyer-Moore.

Transformation Methods

Movement Methods

NoteNormalizer objects behave like iterators and have methods such as setIndex(), next(), previous(), etc. You should note that while the setIndex() and getIndex() refer to indices in the underlying Unicode input text, the next() and previous() methods iterate through characters in the normalized output. This means that there is not necessarily a one-to-one correspondence between characters returned by next() and previous() and the indices passed to and returned from setIndex() and getIndex(). It is for this reason that Normalizer does not implement the CharacterIterator interface.

Programming Examples in C and C++

Programming example for normalizing a string .



Copyright (c) 2000 - 2005 IBM and Others - PDF Version - Feedback: http://icu.sourceforge.net/contacts.html

User Guide for ICU v3.4 Generated 2005-07-27.