RuleBasedTransliterator
is a transliterator that reads a set of rules in order to determine how to perform translations.
More...
#include <rbt.h>
Class diagram for RuleBasedTransliterator:
RuleBasedTransliterator
is a transliterator that reads a set of rules in order to determine how to perform translations.
Rule sets are stored in resource bundles indexed by name. Rules within a rule set are separated by semicolons (';'). To include a literal semicolon, prefix it with a backslash ('\'). Whitespace, as defined by Character.isWhitespace()
, is ignored. If the first non-blank character on a line is '#', the entire line is ignored as a comment.
Each set of rules consists of two groups, one forward, and one reverse. This is a convention that is not enforced; rules for one direction may be omitted, with the result that translations in that direction will not modify the source text. In addition, bidirectional forward-reverse rules may be specified for symmetrical transformations.
Rule syntax
Rule statements take one of the following forms:
$alefmadda=\u0622;
$alefmadda
", will be replaced by the Unicode character U+0622. Variable names must begin with a letter and consist only of letters, digits, and underscores. Case is significant. Duplicate names cause an exception to be thrown, that is, variables cannot be redefined. The right hand side may contain well-formed text of any length, including no text at all ("$empty=;
"). The right hand side may contain embedded UnicodeSet
patterns, for example, "$softvowel=[eiyEIY]
". ai>$alefmadda;
ai<$alefmadda;
ai<>$alefmadda;
Translation rules consist of a match pattern and an output string. The match pattern consists of literal characters, optionally preceded by context, and optionally followed by context. Context characters, like literal pattern characters, must be matched in the text being transliterated. However, unlike literal pattern characters, they are not replaced by the output text. For example, the pattern "abc{def}
" indicates the characters "def
" must be preceded by "abc
" for a successful match. If there is a successful match, "def
" will be replaced, but not "abc
". The final '}
' is optional, so "abc{def
" is equivalent to "abc{def}
". Another example is "{123}456
" (or "123}456
") in which the literal pattern "123
" must be followed by "456
".
The output string of a forward or reverse rule consists of characters to replace the literal pattern characters. If the output string contains the character '
See UnicodeSet for more documentation and examples.
Segments
Segments of the input string can be matched and copied to the output string. This makes certain sets of rules simpler and more general, and makes reordering possible. For example:
The segment of the input string to be copied is delimited by "
Example
The following example rules illustrate many of the features of the rule language.
Applying these rules to the string "
The order of rules is significant. If multiple rules may match at some point, the first matching rule is applied.
Forward and reverse rules may have an empty output string. Otherwise, an empty left or right hand side of any statement is a syntax error.
Single quotes are used to quote any character other than a digit or letter. To specify a single quote itself, inside or outside of quotes, use two single quotes in a row. For example, the rule "
Notes
While a RuleBasedTransliterator is being built, it checks that the rules are added in proper order. For example, if the rule "a>x" is followed by the rule "ab>y", then the second rule will throw an exception. The reason is that the second rule can never be triggered, since the first rule always matches anything it matches. In other words, the first rule masks the second rule.
Definition at line 254 of file rbt.h.
Constructs a new transliterator from the given rules.
Constructs a new transliterator from the given rules.
Covenience constructor with no filter and FORWARD direction.
Covenience constructor.
Copy constructor.
|
', this is taken to indicate the location of the cursor after replacement. The cursor is the point in the text at which the next replacement, if any, will be applied. The cursor is usually placed within the replacement text; however, it can actually be placed into the precending or following context by using the special character '</code>'. Examples:
a {foo} z > | @ bar; # foo -> bar, move cursor before a
{foo} xyz > bar @|; foo -> bar, cursor between y and zUnicodeSet
patterns may appear anywhere that makes sense. They may appear in variable definitions. Contrariwise, UnicodeSet
patterns may themselves contain variable references, such as "$a=[a-z];$not_a=[^$a]
", or "$range=a-z;$ll=[$range]
".
UnicodeSet
patterns may also be embedded directly into rule strings. Thus, the following two rules are equivalent:
$vowel=[aeiou]; $vowel>'*'; # One way to do this
[aeiou]>'*'; # Another way
([a-z]) > $1 $1; # double lowercase letters
([:Lu:]) ([:Ll:]) > $2 $1; # reverse order of Lu-Ll pairs(
" and ")
". Up to nine segments may be defined. Segments may not overlap. In the output string, "$1
" through "$9
" represent the input string segments, in left-to-right order of definition.
Rule 1. abc{def}>x|y
Rule 2. xyz>r
Rule 3. yz>q
adefabcdefz
" yields the following results:
|adefabcdefz
Initial state, no rules match. Advance cursor. a|defabcdefz
Still no match. Rule 1 does not match because the preceding context is not present. ad|efabcdefz
Still no match. Keep advancing until there is a match... ade|fabcdefz
... adef|abcdefz
... adefa|bcdefz
... adefab|cdefz
... adefabc|defz
Rule 1 matches; replace " def
" with "xy
" and back up the cursor to before the 'y
'. adefabcx|yz
Although " xyz
" is present, rule 2 does not match because the cursor is before the 'y
', not before the 'x
'. Rule 3 does match. Replace "yz
" with "q
". adefabcxq|
The cursor is at the end; transliteration is complete. '>'>o''clock
" changes the string ">
" to the string "o'clock
".
Enumeration Value Documentation
RuleBasedTransliterator::PARSE_ERROR_BASE
RuleBasedTransliterator::BAD_VARIABLE_DEFINITION
RuleBasedTransliterator::MALFORMED_RULE
RuleBasedTransliterator::MALFORMED_SET
RuleBasedTransliterator::MALFORMED_SYMBOL_REFERENCE
RuleBasedTransliterator::MALFORMED_UNICODE_ESCAPE
RuleBasedTransliterator::MALFORMED_VARIABLE_DEFINITION
RuleBasedTransliterator::MALFORMED_VARIABLE_REFERENCE
RuleBasedTransliterator::MISMATCHED_SEGMENT_DELIMITERS
RuleBasedTransliterator::MISPLACED_CURSOR_OFFSET
RuleBasedTransliterator::MISSING_OPERATOR
RuleBasedTransliterator::MISSING_SEGMENT_CLOSE
RuleBasedTransliterator::MULTIPLE_ANTE_CONTEXTS
RuleBasedTransliterator::MULTIPLE_CURSORS
RuleBasedTransliterator::MULTIPLE_POST_CONTEXTS
RuleBasedTransliterator::TRAILING_BACKSLASH
RuleBasedTransliterator::UNDEFINED_SEGMENT_REFERENCE
RuleBasedTransliterator::UNDEFINED_VARIABLE
RuleBasedTransliterator::UNQUOTED_SPECIAL
RuleBasedTransliterator::UNTERMINATED_QUOTE
Member Function Documentation
RuleBasedTransliterator::RuleBasedTransliterator (const UnicodeString & ID, const UnicodeString & rules, Direction direction, UnicodeFilter * adoptedFilter, ParseError & parseError, UErrorCode & status) [inline]
rules
rules, separated by ';' direction
either FORWARD or REVERSE.
IllegalArgumentException
if rules are malformed or direction is invalid.
RuleBasedTransliterator::RuleBasedTransliterator (const UnicodeString & ID, const UnicodeString & rules, Direction direction, UnicodeFilter * adoptedFilter, UErrorCode & status) [inline]
rules
rules, separated by ';' direction
either FORWARD or REVERSE.
IllegalArgumentException
if rules are malformed or direction is invalid.
RuleBasedTransliterator::RuleBasedTransliterator (const UnicodeString & ID, const UnicodeString & rules, Direction direction, UErrorCode & status) [inline]
RuleBasedTransliterator::RuleBasedTransliterator (const UnicodeString & ID, const UnicodeString & rules, UErrorCode & status) [inline]
RuleBasedTransliterator::RuleBasedTransliterator (const UnicodeString & ID, const UnicodeString & rules, UnicodeFilter * adoptedFilter, UErrorCode & status) [inline]
RuleBasedTransliterator::RuleBasedTransliterator (const UnicodeString & ID, const TransliterationRuleData * theData, UnicodeFilter * adoptedFilter = 0)
RuleBasedTransliterator::RuleBasedTransliterator (const RuleBasedTransliterator &)
virtual RuleBasedTransliterator::~RuleBasedTransliterator () [virtual]
Transliterator * RuleBasedTransliterator::clone (void) const [virtual]
virtual void RuleBasedTransliterator::handleTransliterate (Replaceable & text, Position & offsets, UBool isIncremental) const [virtual]
The documentation for this class was generated from the following file:
Generated at Mon Jun 5 12:53:21 2000 for ICU1.5 by
1.0.0 written by Dimitri van Heesch,
© 1997-1999