
Text Element Boundary Analysis
Overview of Text Boundary Analysis
Text boundary analysis is the process of locating linguistic boundaries while formatting and handling text. Examples of this process include:
Locating appropriate points to word-wrap text to fit within specific margins while displaying or printing.
Locating the beginning of a word that the user has selected.
Counting characters, words, sentences, or paragraphs.
Determining how far to move the text cursor when the user hits an arrow key (Some characters require more than one position in the text store and some characters in the text store do not display at all).
Making a list of the unique words in a document.
Figuring out if a given range of text contains only whole words.
Capitalizing the first letter of each word.
Locating a particular unit of the text (For example, finding the third word in the document).
The BreakIterator classes were designed to support these kinds of tasks. The BreakIterator objects maintain a location between two characters in the text. This location will always be a text boundary. Clients can move the location forward to the next boundary or backward to the previous boundary. Clients can also check if a particular location within a source text is on a boundary or find the boundary which is before or after a particular location.
Four Types of BreakIterator
ICU BreakIterators can be used to locate the following kinds of text boundaries:
Character Boundary
Word Boundary
Line-break Boundary
Sentence Boundary
Character Boundary
The character-boundary iterator locates the boundaries between "characters", where "character" is what the end user of an application would usually expect. For example, the Ä letter can be represented in Unicode either with a single code-point value or with two code-point values (one representing the A and another representing the umlaut). The character-boundary iterator will treat the Ä as a single character regardless of whether or not it is stored using one code point or two. In short, the character-boundary iterator is used to identify sequences that should be treated as single characters from a user's perspective.
End-user characters, as described above, are also called grapheme clusters, in an attempt to limit the confusion caused by multiple meanings for the word "character".
Word Boundary
The word-boundary iterator locates the boundaries of words, for purposes such as double click selection or "Find whole words" operations in an editor.
Here's an example of a sentence, showing the boundary locations that will be identified by a word break iterator:
Word boundary locations are found according to these general principles:
Words themselves are kept together
Numbers are kept together, including any commas, points or currency symbols.
Apostrophes or hyphens within a word are kept with the word. They are not broken out separately like most other punctuation
Punctuation, spaces and other characters that are not part of a word or number, are broken out separately, with a boundary before and after each character.
The rules used for locating word breaks take into account the alphabets and conventions used for different languages.
Locating word breaks for Thai text presents a special challenge, because there are no spaces or other identifiable characters separating the words. To solve the problem of word-breaking Thai text, ICU provides a special dictionary-based break iterator.
Line-break Boundary
The line-break iterator locates positions within the text that would be appropriate points for a text editor to break lines when displaying the text. Line breaks differ from word breaks in that adjoining punctuation and trailing white space are kept with the words instead of being treated as separate "words" on their own (for example, do not wrap a line before a space).
This example shows the differences in the break locations produced by word and line break iterators
Sentence Boundary
A sentence-break iterator locates sentence boundaries.
The exact rules used for locating each type of boundary are described in a pair of documents from the Unicode Consortium. Unicode Standard Annex 14 ( http://www.unicode.org/unicode/reports/tr14/ ) gives the rules for locating line boundaries, while technical report 29 "http://www.unicode.org/unicode/reports/tr29/" ) describe character, word and sentence boundaries.
Usage
To locate boundaries in a document, create a BreakIterator using the BreakIterator::create***Instance family of methods in C++, or the ubrk_open() function (C). "***" is Character, Word, Line or Sentence, depending on the type of iterator wanted. These factory methods also take a parameter that specifies the locale for the language of the text to be processed.
When creating a BreakIterator, a locale is also specified, and the behavior of the BreakIterator obtained may be specialized in some way for that locale. For ICU 2.6, Break Iterators for the Thai locale will make use of a Thai dictionary for finding word and line boundaries; all other locales will use the default boundary rules.
Applications also may register customized BreakIterators for use in specific locales. Once such a break iterator has been registered, any requests for break iterators for the locale will return copies of the registered break iterator
In the general-usage-model, applications will use the following basic steps to analyze a piece of text for boundaries:
Create a BreakIterator with the desired behavior
Use the setText() or adoptText() methods to set the iterator to analyze a particular piece of text. Since BreakIterator uses a CharacterIterator to access the text, it can be stored in any form as long as you provide an appropriate CharacterIterator. There is a convenience method for analyzing a UnicodeString, but the user also can analyze part of a UnicodeString by creating a StringCharacterIterator directly.
Locate the desired boundaries using the appropriate combination of first(), last(), next(), previous(), preceding(), and following() methods.
The setText() or the adoptText() method can be called more than once, allowing a single BreakIterator to be reused to analyze different pieces of text. Because the creation of a BreakIterator can be relatively time-consuming, it makes good sense to cache and reuse BreakIterators within an application.
Set the text to be searched using the following:
adoptText(CharacterIterator) sets the BreakIterator to analyze a new piece of text. The new piece of text is specified with a CharacterIterator, which allows BreakIterator to analyze the text for boundaries no matter how it happens to be stored [it always accesses the text through the CharacterIterator]. The BreakIterator takes ownership of the CharacterIterator and will delete it when the process is completed.
setText(UnicodeString) is a shortcut for the adoptText() method. If the text is a UnicodeString, the user can call setText and pass it the string, rather than creating a StringCharacterIterator and passing it to the adoptText() method. This method will create the StringCharacterIterator. To analyze only part of a UnicodeString, on the other hand, create the StringCharacterIterator manually, specify the substring, and then pass it to the adoptText() method.
getText() method returns a const reference to the CharacterIterator that the BreakIterator is using to access the text.
createText() method returns a clone of the CharacterIterator that the BreakIterator is using to access the text. Ownership of the clone is transferred to the caller. (The caller can seek the returned CharacterIterator without affecting the BreakIterator, but if the actual text underlying the iterator is changed, the adoptText() method must be called again to make sure the BreakIterator does not malfunction.)
The iterator always points to a boundary position between two characters. The numerical value of the position, as returned by current() is the zero-based index of the character following the boundary. Thus a position of zero represents a boundary preceding the first character of the text, and a position of one represents a boundary between the first and second characters.
The first() and last() methods reset the iterator's current position to the beginning or end of the text (the beginning and the end are always considered boundaries). The next() and previous() methods advance the iterator one boundary forward or backward from the current position. If the next() or previous() methods run off the beginning or end of the text, it returns DONE. Thecurrent() method returns the current position.
The following() and preceding() methods are used for random access or to reposition the iterator to some arbitrary spot in the middle of the text. Since a BreakIterator always points to a boundary position, the following() and preceding() methods will never set the iterator to point to the position specified by the caller (even if it is, in fact, a boundary position). BreakIterator will, however, set the iterator to the nearest boundary position before or after the specified position. The isBoundary() method returns true or false, based on whether or not the specified position is a boundary position. It does this by calling the preceding() and next() methods, so it also repositions the iterator either at the specified position or the first boundary position after it. If any of these functions is passed an out-or-range offset, it returns DONE and repositions the iterator to the beginning or end of the text.
Reuse
It is slow and wasteful to repeatedly create and destroy a BreakIterator when it is not necessary. For example, do not create a separate BreakIterator for each line in a document that is being word-wrapped. Keep around a single instance of a line BreakIterator and use it whenever a line break iterator is needed.
Accuracy
ICU's break iterators implement the default boundary rules described in the Unicode Consortium Technical Reports 14 and 29 . These are relatively simple boundary rules that can be implemented efficiently, and are sufficient for many purposes and languages. However, some languages and applications will require a more sophisticated linguistic analysis of the text in order to find boundaries with good accuracy. Such an analysis is not directly available from ICU at this time.
Break Iterators based on custom, user-supplied boundary rules can be created and used by applications with requirements that are not met by the standard default boundary rules.
BreakIterator Boundary Analysis Examples
Print out all the word-boundary positions in a UnicodeString:
In C++,
void listWordBoundaries(const UnicodeString& s) { UErrorCode status = U_ZERO_ERROR; BreakIterator* bi = BreakIterator::createWordInstance(Locale::getUS(), status); bi->setText(s); int32_t p = bi->first(); while (p != BreakIterator::DONE) { printf("Boundary at position %d\n", p); p = bi->next(); } delete bi; } |
In C:
void listWordBoundaries(const UChar* s, int32_t len) { UBreakIterator* bi; int32_t p; UErrorCode err = U_ZERO_ERROR; bi = ubrk_open(UBRK_WORD, 0, s, len, &err); if (U_FAILURE(err)) return; p = ubrk_first(bi); while (p != UBRK_DONE) { printf("Boundary at position %d\n", p); p = ubrk_next(bi); } ubrk_close(bi); } |
Get the boundaries of the word that contains a double-click position:
In C++:
void wordContaining(BreakIterator& wordBrk, int32_t idx, const UnicodeString& s, int32_t& start, int32_t& end) { // this function is written to assume that we have an // appropriate BreakIterator stored in an object or a // global variable somewhere-- When possible, programmers // should avoid having the create() and delete calls in // a function of this nature. if (s.isEmpty()) return; wordBrk.setText(s); start = wordBrk.preceding(idx + 1); end = wordBrk.next(); // NOTE: for this and similar operations, use preceding() and next() // as shown here, not following() and previous(). preceding() is // faster than following() and next() is faster than previous() // NOTE: By using preceding(idx + 1) above, we're adopting the convention // that if the double-click comes right on top of a word boundary, it // selects the word that _begins_ on that boundary (preceding(idx) would // instead select the word that _ends_ on that boundary). } |
In C:
void wordContaining(UBreakIterator* wordBrk, int32_t idx, const UChar* s, int32_t sLen, int32_t* start, int32_t* end, UErrorCode* err) { if (wordBrk == NULL || s == NULL || start == NULL || end == NULL) { *err = U_ILLEGAL_ARGUMENT_ERROR; return; } ubrk_setText(wordBrk, s, sLen, err); if (U_SUCCESS(*err)) { *start = ubrk_preceding(wordBrk, idx + 1); *end = ubrk_next(wordBrk); } } |
Check for Whole Words
Use the following to check if a range of text is a "whole word":
In C++:
UBool isWholeWord(BreakIterator& wordBrk, const UnicodeString& s, int32_t start, int32_t end) { if (s.isEmpty()) return FALSE; wordBrk.setText(s); if (!wordBrk.isBoundary(start)) return FALSE; return wordBrk.isBoundary(end);} |
In C:
UBool isWholeWord(UBreakIterator* wordBrk, const UChar* s, int32_t sLen, int32_t start, int32_t end, UErrorCode* err) { UBool result = FALSE; if (wordBrk == NULL || s == NULL) { *err = U_ILLEGAL_ARGUMENT_ERROR; return FALSE; } ubrk_setText(wordBrk, s, sLen, err); if (U_SUCCESS(*err)) { result = ubrk_isBoundary(wordBrk, start) >> ubrk_isBoundary(wordBrk, end); } return result; } |
Although users can check for "whole words" using these methods, it is possible to get better performance (in most cases) with the following algorithm:
bool isWholeWord(BreakIterator *wordBrk, const UnicodeString& s, int32_t start, int32_t end) { wordBrk->setText(s); if (!wordBrk->isBoundary(start)) return false; UTextOffset p = wordBrk->current(); while (p < end) p = wordBrk->next(); return p == end; } |
This algorithm is faster because the next() method is the fastest boundary-detection method in BreakIterator. The following() and isBoundary() method [while it calls following()] is the slowest. Two calls to the isBoundary() method is faster only when the selection range is long and comprises more than roughly four words.
Count the words in a document (C++ only):
int32_t containsLetters(RuleBasedBreakIterator& bi, const UnicodeString& s, int32_t start) { bi.setText(s); int32_t count = 0; while (start != BreakIterator::DONE) { int breakType = bi.getRuleStatus(); if (breakType != UBRK_WORD_NONE) { // Exclude spaces, punctuation, and the like. ++count; } start = bi.next(); } return count; } |
The function getRuleStatus() returns an enum giving additional information on the text preceding the last break position found. Using this value, it is possible to distinguish between numbers, words, words containing kana characters, words containing ideographic characters, and non-word characters, such as spaces or punctuation. The sample uses the break status value to filter out, and not count, boundaries associated with non-word characters.
Word-wrap a document (C++ only):
The sample function below wraps a paragraph so that each line is less than or equal to 72 characters. The function fills in an array passed in by the caller with the starting offsets of each line in the document. Also, it fills in a second array to track how many trailing white space characters there are in the line. For simplicity, it is assumed that an outside process has already broken the document into paragraphs. For example, it is assumed that every string the function is passed has a single newline at the end only.
int32_t wrapParagraph(const UnicodeString& s, const Locale& locale, int32_t lineStarts[], int32_t trailingwhitespace[], int32_t maxLines, UErrorCode &status) { int32_t numLines = 0; int32_t p, q; const int32_t MAX_CHARS_PER_LINE = 72; UChar c; BreakIterator *bi = BreakIterator::createLineInstance(locale, status); if (U_FAILURE(status)) { delete bi; return 0; } bi->setText(s); p = 0; while (p < s.length()) { // jump ahead in the paragraph by the maximum number of // characters that will fit q = p + MAX_CHARS_PER_LINE; // if this puts us on a white space character, a control character // (which includes newlines), or a non-spacing mark, seek forward // and stop on the next character that is not any of these things // since none of these characters will be visible at the end of a // line, we can ignore them for the purposes of figuring out how // many characters will fit on the line) if (q < s.length()) { c = s[q]; while (q < s.length() && (u_isspace(c) || u_charType(c) == U_CONTROL_CHAR || u_charType(c) == U_NON_SPACING_MARK )) { ++q; c = s[q]; } } // then locate the last legal line-break decision at or before // the current position ("at or before" is what causes the "+ 1") q = bi->preceding(q + 1); // if this causes us to wind back to where we started, then the // line has no legal line-break positions. Break the line at // the maximum number of characters if (q == p) { p += MAX_CHARS_PER_LINE; lineStarts[numLines] = p; trailingwhitespace[numLines] = 0; ++numLines; } // otherwise, we got a good line-break position. Record the start of this // line (p) and then seek back from the end of this line (q) until you find // a non-white space character (same criteria as above) and // record the number of white space characters at the end of the // line in the other results array else { lineStarts[numLines] = p; int32_t nextLineStart = q; for (q--; q > p; q--) { c = s[q]; if (!(u_isspace(c) || u_charType(c) == U_CONTROL_CHAR || u_charType(c) == U_NON_SPACING_MARK)) { break; } } trailingwhitespace[numLines] = nextLineStart - q -1; p = nextLineStart; ++numLines; } if (numLines >= maxLines) { break; } } delete bi; return numLines; } |
Most text editors would not break lines based on the number of characters on a line. Even with a monospaced font, there are still many Unicode characters that are not displayed and therefore should be filtered out of the calculation. With a proportional font, character widths are added up until a maximum line width is exceeded or an end of the paragraph marker is reached.
Trailing white space does not need to be counted in the line-width measurement because it does not need to be displayed at the end of a line. The sample code above returns an array of trailing white space values because an external rendering process needs to be able to measure the length of the line (without the trailing white space) to justify the lines. For example, if the text is right-justified, the invisible white space would be drawn outside the margin. The line would actually end with the last visible character.
In either case, the basic principle is to jump ahead in the text to the location where the line would break (without taking word breaks into account). Then, move backwards using the preceding() method to find the last legal breaking position before that location. Iterating straight through the text with next() method will generally be slower.
ICU BreakIterator Data Files
The source code for the ICU break rules for the standard boundary types is located in the directory icu/source/data/brkitr. These files will be built, and the corresponding binary state tables incorporated into ICU's data, by the standard ICU4C build process. Unlike older (version 2.0 and before) versions of ICU, no special Java tool based build of the break data files is required.
Beginning with ICU 3.0, the same break rule source files and compiled state tables are used for both ICU4C and ICU4J. The state tables are built using ICU4C, and the binary tables are incorporated into ICU4J.
RBBI Rule Syntax
ICU locates boundary positions within text by means of rules, which take the form of regular expressions. A rule matches a section of text - a word or sentence or whatever - that should remain together, with boundaries occuring between the ranges of matched text. A set of rules consists of a series of regular expressions separated by semicolons; at each point, the expression with the longest match determines the next boundary.
A set of break rules may also define and use variables, which are convenient when subexpressions reappear more than once, or to simplify complex expressions by allowing parts to be separately defined and named. Use of variables within a set of rules has no effect on the efficiency of the resulting break iterator.
Here is the syntax for the boundary rules.
Rule Name | Rule Values | Notes |
---|---|---|
rules | statement+ | |
statement | assignment | rule | |
assignment | variable '=' expr ';' | 5 |
rule | '!'? expr ('{'number'}')? ';' | |
number | [0-9]+ | 1 |
break-point | '/' | |
expr | expr-q | expr '|' expr | expr expr | 3 |
expr-q | term | term '*' | term '?' | term '+' | |
term | rule-char | unicode-set | variable | quoted-sequence | '(' expr ')' | break-point | |
rule-special | any printing ascii character except letters or numbers | white-space | |
rule-char | any non-escaped character that is not rule-special | '.' | any escaped character except '\p' or '\P' | |
variable | '$' name-start-char name-char* | 7 |
name-start-char | '_' | \p{L} | |
name-char | name-start-char | \p{N} | |
quoted-sequence | ''' (any char except single quote or line terminator or two adjacent single quotes)+ ''' | |
escaped-char | See “Character Quoting and Escaping ” | |
Unicode set | See UnicodeSet | 4 |
comment | unescaped '#' [any char except new-line]* new-line | 2 |
s | unescaped \p{Z}, tab, LF, FF, CR, NEL | 6 |
new-line | LF, CR, NEL | 2 |
Notes:
The number associated with a rule that actually determined a break position is available to the application after the break has been returned.
Comments are recognized and removed separately from otherwise parsing the rules. They may appear wherever a space would be allowed (and ignored.)
The implicit concatenation of adjacent terms has higher precedence than the '|' operation. "ab|cd" is interpretted as "(ab)|(cd)", not as "a(b|c)d" or "(((ab)|c)d)"
The syntax for UnicodeSet is defined (and parsed) by the UnicodeSet class. It is not repeated here.
For $variables that will be referenced from inside of a UnicodeSet, the definition must consist only of a Unicode Set. For example, when variable $a is used in a rule like this [$a$b$c], then this definition of $a is ok $a=[:Lu:]; while this one is $a=abcd; would cause an error when the $variable was used.
Spaces are allowed nearly anywhere, and are not significant unless escaped. Exceptions to this are noted.
No spaces are allowed within a variable name. The variable name $_dictionary_ is special. If defined, it must be a Unicode Set, the characters of which will be handled by the word dictionary of a Dictionary based Break Iterator.
In ICU4C 2.0 and earlier, and ICU4J 2.4 and earlier, RBBI break rules, while similar, had a slightly different syntax. Here is a summary of the changes with ICU 2.0.
The right hand side of an assignment is a little looser in what it will accept than is the equivalent in the old RBBI rule syntax - the expression doesn't need to be in parens or brackets. Spaces are allowed around the '='.
Spaces are allowed anywhere that it makes sense. Spaces that need to be recognized as such within a rule or set of characters must be escaped or quoted.
The ICU standard conventions for quoting and escaping witin rules are followed.
Text within single quotes is grouped, for example, 'abc'* is equivalent to (abc)*
Because the old style rules appeared only as literal strings within Java source code, adding a '\' escape within a rule required a '\\' in the source, to get a single '\' into the string. Moving the rules out to a text file required removal of the extra '\'s.
# Comments added.
{nnn} syntax added for specifying a value to be returned from the break iterator when the expression matches.
Non greedy *? quantifier was removed.
$ignore characters don't get special handling. Explicit rules were added for dealing with the combining marks that were previously handled including them in $ignore.
EBNF Syntax used for the RBBI rules syntax description
a? | zero or one instance of a |
---|---|
a+ | one or more instances of a |
a* | zero or more instances of a |
a | b | either a or b, but not both |
'a' "a" | the literal string between the quotes |
Additional Sample Code
C/C++: See icu/source/samples/break/ in the ICU source distribution for code samples showing the use of ICU boundary analysis.
Copyright (c) 2000 - 2004 IBM and Others - PDF Version - Feedback: icu-issues@oss.software.ibm.com
User Guide for ICU v3.2 Generated 2004-11-22.