22.6. Zeichensatz.

22.6.1. Unterstützung für UTF-8 und Zeichensätze mit einzelnen Bytes.

Zend_Search_Lucene arbeitet intern mit dem UTF-8 Zeichensatz. Die Indexdateien speichern Unicode Daten in Javas "modified UTF-8 encoding". Der Kern von Zend_Search_Lucene unterstützt dies komplett mit einer Ausnahme. [10]

Actual input data encoding may be specified through Zend_Search_Lucene API. Data will be automatically converted into UTF-8 encoding.

22.6.2. Default text analyzer.

However, default text analyzer (which is also used within query parser) uses ctype_alpha() for tokenizing text and queries.

ctype_alpha() is not UTF-8 compatible, so analyzer converts text to 'ASCII//TRANSLIT' encoding before indexing. The same processing is performed during query parsing, so it's done transparently. [11]

22.6.3. UTF-8 compatible text analyzer.

Zend_Search_Lucene also contains limited functionaliy utf-8 analyzer. It can be turned on with the following code:

<?php
Zend_Search_Lucene_Analysis_Analyzer::setDefault(
    new Zend_Search_Lucene_Analysis_Analyzer_Common_Utf8());
?>

It tokenizes data for indexing in UTF-8 mode and has no problems with any UTF-8 compatible character.

It has two limitations:

  • treats all non-ascii characters as letters (it's not always true);

  • is case-sensitive;

Because of these limitations it's not set as default, but may be helpful for someone.

Case insensitivity my be emulated with strtolower() conversion:

<?php
setlocale(LC_CTYPE, 'de_DE.iso-8859-1');

...

Zend_Search_Lucene_Analysis_Analyzer::setDefault(
    new Zend_Search_Lucene_Analysis_Analyzer_Common_Utf8());

...

$doc = new Zend_Search_Lucene_Document();

$doc->addField(Zend_Search_Lucene_Field::UnStored('contents', strtolower($contents)));

// Title field for search through (indexed, unstored)
$doc->addField(Zend_Search_Lucene_Field::UnStored('title', strtolower($title)));

// Title field for retrieving (unindexed, stored)
$doc->addField(Zend_Search_Lucene_Field::UnIndexed('_title', $title));

?>

The same conversion has to be performed with query string:

<?php
setlocale(LC_CTYPE, 'de_DE.iso-8859-1');

...

Zend_Search_Lucene_Analysis_Analyzer::setDefault(
    new Zend_Search_Lucene_Analysis_Analyzer_Common_Utf8());

...

$hits = $index->find(strtolower($query));
?>



[10] Zend_Search_Lucene unterstützt nur "Basic Multilingual Plane" (BMP) Zeichen (von 0x0000 bis 0xFFFF) und unterstützt keine "zusätzlichen Zeichen" (Zeichen, dessen Kodierungspunkt größer als 0xFFFF sind).

Java 2 stellt diese Zeichen als ein Paar von Zeichenwerten (16-bit) dar, das erste aus dem "high-surrogates" Bereich (0xD800-0xDBFF), das zweite aus dem "ow-surrogates" Bereich (0xDC00-0xDFFF). Dann werden sie als normale UTF-8 in sechs Bytes kodiert. Die Standard UTF-8 Darstellung verwendet vier Bytes für zusätzliche Zeichen.

[11] Conversion to 'ASCII//TRANSLIT' may depend on current locale and OS.