
Data Management
Overview
ICU makes use of a wide variety of data tables to provide many of its services. Examples include converter mapping tables, collation rules, transliteration rules, break iterator rules and dictionaries, and other locale data. Additional data can be provided by users, either as customizations of ICU's data or as new data altogether.
This section describes how ICU data is stored and located at run time. It also describes how ICU data can be customized to suit the needs of a particular application.
For simple use of ICU's predefined data, this section on data management can safely be skipped. The data is built into a library that is loaded along with the rest of ICU. No specific action or setup is required of either the application program or the execution environment.
ICU Data Directory
The ICU data directory is the default location for all ICU data. Any requests for data items that do not include an explicit directory path will be resolved to files located in the ICU data directory.
The ICU data directory is determined as follows:
If the application has called the function u_setDataDirectory(), use the directory specified there, otherwise:
If the environment variable ICU_DATA is set, use that, otherwise:
If the C preprocessor variable ICU_DATA_DIR was set at the time ICU was built, use its compiled-in value.
Otherwise, the ICU data directory is an empty string. This is the default behavior for ICU using a shared library for its data and provides the highest data loading performance.
![]() | u_setDataDirectory() is not thread-safe. Call it before calling ICU APIs from multiple threads. If you use both u_setDataDirectory() and u_init(), then use u_setDataDirectory() first. |
Earlier versions of ICU supported two additional schemes: setting a data directory relative to the location of the ICU shared libraries, and on Windows, taking a location from the registry. These have both been removed to make the behavior more predictable and easier to understand. |
The ICU data directory does not need to be set in order to reference the standard built-in ICU data. Applications that just use standard ICU capabilities (converters, locales, collation, etc.) but do not build and reference their own data do not need to specify an ICU data directory.
Multiple-Item ICU Data Directory Values
The ICU data directory string can contain multiple directories as well as .dat path/filenames. They must be separated by the path separator that is used on the platform, for example a semicolon (;) on Windows. Data files will be searched in all directories and .dat package files in the order of the directory string. For details, see the example below.
Default ICU Data
The default ICU data consists of the data needed for the converters, collators, locales, etc. that are provided with ICU. Default data must be present in order for ICU to function.
The default data is most commonly built into a shared library that is installed with the other ICU libraries. Nothing is required of the application for this mechanism to work. ICU provides additional options for loading the default data if more flexibility is required.
Here are the steps followed by ICU to locate its default data. This procedure happens only once per process, at the time an ICU data item is first requested.
If the application has called the function udata_setCommonData(), use the data that was provided. The application specifies the address in memory of an image of an ICU common format data file (either in shared-library format or .dat package file format).
Examine the contents of the default ICU data shared library. If it contains data, use that data. If the data library is empty, a stub library, proceed to the next step. (A data shared library must always be present in order for ICU to successfully link and load. A stub data library is used when the actual ICU common data is to be provided from another source).
Dynamically load (memory map, typically) a common format (.dat) file containing the default ICU data. Loading is described in the section How Data Loading Works . The path to the data is of the form "icudt<version><flag>", where <version> is the two-digit ICU version number, and <flag> is a letter indicating the internal format of the file (see Sharing ICU Data Between Platforms ).
Once the default ICU data has been located, loading of individual data items proceeds as described in the section How Data Loading Works .
Application Data
ICU-based applications can ship and use their own data for localized strings, custom conversion tables, etc. Each data item file must have a package name as a prefix, and this package name must match the basename of a .dat package file, if one is used. The package name must be used in ICU APIs, for example in udata_setAppData() (instead of udata_setCommonData() which is only used for ICU's own data) and in the pathname argument of ures_open().
The only real difference to ICU's own data is that application data cannot be simply loaded by specifying a NULL value for the path arguments of ICU APIs, and application data will not be used by APIs that do not have path/package name arguments at all.
The most important APIs that allow application data to be used are for Resource Bundles, which are most often used for localized strings and other data. There are also functions like ucnv_openPackage() that allow to specify application data, and the udata.h API can be used to load any data with minimum requirements on the binary format, and without ICU interpreting the contents of the data.
Flexibility vs. Installation vs. Performance
There are choices that affect ICU data loading and depend on application requirements.
Data in Shared Libraries/DLLs vs. .dat package files
Building ICU data into shared libraries is the most convenient packaging method because shared libraries (DLLs) are easily found if they are in the same directory as the application libraries, or if they are on the system library path. The application installer usually just copies the ICU shared libraries in the same place. On the other hand, shared libraries are not portable.
Packaging data into .dat files allows them to be shared across platforms, but they must either be loaded by the application and set with udata_setCommonData() or udata_setAppData(), or they must be in a known location that is included in the ICU data directory string. This requires the application installer, or the application itself at runtime, to locate the ICU and/or application data by setting the ICU data directory (see ICU Data Directory above) or by loading the data and providing it to one of the udata_setXYZData() functions.
Unlike shared libraries, .dat package files can be taken apart into separate data item files with the decmn ICU tool. This allows post-installation modification of a package file. The gencmn and pkgdata ICU tools can then be used to reassemble the .dat package file.
For more information about .dat package files see the section Sharing ICU Data Between Platforms below.
Data Overriding vs. Loading Performance
If the ICU data directory string is empty, then ICU will not attempt to load data from the file system. It is then only possible to load data from the linked-in shared library or via udata_setCommonData() and udata_setAppData(). This is inflexible but provides the highest performance.
If the ICU data directory string is not empty, then data items are searched in all directories and matching .dat files mentioned before checking in already-loaded package files. This allows overriding of packaged data items with single files after installation but costs some time for filesystem accesses. This is usually done only once per data item; see User Data Caching below.
Single Data Files vs. Packages
Single data files are easy to replace and can override items inside data packages. However, it is usually desirable to reduce the number of files during installation, and package files use less disk space than many small files.
How Data Loading Works
ICU data items are referenced by three names - a path, a name and a type. The following are some examples:
path | name | type |
---|---|---|
cnvalias | icu | |
cp1252 | cnv | |
en | res | |
uprops | icu | |
c:\some\path\dataLibName | test | dat |
Items with no path specified are loaded from the default ICU data.
Application data items include a path, and will be loaded from user data files, not from the ICU default data. For application data, the path argument need not contain an actual directory, but must contain the application data's package name after the last directory separator character (or by itself if there is no directory). If the path argument contains a directory, then it is logically prepended to the ICU data directory string and searched first for data. The path argument can contain at most one directory. (Path separators like semicolon (;) are not handled here.)
![]() | The ICU data directory string itself may contain multiple directories and path/filenames to .dat package files. See ICU Data Directory . |
It is recommended to not include the directory in the path argument but to make sure via setting the application data or the ICU data directory string that the data can be located. This simplifies program maintenance and improves robustness.
See the API descriptions for the functions udata_open() and udata_openChoice() for additional information on opening ICU data from within an application.
Data items can exist as individual files, or a number of them can be packaged together in a single file for greater efficiency in loading and convenience of distribution. The combined files are called Common Files.
Based on the supplied path and name, ICU searches several possible locations when opening data. To make things more concrete in the following descriptions, the following values of path, name and type are used:
path = "c:\some\path\dataLibName"
name = "test"
type = "res"
In this case, "dataLibName" is the "package name" part of the path argument, and "c:\some\path\" is the directory part of it.
The search sequence for the data for "test.res" is as follows (the first successful loading attempt wins):
Try to load the file "dataLibName_test.res" from c:\some\data\.
Try to load the file "dataLibName_test.res" from each of the directories in the ICU data directory string.
Try to locate the data package for the package name "dataLibName".
Try to locate the data package in the internal cache.
Try to load the package file "dataLibName.dat" from c:\some\data\.
Try to load the package file "dataLibName.dat" from each of the directories in the ICU data directory string.
The first steps, loading the data item from an individual file, are omitted if no directory is specified in either the path argument or the ICU data directory string.
Package files are loaded at most once and then cached. They are identified only by their package name. Whenever a data item is requested from a package and that package has been loaded before, then the cached package is used immediately instead of searching through the filesystem.
![]() | ICU versions before 2.2 always searched data packages before looking for individual files, which made it impossible to override packaged data items. See the ICU 2.2 download page and the readme for more information about the changes. |
User Data Caching
Once loaded, data package files are cached, and stay loaded for the duration of the process. Any requests for data items from an already loaded data package file are routed directly to the cached data. No additional search for loadable files is made.
The user data cache is keyed by the base file name portion of the requested path, with any directory portion stripped off and ignored. Using the previous example, for the path name "c:\some\path\dataLibName", the cache key is "dataLibName". After this is cached, a subsequent request for "dataLibName", no matter what directory path is specified, will resolve to the cached data.
Data can be explicitly added to the cache of common format data by means of the udata_setAppData() function. This function takes as input the path (name) and a pointer to a memory image of a .dat file. The data is added to the cache, causing any subsequent requests for data items from that file name to be routed to the cache.
Only data package files are cached. Separate data files that contain just a single data item are not cached; for these, multiple requests to ICU to open the data will result in multiple requests to the operating system to open the underlying file.
However, most ICU services (Resource Bundles, conversion, etc.) themselves cache loaded data, so that data is usually loaded only once until the end of the process (or until u_cleanup() or ucnv_flushCache() or similar are called.)
There is no mechanism for removing or updating cached data files.
Directory Separator Characters
If a directory separator (generally '/' or '\') is needed in a path parameter, use the form that is native to the platform. The ICU header "putil.h" defines U_FILE_SEP_CHAR appropriately for the platform.
![]() | On Windows, the directory separator must be '\' for any paths passed to ICU APIs. This is different from native Windows APIs, which generally allow either '/' or '\'. |
Sharing ICU Data Between Platforms
ICU's default data is (at the time of this writing) about 8 MB in size. Because it is normally built as a shared library, the file format is specific to each platform (operating system). The data libraries can not be shared between platforms even though the actual data contents are identical.
By distributing the default data in the form of common format .dat files rather than as shared libraries, a single data file can be shared among multiple platforms. This is beneficial if a single distribution of the application (a CD, for example) includes binaries for many platforms, and the size requirements for replicating the ICU data for each platform are a problem.
ICU common format data files are not completely interchangeable between platforms. The format depends on these properties of the platform:
Byte Ordering (little endian vs. big endian)
Base character set - ASCII or EBCDIC
This means, for example, that ICU data files are interchangeable between Windows and Linux on X86 (both are ASCII little endian), or between Macintosh and Solaris on SPARC (both are ASCII big endian), but not between Solaris on SPARC and Solaris on X86 (different byte ordering).
The single letter following the version number in the file name of the default ICU data file encodes the properties of the file as follows:
icudt19l.dat Little Endian, ASCII
icudt19b.dat Big Endian, ASCII
icudt19e.dat Big Endian, EBCDIC
(There are no little endian EBCDIC systems. All non-ebcdic encodings include an invariant subset of ASCII that is sufficient to enable these files to interoperate.)
The packaging of the default ICU data as a .dat file rather than as a shared library is requested by using an option in the configure script at build time. Nothing is required at run time; ICU finds and uses whatever form of the data is available.
![]() | When the ICU data is built in the form of shared libraries, the library names have platform-specific prefixes and suffixes. On Unix-style platforms, all the libraries have the "lib" prefix and one of the usual (".dll", ".so", ".sl", etc.) suffixes. Other than these prefixes and suffixes, the library names are the same as the above .dat files. |
Customizing ICU's Data Library
ICU includes a standard library of data that is about 8 MB in size. Most of this consists of conversion tables and locale information. The data itself is normally placed into a single shared library.
The ICU data library can be easily customized, either by adding additional converters or locales, or by removing some of the standard ones for the purpose of saving space.
ICU can load data from individual data files as well as from its default library, so building a customized library when adding additional data is not strictly necessary. Adding to ICU's library can simplify application installation by eliminating the need to include separate files with an application distribution, and the need to tell ICU where they are installed.
Reducing the size of ICU's data by eliminating unneeded resources can make sense on small systems with limited or no disk, but for desktop or server systems there is no real advantage to trimming. ICU's data is memory mapped into an application's address space, and only those portions of the data actually being used are ever paged in, so there are no significant RAM savings. As for disk space, with the large size of today's hard drives, saving a few MB is not worth the bother.
By default, ICU builds with a large set of converters and with all available locales. This means that any extra items added must be provided by the application developer. There is no extra ICU-supplied data that could be specified.
Details
The lists of converters and resources that ICU builds are in these configuration files:
icu/source/data/locales/resfiles.mk | The standard set of locale data resource bundles |
---|---|
icu/source/data/locales/reslocal.mk | User-provided file with additional resource bundles |
icu/source/data/translit/trnsfiles.mk | The standard set of transliterator resource files |
icu/source/data/translit/trnslocal.mk | User-provided file with a set of additional transliterator resource files |
icu/source/data/mappings/ucmcore.mk | Core set of conversion tables for MIME/Unix/Windows |
icu/source/data/mappings/ucmfiles.mk | Additional, large set of conversion tables for a wide range of uses |
icu/source/data/mappings/ucmebcdic.mk | Large set of EBCDIC conversion tables |
icu/source/data/mappings/ucmlocal.mk | User-provided file with additional conversion tables |
icu/source/data/misc/miscfiles.mk | Miscellaneous data, like timezone information |
These files function identically for both Windows and UNIX builds of ICU. ICU will automatically update the list of installed locales returned by uloc_getAvailable() whenever resfiles.mk or reslocal.mk are updated and the ICU data library is rebuilt. These files are only needed while building ICU. If any of these files are removed or renamed, the size of the ICU data library will be reduced.
The optional files reslocal.mk and ucmlocal.mk are not included as part of a standard ICU distribution. Thus these customization files do not need to be merged or updated when updating versions of ICU.
Both reslocal.mk and ucmlocal.mk are makefile includes. So the usual rules for makefiles apply. Lines may be continued by preceding the end of the line to be continued with a back slash. Lines beginning with a # are comments. See ucmfiles.mk and resfiles.mk for additional information.
Reducing the Size of ICU's Data: Conversion Tables
The size of the ICU data file in the standard build configuration is about 8 MB. The majority of this is used for conversion tables. ICU comes with so many conversion tables because many ICU users need to support many encodings from many platforms. There are conversion tables for EBCDIC and DOS codepages, for ISO 2022 variants, and for small variations of popular encodings.
Important: ICU provides full internationalization functionality without any conversion table data. The common library contains code to handle several important encodings algorithmically: US-ASCII, ISO-8859-1, UTF-7/8/16/32, SCSU, BOCU-1, CESU-8, and IMAP-mailbox-name (i.e., US-ASCII, ISO-8859-1, and all Unicode charsets; see source/data/mappings/convrtrs.txt for the current list).
Therefore, the easiest way to reduce the size of ICU's data by a lot (without limitation of I18N support) is to reduce the number of conversion tables that are built into the data file.
The conversion tables are listed for the build process in several makefiles icu/source/data/mappings/ucm*.mk, roughly grouped by how commonly they are used. If you remove or rename any of these files, then the ICU build will exclude the conversion tables that are listed in that file. Beginning with ICU 2.0, all of these makefiles including the main one are optional. If you remove all of them, then ICU will include only very few conversion tables for "fallback" encodings (see note below).
If you remove or rename all ucm*.mk files, then ICU's data is reduced to about 3.6 MB. If you remove all these files except for ucmcore.mk, then ICU's data is reduced to about 4.7 MB, while keeping support for a core set of common MIME/Unix/Windows encodings.
![]() | If you remove the conversion table for an encoding that could be a default encoding on one of your platforms, then ICU will not be able to instantiate a default converter. In this case, ICU 2.0 and up will automatically fall back to a "lowest common denominator" and load a converter for US-ASCII (or, on EBCDIC platforms, for codepages 37 or 1047). This will be good enough for converting strings that contain only "ASCII" characters (see the comment about "invariant characters" in utypes.h). |
When ICU is built with a reduced set of conversion tables, then some tests will fail that test the behavior of the converters based on known features of some encodings. Also, building the testdata will fail if you remove some conversion tables that are necessary for that (to test non-ASCII/Unicode resource bundle source files, for example). You can ignore these failures. Build with the standard set of conversion tables, if you want to run the tests. |
Reducing the Size of ICU's Data: Locale Data
If you need to reduce the size of ICU's data even further, then you need to remove other files or parts of files from the build as well.
The largest part of the data besides conversion tables is in collation for East Asian languages. You can remove the collation data for those languages by removing the CollationElements entries from those icu/source/data/locales/*.txt files. When you do that, the collation for those languages will become the same as the Unicode Collation Algorithm.
You can remove data for entire locales by removing their files from icu/source/data/locales/resfiles.mk. ICU will then use the data of the parent locale instead, which is root.txt. If you remove all resource bundles for a given language and its country/region/variant sublocales, do not remove root.txt! Also, do not remove a parent locale if child locales exist. For example, do not remove "en" while retaining "en_US".
Adding Converters to ICU
The first step is to obtain or create a .ucm (source) mapping data file for the desired converter. A large archive of converter data is maintained by the ICU team at http://oss.software.ibm.com/cvs/icu/charset/data/ucm/
We will use solaris-eucJP-2.7.ucm, available from the repository mentioned above, as an example.
Build the Converter
Converter source files are compiled into binary converter files (.cnv files) by using the icu tool makeconv. For the example, you can use this command
makeconv -v solaris-eucJP-2.7.ucm
Some of the .ucm files from the repository will need additional header information before they can be built. Use the error messages from the makeconv tool, .ucm files for similar converters, and the ICU user guide documentation of .ucm files as a guide when making changes. For the solaris-eucJP-2.7.ucm example, we will borrow the missing header fields from icu/source/data/mappings/ibm-33722_P12A-2000.ucm, which is the standard ICU eucJP converter data.
The ucm file format is described here .
After adjustment, the header of the solaris-eucJP-2.7.ucm file contains these items:
<code_set_name> "solaris-eucJP-2.7" <subchar> \x3F <uconv_class> "MBCS" <mb_cur_max> 3 <mb_cur_min> 1 <icu:state> 0-8d, 8e:2, 8f:3, 90-9f, a1-fe:1 <icu:state> a1-fe <icu:state> a1-e4 <icu:state> a1-fe:1, a1:4, a3-af:4, b6:4, d6:4, da-db:4, ed-f2:4 <icu:state> a1-fe |
The binary converter file produced by the makeconv tool is solaris-eucJP-2.7.cnv
Installation
Copy the new .cnv file to the desired location for use. Set the environment variable ICU_DATA to the directory containing the data, or, alternatively, from within an application, tell ICU the location of the new data with the function u_setDataDirectory() before using the new converter.
If ICU is already obtaining data from files rather than a shared library, install the new file in the same location as the existing ICU data file(s), and don't change/set the environment variable or data directory.
If you do not want to add a converter to ICU's base data, you can also generate a conversion table with makeconv, use pkgdata to generate your own package and use the ucnv_openPackage() to open up a converter with that conversion table from the generated package.
Building the new converter into ICU
The need to install a separate file and inform ICU of the data directory can be avoided by building the new converter into ICU's standard data library. Here is the procedure for doing so:
Move the .ucm file(s) for the converter(s) to be added ( solaris-eucJP-2.7.ucm for our example) into the directory icu/source/data/mappings/
Create, or edit, if it already exists, the file icu/source/data/mappings/ucmlocal.mk Add this line:
UCM_SOURCE_LOCAL = solaris-eucJP-2.7.ucm
Any number of converters can be listed. Extend the list to new lines with a back slash at the end of the line. The ucmlocal.mk file is described in more detail in icu/source/data/mappings/ucmfiles.mk (Even though they use very different build systems, ucmlocal.mk is used for both the Windows and UNIX builds.)
Add the converter name and aliases to icu/source/data/mappings/convrtrs.txt. This will allow your converter to be shown in the list of available converters when you call the ucnv_getAvailableName() function. The file syntax is described within the file.
Rebuild the ICU data.
For Windows, from MSVC choose the makedata project from the GUI, then build the project.
For UNIX, "cd icu/source/data; gmake"
When opening an ICU converter (ucnv_open()), the converter name can not be qualified with a path that indicates the directory or common data file containing the corresponding converter data. The required data must be present either in the main ICU data library or as a separate .cnv file located in the ICU data directory. This is different from opening resources or other types of ICU data, which do allow a path.
Adding Locale Data to ICU's Data
If you have data for a locale that is not included in ICU's standard build, then you can add it to the build in a very similar way as with conversion tables above. The ICU project provides a large number of additional locales in its locale repository on the web.
You need to write a resource bundle file for it with a structure like the existing locale resource bundles (e.g. icu/source/data/locales/ja.txt, ru_RU.txt, kok_IN.txt) and add it by writing a file icu/source/data/locales/reslocal.mk just like above. In this file, define the list of additional resource bundles as
GENRB_SOURCE_LOCAL=myLocale.txt other.txt ...
Starting in ICU 2.2, these added locales are automatically listed by uloc_getAvailable().
ICU Data File Formats
ICU uses several kinds of data files with specific source (plain text) and binary data formats. The following table provides links to descriptions of those formats.
Each ICU data object begins with a header before the actual, specific data. The header consists of a 16-bit header length value, the two "magic" bytes DA 27 and a UDataInfo structure which specifies the data object's endianness, charset family, format, data version, etc.
Files | Source format | Binary format | Generator tool |
---|---|---|---|
ICU .dat package files | (list of files on the gencmn tool command line) | .dat: icu/source/tools/gencmn/gencmn.c | gencmn |
Resource bundles | .txt: icuhtml/design/bnf_rb.txt
.xml: icuhtml/design/resourceBundle.dtd | .res: icu/source/common/uresdata.h | genrb |
Unicode conversion mapping tables | .ucm: Conversion Data chapter | .cnv: icu/source/common/ ucnvmbcs.h | makeconv |
Conversion (charset) aliases | icu/source/data/mappings/convrtrs.txt
: contains format description The command "uconv -l --canon" will also generate the alias table from the currently used copy of ICU. | cnvalias.icu: icu/source/common/ ucnv_io.c | gencnval |
Unicode Character Data (Properties) | icu/source/data/unidata/*.txt : Unicode Character Database | uprops.icu: icu/source/tools/ genprops/store.c | genprops |
Unicode Character Data (Case mappings) | icu/source/data/unidata/*.txt : Unicode Character Database | ucase.icu: icu/source/tools/gencase/store.c | gencase |
Unicode Character Data (Normalization) | icu/source/data/unidata/*.txt : Unicode Character Database | unorm.icu: icu/source/common/ unormimp.h | gennorm |
Unicode Character Data (Character names) | icu/source/data/unidata/UnicodeData.txt : Unicode Character Database | unames.icu: icu/source/tools/ gennames/gennames.c | gennames |
Unicode Character Data (Property [value] aliases) | icu/source/data/unidata/Property*Aliases.txt : Unicode Character Database | pnames.icu: icu/source/common/ propname.h | genpname |
Collation data (UCA, code points to weights) | Original data from allkeys.txt in UTS #10 Unicode Collation Algorithm
processed into icu/source/data/unidata/FractionalUCA.txt by icu4j/unicodetools/com/ibm/text/UCA/ (call the Main class with option writeFractionalUCA) | ucadata.icu: (icu/source/i18n/ ucol_imp.h ) | genuca |
Collation data (Inverse UCA, weights->code points) | Processed from FractionalUCA.txt like ucadata.icu | invuca.icu: (icu/source/i18n/ ucol_imp.h ) | genuca |
Collation data (Tailorings, code points->weights) | Source tailorings (text rules) in resource bundles: Collation Services Customization chapter | Binary tailorings in resource bundles: same format as ucadata.icu (icu/source/i18n/ ucol_imp.h ) | genrb |
Rule-based break iterator data | .txt: Boundary Analysis chapter | .brk: TBD (icu/source/common/ rbbirb.h ) | genbrk |
Rule-based transform (transliterator) data | .txt (in resource bundles): Transform Rule Tutorial chapter | Uses genrb to make binary format | Does not apply |
Time zone data | icu/source/data/misc/zoneinfo.txt : ftp://elsie.nci.nih.gov/pub/ tzdata<year> | zoneinfo.res (generated by genrb and source/tools/tzcode/ tz.pl ) | Does not apply |
StringPrep profile data | icu/source/data/misc/NamePrepProfile.txt | .spp: icu/source/tools/ gensprep/store.c | gensprep |
Copyright (c) 2000 - 2004 IBM and Others - PDF Version - Feedback: icu-issues@oss.software.ibm.com
User Guide for ICU v3.2 Generated 2004-11-22.