Bidi CCSID support in OS/400
In Arabic and Hebrew, the general flow of text proceeds horizontally from right to left, but numbers are written from left to right, just like in English. Additionally, English (or other left-to-right languages) text embedded within Hebrew or Arabic is also written from left to right. For this reason, these languages are called Bidirectional.
In this document, the following abbreviations are used:
Bidi
Bidirectional
LTR
Left-to-Right
RTL
Right-to-Left
In the examples presented throughout this document, Latin upper case letters represent Arabic or Hebrew text, while Latin lower case letters represent English text.
Example 1: Arabic text
THIS SIMULATES ARABIC TEXT
Example 2: English text
this represents english text
When a program sends a string of characters to a presentation device (like a CRT or a printer), those characters are generally displayed adjacent to one another within one or more lines, proceeding from left to right and top to bottom. This arrangement is natural for LTR languages, but raises some concerns for Bidi languages.
Let us consider the Hebrew or Arabic sentence "THE NUMBER IS
1234." The natural way to enter it using a keyboard is to type
"T", then "H", then "E". etc... until
"4" and ".". However, this sentence must be displayed as:
.1234 SI REBMUN EHT
since Arabic or Hebrew text is read from right to left.
The order in which Arabic or Hebrew text is entered is called "Logical" (also known as "Implicit"). The order on the presentation device is called "Visual". Here are a few examples of text in Logical and in Visual order.
Logical Order |
Visual Order |
MY NAME IS "MOSES". |
."SESOM" SI EMAN YM |
I ATE 12 COOKIES. |
.SEIKOOC 12 ETA I |
I LOVE new-york. |
.new-york EVOL I |
You said: "I ATE 12 COOKIES". |
you said: "SEIKOOC 12 ETA I". |
For Arabic, there is a difference between text as it may be stored and text as it must be presented. This is due to the fact that each Arabic letter may be pictured in more than one shape. The shape to use depends on the connection capability of the letter itself, its position in the word, and on the characters which are its neighbors on both sides. When text is entered, it is easier for the user to pick a canonical form of the letter which stands for any shape; this form is called the "base" shape. The Arabic character base (or canonical) shape represents the character with no special shaping; it is used for implicit data storage. For presentation, base shapes must be replaced by the proper shapes depending on the context. This operation is called "shaping". Data may be stored in base shapes with the shaping process taking place only at presentation time, or data may be shaped at its inception and stored with ready-to-present shapes. The Arabic scripts are cursive. A writing system is cursive if it is suited to handwriting rather than printing, with adjacent characters in a word connected to each other. Some letters can only connect to the letter on their right. This is the only way in which Arabic script is used, whether in books, newspapers, signs, or workstation displays.
An Arabic language character could have four different shapes depending on its position in the word:
-
Isolated: The character is not linked to either the preceding
or the following character.
-
Final: The character is linked to the preceding character but
not to the following one.
-
Initial: The character is linked to the following character
but not to the preceding one.
-
Middle: The character is linked to both the preceding and
following characters.
The Arabic character is shaped according to its position in the word. Some Arabic characters do not have the four different shapes.
Example: the Arabic character ALEF WITH HAMZA ABOVE has 2 shapes only:
-
Isolated
-
Final
Example: the Arabic character DAL has 2 shapes only:
-
Isolated
-
Final
A character is the smallest semantic unit in a writing system. It might happen that there is not a one to one correspondence between the number of characters of text stored for processing and the number of characters of the presented text. Sometimes two or more characters might be represented by a single glyph occupying one presentation cell; a glyph is the visual presentation of one or more characters, and is often dependent on adjacent characters. There is not always a one to one mapping between characters and glyphs.
The <Lam-Alef> ligature is coded as a single character in CCSID 420, while it is stored as 2 characters <Lam> and <Alef> in implicit code pages like 1256 and 1089. Special processing handles the <Lam-Alef> conversion:
· From Visual to Implicit storage: expand <Lam-Alef> to <Lam> and <Alef>.
<Lam-Alef> is expanded consuming the spaces at the buffer extremity towards the direction of the end of Arabic text. If the number of spaces at the buffer extremity is greater than or equal to the number of <Lam-Alef> characters in the buffer they are all expanded to <Lam> and <Alef>. If the number of spaces at the buffer extremity is less than the number of Lam-Alef, then every <Lam-Alef> occurrence exceeding the number of spaces is converted to Substitution control character.
· From Implicit to Visual: compress <Lam> and <Alef> to <Lam-Alef>.
<Lam> and <Alef> are compressed to <Lam-Alef> and one Blank space is added towards the buffer extremity.
The Bidi layout engine handles the expansion and compression automatically, so that the user does not need to intervene in the process.
When Arabic and Hebrew support was added to the iSeries originally, the system was used in a stand-alone environment. The designers of this support decided to store the Bidi data in visual order, with shaped letters (for Arabic). This had the advantage that no special processing was needed to format the data for presentation, since it was already in presentation form. Since the data only existed on the iSeries, it did not matter what form was used.
When Arabic and Hebrew support was added to workstations (PC systems, Unix systems), the designers of this support decided to store the Bidi data in logical order, with base shapes (for Arabic). This format is more convenient for processing like sorting and searching. It also has the advantage that Bidi data can be processed like non-Bidi data in many cases. However, the system needs to format the data for presentation.
With the change of time, customers began to interchange data back and forth between different systems, including between iSeries and workstations. They then discovered that, even though the same characters were used, the data was not the same. The data needed to be transformed from the format used on the source system to the format used on the target system.
For several releases, customers have requested to make this process more transparent to applications. The first step was to create a classification of the different formats of Bidi data. This is done by means of Coded Character Set Identifiers (CCSIDs) (more on that below).
Beginning in V4R4, PTF(s) were made available to allow automatic transformation of Bidi data flowing through OS/400, according to the format appropriate in each phase, as specified by a CCSID. In V5R1, this processing is part of the base system.
The specific characteristics of Bidi data can be described using five attributes:
· Orientation
· Text Type
· Symmetrical Swapping
· Text Shaping
· Numeric Shaping
The orientation of a piece of text specifies whether this text flows mainly from left to right (for instance, an English sentence with possible Arabic or Hebrew embedded terms) or from right to left (for instance, an Arabic or Hebrew sentence with possible embedded numbers or English terms).
The different possible values for Orientation are:
Appropriate for logically ordered text whose main language is LTR (like the European languages), or visually ordered text ready for presentation on a device where character progression is LTR.
Appropriate for logically ordered text whose main language is RTL (like Arabic or Hebrew), or visually ordered text ready for presentation on a device where character progression is RTL.
The orientation is induced from the first "strong" character (character with a "strong" orientation) in the text. Letters from the Hebrew or Arabic alphabets are considered to have a strong RTL orientation. Letters from European (and most other) alphabets are considered to have a strong LTR orientation. Characters such as spaces, punctuation, and even digits, which can be used within either a LTR or RTL context, have no strong orientation. If a string is specified to have contextual orientation but contains no strong character, the orientation may default to LTR (if specified as Contextual Left) or to RTL (if specified as Contextual Right).
The Text Type can be Logical (alias Implicit) or Visual according to the sequence in which the characters of the text are ordered.
Example:
Visual order shaped text
Logical order unshaped text
Within RTL text, some characters may be displayed as if they were a symmetrical character. For instance, a left parenthesis will be displayed as a right parenthesis and vice-versa. This allows keeping a semantic of "open parenthesis" to the code allocated for left parenthesis, and "close parenthesis" to the code of right parenthesis, within either a LTR or RTL context. The same swapping applies to other pairs of characters with symmetrical shapes and roles, like square brackets, curly brackets, and even less-than and greater-than signs.
The possible values for this attribute are "ON" and "OFF".
If the symmetrical swapping is not taken into consideration, the open and close brackets (for example) remain the same as their original shape when swapping between Arabic and English occurs. This results in text misinterpretation, like in the following example:
On a right-to-left window of the screen, the expression:
b < a
is read as a is greater than b. In storage, the
first character will be a followed by < and then b. So
we end up having in storage:
a < b
which is, of course, incorrect. In this case, in order to preserve the
correct meaning of the expression, the < character must be exchanged in
storage with >.
As explained above (see "Arabic Shaping"), Arabic text may be encoded as base shapes, or with the contextual shapes (Initial, Middle, Final, Isolated) used for presentation. Accordingly, the values for this attribute may be "Unshaped" (base shapes) or "Shaped".
Some Arabic countries use different shapes for digits, the "Hindi" digits. These digits may have specific encodings, different from the regular ("Arabic") digits. However, it is often convenient, or even mandatory, to keep the digits encoded in their regular codes, for instance to facilitate their handling by software and hardware without special treatment for Bidi environments. In this case, the conversion of the digits to Hindi shapes must be made at presentation. The Numeric Shaping attribute specifies how the different kinds of digits are encoded. Its possible values are:
Arabic digits and Hindi digits are encoded separately; no conversion is needed for presentation. During conversion to a string type with numeric shaping set to passthrough, the Hindi or Arabic numerics are preserved in their original state without any conversion.
All digits are encoded as Arabic digits; the presentation mechanism must select Hindi digit shapes for numbers appearing in an Arabic context. If the code page has a different encoding for Arabic and Hindi digits, during conversion to a string type with numeric shaping set to Arabic the Hindi digits are converted to Arabic digits.
Certain combinations of Bidi attributes are more frequently used in applications. In order to simplify their specification, each such combination has been equated to a "String Type". The string types used for Bidi data are defined in the following table.
String Type |
Text Type |
Numeric Shaping |
Orientation |
Text Shaping |
Symmetrical
Swapping |
4 |
Visual |
passthrough |
LTR |
Shaped |
Off |
5 |
Implicit |
Arabic |
LTR |
Unshaped |
On |
6 |
Implicit |
Arabic |
RTL |
Unshaped |
On |
7(*) |
Visual |
passthrough |
Contextual* |
Unshaped-Lig |
Off |
8 |
Visual |
passthrough |
RTL |
Shaped |
Off |
9 |
Visual |
passthrough |
RTL |
Shaped |
On |
10 |
Implicit |
|
Contextual Left |
|
On |
11 |
Implicit |
|
Contextual Right |
|
On |
12 |
Implicit |
Arabic |
RTL |
Shaped |
Off |
Note: (*) String Orientation is LTR when the first alphabetic character is a Latin one, and RTL when it is an Arabic or Hebrew (RTL) character; characters are unshaped, but LamAlef ligatures are kept, and not broken into constituents.
"CCSID" is an acronym for "Coded Character Set IDentifier". This is a number in the range 0-65535 (0-FFFF in hexadecimal notation). A CCSID represents a combination of the following elements:
· Character Set: a repertoire of characters
· Encoding Scheme: a set of rules by which characters are given a specific encoded value. Examples of encoding schemes are EBCDIC or ASCII.
· Code Page: a set of numeric values for all characters in the Character Set
· String Type: a number representing a set of attributes applying to data encoded with this CCSID.
The following CCSID numbers have special meanings.
This CCSID is used to show that the associated data should not be processed as text. In other words, data associated with this CCSID must not be converted to another CCSID.
This CCSID is used to show that a CCSID value for data at this level of
processing is not relevant, and CCSID values should be obtained from data
elements at a lower level in the defined hierarchy. For example, if a message
file is tagged with this CCSID, processing will be based on CCSIDs assigned to
each individual message ID. If a file is tagged with this CCSID, processing
will be based on CCSIDs assigned to each individual field.
In environments with multiple languages and multiple encodings for data used
by the various components of the system, CCSIDs are the means used by OS/400 to
maintain data integrity. Practically, this translates into allowing CCSIDs to
be specified at various levels for data, programs, jobs, etc... Whenever data
flows from a source to a destination, the involved system functions compare the
CCSID of the data supplied at the source with the CCSID of the data expected by
the destination, and perform appropriate conversions if needed. For Bidi data,
this may include both code page conversion and Bidi attribute transformation.
For instance, when a program reads data from a database and the CCSIDs of the
job and of the database are different, the data is converted to the CCSID of
the job. For example if the database file is tagged with CCSID 8612 (implicit
data) and the interactive job is associated to CCSID 420 (visual display) the
data is converted from implicit to visual.
Note: if one of the CCSIDs is 65535, no conversion is done on the data.
The next sections will relate to CCSIDs at different levels in the system.
The system value QCCSID is the default CCSID for all jobs running on the system. QCCSID can be set or changed with the CHGSYSVAL and WRKSYSVAL commands. The system is shipped with a default CCSID of 65535, which prevents data conversion. However, the Arabic NLV (National Language Version) has a default CCSID of 00420, and the Hebrew NLV has a default CCSID of 00424.
A CCSID specified in a user profile is assigned by default to all jobs run under that user profile. The CCSID can be set or changed with the CRTUSRPRF and CHGUSRPRF commands.
The job CCSID for an interactive job is set initially to the CCSID of the user profile. For batch jobs, the CCSID of the current job is used as the default CCSID for the submitted job, unless a CCSID is specifically entered on the BCHJOB command. A job CCSID can be changed with the CHGJOB command.
The CCSID of a file can be specified when creating it. By default, the file receives the CCSID of the job creating it. The CCSID of a physical file can be changed with the CHGPF command, if it is not explicitly defined in the Data Description Specification (DDS) source description of the file.
Figure 1:
Change Physical File (CHGPF) command prompt
The CCSID of Stream files can be changed by using the EDTF command and pressing <F15> to change the associated CCSID.
Figure 2:
EDTF Options Screen
Note: Changing the CCSID associated to the file in the above manner does not convert the data to the target CCSID. The CCSID is a tag with no effect on the data stored. Refer to Data conversion across different CCSID.
The CCSID can be specified at field level in databases.
The field CCSID is specified when the file is created using one of the following:
· DDS source, in the CCSID field level keyword. See CCSID (Coded Character Set Identifier) keyword for physical and logical files
· SQL statement CREATE TABLE, in the CCSID parameter.
OS/400 supports the following CCSIDs for Arabic.
CCSID |
String Type |
Code Page |
Description |
420 |
4 |
420 |
EBCDIC (Original CCSID for Arabic data) |
425 |
5 |
425 |
EBCDIC with POSIX chars, like [ ] { } etc. |
864 |
5 |
864 |
PC Data |
1046 |
5 |
1046 |
Arabic AIX data all presentation shapes |
1089 |
5 |
1089 |
ISO 8859-6 |
1256 |
5 |
1256 |
MS Windows |
8612 |
5 |
420 |
EBCDIC |
12708 |
8 |
420 |
EBCDIC |
62218 |
4 |
864 |
PC Data |
62224 |
6 |
420 |
EBCDIC |
62228 |
6 |
1256 |
MS Windows |
62251 |
6 |
425 |
EBCDIC with POSIX chars, like [ ] { } etc. |
CCSID 420 can be mapped to:
00037, 00256, 00500, 00720, 00737, 00775, 00819, 00850, 00864, 00937, 01008,
01046, 01089, 01112, 01122, 01208, 01256, 04960, 08612, 09030, 09056, 12708,
13488, 28709, 61952, 62218, 62224, 62228.
CCSID 425 can be mapped to:
00037, 00500, 00819, 00864, 01046, 01089, 01252, 01256, 08612, 13488, 61952,
62224, 62228.
For a table of all supported CCSID conversions, see iSeries Bidirectional CCSID Mapping Information
OS/400 supports the following CCSIDs for Hebrew.
CCSID |
String Type |
Code Page |
Description |
424 |
4 |
424 |
EBCDIC (Original CCSID for Hebrew data) |
916 |
5 |
916 |
ISO 8859-8 |
1255 |
5 |
1255 |
MS Windows |
62210 |
4 |
916 |
ISO 8859-9 |
62211 |
5 |
424 |
EBCDIC |
62215 |
4 |
1255 |
MS Windows |
62222 |
6 |
916 |
ISO 8859-9 |
62223 |
6 |
1255 |
MS Windows |
62235 |
6 |
424 |
EBCDIC |
62238 |
10 |
916 |
ISO 8859-9 |
62239 |
10 |
1255 |
MS Windows |
62245 |
10 |
424 |
EBCDIC |
CCSID 424 can be mapped to:
00037, 00256, 00500, 00737, 00775, 00819, 00850, 00862, 00916, 00937, 01112,
01122, 01208, 01255, 04952, 09030, 13488, 28709, 61952, 62210, 62211, 62215,
62222, 62223, 62235, 62238, 62239, 62245.
For a table of all supported CCSID conversions, see iSeries Bidirectional CCSID Mapping Information .
The CCSID support in OS/400 affects many components of the system. Evaluating what conversions will take place based on CCSID values specified for various elements can be a challenge. More information can be found in the following IBM publications:
· AS/400 series - National Language Support, SC41-5101
· AS/400 series - International Application Development, SC41-5603
Many strategies are possible, depending on the level of heterogeneity of the data formats involved. We will present only two archetypal ones.
With this strategy, all data conversions done automatically by the system are avoided. This can be achieved by specifying CCSIDs as 65535 wherever possible. This strategy avoids unwanted conversions on behalf of the system. On the other hand, whenever conversions are needed, the burden lays on the applications.
With this strategy, all data conversions are left to the system, which will know exactly what to do based on scrupulous specification of CCSIDs for all involved elements. This strategy requires a good understanding of the system capabilities, but it relieves the applications from the burden of data conversions.
In practical situations, users will probably want to use a mix of strategies 1 and 2, starting closer to strategy one (for compatibility with existing procedures) and getting closer to strategy 2 as the system is better understood and data interchange between heterogeneous platforms becomes more prevalent.
Traditionally, Bidi data under OS/400 has been stored in Visual LTR format. This corresponds to CCSID 420 for Arabic, 424 for Hebrew. For compatibility with all this "legacy" data, it is suggested to make 420 and 424 (for Arabic and Hebrew respectively) the value of QCCSID, and the CCSID value for most user profiles.
Other data formats are used for various purposes. The jobs that create these kinds of data should have a corresponding CCSID.
·
EBCDIC Bidi data could be stored in Logical format,
which is useful for sorting, searching, and data interchange with systems using
implicit code pages. This corresponds to CCSID 8612 or 62224 for Arabic and
CCSID 62211 or 62235 for Hebrew.
·
Windows platforms, and most Unix platforms, expect data
in LTR Logical format. This corresponds to CCSIDs 1255 and 1256 for Hebrew and
Arabic respectively.
· Web pages should be in Logical format (for Hebrew, Visual format is also used, but this usage is deprecated). The corresponding CCSIDs are 916 or 1255 for Hebrew, 1089 or 1256 for Arabic.