Bidi CCSID support in OS/400


Introduction

Bidirectional Languages

In Arabic and Hebrew, the general flow of text proceeds horizontally from right to left, but numbers are written from left to right, just like in English. Additionally, English (or other left-to-right languages) text embedded within Hebrew or Arabic is also written from left to right. For this reason, these languages are called Bidirectional.

Abbreviations and Conventions

In this document, the following abbreviations are used:

Bidi

Bidirectional

LTR

Left-to-Right

RTL

Right-to-Left

In the examples presented throughout this document, Latin upper case letters represent Arabic or Hebrew text, while Latin lower case letters represent English text.

Example 1: Arabic text

THIS SIMULATES ARABIC TEXT

 

Example 2: English text

this represents english text

Visual Order versus Logical Order

When a program sends a string of characters to a presentation device (like a CRT or a printer), those characters are generally displayed adjacent to one another within one or more lines, proceeding from left to right and top to bottom. This arrangement is natural for LTR languages, but raises some concerns for Bidi languages.

Let us consider the Hebrew or Arabic sentence "THE NUMBER IS 1234." The natural way to enter it using a keyboard is to type "T", then "H", then "E". etc... until "4" and ".". However, this sentence must be displayed as:

.1234 SI REBMUN EHT

since Arabic or Hebrew text is read from right to left.

The order in which Arabic or Hebrew text is entered is called "Logical" (also known as "Implicit"). The order on the presentation device is called "Visual". Here are a few examples of text in Logical and in Visual order.

Logical Order

Visual Order

MY NAME IS "MOSES".

."SESOM" SI EMAN YM

I ATE 12 COOKIES.

.SEIKOOC 12 ETA I

I LOVE new-york.

.new-york EVOL I

You said: "I ATE 12 COOKIES".

you said: "SEIKOOC 12 ETA I".

So we can consider that the logical order is the sequence in which the characters are pronounced, while the visual order is the sequence in which the characters are juxtaposed on a view port.

Arabic Shaping

For Arabic, there is a difference between text as it may be stored and text as it must be presented. This is due to the fact that each Arabic letter may be pictured in more than one shape. The shape to use depends on the connection capability of the letter itself, its position in the word, and on the characters which are its neighbors on both sides. When text is entered, it is easier for the user to pick a canonical form of the letter which stands for any shape; this form is called the "base" shape. The Arabic character base (or canonical) shape represents the character with no special shaping; it is used for implicit data storage. For presentation, base shapes must be replaced by the proper shapes depending on the context. This operation is called "shaping". Data may be stored in base shapes with the shaping process taking place only at presentation time, or data may be shaped at its inception and stored with ready-to-present shapes. The Arabic scripts are cursive. A writing system is cursive if it is suited to handwriting rather than printing, with adjacent characters in a word connected to each other. Some letters can only connect to the letter on their right. This is the only way in which Arabic script is used, whether in books, newspapers, signs, or workstation displays.

An Arabic language character could have four different shapes depending on its position in the word:

-        Isolated: The character is not linked to either the preceding or the following character.           

-        Final: The character is linked to the preceding character but not to the following one.           

-        Initial: The character is linked to the following character but not to the preceding one.           

-        Middle: The character is linked to both the preceding and following characters.                  

The Arabic character is shaped according to its position in the word. Some Arabic characters do not have the four different shapes.

Example: the Arabic character ALEF WITH HAMZA ABOVE has 2 shapes only:

-        Isolated          

-        Final               

Example: the Arabic character DAL has 2 shapes only:

-        Isolated          

-        Final               

 

Lam-Alef Ligature

A character is the smallest semantic unit in a writing system. It might happen that there is not a one to one correspondence between the number of characters of text stored for processing and the number of characters of the presented text. Sometimes two or more characters might be represented by a single glyph occupying one presentation cell; a glyph is the visual presentation of one or more characters, and is often dependent on adjacent characters. There is not always a one to one mapping between characters and glyphs.

The <Lam-Alef> ligature is coded as a single character in CCSID 420, while it is stored as 2 characters <Lam> and <Alef> in implicit code pages like 1256 and 1089. Special processing handles the <Lam-Alef> conversion:

·       From Visual to Implicit storage: expand <Lam-Alef> to <Lam> and <Alef>.

<Lam-Alef> is expanded consuming the spaces at the buffer extremity towards the direction of the end of Arabic text. If the number of spaces at the buffer extremity is greater than or equal to the number of <Lam-Alef> characters in the buffer they are all expanded to <Lam> and <Alef>. If the number of spaces at the buffer extremity is less than the number of Lam-Alef, then every <Lam-Alef> occurrence exceeding the number of spaces is converted to Substitution control character.

·       From Implicit to Visual: compress <Lam> and <Alef> to <Lam-Alef>.

<Lam> and <Alef> are compressed to <Lam-Alef> and one Blank space is added towards the buffer extremity.

The Bidi layout engine handles the expansion and compression automatically, so that the user does not need to intervene in the process.

 

 

A bit of history

When Arabic and Hebrew support was added to the iSeries originally, the system was used in a stand-alone environment. The designers of this support decided to store the Bidi data in visual order, with shaped letters (for Arabic). This had the advantage that no special processing was needed to format the data for presentation, since it was already in presentation form. Since the data only existed on the iSeries, it did not matter what form was used.

When Arabic and Hebrew support was added to workstations (PC systems, Unix systems), the designers of this support decided to store the Bidi data in logical order, with base shapes (for Arabic). This format is more convenient for processing like sorting and searching. It also has the advantage that Bidi data can be processed like non-Bidi data in many cases. However, the system needs to format the data for presentation.

With the change of time, customers began to interchange data back and forth between different systems, including between iSeries and workstations. They then discovered that, even though the same characters were used, the data was not the same. The data needed to be transformed from the format used on the source system to the format used on the target system.

For several releases, customers have requested to make this process more transparent to applications. The first step was to create a classification of the different formats of Bidi data. This is done by means of Coded Character Set Identifiers (CCSIDs) (more on that below).

Beginning in V4R4, PTF(s) were made available to allow automatic transformation of Bidi data flowing through OS/400, according to the format appropriate in each phase, as specified by a CCSID. In V5R1, this processing is part of the base system.

Bidi Attributes, String Types and CCSIDs

Bidi Attributes

The specific characteristics of Bidi data can be described using five attributes:

·        Orientation

·        Text Type

·        Symmetrical Swapping

·        Text Shaping

·        Numeric Shaping

Orientation

The orientation of a piece of text specifies whether this text flows mainly from left to right (for instance, an English sentence with possible Arabic or Hebrew embedded terms) or from right to left (for instance, an Arabic or Hebrew sentence with possible embedded numbers or English terms).

The different possible values for Orientation are:

LTR

Appropriate for logically ordered text whose main language is LTR (like the European languages), or visually ordered text ready for presentation on a device where character progression is LTR.

RTL

Appropriate for logically ordered text whose main language is RTL (like Arabic or Hebrew), or visually ordered text ready for presentation on a device where character progression is RTL.

Contextual, Contextual Left, Contextual Right

The orientation is induced from the first "strong" character (character with a "strong" orientation) in the text. Letters from the Hebrew or Arabic alphabets are considered to have a strong RTL orientation. Letters from European (and most other) alphabets are considered to have a strong LTR orientation. Characters such as spaces, punctuation, and even digits, which can be used within either a LTR or RTL context, have no strong orientation. If a string is specified to have contextual orientation but contains no strong character, the orientation may default to LTR (if specified as Contextual Left) or to RTL (if specified as Contextual Right).

Text Type

The Text Type can be Logical (alias Implicit) or Visual according to the sequence in which the characters of the text are ordered.

Example:

Visual order shaped text         

Logical order unshaped text   

Symmetrical Swapping

Within RTL text, some characters may be displayed as if they were a symmetrical character. For instance, a left parenthesis will be displayed as a right parenthesis and vice-versa. This allows keeping a semantic of "open parenthesis" to the code allocated for left parenthesis, and "close parenthesis" to the code of right parenthesis, within either a LTR or RTL context. The same swapping applies to other pairs of characters with symmetrical shapes and roles, like square brackets, curly brackets, and even less-than and greater-than signs.

The possible values for this attribute are "ON" and "OFF".

If the symmetrical swapping is not taken into consideration, the open and close brackets (for example) remain the same as their original shape when swapping between Arabic and English occurs. This results in text misinterpretation, like in the following example:

On a right-to-left window of the screen, the expression:

b < a

is read as a is greater than b. In storage, the first character will be a followed by < and then b. So we end up having in storage:

a < b

which is, of course, incorrect. In this case, in order to preserve the correct meaning of the expression, the < character must be exchanged in storage with >.

Text Shaping

As explained above (see "Arabic Shaping"), Arabic text may be encoded as base shapes, or with the contextual shapes (Initial, Middle, Final, Isolated) used for presentation. Accordingly, the values for this attribute may be "Unshaped" (base shapes) or "Shaped".

Numeric Shaping

Some Arabic countries use different shapes for digits, the "Hindi" digits. These digits may have specific encodings, different from the regular ("Arabic") digits. However, it is often convenient, or even mandatory, to keep the digits encoded in their regular codes, for instance to facilitate their handling by software and hardware without special treatment for Bidi environments. In this case, the conversion of the digits to Hindi shapes must be made at presentation. The Numeric Shaping attribute specifies how the different kinds of digits are encoded. Its possible values are:

passthrough

Arabic digits and Hindi digits are encoded separately; no conversion is needed for presentation. During conversion to a string type with numeric shaping set to passthrough, the Hindi or Arabic numerics are preserved in their original state without any conversion.

Arabic

All digits are encoded as Arabic digits; the presentation mechanism must select Hindi digit shapes for numbers appearing in an Arabic context. If the code page has a different encoding for Arabic and Hindi digits, during conversion to a string type with numeric shaping set to Arabic the Hindi digits are converted to Arabic digits.

String Types

Certain combinations of Bidi attributes are more frequently used in applications. In order to simplify their specification, each such combination has been equated to a "String Type". The string types used for Bidi data are defined in the following table.

String Type

Text Type

Numeric Shaping

Orientation

Text Shaping

Symmetrical Swapping

4

Visual

passthrough

LTR

Shaped

Off

5

Implicit

Arabic

LTR

Unshaped

On

6

Implicit

Arabic

RTL

Unshaped

On

7(*)

Visual

passthrough

Contextual*

Unshaped-Lig

Off

8

Visual

passthrough

RTL

Shaped

Off

9

Visual

passthrough

RTL

Shaped

On

10

Implicit

 

Contextual Left

 

On

11

Implicit

 

Contextual Right

 

On

12

Implicit

Arabic

RTL

Shaped

Off

Note: (*) String Orientation is LTR when the first alphabetic character is a Latin one, and RTL when it is an Arabic or Hebrew (RTL) character; characters are unshaped, but LamAlef ligatures are kept, and not broken into constituents.

CCSIDs

"CCSID" is an acronym for "Coded Character Set IDentifier". This is a number in the range 0-65535 (0-FFFF in hexadecimal notation). A CCSID represents a combination of the following elements:

·        Character Set: a repertoire of characters

·        Encoding Scheme: a set of rules by which characters are given a specific encoded value. Examples of encoding schemes are EBCDIC or ASCII.

·        Code Page: a set of numeric values for all characters in the Character Set

·        String Type: a number representing a set of attributes applying to data encoded with this CCSID.

The following CCSID numbers have special meanings.

CCSID 65535 (X'FFFF')

This CCSID is used to show that the associated data should not be processed as text. In other words, data associated with this CCSID must not be converted to another CCSID.

CCSID 65534 (X'FFFE')

This CCSID is used to show that a CCSID value for data at this level of processing is not relevant, and CCSID values should be obtained from data elements at a lower level in the defined hierarchy. For example, if a message file is tagged with this CCSID, processing will be based on CCSIDs assigned to each individual message ID. If a file is tagged with this CCSID, processing will be based on CCSIDs assigned to each individual field.

CCSID Support in OS/400

In environments with multiple languages and multiple encodings for data used by the various components of the system, CCSIDs are the means used by OS/400 to maintain data integrity. Practically, this translates into allowing CCSIDs to be specified at various levels for data, programs, jobs, etc... Whenever data flows from a source to a destination, the involved system functions compare the CCSID of the data supplied at the source with the CCSID of the data expected by the destination, and perform appropriate conversions if needed. For Bidi data, this may include both code page conversion and Bidi attribute transformation. For instance, when a program reads data from a database and the CCSIDs of the job and of the database are different, the data is converted to the CCSID of the job. For example if the database file is tagged with CCSID 8612 (implicit data) and the interactive job is associated to CCSID 420 (visual display) the data is converted from implicit to visual.
Note: if one of the CCSIDs is 65535, no conversion is done on the data.

The next sections will relate to CCSIDs at different levels in the system.

System level

The system value QCCSID is the default CCSID for all jobs running on the system. QCCSID can be set or changed with the CHGSYSVAL and WRKSYSVAL commands. The system is shipped with a default CCSID of 65535, which prevents data conversion. However, the Arabic NLV (National Language Version) has a default CCSID of 00420, and the Hebrew NLV has a default CCSID of 00424.

User Profile level

A CCSID specified in a user profile is assigned by default to all jobs run under that user profile. The CCSID can be set or changed with the CRTUSRPRF and CHGUSRPRF commands.

Job level

The job CCSID for an interactive job is set initially to the CCSID of the user profile. For batch jobs, the CCSID of the current job is used as the default CCSID for the submitted job, unless a CCSID is specifically entered on the BCHJOB command. A job CCSID can be changed with the CHGJOB command.

File level

The CCSID of a file can be specified when creating it. By default, the file receives the CCSID of the job creating it. The CCSID of a physical file can be changed with the CHGPF command, if it is not explicitly defined in the Data Description Specification (DDS) source description of the file.

Figure 1: Change Physical File (CHGPF) command promptText Box:                           Change Physical File (CHGPF)                          
                                                                                
 Type choices, press Enter.                                                     
                                                                                
 Language ID  . . . . . . . . . . LANGID         ARA                            
 Record format level check  . . . LVLCHK         *NO                            
 Node group . . . . . . . . . . . NODGRP         *NONE                          
   Library  . . . . . . . . . . .                                               
 Partitioning Key . . . . . . . . PTNKEY         *SAME                          
                           + for more values                                    
 Text 'description' . . . . . . . TEXT         > '                              
                    '                                                           
                                                                                
                            Additional Parameters                               
                                                                                
 Coded character set ID . . . . . CCSID          420                            
                                                                                
                                                                                
                                                                                
                                                                                
                                                                         Bottom 
 F3=Exit   F4=Prompt   F5=Refresh   F12=Cancel   F13=How to use this display    
 F24=More keys

 The CCSID of Stream files can be changed by using the EDTF command and pressing <F15> to change the associated CCSID.

Text Box:                                   EDTF Options Screen                           
                                                                                
 Selection . . . . . . . . . . . .     3                                        
                                                                                
 1. Copy from stream file  . . . .     /tmp/brms/flightrec                      
                                                                                
                                                                                
                                                                                
                                                                                
 2. Copy from database file  . . .                   Name                       
     Library . . . . . . . . . . .                   Name, *LIBL, *CURL         
     Member  . . . . . . . . . . .                   Name, *FIRST               
                                                                                
 3. Change CCSID of file . . . . .     00037    Job CCSID: 00037                
                                                                                
 4. Change CCSID of line . . . . .     *NONE                                    
                                                                                
 5. Stream file EOL option . . . .     *LF      *CR, *LF, *CRLF, *LFCR, *USRDFN 
     User defined. . . . . . . . .              Hexadecimal value               
                                                                                
                                                                                
                                                                                
 F3=Exit   F12=Cancel
 


Figure 2: EDTF Options Screen

Note: Changing the CCSID associated to the file in the above manner does not convert the data to the target CCSID. The CCSID is a tag with no effect on the data stored. Refer to Data conversion across different CCSID.

Field level

The CCSID can be specified at field level in databases.

The field CCSID is specified when the file is created using one of the following:

·       DDS source, in the CCSID field level keyword. See CCSID (Coded Character Set Identifier) keyword for physical and logical files

·       SQL statement CREATE TABLE, in the CCSID parameter.

Message File level

The CCSID of a message file that contains Bidi messages should be 65534, which allows the CCSID retrieval from each message independently. Then each message in the message file must be tagged with the proper CCSID.

Supported CCSIDs for Arabic

OS/400 supports the following CCSIDs for Arabic.

CCSID

String Type

Code Page

Description

420

4

420

EBCDIC (Original CCSID for Arabic data)

425

5

425

EBCDIC with POSIX chars, like [ ] { } etc.

864

5

864

PC Data

1046

5

1046

Arabic AIX data all presentation shapes

1089

5

1089

ISO 8859-6 

1256

5

1256

MS Windows

8612

5

420

EBCDIC

12708

8

420

EBCDIC 

62218

4

864

PC Data 

62224

6

420

EBCDIC 

62228

6

1256

MS Windows 

62251

6

425

EBCDIC with POSIX chars, like [ ] { } etc.

CCSID 420 can be mapped to:
00037, 00256, 00500, 00720, 00737, 00775, 00819, 00850, 00864, 00937, 01008, 01046, 01089, 01112, 01122, 01208, 01256, 04960, 08612, 09030, 09056, 12708, 13488, 28709, 61952, 62218, 62224, 62228.

CCSID 425 can be mapped to:
00037, 00500, 00819, 00864, 01046, 01089, 01252, 01256, 08612, 13488, 61952, 62224, 62228.

For a table of all supported CCSID conversions, see iSeries Bidirectional CCSID Mapping Information

Supported CCSIDs for Hebrew

OS/400 supports the following CCSIDs for Hebrew.

CCSID

String Type

Code Page

Description

424

4

424

EBCDIC (Original CCSID for Hebrew data)

916

5

916

ISO 8859-8

1255

5

1255

MS Windows

62210

4

916

ISO 8859-9

62211

5

424

EBCDIC 

62215

4

1255

MS Windows

62222

6

916

ISO 8859-9 

62223

6

1255

MS Windows 

62235

6

424

EBCDIC

62238

10

916

ISO 8859-9

62239

10

1255

MS Windows

62245

10

424

EBCDIC

CCSID 424 can be mapped to:
00037, 00256, 00500, 00737, 00775, 00819, 00850, 00862, 00916, 00937, 01112, 01122, 01208, 01255, 04952, 09030, 13488, 28709, 61952, 62210, 62211, 62215, 62222, 62223, 62235, 62238, 62239, 62245.

For a table of all supported CCSID conversions, see  iSeries Bidirectional CCSID Mapping Information .

 

Data Conversion Strategies

The CCSID support in OS/400 affects many components of the system. Evaluating what conversions will take place based on CCSID values specified for various elements can be a challenge. More information can be found in the following IBM publications:

·        AS/400 series - National Language Support, SC41-5101

·        AS/400 series - International Application Development, SC41-5603

Many strategies are possible, depending on the level of heterogeneity of the data formats involved. We will present only two archetypal ones.

Strategy 1: no system interference

With this strategy, all data conversions done automatically by the system are avoided. This can be achieved by specifying CCSIDs as 65535 wherever possible. This strategy avoids unwanted conversions on behalf of the system. On the other hand, whenever conversions are needed, the burden lays on the applications.

Strategy 2: let the system do all the work

With this strategy, all data conversions are left to the system, which will know exactly what to do based on scrupulous specification of CCSIDs for all involved elements. This strategy requires a good understanding of the system capabilities, but it relieves the applications from the burden of data conversions.

A few suggestions

In practical situations, users will probably want to use a mix of strategies 1 and 2, starting closer to strategy one (for compatibility with existing procedures) and getting closer to strategy 2 as the system is better understood and data interchange between heterogeneous platforms becomes more prevalent.

Traditionally, Bidi data under OS/400 has been stored in Visual LTR format. This corresponds to CCSID 420 for Arabic, 424 for Hebrew. For compatibility with all this "legacy" data, it is suggested to make 420 and 424 (for Arabic and Hebrew respectively) the value of QCCSID, and the CCSID value for most user profiles.

Other data formats are used for various purposes. The jobs that create these kinds of data should have a corresponding CCSID.

 

·        EBCDIC Bidi data could be stored in Logical format, which is useful for sorting, searching, and data interchange with systems using implicit code pages. This corresponds to CCSID 8612 or 62224 for Arabic and CCSID 62211 or 62235 for Hebrew.
 

·        Windows platforms, and most Unix platforms, expect data in LTR Logical format. This corresponds to CCSIDs 1255 and 1256 for Hebrew and Arabic respectively.
 

·        Web pages should be in Logical format (for Hebrew, Visual format is also used, but this usage is deprecated). The corresponding CCSIDs are 916 or 1255 for Hebrew, 1089 or 1256 for Arabic.