Unicode Support, Part 1
by Claudia Holanda
With the number of companies adapting Unicode in their systems increasing rapidly, Information Builders decided to add Unicode support starting with WebFOCUS Reporting Server 7.6.0.
This major implementation leads to other new features and changes in configuration, raising questions from many of you. So I decided to highlight the various aspects of emerging Unicode capability in a series of articles, starting with this one.
Before I get into the specifics, let's first cover the basics.
Unicode is a universal character-encoding standard that assigns a different code to every character and symbol in every spoken and/or written language in the world. Since no other encoding standard supports all languages, Unicode is the only one that ensures you can retrieve or combine data using any combination of languages. Unicode is required with XML, Java, JavaScript, LDAP and other Web-based technologies.
The server will support a Unicode Transformation Format (UTF) called UTF-8 in ASCII environments. This encoding standard assigns each character a code that can be from one to three bytes long. The codes assigned to characters from SBCS languages (Eastern and Western European) are one or two bytes long, and those assigned to DBCS languages (Asian) are three bytes long.
This standard is compatible with the ASCII format because the first 128 UTF-8 codes have the same one-byte representation as the corresponding ASCII codes. For EBCDIC environments, a transformation format called UTF-E is used. This encoding standard assigns each character a code that can be from one to four bytes long.
In non-Unicode encoding systems, each character is assigned a code that is one byte long, limiting the number of characters that can be encoded. When using those standards, it became common to equate a character with a byte of storage. If you had 10 characters to store, the amount of storage needed was 10 bytes, and many character manipulation routines expected character string lengths to be provided as a number of bytes.
With Unicode encoding, bytes and characters can no longer be equated. If you select the UTF-8 encoding scheme, the WebFOCUS Reporting Server will be configured, by default, to do all character manipulation and interpret all alphanumeric field lengths in terms of characters that consist of up to three bytes each (in ASCII environments). This processing mode is called character semantics. Existing procedures will continue to work without adjustment (in most cases). Although a character may take up more than one byte in memory, it will display using only one space in a report column.
The main benefit of the new system is the ability to have multiple languages (both European and Asian) in the following objects:
- Titles, descriptions, and names in synonyms.
- Headings and prompts in procedures.
- Column names, titles and sample data in the Data Management Console.
- Data for all supported adapters:
- SAP BW
- SAP R/3-ECC
- Web Services
- Oracle
- DB2
- Microsoft SQL Server
- Sybase ASE
- Fixed files
Support for more adapters will be added later.
Configuring WebFOCUS for UTF-8 Encoding
To configure the WebFOCUS Reporting Server for Unicode, select Workspace, then Configuration from the Web Console menu bar. In the Workspace Configuration navigation pane, expand the General folder and choose NLS Settings. Then choose 65001 - Unicode (UTF-8) in the CODE_PAGE field. The server will be configured for character semantics once you save this configuration.
Accessing Non-FOCUS Data Sources with Character Semantics
Having the WebFOCUS Reporting Server configured for UTF-8 doesn't mean that your DBMS must also be UTF-8.
Relational adapters convert to UTF-8 on retrieval and to the correct DBMS API when writing to the relational data source (for example, Oracle to UTF-8; MSSQL to UTF-16).
The actual format is allocated accordingly. The Adapter for SAP BW converts to and from UTF-16.
Control objects, such as connections, are converted from UTF-8 (or another code page) to UTF-16.
For nonrelational adapters, the actual format may be interpreted as being in any code page. For legacy adapters, this will be a non-Unicode code page.
Watch for future WebFOCUS Newsletter issues, as I'll be covering more areas of Unicode support. In the meantime, if you have any questions or suggestions, drop me a line at claudia_holanda@ibi.com.
|