Unicode Support, Part 2

By Claudia Holanda

Hello. Oi. Hola. Bonjour. Guten Tag.
Jambo. Hyvää päivää. Ciao. 你好. Cześć.
Selamat pagi. Buna ziua.こんにちは.
여보세요. Szia. Καλή ημέρα. Góðan dag.  
Ba'ax ka wa'alik. Здрáвствуйте! שָׁלוֹם.

 

Can you identify the languages for each “Hello” expression above?

Fewer than 10: You need to watch more international news.
Between 10-15: Consider yourself a linguist.
Between 15-20:  Pack your bags. You’re ready for a trip around the world!

Look for the languages at the end of this article.

Unicode-related questions continue to be the No. 1 type of queries in my inbox. I have written about it before, and now I will share with you more information on Unicode Support for the WebFOCUS Reporting Server.

Selecting, Reformatting, and Manipulating Characters

When the WebFOCUS Reporting Server is configured for Unicode, it does all character manipulation and interprets all alphanumeric field lengths in terms of characters that consist of up to three bytes each (in ASCII environments).

In character semantics mode, selection tests against a mask are automatically adjusted to work with characters rather than bytes. Formats assigned by reformatting a field in a request or by defining a temporary field are interpreted in terms of characters. Character functions interpret all lengths in terms of characters as well.

Consider the following DEFINE in the Master File for the EMPLOYEE data source:

DEFINE FIRST_ABBREV/A5 WITH FIRST_NAME = EDIT(FIRST_NAME, '99999$$$$$');$

In character semantics mode, format A5 is interpreted as five characters (up to15 bytes on ASCII platforms, up to 20 bytes on EBCDIC platforms), and the comparison is performed based on this number of bytes. In byte semantics mode, format A5 is interpreted as five bytes, and the comparison is performed based on five bytes. In either case, the correct characters are compared and extracted.

Consider the following PRINT command:

PRINT FIELD1/A10

In character semantics mode, format A10 is interpreted as 10 characters (up to 30 bytes), meaning that up to 30 bytes must be retrieved when this field is referenced. In byte semantics mode, format A10 means that 10 bytes will be retrieved. In either case, the field displays as 10 characters that take up 10 spaces on the report output.

Sort Order under Unicode

Sort order is based on the binary values assigned to the characters. When the server is configured for Unicode, the sort order is based on the Unicode encoding standard. If ascending values of the codes correspond to the alphabetical order of the letters in the language being used, a report can be sorted in alphabetical order. This is entirely dependent on the encoding standard and its mapping of codes to letters. In most, but not all cases, the encoding standard assigns codes in the alphabetical order of each language.

For example, Ukrainian added a new letter Ґ  (Cyrillic capital letter ‘ghe’ with upturn) to its alphabet after the UTF-8 coding specification had already been set. This letter was not assigned a code that sorts it alphabetically, either in Unicode or in code page 1251 (used for Ukrainian). It sorts differently and incorrectly using either encoding scheme.

With code page 1251, this letter sorts as the first letter on the report output. With UTF-8, this letter sorts as the last letter on the report output.

To determine whether a language sorts alphabetically, you can examine the hexadecimal codes assigned to its letters on the code page you are using and check whether ascending hexadecimal codes match the alphabetical order (see Fall 2006 WebFOCUS Newsletter).

PDF & PS formats

You’ve asked and now you have it. PDF and PS output formats support for Unicode are now available.

There are two fonts that support Unicode characters with PDF and PS formats:

  • Lucida Sans Unicode – used to display Single Byte characters only. This font is available on all versions of Windows 2000 and above.
  • Arial Unicode MS – used to display both Single Byte and Double Byte characters. This font is available as an option from the Microsoft Office CD version 2000 and above. Not available as a PS format option.

 

Both fonts are listed in \ibi\srv76\home\nls\pdf.fmp. Lucida Sans Unicode will be the default font if the WebFOCUS Reporting Server is configured for UTF8 (code page 65001) or UTFE (code page 65002).

If the Server is configured for Unicode and you would like to use Arial Unicode MS font instead, you must specify it in your stylesheet. Unless of course pdf.fmp has the “DEFAULT-FONT=YES” attribute, then it will become the default font.

The font transcoding table will be automatically generated when the Unicode configuration is done through the NLS option of the WebFOCUS Reporting Server Console.

--------------------

Answers to the “Hello” language quiz:
English. Portuguese. Spanish. French. German. Swahili. Finnish. Italian. Chinese. Polish. Indonesian. Romanian. Japanese. Korean. Hungarian. Greek. Icelandic. Mayan. Russian. Hebrew.

previous next