Appendix: Unicode Multilingual Support

This appendix describes how Photon handles international characters. It includes:

Photon is designed to handle international characters. Following the Unicode standard, Photon provides developers with the ability to create applications that can easily support the world's major languages and scripts.

Unicode is modeled on the ASCII character set, but uses a 32-bit encoding to support full multilingual text. There's no need for escape sequences or control codes when specifying any character in any language. Note that Unicode encoding conveniently treats all characters — whether alphabetic, ideographs, or symbols — in exactly the same way.

In designing the keyboard driver and the character handling mechanisms, we referred to the X11 keyboard extensions and ISO standards 9995 and 10646-1.

Wide and multibyte characters

ANSI C includes the following concepts:

wide character
A character represented as a value of type wchar_t, which typically is larger than a char.
multibyte character
A sequence of one or more bytes that represents a character, stored in a char array. The number of bytes depends on the character.
wide-character string
An array of wchar_t.
multibyte string
A sequence of multibyte characters stored in a char array.

Unicode

Unicode is a 32-bit encoding scheme:

For Unicode character values, see /usr/include/photon/PkKeyDef.h. For more information about Unicode, see the Unicode Consortium's website at www.unicode.org.

UTF-8 encoding

Formerly known as UTF-2, the UTF-8 (for “8-bit form”) transformation format is designed to address the use of Unicode character data in 8-bit UNIX environments. Each Unicode value is encoded as a multibyte UTF-8 sequence.

Here are some of the main features of UTF-8:

The actual encoding is this:

Conversion functions

In our C libraries, “wide characters” are assumed to be Unicode, and “multibyte” is UTF-8 in the default locale. The wchar_t type is defined as an unsigned 32-bit type, and wctomb() and mbtowc() implement the UTF-8 encoding in the default locale.


Note: Multibyte characters in the C library are UTF-8 in the default locale; in different locales, multibyte characters might use a different encoding.

You can use the following functions (described in the QNX Neutrino Library Reference) for converting between wide-character and multibyte encodings:

mblen()
Compute the length of a multibyte string in characters
mbtowc()
Convert a multibyte character to a wide character
mbstowcs()
Convert a multibyte string to a wide-character string
wctomb()
Convert a wide character to its multibyte representation
wcstombs()
Convert a wide-character string to a multibyte string

Photon libraries use multibyte UTF-8 character strings: any function that handles strings should be able to handle a valid UTF-8 string, and functions that return a string can return a multibyte-character string. This also applies to widget resources. The graphics drivers and font server assume that all strings use UTF-8.

The main Photon library, ph, provides the following non-ANSI functions (described in the Photon Library Reference) for working with multibyte UTF-8 and wide characters:

utf8len()
Count the bytes in a UTF-8 character
utf8strblen()
Find the number of UTF-8 characters in part of a string
utf8strchr()
Search for a UTF-8 character in a string
utf8strichr()
Search for a UTF-8 character in a string, ignoring case
utf8strirchr()
Search backwards for a UTF-8 character in a string, ignoring case
utf8strlen()
Find the length of a UTF-8-character string
utf8strnchr()
Search for a UTF-8 character in part of a string
utf8strncmp()
Compare part of a UTF-8-character string
utf8strndup()
Create a copy of part of a UTF-8-character string
utf8strnichr()
Search for a UTF-8 character in part of a string, ignoring case
utf8strnlen()
Find the number of bytes used by a UTF-8-character string
utf8strrchr()
Search backwards for a UTF-8 character in a string
utf8towc()
Convert a UTF-8 character to a wide-character code
wctolower()
Return the lowercase equivalent of a wide character
wctoutf8()
Convert a wide-character code into a UTF-8 character

These functions are defined in <utf8.h> (notice it isn't <photon/utf8.h>), and use UTF-8 encodings no matter what the current locale is. UTF8_LEN_MAX is defined to be the maximum number of bytes in a UTF-8 character.

Other encodings

If your application needs to work with other character encodings, you'll need to convert to and from UTF-8. Character sets are defined in the file /usr/photon/translations/charsets, and include:

The following translation functions are provided, and are described in the Photon Library Reference:

PxTranslateFromUTF()
Translate characters from UTF-8
PxTranslateList()
Create a list of all supported character translations
PxTranslateSet()
Install a new character-set translation
PxTranslateStateFromUTF()
Translate characters from UTF-8, using an internal state buffer
PxTranslateStateToUTF()
Translate characters to UTF-8, using an internal state buffer
PxTranslateToUTF()
Translate characters to UTF-8
PxTranslateUnknown()
Control how unknown encodings are handled

Note: These functions are supplied only in static form in the Photon library phexlib. The prototypes are in <photon/PxProto.h>.

In short, Photon supports any Unicode encoded TrueType font. However, Photon does not support complex languages such as Hebrew or Arabic. In order to provide support for complex languages, you must obtain a third-party font rendering engine.

Keyboard drivers

The keyboard driver is table-driven; it handles any keyboard with 127 or fewer physical keys.

A keypress is stored in a structure of type PhKeyEvent_t (described in the Photon Library Reference).

Example: text widgets

The text widgets use the key_sym field for displayable characters. These widgets also check it to detect cursor movement. For example, if the content of the field is Pk_Left, the cursor is moved left. The key_sym is Pk_Left for both the left cursor key and the numeric keypad left cursor key (assuming NumLock is off).

Dead keys and compose sequences

QNX Neutrino supports “dead” keys and “compose” key sequences to generate key_syms that aren't on the keyboard. The key_sym field is valid only on a key press — not on a key release — to ensure that you get only one symbol, not two.

For example, if the keyboard has a dead accent key (for example, `) and the user presses it followed by e, the key_sym is an “e” with a grave accent (è). If the e key isn't released, and then another group of keys (or more compose or dead key sequences) are pressed, the key_syms would have to be stacked for the final releases.

If an invalid key is pressed during a compose sequence, the keyboard drivers generate key_syms for all the intermediate keys, but not an actual press or release.

For a list of compose sequences, see below.

Photon compose sequences

Photon comes equipped with standard compose sequences. If your keyboard doesn't include a character from the standard ASCII table, you can generate the character using a compose sequence. For example, ó can be generated by pressing the Alt key, followed by the ' key, followed by the o key.


Note: These aren't keychords; press and release each key one after the other.

The following keys can be used for generating accented letters:

Key Accent Example sequence Result
' acute Alt ' o ó
, cedilla Alt , c ç
^ circumflex Alt ^ o ô
> circumflex Alt > o ô
" diaeresis Alt " o ö
` grave Alt ` o ò
/ slash Alt / o ø
~ tilde Alt ~ n ñ

If your keyboard doesn't have the following symbols, you can create them by pressing the Alt key, followed by the first key in the sequence, followed by the second key in the sequence.

Symbol Description Unicode value Sequence
æ small letter ae (ligature) E6 Alt e a
Æ capital letter ae (ligature) C6 Alt E A
Ð capital letter eth D0 Alt D -
ð small letter eth F0 Alt d -
ß small letter sharp s (German scharfes s) DF Alt s s
µ micro sign B5 Alt / U
Alt / u
þ small letter thorn FE Alt h t
Þ capital letter thorn DE Alt H T
# number sign 23 Alt + +
@ commercial at 40 Alt A A
© copyright sign A9 Alt C 0
Alt C O
Alt C o
Alt c 0
Alt c O
Alt c o
® registered trademark sign AE Alt R O
[ left square bracket 5B Alt ( (
] right square bracket 5D Alt ) )
{ left curly bracket 7B Alt ( -
} right curly bracket 7D Alt ) -
» right-pointing double angle quotation mark BB Alt > >
« left-pointing double angle quotation mark AB Alt < <
^ circumflex accent 5E Alt > space
' apostrophe 27 Alt ' space
` grave accent 60 Alt ` space
| vertical bar 7C Alt / ^
Alt V L
Alt v l
\ reverse solidus (backslash) 5C Alt / /
Alt / <
~ tilde 7E Alt - space
no-break space A0 Alt space space
° degree sign B0 Alt 0 ^
¡ inverted exclamation mark A1 Alt ! !
¿ inverted question mark BF Alt ? ?
¢ cent sign A2 Alt C /
Alt C |
Alt c /
Alt c |
# pound sign A3 Alt L -
Alt L =
Alt l -
Alt l =
¤ currency sign A4 Alt X 0
Alt X O
Alt X o
Alt x 0
Alt x O
Alt x o
¥ yen sign A5 Alt Y -
Alt Y =
Alt y -
Alt y =
¦ broken (vertical) bar A6 Alt ! ^
Alt V B
Alt v b
Alt | |
§ section sign A7 Alt S !
Alt S 0
Alt S O
Alt s !
Alt s 0
Alt s o
¨ diaeresis or umlaut A8 Alt " "
· middle dot B7 Alt . .
Alt . ^
¸ cedilla B8 Alt , space
Alt , ,
¬ not sign AC Alt - ,
soft hyphen AD Alt - -
¯ macron AF Alt - ^
Alt _ ^
Alt _ _
± plus-minus sign B1 Alt + -
¹ superscript one B9 Alt 1 ^
Alt S 1
Alt s 1
² superscript two B2 Alt 2 ^
Alt S 2
Alt s 2
³ superscript three B3 Alt 3 ^
Alt S 3
Alt s 3
pilcrow sign (paragraph sign) B6 Alt P !
Alt p !
ª feminine ordinal indicator AA Alt A _
Alt a _
º masculine ordinal indicator BA Alt O _
Alt o _
¼ vulgar fraction one quarter BC Alt 1 4
½ vulgar fraction one half BD Alt 1 2
¾ vulgar fraction three quarters BE Alt 3 4
/ division sign F7 Alt - :
× multiplication sign D7 Alt x x