for connected embedded systems
![]() |
![]() |
![]() |
![]() |
Appendix: Unicode Multilingual Support
This appendix describes how Photon handles international characters. It includes:
- Wide and multibyte characters
- Unicode
- UTF-8 encoding
- Conversion functions
- Other encodings
- Keyboard drivers
- Photon compose sequences
Photon is designed to handle international characters. Following the Unicode standard, Photon provides developers with the ability to create applications that can easily support the world's major languages and scripts.
Unicode is modeled on the ASCII character set, but uses a 32-bit encoding to support full multilingual text. There's no need for escape sequences or control codes when specifying any character in any language. Note that Unicode encoding conveniently treats all characters -- whether alphabetic, ideographs, or symbols -- in exactly the same way.
In designing the keyboard driver and the character handling mechanisms, we referred to the X11 keyboard extensions and ISO standards 9995 and 10646-1.
Wide and multibyte characters
ANSI C includes the following concepts:
- wide character
- A character represented as a value of type wchar_t, which typically is larger than a char.
- multibyte character
- A sequence of one or more bytes that represents a character, stored in a char array. The number of bytes depends on the character.
- wide-character string
- An array of wchar_t.
- multibyte string
- A sequence of multibyte characters stored in a char array.
Unicode
Unicode is a 32-bit encoding scheme:
- It packs most international characters into wide-character representations (two bytes per character).
- Codes below 128 define the same characters as the ASCII standard.
- Codes between 128 and 255 define the same characters as in the ISO 8859-1 character set.
- There's a private-use area
from 0xE000 to 0xF7FF; Photon maps it as follows:
Glyphs Range Nondisplayable keys 0xF000 -- 0xF0FF Cursor font 0xE900 -- 0xE9FF
For Unicode character values, see /usr/include/photon/PkKeyDef.h. For more information about Unicode, see the Unicode Consortium's website at www.unicode.org.
UTF-8 encoding
Formerly known as UTF-2, the UTF-8 (for "8-bit form") transformation format is designed to address the use of Unicode character data in 8-bit UNIX environments. Each Unicode value is encoded as a multibyte UTF-8 sequence.
Here are some of the main features of UTF-8:
- The UTF-8 representation of codes below 128 is the same as in the ASCII standard, so any ASCII string is also a valid UTF-8 string and represents the same characters.
- ASCII values don't otherwise occur in a UTF-8 transformation, giving complete compatibility with historical filesystems that parse for ASCII bytes.
- UTF-8 encodes the ISO 8859-1 character set as double-byte sequences.
- UTF-8 simplifies conversions to and from Unicode text.
- The first byte indicates the number of bytes to follow in a multibyte sequence, allowing for efficient forward parsing.
- Finding the start of a character from an arbitrary
location in a byte stream is efficient, because you need to
search at most four bytes backwards to find an easily
recognizable initial byte. For example:
isInitialByte = ((byte & 0xC0) != 0x80); - UTF-8 is reasonably compact in terms of the number of bytes used for encoding.
The actual encoding is this:
- For multibyte encodings, the first byte sets 1 in a number of high-order bits equal to the number of bytes used in the encoding; the bit after that is set to 0. For example, a 2-byte sequence always starts with 110 in the first byte.
- For all subsequent bytes in a multibyte encoding, the
first two bits are 10. The
value of a trailing byte in a multibyte encoding is always
greater than or equal to 0x80.
The following table shows the binary form of each byte of the encoding and the minimum and maximum values for the characters represented by 1-, 2-, 3-, and 4-byte encodings:
Length First byte Following bytes Min. value Max. value Single byte 0XXXXXXX N/A 0x0000 0x007F Two bytes 110XXXXX 10XXXXXX 0x0080 0x07FF Three bytes 1110XXXX 10XXXXXX 0x0800 0xFFFF Four bytes 11110XXX 10XXXXXX 0x10000 0x10FFFF - The actual content of the multibyte encoding (i.e. the wide-character encoding) is the catenation of the XX bits in the encoding. A 2-byte encoding of 11011111 10000000 encodes the wide character 11111000000.
- Where there's more than one way to encode a value (such as 0), the shortest is the only legal value. The null character is always a single byte.
Conversion functions
In our C libraries, "wide characters" are assumed to be Unicode, and "multibyte" is UTF-8 in the default locale. The wchar_t type is defined as an unsigned 32-bit type, and wctomb() and mbtowc() implement the UTF-8 encoding in the default locale.
![]() |
Multibyte characters in the C library are UTF-8 in the default locale; in different locales, multibyte characters might use a different encoding. |
You can use the following functions (described in the QNX Neutrino Library Reference) for converting between wide-character and multibyte encodings:
- mblen()
- Compute the length of a multibyte string in characters
- mbtowc()
- Convert a multibyte character to a wide character
- mbstowcs()
- Convert a multibyte string to a wide-character string
- wctomb()
- Convert a wide character to its multibyte representation
- wcstombs()
- Convert a wide-character string to a multibyte string
Photon libraries use multibyte UTF-8 character strings: any function that handles strings should be able to handle a valid UTF-8 string, and functions that return a string can return a multibyte-character string. This also applies to widget resources. The graphics drivers and font server assume that all strings use UTF-8.
The main Photon library, ph, provides the following non-ANSI functions (described in the Photon Library Reference) for working with multibyte UTF-8 and wide characters:
- utf8len()
- Count the bytes in a UTF-8 character
- utf8strblen()
- Find the number of UTF-8 characters in part of a string
- utf8strchr()
- Search for a UTF-8 character in a string
- utf8strichr()
- Search for a UTF-8 character in a string, ignoring case
- utf8strirchr()
- Search backwards for a UTF-8 character in a string, ignoring case
- utf8strlen()
- Find the length of a UTF-8-character string
- utf8strnchr()
- Search for a UTF-8 character in part of a string
- utf8strncmp()
- Compare part of a UTF-8-character string
- utf8strndup()
- Create a copy of part of a UTF-8-character string
- utf8strnichr()
- Search for a UTF-8 character in part of a string, ignoring case
- utf8strnlen()
- Find the number of bytes used by a UTF-8-character string
- utf8strrchr()
- Search backwards for a UTF-8 character in a string
- utf8towc()
- Convert a UTF-8 character to a wide-character code
- wctolower()
- Return the lowercase equivalent of a wide character
- wctoutf8()
- Convert a wide-character code into a UTF-8 character
These functions are defined in <utf8.h> (notice it isn't <photon/utf8.h>), and use UTF-8 encodings no matter what the current locale is. UTF8_LEN_MAX is defined to be the maximum number of bytes in a UTF-8 character.
Other encodings
If your application needs to work with other character encodings, you'll need to convert to and from UTF-8. Character sets are defined in the file /usr/photon/translations/charsets, and include:
- Big5 (Chinese)
- Cyrillic (KOI8-R)
- Japanese (EUC)
- Japanese (Shift-JIS)
- Korean (EUC)
- Western (ISO 8859-1)
The following translation functions are provided, and are described in the Photon Library Reference:
- PxTranslateFromUTF()
- Translate characters from UTF-8
- PxTranslateList()
- Create a list of all supported character translations
- PxTranslateSet()
- Install a new character-set translation
- PxTranslateStateFromUTF()
- Translate characters from UTF-8, using an internal state buffer
- PxTranslateStateToUTF()
- Translate characters to UTF-8, using an internal state buffer
- PxTranslateToUTF()
- Translate characters to UTF-8
- PxTranslateUnknown()
- Control how unknown encodings are handled
![]() |
These functions are supplied only in static form in the Photon library phexlib. The prototypes are in <photon/PxProto.h>. |
Keyboard drivers
The keyboard driver is table-driven; it handles any keyboard with 127 or fewer physical keys.
A keypress is stored in a structure of type PhKeyEvent_t (described in the Photon Library Reference).
Example: text widgets
The text widgets use the key_sym field for displayable characters. These widgets also check it to detect cursor movement. For example, if the content of the field is Pk_Left, the cursor is moved left. The key_sym is Pk_Left for both the left cursor key and the numeric keypad left cursor key (assuming NumLock is off).
Dead keys and compose sequences
QNX Neutrino supports "dead" keys and "compose" key sequences to generate key_syms that aren't on the keyboard. The key_sym field is valid only on a key press -- not on a key release -- to ensure that you get only one symbol, not two.
For example, if the keyboard has a dead accent key (for example, `) and the user presses it followed by e, the key_sym is an "e" with a grave accent (è). If the e key isn't released, and then another group of keys (or more compose or dead key sequences) are pressed, the key_syms would have to be stacked for the final releases.
If an invalid key is pressed during a compose sequence, the keyboard drivers generate key_syms for all the intermediate keys, but not an actual press or release.
For a list of compose sequences, see below.
Photon compose sequences
Photon comes equipped with standard compose sequences. If your keyboard doesn't include a character from the standard ASCII table, you can generate the character using a compose sequence. For example, ó can be generated by pressing the Alt key, followed by the ' key, followed by the o key.
![]() |
These aren't keychords; press and release each key one after the other. |
The following keys can be used for generating accented letters:
| Key | Accent | Example sequence | Result |
|---|---|---|---|
| ' | acute | Alt ' o | ó |
| , | cedilla | Alt , c | ç |
| ^ | circumflex | Alt ^ o | ô |
| > | circumflex | Alt > o | ô |
| " | diaeresis | Alt " o | ö |
| ` | grave | Alt ` o | ò |
| / | slash | Alt / o | ø |
| ~ | tilde | Alt ~ n | ñ |
If your keyboard doesn't have the following symbols, you can create them by pressing the Alt key, followed by the first key in the sequence, followed by the second key in the sequence.
| Symbol | Description | Unicode value | Sequence |
|---|---|---|---|
| æ | small letter ae (ligature) | E6 | Alt e a |
| Æ | capital letter ae (ligature) | C6 | Alt E A |
| Ð | capital letter eth | D0 | Alt D - |
| ð | small letter eth | F0 | Alt d - |
| ß | small letter sharp s (German scharfes s) | DF | Alt s s |
| µ | micro sign | B5 | Alt / U |
| Alt / u | |||
| þ | small letter thorn | FE | Alt h t |
| Þ | capital letter thorn | DE | Alt H T |
| # | number sign | 23 | Alt + + |
| @ | commercial at | 40 | Alt A A |
| © | copyright sign | A9 | Alt C 0 |
| Alt C O | |||
| Alt C o | |||
| Alt c 0 | |||
| Alt c O | |||
| Alt c o | |||
| ® | registered trademark sign | AE | Alt R O |
| [ | left square bracket | 5B | Alt ( ( |
| ] | right square bracket | 5D | Alt ) ) |
| { | left curly bracket | 7B | Alt ( - |
| } | right curly bracket | 7D | Alt ) - |
| » | right-pointing double angle quotation mark | BB | Alt > > |
| « | left-pointing double angle quotation mark | AB | Alt < < |
| ^ | circumflex accent | 5E | Alt > space |
| ' | apostrophe | 27 | Alt ' space |
| ` | grave accent | 60 | Alt ` space |
| | | vertical bar | 7C | Alt / ^ |
| Alt V L | |||
| Alt v l | |||
| \ | reverse solidus (backslash) | 5C | Alt / / |
| Alt / < | |||
| ~ | tilde | 7E | Alt - space |
| no-break space | A0 | Alt space space | |
| ° | degree sign | B0 | Alt 0 ^ |
| ¡ | inverted exclamation mark | A1 | Alt ! ! |
| ¿ | inverted question mark | BF | Alt ? ? |
| ¢ | cent sign | A2 | Alt C / |
| Alt C | | |||
| Alt c / | |||
| Alt c | | |||
| # | pound sign | A3 | Alt L - |
| Alt L = | |||
| Alt l - | |||
| Alt l = | |||
| ¤ | currency sign | A4 | Alt X 0 |
| Alt X O | |||
| Alt X o | |||
| Alt x 0 | |||
| Alt x O | |||
| Alt x o | |||
| ¥ | yen sign | A5 | Alt Y - |
| Alt Y = | |||
| Alt y - | |||
| Alt y = | |||
| ¦ | broken (vertical) bar | A6 | Alt ! ^ |
| Alt V B | |||
| Alt v b | |||
| Alt | | | |||
| § | section sign | A7 | Alt S ! |
| Alt S 0 | |||
| Alt S O | |||
| Alt s ! | |||
| Alt s 0 | |||
| Alt s o | |||
| " | diaeresis or umlaut | A8 | Alt " " |
| · | middle dot | B7 | Alt . . |
| Alt . ^ | |||
| , | cedilla | B8 | Alt , space |
| Alt , , | |||
| ¬ | not sign | AC | Alt - , |
| soft hyphen | AD | Alt - - | |
| - | macron | AF | Alt - ^ |
| Alt _ ^ | |||
| Alt _ _ | |||
| +/- | plus-minus sign | B1 | Alt + - |
| ¹ | superscript one | B9 | Alt 1 ^ |
| Alt S 1 | |||
| Alt s 1 | |||
| ² | superscript two | B2 | Alt 2 ^ |
| Alt S 2 | |||
| Alt s 2 | |||
| ³ | superscript three | B3 | Alt 3 ^ |
| Alt S 3 | |||
| Alt s 3 | |||
| ¶ | pilcrow sign (paragraph sign) | B6 | Alt P ! |
| Alt p ! | |||
| ª | feminine ordinal indicator | AA | Alt A _ |
| Alt a _ | |||
| º | masculine ordinal indicator | BA | Alt O _ |
| Alt o _ | |||
| 1/4 | vulgar fraction one quarter | BC | Alt 1 4 |
| 1/2 | vulgar fraction one half | BD | Alt 1 2 |
| 3/4 | vulgar fraction three quarters | BE | Alt 3 4 |
| / | division sign | F7 | Alt - : |
| * | multiplication sign | D7 | Alt x x |
![]() |
![]() |
![]() |
![]() |

![[Previous]](prev.gif)
![[Contents]](contents.gif)
![[Index]](keyword_index.gif)
![[Next]](next.gif)
