Multibyte Conversions

Multibyte Sequences in C · Multibyte Sequences in C++ · Wide-Character Limitations · Wide-Character Encodings · Multibyte Encodings

Multibyte Sequences in C

You can represent a character two different ways within a C (or C++) program:

as a wide character, where each character occupies a single element of type wchar_t
as a multibyte sequence, where each character occupies one or more sequential bytes (elements of type char)

Moreover, a multibyte sequence can have a state-dependent encoding, where one or more byte sequences changes the interpretation of byte sequences that follow. Every sequence of such multibyte sequences is presumed to begin in an initial shift state, and can be brought back to the initial shift state by a suitable homing sequence (which may be state dependent).

As a special case, certain characters can be represented as single-byte sequences in the initial shift state. Such characters form the single-byte character set from traditional C. They are thus candidates for testing and manipulation by the functions defined in the Standard C header <ctype.h>. And they are the characters you write to, and read from, byte streams, using the functions defined in the Standard C header <stdio.h>.

The Standard C library defines a number of functions for converting between multibyte and wide-character representations. See, for example, the functions mbtowc and wctomb, defined in the header <stdlib>. The specific conversion rule is implementation defined. It may be possible to change the rule globally by a call to setlocale, but no standards currently exist for specifying how to do so. Whatever conversion rule applies, however, it must obey several constraints spelled out in the C Standard:

The literal 'x', where x is any member of the basic C character set, must correspond to the (only) byte of the single-byte sequence that represents x.
Each member of the basic C character set has a different encoding from all other such members.
The literal L'x' must correspond to the wide character that represents x.
The nul character '\0' is represented by the single-byte sequence whose (only) byte has the value zero
The nul wide character L'\0' is represented by the wide character that has the value zero.
No multibyte sequence can have a byte with value zero as other than the first (and only) byte.

Not all multibyte encodings obey all these rules. For example, the last rule is broken by a multibyte encoding that represents each two-byte wide character by its less-significant byte followed by its more-significant byte. Such a little-endian sequence is easy to generate and useful for communicating values between different computer architectures, but it can contain any number of bytes with zero value that are not nul characters. Thus, this encoding is not suitable as an implementation of, say, the function mbtowc. (The encoding that puts the more-significant byte first is called a big-endian sequence, naturally enough.)

The C Standard recognizes the need for generalized multibyte sequences, which may break one or more of the rules above, for communicating sequences of characters between programs. The Standard C header <wchar.h> defines a host of functions, such as fwscanf and fwprintf, for reading and writing wide streams. You write a sequence of wide characters to a wide stream and the stream converts the wide characters in the program to generalized multibyte sequences in the external stream. Similarly, you read a sequence of wide characters from a wide stream and the stream converts generalized multibyte sequences in the external stream to a sequence of wide characters in the program. (Wide streams otherwise behave very much like the more traditional byte streams from the earliest days of C.)

Once again, however, no standards exist for specifying the multibyte encoding rule for converting between a wide-character sequence within the program and the generalized multibyte sequences in the external file or data stream.

Multibyte Sequences in C++

The Standard C++ library incorporates all of the Standard C library, at least through 1998. Thus, it contains all the functions outlined above for converting between wide characters and multibyte sequences and for reading and writing wide streams. Moreover, it extends the traditional iostreams classes to include wide-stream as well as byte-stream operations. For example, you traditionally read a byte stream by extracting bytes (elements of type char) from the object cin, which has type istream. In Standard C++, cin has type basic_istream<char, char_traits<char> > and istream is just a synonym (type definition) for this type. To read the standard input as a wide stream, you extract wide characters (elements of type wchar_t) from the object wcin, which has type basic_istream<wchar_t, char_traits<wchar_t> >. (The type definition wistream is a synonym for this type.) The wide stream responds by reading bytes from the actual stream and composing them, by some rule, into the wide character stream.

Similarly, you can write a byte stream by inserting bytes into an object of type cout, which has type ostream. In Standard C++, cout has type basic_ostream<char, char_traits<char> > and ostream is just a synonym for this type. To write the standard output as a wide stream, you insert wide characters (elements of type wchar_t) into the object wcout, which has type basic_ostream<wchar_t, char_traits<wchar_t> >. (The type definition wostream is a synonym for this type.) The wide stream responds by writing bytes to the actual string, composing them, by some rule, from the wide character stream.

What multibyte encoding rule do these wide-stream objects apply? That depends, in principle at least, on the locale associated with each wide-stream object. The Standard C function setlocale alters the global behavior of the Standard C library, but C++ provides for greater encapsulation. C++ lets you deal with multiple locales within a program all at the same time. You can construct an object of class locale in a variety of ways. One way is to use a locale name acceptable to setlocale, to make an object that presumably encapsulates the same locale-dependent library behavior as in C. You then associate a locale object with a wide-stream object by calling the member function imbue, as in:

std::locale loc("en_US");  // US English locale
std::wifstream mystr;

mystr.imbue(loc);
mystr.open("file.txt", std::ios_base::binary);
if (!mystr.is_open())
    throw "open failed";

Henceforth, you can extract wide characters from mystr. The wide stream generates each wide character from a generalized multibyte sequence read from the file file.txt, using a conversion rule that is presumably determined by the locale named en_US.

But to belabor the point, no standards exist for specifying the multibyte conversion rule associated with en_US or any other named locale. An implementation may offer a rich set of well documented locales, or it may offer nothing beyond the required "C" locale. It may provide for multiple multibyte encoding rules, or it may apply just one rule universally. Put simply, you cannot in general depend on the availability of predefined locales to supply the multibyte conversion rule(s) you need.

The Standard C++ library offers a more deterministic approach, however. It may not say how to specify the behavior of a locale, in general, but it does specify quite a bit of the behavior of an locale object. An locale object encapsulates references to a couple of dozen locale facets, each of which encapsulates in turn some aspect of locale-dependent library behavior. The facet codecvt<wchar_t, char, mbstate_t>, in particular, performs conversions between wide characters and generalized multibyte sequences. Given a codecvt facet of the appropriate flavor, you can construct a locale object that does what you want and imbue it into the wide stream object(s) you use to read and write files as you desire.

Say, for example, you have a definition for the template class Dinkum::codecvt::codecvt_utf8<Elem> that implements the multibyte encoding rule you want to use for a stream with element type Elem. (Elem is typically wchar_t for a wide stream.) Replace the declaration of loc above with:

std::locale loc(locale::classic(),
    new Dinkum::codecvt::codecvt_utf8<wchar_t>);

and you have the locale object you need to imbue into one or more wide stream objects. If your compiler balks at this form, and you are using the Dinkum C++ Library, try the sturdier substitute:

std::locale loc = _ADDFAC(std::locale::classic(),
    new Dinkum::codecvt::codecvt_utf8<wchar_t>);

which works with older compilers as well as newer ones.

This document describes a collection of template classes that can serve as code-conversion facets and how you can use them. Each implements a different multibyte conversion rule. Please note, however, that each of these template classes implements a conversion between two encodings. It is not enough to decide that you want to read a file containing, say, UTF-8 encoded characters. You also have to know what set of wide-character codes you can convert it to. And you need to know what options are available to you with a particular C++ implementation.

Wide-Character Limitations

The compiler imposes two important constraints on the wide-character encodings you can use in a C or C++ program:

the representation it chooses for the type wchar_t, which stores a wide-character code, and
the values it generates for wide-character literals, of the form L'x' (and L"xyz", by extension)

To a lesser extent, it also matters how the library defines:

the macro WEOF
the macro WCHAR_MAX which is also the value returned by the member function numeric_limits<wchar_t>::max()
the macro WCHAR_MIN which is also the value returned by the member function numeric_limits<wchar_t>::min()

Size matters the most. A large number of C and C++ compilers represent type wchar_t as a one-, two-, or four-byte integer. Those sizes usually translate into eight-, 16-, or 32-bit representations. The code-conversion facets described here are all designed to work properly if wchar_t is larger or smaller than required. For a value too large to convert, in either direction, you can usually instruct the facet either to truncate the result or to report a conversion error. A wide-character result smaller than a wchar_t value is padded with high-order zero bits.

It is not strictly necessary to convert between generalized multibyte sequences and elements of type wchar_t, by the way. You can specialize template class basic_istream and friends for element types other than char and wchar_t. The code-conversion facets described here are all designed to work properly for an integer element type other than wchar_t. So you can write:

std::locale loc = _ADDFAC(std::locale::classic(),
    new Dinkum::codecvt::codecvt_utf8<unsigned long>);
std::basic_ifstream<unsigned long> mystr;

mystr.imbue(loc);
mystr.open("file.txt", std::ios_base::binary);
if (!mystr.is_open())
    throw "open failed";

and traffic within the program with elements of type unsigned long.

Be warned, however, that single-element inserters and extractors will not work properly with elements of most integer types other than char and wchar_t. If you write:

unsigned char ch;
mystr >> ch;

the extractor will not extract a single element and store it in ch. Rather, it will skip white space, then read a sequence of decimal digits and convert them to an integer value to store in ch. Moreover, the extractor will expect the imbued locale loc to contain a facet of type ctype<unsigned long>, to test for white space. You will have to supply your own version, which may or may not be easy.

In principle, you can specialize a stream on a user-defined type that you supply, not just on an arithmetic type. But be warned that not all Standard C++ libraries are this flexible, and those that are may have different requirements.

If you choose to work with streams with elements other than type char or wchar_t, you should extract elements from an input stream only by calling read, which performs no checking on the value transmitted, as in:

if (!mystr.read(&ch, 1))
    throw "unexpected end of file";

Similarly, you should insert elements into an output stream only by calling write.

Once you settle on a wide-character encoding of a representable size, you then have to determine how well it interacts with code generated by the C or C++ compiler. If you use conventional wide streams, with elements of type wchar_t, you have the maximum freedom to use all the inserters and extractors defined by the Standard C++ library. If the program contains wide-character and wide-string literals, they should probably agree with the encoding you choose. Otherwise, you (and succeeding maintainers) will enounter any number of surprises. You can in principle write literals that contain arbitrary wide characters. If you do, the wide-character encoding you use had better exactly match what the compiler presumes. A good coding style, however, is to use just characters from the basic C character set in literals. Then any wide-character encoding that agrees with this subset of values is a safe candidate. (For example, many wide-character encodings use the same code values as ASCII, a.k.a ISO 646 and ISO 8859, for the basic C character set.)

An even safer alternative is to use no wide-character or wide-string literals at all, at least in the parts of a program that need to be maximally flexible. That avoids most potential problems, but not necessarily all.

You may still have to worry about the value of the macro WEOF. It is used throughout both the C and C++ libraries as an end-of-stream indicator, often in a context where you might otherwise expect a wide-character code. Wherever possible, a good implementation will choose a value that cannot be mistaken for a valid code. (The macro EOF is often defined as -1 so that it can never be confused with any of the single-byte codes, each of which is represented as a non-negative value.) This is not possible if the representation of wint_t has no more bits than that for wchar_t. In such a case, the implementation must at least choose a value for WEOF that is invalid as a wide-character code. A common value is (wchar_t)(-1) which has all bits set in an unsigned representation. Many wide-character encodings reserve this value as invalid, but not all.

If you choose an encoding that permits all code values, expect problems when you read the value (wchar_t)WEOF from a file. It will almost always be mistaken for an end-of-file indication from lower-level code. You will face similar problems when you write this value to a file. It will almost always be mistaken for a write-error indication from lower-level code. Once again, your safest bet is to extract elements only by calling read and insert elements only by calling write, as described above. These member functions test only the number of elements read or written, without inspecting any values. They are the only such functions that transmit element values transparently.

Some implementations choose a signed-integer representation for wchar_t. In this case, the macro WCHAR_MIN is less than zero. (It must be zero for an unsigned-integer representation.) The code-conversion facets presented here all treat wide characters as non-negative codes. They assume that it is safe to store in a wchar_t object all code values in the range [0, WCHAR_MAX - WCHAR_MIN], and that the value will be recovered if cast to a suitably large unsigned-integer type. In the common case where the computer represents negative numbers in twos-complement, with quiet wraparound on overflow, these assumptions are safe. But beware of a representation that has a negative zero, particularly if it sometimes collapses to positive zero. And beware of a representation that traps on apparent integer overflow when converting from unsigned to signed. Both can cause trouble for wide-character encodings that the compiler does not anticipate.

Wide-Character Encodings

One-Byte Wide-Character Encodings

A number of character-set encodings fit neatly in a single eight-bit byte. Many of these are based on ASCII, or ISO 646, which defines code values in the range [0, 127]. The character set ISO 8859-1 extends this encoding by defining the remaining codes, in the range [128, 255]. Variations on this popular set exist for several Western European alphabets, such as ISO 8859-7 for Greek.

Microsoft Windows and other systems implement a large number of code pages, each of which effectively defines a mapping between multibyte and wide-character encodings. Many of these code pages simply assign one-byte codes to a selection of characters from a larger character set.

An implementation based on one of these eight-bit character-set encodings may well be content to define wchar_t as one of the one-byte character types (char, signed char, or unsigned char). And it will likely adopt the same character set encoding for both its single-byte and wide-character encoding. Yet another approach is to allow a broader range of wide-character codes, but to permit wide-character conversions only when a single-byte equivalent is defined. This is a good way to reconcile the numerous ISO 8859-x single-byte encodings, or Windows single-byte code pages, with an unambiguous subset of Unicode (UCS-2) wide-character codes.

A widely used character-set encoding outside the ISO family is EBCDIC. It has been used for several decades by IBM and still has a presence. EBCDIC is also an eight-bit encoding, with the added virtue that all its Unicode equivalents (the ISO 8859-1 subset of UCS-2) also fit in a single byte. so it is another candidate for either a one-byte wide-character encoding or a subset of Unicode with only single-byte multibyte codes.

Two-Byte Wide-Character Encodings

The Japanese were among the first to face the problem of representing character sets whose elements number in the thousands. Over the past couple of decades, a series of Japanese Industrial Standards have evolved to represent a mix of Kanji, Hirigana, and Katakana characters for Japanese, plus the Western characters traditionally used with computers. A widely used encoding is JIS X0208, which uses 16 bits to represent a subset of Kanji plus these other alphabets. As usual, ISO 646 characters form a subset of this larger character set. Thus, JIS X0208 is often used in conjunction with a two-byte representation of wchar_t, usually declared as either short or unsigned short.

Unicode is probably the most widely known encoding for larger character sets. It is maintained by the private Unicode Consortium (see http://www.unicode.org), but has been kept pretty closely in sync with the evolution of ISO 10646, a.k.a. UCS. Both have the virtue of overlapping neatly with ISO 8859-1. And both are serious attempts to provide a single, unified representation for all the characters used throughout the world -- past, present, and future (including Klingon). Unicode got a real boost when it was adopted as the character set encoding for Java, which strives for a high level of portability and international support.

That's the good news. What muddies the picture is that Unicode has been subsetted in ways that now prove to be short sighted. The full ISO 10646 specification sets aside 31 bits to represent upwards of two billion different characters (UCS-4). These codes can be represented nicely as non-negative values in a 32-bit integer, either signed or unsigned. But until fairly recently, all the defined code values could be represented in 16 bits (UCS-2). Java fixated on a 16-bit representation for the basic Java type char, which is equivalent to the C or C++ type wchar_t. Less rigidly, but for similar reasons, a number of C and C++ implementations have chosen to represent wide characters using the 16-bit subset of Unicode stored in a two-byte wchar_t.

But the set of code values has recently been extended. Currently, all defined characters in ISO 10646 or Unicode fall in the range [0, 0x10FFFF]. As a palliative, some people are proposing the use of UTF-16 as a wide-character encoding. UTF-16 involves a bit of trickery. All the codes above 0xFFFF can be represented in 20 bits. These 20 bits can then be divided into two ten-bit pieces and stuffed into a pair of 16-bit codes that occupy holes left in the range [0, 0xFFFF]. Thus UTF-16 provides a way to represent all the currently defined Unicode characters as either one or two 16-bit words.

It is important to emphasize that UTF-16 is neither a proper generalized multibyte encoding, nor a proper multibyte encoding, nor a proper wide-character encoding, in C or C++ terms. It can be made into a generalized multibyte encoding simply by adding rules for specifying the order of individual bytes for the two- or four-byte codes (UTF-16LE for little-endian, UTF-16BE for big-endian, or UTF-16 with a header code that signals the endianness of the codes that follow.) It cannot be made into a multibyte encoding because it contains embedded bytes with zero value. It can be made into a wide-character encoding only by ignoring a fundamental principle -- every character is supposed to be representable as a single element of fixed size. Nevertheless, more than one group has expressed an interest in using UTF-16 as a kind of bastard wide-character encoding, choosing to break a rule for some currently lesser-used characters rather than face the fallout from changing the size of a basic character type.

Four-Byte Wide-Character Encodings

A four-byte representation for wchar_t has the obvious virtue that it can store all elements of the largest character sets currently under consideration. (The costs of the extra storage required are still being debated.) ISO 10646 is an obvious candidate for wide-character encoding in this case. But even here you can find several subsetting choices. The range of valid character codes can be assumed to be:

[0, 0x0010FFFF], for the currently defined full range of codes
[0, 0x7FFFFFFF], for the originally anticipated full range of codes
[0, 0xFFFFFFFF], for 32-bit transparent encoding

Such range issues will become more apparent when examining the choice of multibyte conversion rules that can be used with each wide-character encoding.

Multibyte Encodings

As you may have gathered from the dozens of code-conversion facets in this library, there are many multibyte encodings. A large number are jiggered to survive transmission via text files, but not all. Some are designed to be economical of storage, using shorter byte sequences for the more frequently used wide-character codes, but not all. Some are designed to permit easy translation to and from wide-character codes, but not all. The one common denominator of the code-conversion facets presented here is that every one translates a multibyte encoding that was created for a good commercial reason.

If you don't recognize an encoding supported by this library, chances are that you have no (current) need for it. If you see one that you need, chances are that this implementation will do what you want. If you want to learn more about any of the mappings implemented in this library, chances are that the source code will supply more precise details. And if you want to add your own code-conversion facets, chances are that one of the ones in this library will serve as background information and inspiration.

See also the Table of Contents and the Index.