Multibyte Sequences in C · Multibyte Sequences in C++ · Wide-Character Limitations · Wide-Character Encodings · Multibyte Encodings
You can represent a character two different ways within a C (or C++) program:
Moreover, a multibyte sequence can have a state-dependent encoding, where one or more byte sequences changes the interpretation of byte sequences that follow. Every sequence of such multibyte sequences is presumed to begin in an initial shift state, and can be brought back to the initial shift state by a suitable homing sequence (which may be state dependent).
As a special case, certain characters can be represented as
in the initial shift state. Such characters form the single-byte
character set from traditional C. They are thus candidates for
testing and manipulation by the functions defined in the Standard
And they are the characters you write to, and read from,
using the functions defined in the Standard
The Standard C library defines a number of functions for
converting between multibyte and wide-character representations.
See, for example, the functions
wctomb, defined in the header
The specific conversion rule is implementation defined. It may
be possible to change the rule globally by a call to
but no standards currently exist for specifying how to do so.
Whatever conversion rule applies, however, it must obey several
constraints spelled out in the C Standard:
xis any member of the basic C character set, must correspond to the (only) byte of the single-byte sequence that represents
L'x'must correspond to the wide character that represents
'\0'is represented by the single-byte sequence whose (only) byte has the value zero
L'\0'is represented by the wide character that has the value zero.
Not all multibyte encodings obey all these rules. For example,
the last rule is broken by a multibyte encoding that represents
each two-byte wide character by its less-significant byte followed
by its more-significant byte. Such a
is easy to generate and useful for communicating values between
different computer architectures, but it can contain any number
of bytes with zero value that are not nul characters.
Thus, this encoding is not suitable as an implementation of, say,
(The encoding that puts the more-significant byte first is called a
The C Standard recognizes the need for
generalized multibyte sequences,
which may break one or more of the rules above,
for communicating sequences of characters between programs. The Standard
a host of functions, such as
fwprintf, for reading and writing
wide streams. You write a sequence
of wide characters to a wide stream and the stream converts the wide characters
in the program to generalized multibyte sequences in the external stream.
Similarly, you read a sequence of wide characters from a wide stream
and the stream converts generalized multibyte sequences in the external stream
to a sequence of wide characters in the program.
(Wide streams otherwise behave very much like
the more traditional byte streams from the earliest days of C.)
Once again, however, no standards exist for specifying the multibyte encoding rule for converting between a wide-character sequence within the program and the generalized multibyte sequences in the external file or data stream.
The Standard C++ library incorporates all of the Standard C library,
at least through 1998. Thus, it contains all the functions outlined above
for converting between wide characters and multibyte sequences and for
reading and writing wide streams. Moreover, it extends the traditional
iostreams classes to include wide-stream as well as byte-stream operations.
For example, you traditionally read a byte stream by
extracting bytes (elements of type
char) from the object
cin, which has type
In Standard C++,
cin has type
basic_istream<char, char_traits<char> >
istream is just a synonym (type definition) for this type.
To read the standard input as a wide stream,
you extract wide characters (elements of type
wchar_t) from the object
wcin, which has type
basic_istream<wchar_t, char_traits<wchar_t> >.
(The type definition
is a synonym for this type.) The wide stream responds by reading bytes from
the actual stream and composing them, by some rule, into the wide character
Similarly, you can write a byte stream by inserting bytes into
an object of type
cout, which has type
In Standard C++,
cout has type
basic_ostream<char, char_traits<char> >
ostream is just a synonym for this type.
To write the standard output as a wide stream,
you insert wide characters (elements of type
wchar_t) into the object
wcout, which has type
basic_ostream<wchar_t, char_traits<wchar_t> >.
(The type definition
is a synonym for this type.) The wide stream responds by writing bytes to
the actual string, composing them, by some rule, from the wide character stream.
What multibyte encoding rule do these wide-stream objects apply? That
depends, in principle at least, on the locale associated with each
wide-stream object. The Standard C function
the global behavior of the Standard C library, but C++ provides for greater
encapsulation. C++ lets you deal with multiple locales within a program
all at the same time. You can construct an object of class
locale in a variety of ways.
One way is to use a locale name acceptable to
make an object that presumably encapsulates the same locale-dependent
library behavior as in C. You then associate a locale object with a
wide-stream object by calling the member function
imbue, as in:
std::locale loc("en_US"); // US English locale std::wifstream mystr; mystr.imbue(loc); mystr.open("file.txt", std::ios_base::binary); if (!mystr.is_open()) throw "open failed";
Henceforth, you can extract wide characters from
wide stream generates each wide character from a generalized multibyte
sequence read from the file
file.txt, using a conversion rule
that is presumably determined by the locale named
But to belabor the point, no standards exist for specifying
the multibyte conversion rule associated with
en_US or any other
named locale. An implementation may offer a rich set of well documented
locales, or it may offer nothing beyond the required
locale. It may provide for multiple multibyte encoding rules, or it may
apply just one rule universally. Put simply, you cannot in general depend
on the availability of predefined locales to supply the multibyte conversion
rule(s) you need.
The Standard C++ library offers a more deterministic approach, however.
It may not say how to specify the behavior of a locale, in general,
but it does specify quite a bit of the behavior of an
locale object encapsulates references to a couple of dozen
locale facets, each of which
encapsulates in turn some aspect of locale-dependent library behavior. The facet
codecvt<wchar_t, char, mbstate_t>,
in particular, performs conversions between
wide characters and generalized multibyte sequences. Given a
facet of the appropriate flavor, you can construct a locale object that
does what you want and imbue it into the wide stream object(s) you use to
read and write files as you desire.
Say, for example, you have a definition for the template class
that implements the multibyte encoding rule
you want to use for a stream with element type
Elem is typically
wchar_t for a wide stream.)
Replace the declaration of
loc above with:
std::locale loc(locale::classic(), new Dinkum::codecvt::codecvt_utf8<wchar_t>);
and you have the locale object you need to imbue into one or more wide stream objects. If your compiler balks at this form, and you are using the Dinkum C++ Library, try the sturdier substitute:
std::locale loc = _ADDFAC(std::locale::classic(), new Dinkum::codecvt::codecvt_utf8<wchar_t>);
which works with older compilers as well as newer ones.
This document describes a collection of template classes that can serve as code-conversion facets and how you can use them. Each implements a different multibyte conversion rule. Please note, however, that each of these template classes implements a conversion between two encodings. It is not enough to decide that you want to read a file containing, say, UTF-8 encoded characters. You also have to know what set of wide-character codes you can convert it to. And you need to know what options are available to you with a particular C++ implementation.
The compiler imposes two important constraints on the wide-character encodings you can use in a C or C++ program:
wchar_t, which stores a wide-character code, and
L"xyz", by extension)
To a lesser extent, it also matters how the library defines:
WCHAR_MAXwhich is also the value returned by the member function
WCHAR_MINwhich is also the value returned by the member function
Size matters the most. A large number of C and C++ compilers represent
wchar_t as a one-, two-, or four-byte integer. Those
sizes usually translate into eight-, 16-, or 32-bit representations. The
code-conversion facets described here are all designed to work properly
wchar_t is larger or smaller than required. For a value too
large to convert, in either direction, you can usually instruct the facet
either to truncate the result or to report a conversion error.
A wide-character result smaller than a
wchar_t value is
padded with high-order zero bits.
It is not strictly necessary to convert between generalized multibyte
sequences and elements of type
wchar_t, by the way. You can
specialize template class
basic_istream and friends for
element types other than
code-conversion facets described here are all designed to work properly
for an integer element type other than
wchar_t. So you can
std::locale loc = _ADDFAC(std::locale::classic(), new Dinkum::codecvt::codecvt_utf8<unsigned long>); std::basic_ifstream<unsigned long> mystr; mystr.imbue(loc); mystr.open("file.txt", std::ios_base::binary); if (!mystr.is_open()) throw "open failed";
and traffic within the program with elements of type
Be warned, however, that single-element inserters and
extractors will not work properly with elements of most integer types other
wchar_t. If you write:
unsigned char ch; mystr >> ch;
the extractor will not extract a single element and store it in
ch. Rather, it will skip white space, then read a sequence of
decimal digits and convert them to an integer value to store in
Moreover, the extractor will expect the imbued locale
contain a facet of type
ctype<unsigned long>, to test for
white space. You will have to supply your own version, which may or may not
In principle, you can specialize a stream on a user-defined type that you supply, not just on an arithmetic type. But be warned that not all Standard C++ libraries are this flexible, and those that are may have different requirements.
If you choose to work with streams with elements other than type
wchar_t, you should extract elements
from an input stream only by calling
which performs no checking on the value transmitted,
if (!mystr.read(&ch, 1)) throw "unexpected end of file";
Similarly, you should insert elements into an output stream only by calling
Once you settle on a wide-character encoding of a representable size,
you then have to determine how well it interacts with code generated by
the C or C++ compiler. If you use conventional wide streams, with elements
wchar_t, you have the maximum freedom to use all
the inserters and extractors defined by the Standard C++ library.
If the program contains wide-character and wide-string
literals, they should probably agree with the encoding you choose. Otherwise,
you (and succeeding maintainers) will enounter any number of surprises.
You can in principle write literals that contain arbitrary wide characters. If you
do, the wide-character encoding you use had better exactly match what the
compiler presumes. A good coding style, however, is to use just characters
from the basic C character set in literals. Then any wide-character encoding
that agrees with this subset of values is a safe candidate. (For example,
many wide-character encodings use the same code values as ASCII, a.k.a ISO 646
and ISO 8859, for the basic C character set.)
An even safer alternative is to use no wide-character or wide-string literals at all, at least in the parts of a program that need to be maximally flexible. That avoids most potential problems, but not necessarily all.
You may still have to worry about the value of the macro
It is used throughout both the C and C++ libraries as an end-of-stream
indicator, often in a context where you might otherwise expect a wide-character
code. Wherever possible, a good implementation will choose a value that
cannot be mistaken for a valid code. (The macro
EOF is often defined as
so that it can never be confused with any of the single-byte codes,
each of which is represented as a non-negative value.)
This is not possible if the representation of
wint_t has no more bits than that for
In such a case, the implementation must at least
choose a value for
WEOF that is invalid as a wide-character
code. A common value is
(wchar_t)(-1) which has all bits set
in an unsigned representation. Many wide-character encodings reserve this
value as invalid, but not all.
If you choose an encoding that permits all code values, expect problems
when you read the value
(wchar_t)WEOF from a file. It will
almost always be mistaken for an end-of-file indication from lower-level
code. You will face similar problems when you write this value to a file.
It will almost always be mistaken for a write-error indication from
lower-level code. Once again, your safest bet is to extract elements only
read and insert elements only by calling
write, as described above. These member functions test only the
number of elements read or written, without inspecting any values. They
are the only such functions that transmit element values
Some implementations choose a signed-integer representation for
wchar_t. In this case, the macro
less than zero.
(It must be zero for an unsigned-integer representation.)
The code-conversion facets presented here all treat wide characters as
non-negative codes. They assume that it is safe to store in a
object all code values in the range
[0, WCHAR_MAX - WCHAR_MIN],
and that the value will be recovered if cast to a suitably large unsigned-integer
type. In the common case where the computer represents negative numbers in
twos-complement, with quiet wraparound on overflow, these assumptions are
safe. But beware of a representation that has a negative zero, particularly
if it sometimes collapses to positive zero. And beware of a representation
that traps on apparent integer overflow when converting from unsigned to
signed. Both can cause trouble for
wide-character encodings that the compiler does not anticipate.
A number of character-set encodings fit neatly in a single eight-bit byte. Many of these are based on ASCII, or ISO 646, which defines code values in the range [0, 127]. The character set ISO 8859-1 extends this encoding by defining the remaining codes, in the range [128, 255]. Variations on this popular set exist for several Western European alphabets, such as ISO 8859-7 for Greek.
Microsoft Windows and other systems implement a large number of code pages, each of which effectively defines a mapping between multibyte and wide-character encodings. Many of these code pages simply assign one-byte codes to a selection of characters from a larger character set.
An implementation based on one of these eight-bit character-set
encodings may well be content to define
wchar_t as one of the
one-byte character types (
signed char, or
unsigned char). And it will likely adopt the same character set
encoding for both its single-byte and wide-character encoding. Yet another approach
is to allow a broader range of wide-character codes, but to permit wide-character
conversions only when a single-byte equivalent is defined. This is a good
way to reconcile the numerous ISO 8859-x single-byte encodings,
or Windows single-byte code pages, with an
unambiguous subset of Unicode (UCS-2) wide-character codes.
A widely used character-set encoding outside the ISO family is EBCDIC. It has been used for several decades by IBM and still has a presence. EBCDIC is also an eight-bit encoding, with the added virtue that all its Unicode equivalents (the ISO 8859-1 subset of UCS-2) also fit in a single byte. so it is another candidate for either a one-byte wide-character encoding or a subset of Unicode with only single-byte multibyte codes.
The Japanese were among the first to face the problem of representing
character sets whose elements number in the thousands. Over the past couple
of decades, a series of Japanese Industrial Standards have evolved to
represent a mix of Kanji, Hirigana, and Katakana characters for Japanese,
plus the Western characters traditionally used with computers. A widely
used encoding is JIS X0208,
which uses 16 bits to represent a subset
of Kanji plus these other alphabets. As usual, ISO 646 characters form a
subset of this larger character set. Thus, JIS X0208 is often used in
conjunction with a two-byte representation of
usually declared as either
Unicode is probably the most widely known encoding for larger character sets. It is maintained by the private Unicode Consortium (see http://www.unicode.org), but has been kept pretty closely in sync with the evolution of ISO 10646, a.k.a. UCS. Both have the virtue of overlapping neatly with ISO 8859-1. And both are serious attempts to provide a single, unified representation for all the characters used throughout the world -- past, present, and future (including Klingon). Unicode got a real boost when it was adopted as the character set encoding for Java, which strives for a high level of portability and international support.
That's the good news. What muddies the picture is that Unicode has been
subsetted in ways that now prove to be short sighted. The full ISO 10646
specification sets aside 31 bits to represent upwards of two billion different
These codes can be represented nicely as non-negative values in
a 32-bit integer, either signed or unsigned. But until fairly recently, all
the defined code values could be represented in 16 bits
(UCS-2). Java fixated on a
16-bit representation for the basic Java type
char, which is equivalent
to the C or C++ type
wchar_t. Less rigidly, but for similar reasons,
a number of C and C++ implementations have chosen to represent wide characters
using the 16-bit subset of Unicode stored in a two-byte
But the set of code values has recently been extended. Currently, all defined characters in ISO 10646 or Unicode fall in the range [0, 0x10FFFF]. As a palliative, some people are proposing the use of UTF-16 as a wide-character encoding. UTF-16 involves a bit of trickery. All the codes above 0xFFFF can be represented in 20 bits. These 20 bits can then be divided into two ten-bit pieces and stuffed into a pair of 16-bit codes that occupy holes left in the range [0, 0xFFFF]. Thus UTF-16 provides a way to represent all the currently defined Unicode characters as either one or two 16-bit words.
It is important to emphasize that UTF-16 is neither a proper generalized multibyte encoding, nor a proper multibyte encoding, nor a proper wide-character encoding, in C or C++ terms. It can be made into a generalized multibyte encoding simply by adding rules for specifying the order of individual bytes for the two- or four-byte codes (UTF-16LE for little-endian, UTF-16BE for big-endian, or UTF-16 with a header code that signals the endianness of the codes that follow.) It cannot be made into a multibyte encoding because it contains embedded bytes with zero value. It can be made into a wide-character encoding only by ignoring a fundamental principle -- every character is supposed to be representable as a single element of fixed size. Nevertheless, more than one group has expressed an interest in using UTF-16 as a kind of bastard wide-character encoding, choosing to break a rule for some currently lesser-used characters rather than face the fallout from changing the size of a basic character type.
A four-byte representation for
wchar_t has the obvious virtue
that it can store all elements of the largest character sets currently
under consideration. (The costs of the extra storage required are still being
debated.) ISO 10646 is an obvious candidate for wide-character encoding in
this case. But even here you can find several subsetting choices. The range
of valid character codes can be assumed to be:
Such range issues will become more apparent when examining the choice of multibyte conversion rules that can be used with each wide-character encoding.
As you may have gathered from the dozens of code-conversion facets in this library, there are many multibyte encodings. A large number are jiggered to survive transmission via text files, but not all. Some are designed to be economical of storage, using shorter byte sequences for the more frequently used wide-character codes, but not all. Some are designed to permit easy translation to and from wide-character codes, but not all. The one common denominator of the code-conversion facets presented here is that every one translates a multibyte encoding that was created for a good commercial reason.
If you don't recognize an encoding supported by this library, chances are that you have no (current) need for it. If you see one that you need, chances are that this implementation will do what you want. If you want to learn more about any of the mappings implemented in this library, chances are that the source code will supply more precise details. And if you want to add your own code-conversion facets, chances are that one of the ones in this library will serve as background information and inspiration.
See also the Table of Contents and the Index.
Copyright © 1992-2013 by Dinkumware, Ltd. All rights reserved.