Multibyte Sequences in C · Multibyte Sequences in C++ · Wide-Character Limitations · Wide-Character Encodings · Multibyte Encodings
You can represent a character two different ways within a C (or C++) program:
wchar_t
char
)Moreover, a multibyte sequence can have a state-dependent encoding, where one or more byte sequences changes the interpretation of byte sequences that follow. Every sequence of such multibyte sequences is presumed to begin in an initial shift state, and can be brought back to the initial shift state by a suitable homing sequence (which may be state dependent).
As a special case, certain characters can be represented as
single-byte sequences
in the initial shift state. Such characters form the single-byte
character set from traditional C. They are thus candidates for
testing and manipulation by the functions defined in the Standard
C header <ctype.h>
.
And they are the characters you write to, and read from,
byte streams,
using the functions defined in the Standard
C header <stdio.h>
.
The Standard C library defines a number of functions for
converting between multibyte and wide-character representations.
See, for example, the functions
mbtowc
and wctomb
, defined in the header
<stdlib>
.
The specific conversion rule is implementation defined. It may
be possible to change the rule globally by a call to setlocale
,
but no standards currently exist for specifying how to do so.
Whatever conversion rule applies, however, it must obey several
constraints spelled out in the C Standard:
'x'
, where x
is any member of the
basic C character set,
must correspond to the (only) byte of the single-byte sequence
that represents x
.L'x'
must correspond to the wide character
that represents x
.'\0'
is represented by the
single-byte sequence whose (only) byte has the value zeroL'\0'
is represented
by the wide character that has the value zero.Not all multibyte encodings obey all these rules. For example,
the last rule is broken by a multibyte encoding that represents
each two-byte wide character by its less-significant byte followed
by its more-significant byte. Such a
little-endian sequence
is easy to generate and useful for communicating values between
different computer architectures, but it can contain any number
of bytes with zero value that are not nul characters.
Thus, this encoding is not suitable as an implementation of, say,
the function mbtowc
.
(The encoding that puts the more-significant byte first is called a
big-endian sequence,
naturally enough.)
The C Standard recognizes the need for
generalized multibyte sequences,
which may break one or more of the rules above,
for communicating sequences of characters between programs. The Standard
C header <wchar.h>
defines
a host of functions, such as
fwscanf
and fwprintf
, for reading and writing
wide streams. You write a sequence
of wide characters to a wide stream and the stream converts the wide characters
in the program to generalized multibyte sequences in the external stream.
Similarly, you read a sequence of wide characters from a wide stream
and the stream converts generalized multibyte sequences in the external stream
to a sequence of wide characters in the program.
(Wide streams otherwise behave very much like
the more traditional byte streams from the earliest days of C.)
Once again, however, no standards exist for specifying the multibyte encoding rule for converting between a wide-character sequence within the program and the generalized multibyte sequences in the external file or data stream.
The Standard C++ library incorporates all of the Standard C library,
at least through 1998. Thus, it contains all the functions outlined above
for converting between wide characters and multibyte sequences and for
reading and writing wide streams. Moreover, it extends the traditional
iostreams classes to include wide-stream as well as byte-stream operations.
For example, you traditionally read a byte stream by
extracting bytes (elements of type char
) from the object
cin
, which has type istream
.
In Standard C++, cin
has type
basic_istream<char, char_traits<char> >
and istream
is just a synonym (type definition) for this type.
To read the standard input as a wide stream,
you extract wide characters (elements of type wchar_t
) from the object
wcin
, which has type
basic_istream<wchar_t, char_traits<wchar_t> >
.
(The type definition wistream
is a synonym for this type.) The wide stream responds by reading bytes from
the actual stream and composing them, by some rule, into the wide character
stream.
Similarly, you can write a byte stream by inserting bytes into
an object of type
cout
, which has type ostream
.
In Standard C++, cout
has type
basic_ostream<char, char_traits<char> >
and ostream
is just a synonym for this type.
To write the standard output as a wide stream,
you insert wide characters (elements of type wchar_t
) into the object
wcout
, which has type
basic_ostream<wchar_t, char_traits<wchar_t> >
.
(The type definition wostream
is a synonym for this type.) The wide stream responds by writing bytes to
the actual string, composing them, by some rule, from the wide character stream.
What multibyte encoding rule do these wide-stream objects apply? That
depends, in principle at least, on the locale associated with each
wide-stream object. The Standard C function setlocale
alters
the global behavior of the Standard C library, but C++ provides for greater
encapsulation. C++ lets you deal with multiple locales within a program
all at the same time. You can construct an object of class
locale
in a variety of ways.
One way is to use a locale name acceptable to setlocale
, to
make an object that presumably encapsulates the same locale-dependent
library behavior as in C. You then associate a locale object with a
wide-stream object by calling the member function
imbue
, as in:
std::locale loc("en_US"); // US English locale std::wifstream mystr; mystr.imbue(loc); mystr.open("file.txt", std::ios_base::binary); if (!mystr.is_open()) throw "open failed";
Henceforth, you can extract wide characters from mystr
. The
wide stream generates each wide character from a generalized multibyte
sequence read from the file file.txt
, using a conversion rule
that is presumably determined by the locale named en_US
.
But to belabor the point, no standards exist for specifying
the multibyte conversion rule associated with en_US
or any other
named locale. An implementation may offer a rich set of well documented
locales, or it may offer nothing beyond the required "C"
locale. It may provide for multiple multibyte encoding rules, or it may
apply just one rule universally. Put simply, you cannot in general depend
on the availability of predefined locales to supply the multibyte conversion
rule(s) you need.
The Standard C++ library offers a more deterministic approach, however.
It may not say how to specify the behavior of a locale, in general,
but it does specify quite a bit of the behavior of an
locale
object.
An locale
object encapsulates references to a couple of dozen
locale facets, each of which
encapsulates in turn some aspect of locale-dependent library behavior. The facet
codecvt<wchar_t, char, mbstate_t>
,
in particular, performs conversions between
wide characters and generalized multibyte sequences. Given a codecvt
facet of the appropriate flavor, you can construct a locale object that
does what you want and imbue it into the wide stream object(s) you use to
read and write files as you desire.
Say, for example, you have a definition for the template class
Dinkum::codecvt::codecvt_utf8<Elem>
that implements the multibyte encoding rule
you want to use for a stream with element type Elem
.
(Elem
is typically wchar_t
for a wide stream.)
Replace the declaration of loc
above with:
std::locale loc(locale::classic(), new Dinkum::codecvt::codecvt_utf8<wchar_t>);
and you have the locale object you need to imbue into one or more wide stream objects. If your compiler balks at this form, and you are using the Dinkum C++ Library, try the sturdier substitute:
std::locale loc = _ADDFAC(std::locale::classic(), new Dinkum::codecvt::codecvt_utf8<wchar_t>);
which works with older compilers as well as newer ones.
This document describes a collection of template classes that can serve as code-conversion facets and how you can use them. Each implements a different multibyte conversion rule. Please note, however, that each of these template classes implements a conversion between two encodings. It is not enough to decide that you want to read a file containing, say, UTF-8 encoded characters. You also have to know what set of wide-character codes you can convert it to. And you need to know what options are available to you with a particular C++ implementation.
The compiler imposes two important constraints on the wide-character encodings you can use in a C or C++ program:
wchar_t
,
which stores a wide-character code, andL'x'
(and L"xyz"
, by extension)To a lesser extent, it also matters how the library defines:
WEOF
WCHAR_MAX
which is also the value returned by the member function
numeric_limits<wchar_t>::max()
WCHAR_MIN
which is also the value returned by the member function
numeric_limits<wchar_t>::min()
Size matters the most. A large number of C and C++ compilers represent
type wchar_t
as a one-, two-, or four-byte integer. Those
sizes usually translate into eight-, 16-, or 32-bit representations. The
code-conversion facets described here are all designed to work properly
if wchar_t
is larger or smaller than required. For a value too
large to convert, in either direction, you can usually instruct the facet
either to truncate the result or to report a conversion error.
A wide-character result smaller than a wchar_t
value is
padded with high-order zero bits.
It is not strictly necessary to convert between generalized multibyte
sequences and elements of type wchar_t
, by the way. You can
specialize template class basic_istream
and friends for
element types other than char
and wchar_t
. The
code-conversion facets described here are all designed to work properly
for an integer element type other than wchar_t
. So you can
write:
std::locale loc = _ADDFAC(std::locale::classic(), new Dinkum::codecvt::codecvt_utf8<unsigned long>); std::basic_ifstream<unsigned long> mystr; mystr.imbue(loc); mystr.open("file.txt", std::ios_base::binary); if (!mystr.is_open()) throw "open failed";
and traffic within the program with elements of type unsigned
long
.
Be warned, however, that single-element inserters and
extractors will not work properly with elements of most integer types other
than char
and wchar_t
. If you write:
unsigned char ch; mystr >> ch;
the extractor will not extract a single element and store it in
ch
. Rather, it will skip white space, then read a sequence of
decimal digits and convert them to an integer value to store in ch
.
Moreover, the extractor will expect the imbued locale loc
to
contain a facet of type ctype<unsigned long>
, to test for
white space. You will have to supply your own version, which may or may not
be easy.
In principle, you can specialize a stream on a user-defined type that you supply, not just on an arithmetic type. But be warned that not all Standard C++ libraries are this flexible, and those that are may have different requirements.
If you choose to work with streams with elements other than type
char
or wchar_t
, you should extract elements
from an input stream only by calling read
,
which performs no checking on the value transmitted,
as in:
if (!mystr.read(&ch, 1)) throw "unexpected end of file";
Similarly, you should insert elements into an output stream only by calling
write
.
Once you settle on a wide-character encoding of a representable size,
you then have to determine how well it interacts with code generated by
the C or C++ compiler. If you use conventional wide streams, with elements
of type wchar_t
, you have the maximum freedom to use all
the inserters and extractors defined by the Standard C++ library.
If the program contains wide-character and wide-string
literals, they should probably agree with the encoding you choose. Otherwise,
you (and succeeding maintainers) will enounter any number of surprises.
You can in principle write literals that contain arbitrary wide characters. If you
do, the wide-character encoding you use had better exactly match what the
compiler presumes. A good coding style, however, is to use just characters
from the basic C character set in literals. Then any wide-character encoding
that agrees with this subset of values is a safe candidate. (For example,
many wide-character encodings use the same code values as ASCII, a.k.a ISO 646
and ISO 8859, for the basic C character set.)
An even safer alternative is to use no wide-character or wide-string literals at all, at least in the parts of a program that need to be maximally flexible. That avoids most potential problems, but not necessarily all.
You may still have to worry about the value of the macro WEOF
.
It is used throughout both the C and C++ libraries as an end-of-stream
indicator, often in a context where you might otherwise expect a wide-character
code. Wherever possible, a good implementation will choose a value that
cannot be mistaken for a valid code. (The macro
EOF
is often defined as -1
so that it can never be confused with any of the single-byte codes,
each of which is represented as a non-negative value.)
This is not possible if the representation of
wint_t
has no more bits than that for wchar_t
.
In such a case, the implementation must at least
choose a value for WEOF
that is invalid as a wide-character
code. A common value is (wchar_t)(-1)
which has all bits set
in an unsigned representation. Many wide-character encodings reserve this
value as invalid, but not all.
If you choose an encoding that permits all code values, expect problems
when you read the value (wchar_t)WEOF
from a file. It will
almost always be mistaken for an end-of-file indication from lower-level
code. You will face similar problems when you write this value to a file.
It will almost always be mistaken for a write-error indication from
lower-level code. Once again, your safest bet is to extract elements only
by calling read
and insert elements only by calling
write
, as described above. These member functions test only the
number of elements read or written, without inspecting any values. They
are the only such functions that transmit element values
transparently.
Some implementations choose a signed-integer representation for
wchar_t
. In this case, the macro WCHAR_MIN
is
less than zero.
(It must be zero for an unsigned-integer representation.)
The code-conversion facets presented here all treat wide characters as
non-negative codes. They assume that it is safe to store in a wchar_t
object all code values in the range [0, WCHAR_MAX - WCHAR_MIN]
,
and that the value will be recovered if cast to a suitably large unsigned-integer
type. In the common case where the computer represents negative numbers in
twos-complement, with quiet wraparound on overflow, these assumptions are
safe. But beware of a representation that has a negative zero, particularly
if it sometimes collapses to positive zero. And beware of a representation
that traps on apparent integer overflow when converting from unsigned to
signed. Both can cause trouble for
wide-character encodings that the compiler does not anticipate.
A number of character-set encodings fit neatly in a single eight-bit byte. Many of these are based on ASCII, or ISO 646, which defines code values in the range [0, 127]. The character set ISO 8859-1 extends this encoding by defining the remaining codes, in the range [128, 255]. Variations on this popular set exist for several Western European alphabets, such as ISO 8859-7 for Greek.
Microsoft Windows and other systems implement a large number of code pages, each of which effectively defines a mapping between multibyte and wide-character encodings. Many of these code pages simply assign one-byte codes to a selection of characters from a larger character set.
An implementation based on one of these eight-bit character-set
encodings may well be content to define wchar_t
as one of the
one-byte character types (char
, signed char
, or
unsigned char
). And it will likely adopt the same character set
encoding for both its single-byte and wide-character encoding. Yet another approach
is to allow a broader range of wide-character codes, but to permit wide-character
conversions only when a single-byte equivalent is defined. This is a good
way to reconcile the numerous ISO 8859-x single-byte encodings,
or Windows single-byte code pages, with an
unambiguous subset of Unicode (UCS-2) wide-character codes.
A widely used character-set encoding outside the ISO family is EBCDIC. It has been used for several decades by IBM and still has a presence. EBCDIC is also an eight-bit encoding, with the added virtue that all its Unicode equivalents (the ISO 8859-1 subset of UCS-2) also fit in a single byte. so it is another candidate for either a one-byte wide-character encoding or a subset of Unicode with only single-byte multibyte codes.
The Japanese were among the first to face the problem of representing
character sets whose elements number in the thousands. Over the past couple
of decades, a series of Japanese Industrial Standards have evolved to
represent a mix of Kanji, Hirigana, and Katakana characters for Japanese,
plus the Western characters traditionally used with computers. A widely
used encoding is JIS X0208,
which uses 16 bits to represent a subset
of Kanji plus these other alphabets. As usual, ISO 646 characters form a
subset of this larger character set. Thus, JIS X0208 is often used in
conjunction with a two-byte representation of wchar_t
,
usually declared as either short
or unsigned short
.
Unicode is probably the most widely known encoding for larger character sets. It is maintained by the private Unicode Consortium (see http://www.unicode.org), but has been kept pretty closely in sync with the evolution of ISO 10646, a.k.a. UCS. Both have the virtue of overlapping neatly with ISO 8859-1. And both are serious attempts to provide a single, unified representation for all the characters used throughout the world -- past, present, and future (including Klingon). Unicode got a real boost when it was adopted as the character set encoding for Java, which strives for a high level of portability and international support.
That's the good news. What muddies the picture is that Unicode has been
subsetted in ways that now prove to be short sighted. The full ISO 10646
specification sets aside 31 bits to represent upwards of two billion different
characters (UCS-4).
These codes can be represented nicely as non-negative values in
a 32-bit integer, either signed or unsigned. But until fairly recently, all
the defined code values could be represented in 16 bits
(UCS-2). Java fixated on a
16-bit representation for the basic Java type char
, which is equivalent
to the C or C++ type wchar_t
. Less rigidly, but for similar reasons,
a number of C and C++ implementations have chosen to represent wide characters
using the 16-bit subset of Unicode stored in a two-byte wchar_t
.
But the set of code values has recently been extended. Currently, all defined characters in ISO 10646 or Unicode fall in the range [0, 0x10FFFF]. As a palliative, some people are proposing the use of UTF-16 as a wide-character encoding. UTF-16 involves a bit of trickery. All the codes above 0xFFFF can be represented in 20 bits. These 20 bits can then be divided into two ten-bit pieces and stuffed into a pair of 16-bit codes that occupy holes left in the range [0, 0xFFFF]. Thus UTF-16 provides a way to represent all the currently defined Unicode characters as either one or two 16-bit words.
It is important to emphasize that UTF-16 is neither a proper generalized multibyte encoding, nor a proper multibyte encoding, nor a proper wide-character encoding, in C or C++ terms. It can be made into a generalized multibyte encoding simply by adding rules for specifying the order of individual bytes for the two- or four-byte codes (UTF-16LE for little-endian, UTF-16BE for big-endian, or UTF-16 with a header code that signals the endianness of the codes that follow.) It cannot be made into a multibyte encoding because it contains embedded bytes with zero value. It can be made into a wide-character encoding only by ignoring a fundamental principle -- every character is supposed to be representable as a single element of fixed size. Nevertheless, more than one group has expressed an interest in using UTF-16 as a kind of bastard wide-character encoding, choosing to break a rule for some currently lesser-used characters rather than face the fallout from changing the size of a basic character type.
A four-byte representation for wchar_t
has the obvious virtue
that it can store all elements of the largest character sets currently
under consideration. (The costs of the extra storage required are still being
debated.) ISO 10646 is an obvious candidate for wide-character encoding in
this case. But even here you can find several subsetting choices. The range
of valid character codes can be assumed to be:
Such range issues will become more apparent when examining the choice of multibyte conversion rules that can be used with each wide-character encoding.
As you may have gathered from the dozens of code-conversion facets in this library, there are many multibyte encodings. A large number are jiggered to survive transmission via text files, but not all. Some are designed to be economical of storage, using shorter byte sequences for the more frequently used wide-character codes, but not all. Some are designed to permit easy translation to and from wide-character codes, but not all. The one common denominator of the code-conversion facets presented here is that every one translates a multibyte encoding that was created for a good commercial reason.
If you don't recognize an encoding supported by this library, chances are that you have no (current) need for it. If you see one that you need, chances are that this implementation will do what you want. If you want to learn more about any of the mappings implemented in this library, chances are that the source code will supply more precise details. And if you want to add your own code-conversion facets, chances are that one of the ones in this library will serve as background information and inspiration.
See also the Table of Contents and the Index.
Copyright © 1992-2013 by Dinkumware, Ltd. All rights reserved.