A C++ program can specialize a number of templates from the Dinkum Conversions Library, a portable library for converting wide strings and input/output streams to and from their corresponding multibyte encoding.
"Dinkum/codecvt/wbuffer"
· "Dinkum/codecvt/wstring"
The Dinkum Conversions Library consists of a rich assortment of
code-conversion facets, all suitable as replacements for template
class codecvt
, defined in the header <locale>
in the Standard C++ library. When you write:
#include <fstream> #include <locale> ..... { // write Hello as one line to a file std::wofstream mystr("myfile.txt"); // open file as wide stream mystr << L"Hello" << std::endl; }
the program opens a file named myfile.txt
and writes
six wide characters (each of type wchar_t
) to the file, by
converting each of the wide characters in turn to its corresponding
multibyte sequence. (For more details, see the companion essay on
Multibyte Encodings.)
Wide characters inserted into the stream are
translated, under the hood, by the code-conversion facet
codecvt<wchar_t, char, mbstate_t>
obtained from the
global locale object when the program constructs mystr
.
Similarly, you can construct a wifstream
object that reads multibyte sequences from a file and converts them in turn
to a sequence of wide characters that the program extracts from the object.
It uses the same code-conversion facet as above.
Unless you explicitly specify otherwise, that facet implements a default
mapping between wide characters and multibyte encodings, which may not
be the mapping you want. Worse, the C++ Standard provides no portable
way to specify the mapping. It leaves each implementation, or programmer,
to supply any desired mappings.
The Dinkum Conversions Library lets you explicitly specify a number
of different mappings, and it works with all the currently popular
implementations of the Standard C++ library. Chances are, it includes the
mapping you want. Say, for example, that you want to convert between Unicode
wide characters within the program and UTF8-encoded files. The header
utf8
defines template class codecvt_utf8
to do this job. So you can write:
#include <fstream> #include <locale> #include "Dinkum/codecvt/utf8" ..... { // write Hello as one line to a file std::wofstream mystr("myfile.txt"); // open file as wide stream std::locale loc(std::locale::classic(), new Dinkum::codecvt::codecvt_utf8<wchar_t>); mystr.imbue(loc); // replace codecvt<wchar_t, char, std::mbstate_t> mystr << L"Hello" << std::endl; }
In this particular case, the final file contents will be the
pedestrian byte sequence Hello\n
, because all codes in the
basic C character set have the obvious single-byte equivalents. But in general,
the program will translate any valid Unicode code to the appropriate multibyte
sequence in the file.
Now for an important warning: Not all Standard C++ libraries work
properly with all the code-conversion facets presented here. The Dinkum
C++ Library behaves properly in all cases, but others are less reliable.
So the Conversions Library also includes a stream buffer that offloads
the code conversion process from existing stream buffers, and does it
correctly. The header wbuffer
defines template
class wbuffer_convert
, which is the conversion stream buffer. So with
any Standard C++ library, you can safely replace the code sequence
above with:
#include <fstream> #include <locale> #include "Dinkum/codecvt/wbuffer" #include "Dinkum/codecvt/utf8" ..... { // write Hello as one line to a file std::ofstream bytestream("myfile.txt"); // open file as byte stream Dinkum::codecvt::wbuffer_convert< Dinkum::codecvt::codecvt_utf8<wchar_t> > mybuf(bytestream.rdbuf()); // construct wide stream buffer object std::wostream mystr(&mybuf); // construct wide ostream object mystr << L"Hello" << std::endl; }
Template class wbuffer_convert
looks like a wide stream to
the program. It performs all input and output by calling on an underlying
byte stream, typically an existing one from the Standard C++ library.
Internally, it uses any of the code-conversion facets presented
here to perform the wide-to-multibyte mapping.
Still another approach is to perform code conversion as a mapping between
wide and byte strings. The Standard C++ library uses code-conversion facets
only when performing input and output to a file, but the Conversions
Library provides a string converter as well. The header
wstring
defines template class wstring_convert
,
which is the string converter. So you can rewrite the running example above as:
#include <fstream> #include <locale> #include <string> #include "Dinkum/codecvt/wstring" #include "Dinkum/codecvt/utf8" ..... { // write Hello as one line to a file std::ofstream bytestream("myfile.txt"); // open file as byte stream Dinkum::codecvt::wstring_convert< Dinkum::codecvt::codecvt_utf8<wchar_t>> myconv; std::string mbstring = myconv.to_bytes(L"Hello\n"); mystr << mbstring; }
Template class string_convert
also lets you convert from
multibyte to wide strings, of course.
Here is a table of the code-conversion facets defined in this library:
Header Wide Multibyte File Code Code 8859_1 UCS2/4 ISO 8859-1 8859_2 UCS2/4 ISO 8859-2 8859_3 UCS2/4 ISO 8859-3 8859_4 UCS2/4 ISO 8859-4 8859_5 UCS2/4 ISO 8859-5 8859_6 UCS2/4 ISO 8859-6 8859_7 UCS2/4 ISO 8859-7 8859_8 UCS2/4 ISO 8859-8 8859_9 UCS2/4 ISO 8859-9 8859_10 UCS2/4 ISO 8859-10 8859_13 UCS2/4 ISO 8859-13 8859_14 UCS2/4 ISO 8859-14 8859_15 UCS2/4 ISO 8859-15 8859_16 UCS2/4 ISO 8859-16 baltic UCS2/4 ISO IR-179 Baltic big5 UCS2/4 BIG5 Chinese double byte cp037 UCS2/4 Code page 037 IBM US Canada cp424 UCS2/4 Code page 424 IBM EBCDIC Hebrew cp437 UCS2/4 Code page 437 DOS Latin US cp500 UCS2/4 Code page 500 IBM International cp737 UCS2/4 Code page 737 DOS Greek cp775 UCS2/4 Code page 775 DOS Baltic Rim cp850 UCS2/4 Code page 850 DOS Latin1 cp852 UCS2/4 Code page 852 DOS Latin2 cp855 UCS2/4 Code page 855 DOS Cyrillic cp856 UCS2/4 Code page 856 Hebrew PC cp857 UCS2/4 Code page 857 DOS Turkish cp860 UCS2/4 Code page 860 DOS Portugese cp861 UCS2/4 Code page 861 DOS Icelandic cp862 UCS2/4 Code page 862 DOS Hebrew cp863 UCS2/4 Code page 863 DOS Canada French cp864 UCS2/4 Code page 864 DOS Arabic cp865 UCS2/4 Code page 865 DOS Nordic cp866 UCS2/4 Code page 866 DOS Cyrillic Russian cp869 UCS2/4 Code page 869 DOS Greek2 cp874 UCS2/4 Code page 874 DOS Thai cp875 UCS2/4 Code page 875 IBM Greek cp932 UCS2/4 Code page 932 DOS double byte cp936 UCS2/4 Code page 936 DOS double byte cp949 UCS2/4 Code page 949 DOS double byte cp950 UCS2/4 Code page 950 DOS double byte cp1006 UCS2/4 Code page 1006 IBM Arabic cp1026 UCS2/4 Code page 1026 IBM Latin Turkish cp1250 UCS2/4 Code page 1250 cp1251 UCS2/4 Code page 1251 cp1252 UCS2/4 Code page 1252 cp1253 UCS2/4 Code page 1253 cp1254 UCS2/4 Code page 1254 cp1255 UCS2/4 Code page 1255 cp1256 UCS2/4 Code page 1256 cp1257 UCS2/4 Code page 1257 cp1258 UCS2/4 Code page 1258 cyrillic UCS2/4 Code page 10007 Mac Cyrillic ebcdic UCS2/4 EBCDIC euc UCS2/4 EUC Japanese euc_0208 JIS X0208 EUC Japanese gb12345 UCS2/4 GB12345-80 double byte gb2312 UCS2/4 GB2312-80 double byte greek UCS2/4 Code page 10006 Mac Greek iceland UCS2/4 Code page 10079 Mac Icelandic jis UCS2/4 JIS Japanese jis0201 UCS2/4 JIS X0201 Japanese jis_0208 JIS X0208 JIS Japanese ksc5601 UCS2/4 KSC5601 Unified Hangul double byte latin2 UCS2/4 Code page 10029 Mac Latin2 one_one UCS2/4 UCS2/4 transparent, optional header, binary roman UCS2/4 Code page 10000 Mac Roman sjis UCS2/4 Shift JIS Japanese sjis_0208 JIS X0208 Shift JIS Japanese turkish UCS2/4 Code page 10081 Mac Turkish utf8 UCS2/4 UTF-8, optional header utf8_utf16 UTF-16 UTF-8, optional header utf16 UCS2/4 UTF-16, optional header, binary
The first column gives the name of the header file, in the include
subdirectory Dinkum/codecvt
. The second column describes the
wide-character encoding assumed by the code-conversion facet:
The third column briefly describes the multibyte encoding assumed by the code-conversion facet. For more information, see the header itself. Many are derived from tables made available by various standards bodies. In such cases, the header preserves as comments any descriptive information that accompanies the tables.
All the code-conversion facets are defined as templates with the common form:
template<Elem, unsigned long Maxcode = default value> class codecvt_XXX {....};
Elem
is the wide-character type, typically wchar_t
.
Maxcode
is the largest wide-character code that the
code-conversion facet will read or write without reporting a conversion error.
Each facet specifies a default value which is most appropriate for its
multibyte encoding.
If the multibyte description in the table advertises an "optional header," the template class also has an optional third template parameter to provide additional information about the multibyte encoding. You specify this information as the union of three enumeration constants:
consume_header
-- to consume an initial header
sequence when reading a multibyte sequence and determine the endianness
of the subsequent multibyte sequence to be readgenerate_header
-- to generate an initial header
sequence when writing a multibyte sequence to advertise the endianness
of the subsequent multibyte sequence to be writtenlittle_endian
-- to generate a multibyte sequence
in little-endian order, as opposed to the default big-endian orderFinally, if the multibyte description in the table is labeled as "binary," then it is not suitable for reading and writing as a text file. Such a multibyte sequence may contain nul bytes that do not represent nul characters (which makes it unsuitable for storing in a nul-terminated byte string as well). It may also contain other bytes that get altered or discarded when reading or writing text files. Be sure to open files in binary mode if you read or write wide streams using these code-conversion facets.
To use one of these code-conversion facets, follow the pattern in the examples given above. In more detail:
XXX
,
from the table above,
implements the code-conversion rule you want to use and include the header
"Dinkum/codecvt/XXX"
at the top of the C++ source file.Dinkum::codecvt::codecvt_XXX
. Almost always you will want
to specialize the template class on the element type wchar_t
,
as in Dinkum::codecvt::codecvt_XXX<wchar_t>
.Dinkum::codecvt::codecvt_XXX<wchar_t, 0x10ffff>
.
(On an implementation with 32-bit wide characters, this particular value
causes a conversion error if the code-conversion facet generates, or is
asked to generate, a Unicode wide-character code that is currently undefined.)Dinkum::codecvt::codecvt_XXX<wchar_t, 0x10ffff,
generate_header>
.For more examples of how to use these code-conversion facets, study the test code that exercises each facet.
See also the Table of Contents and the Index.
Copyright © 1992-2013 by Dinkumware, Ltd. All rights reserved.