Dinkum Conversions Library

A C++ program can specialize a number of templates from the Dinkum Conversions Library, a portable library for converting wide strings and input/output streams to and from their corresponding multibyte encoding.

The Dinkum Conversions Library consists of a rich assortment of code-conversion facets, all suitable as replacements for template class codecvt, defined in the header <locale> in the Standard C++ library. When you write:

#include <fstream>
#include <locale>
.....
    {   // write Hello as one line to a file
    std::wofstream mystr("myfile.txt"); // open file as wide stream
    mystr << L"Hello" << std::endl;
    }

the program opens a file named myfile.txt and writes six wide characters (each of type wchar_t) to the file, by converting each of the wide characters in turn to its corresponding multibyte sequence. (For more details, see the companion essay on Multibyte Encodings.) Wide characters inserted into the stream are translated, under the hood, by the code-conversion facet codecvt<wchar_t, char, mbstate_t> obtained from the global locale object when the program constructs mystr.

Similarly, you can construct a wifstream object that reads multibyte sequences from a file and converts them in turn to a sequence of wide characters that the program extracts from the object. It uses the same code-conversion facet as above. Unless you explicitly specify otherwise, that facet implements a default mapping between wide characters and multibyte encodings, which may not be the mapping you want. Worse, the C++ Standard provides no portable way to specify the mapping. It leaves each implementation, or programmer, to supply any desired mappings.

The Dinkum Conversions Library lets you explicitly specify a number of different mappings, and it works with all the currently popular implementations of the Standard C++ library. Chances are, it includes the mapping you want. Say, for example, that you want to convert between Unicode wide characters within the program and UTF8-encoded files. The header utf8 defines template class codecvt_utf8 to do this job. So you can write:

#include <fstream>
#include <locale>
#include "Dinkum/codecvt/utf8"
.....
    {   // write Hello as one line to a file
    std::wofstream mystr("myfile.txt"); // open file as wide stream
    std::locale loc(std::locale::classic(),
        new Dinkum::codecvt::codecvt_utf8<wchar_t>);
    mystr.imbue(loc);  // replace codecvt<wchar_t, char, std::mbstate_t>
    mystr << L"Hello" << std::endl;
    }

In this particular case, the final file contents will be the pedestrian byte sequence Hello\n, because all codes in the basic C character set have the obvious single-byte equivalents. But in general, the program will translate any valid Unicode code to the appropriate multibyte sequence in the file.

Now for an important warning: Not all Standard C++ libraries work properly with all the code-conversion facets presented here. The Dinkum C++ Library behaves properly in all cases, but others are less reliable. So the Conversions Library also includes a stream buffer that offloads the code conversion process from existing stream buffers, and does it correctly. The header wbuffer defines template class wbuffer_convert, which is the conversion stream buffer. So with any Standard C++ library, you can safely replace the code sequence above with:

#include <fstream>
#include <locale>
#include "Dinkum/codecvt/wbuffer"
#include "Dinkum/codecvt/utf8"
.....
    {   // write Hello as one line to a file
    std::ofstream bytestream("myfile.txt"); // open file as byte stream
    Dinkum::codecvt::wbuffer_convert<
        Dinkum::codecvt::codecvt_utf8<wchar_t> >
            mybuf(bytestream.rdbuf());  // construct wide stream buffer object
    std::wostream mystr(&mybuf); // construct wide ostream object
    mystr << L"Hello" << std::endl;
    }

Template class wbuffer_convert looks like a wide stream to the program. It performs all input and output by calling on an underlying byte stream, typically an existing one from the Standard C++ library. Internally, it uses any of the code-conversion facets presented here to perform the wide-to-multibyte mapping.

Still another approach is to perform code conversion as a mapping between wide and byte strings. The Standard C++ library uses code-conversion facets only when performing input and output to a file, but the Conversions Library provides a string converter as well. The header wstring defines template class wstring_convert, which is the string converter. So you can rewrite the running example above as:

#include <fstream>
#include <locale>
#include <string>
#include "Dinkum/codecvt/wstring"
#include "Dinkum/codecvt/utf8"
.....
    {   // write Hello as one line to a file
    std::ofstream bytestream("myfile.txt"); // open file as byte stream
    Dinkum::codecvt::wstring_convert<
        Dinkum::codecvt::codecvt_utf8<wchar_t>> myconv;
    std::string mbstring = myconv.to_bytes(L"Hello\n");
    mystr << mbstring;
    }

Template class string_convert also lets you convert from multibyte to wide strings, of course.

Code Conversions

Here is a table of the code-conversion facets defined in this library:

Header      Wide        Multibyte
File        Code        Code

8859_1      UCS2/4      ISO 8859-1
8859_2      UCS2/4      ISO 8859-2
8859_3      UCS2/4      ISO 8859-3
8859_4      UCS2/4      ISO 8859-4
8859_5      UCS2/4      ISO 8859-5
8859_6      UCS2/4      ISO 8859-6
8859_7      UCS2/4      ISO 8859-7
8859_8      UCS2/4      ISO 8859-8
8859_9      UCS2/4      ISO 8859-9
8859_10     UCS2/4      ISO 8859-10
8859_13     UCS2/4      ISO 8859-13
8859_14     UCS2/4      ISO 8859-14
8859_15     UCS2/4      ISO 8859-15
8859_16     UCS2/4      ISO 8859-16
baltic      UCS2/4      ISO IR-179 Baltic
big5        UCS2/4      BIG5 Chinese double byte
cp037       UCS2/4      Code page 037 IBM US Canada
cp424       UCS2/4      Code page 424 IBM EBCDIC Hebrew
cp437       UCS2/4      Code page 437 DOS Latin US
cp500       UCS2/4      Code page 500 IBM International
cp737       UCS2/4      Code page 737 DOS Greek
cp775       UCS2/4      Code page 775 DOS Baltic Rim
cp850       UCS2/4      Code page 850 DOS Latin1
cp852       UCS2/4      Code page 852 DOS Latin2
cp855       UCS2/4      Code page 855 DOS Cyrillic
cp856       UCS2/4      Code page 856 Hebrew PC
cp857       UCS2/4      Code page 857 DOS Turkish
cp860       UCS2/4      Code page 860 DOS Portugese
cp861       UCS2/4      Code page 861 DOS Icelandic
cp862       UCS2/4      Code page 862 DOS Hebrew
cp863       UCS2/4      Code page 863 DOS Canada French
cp864       UCS2/4      Code page 864 DOS Arabic
cp865       UCS2/4      Code page 865 DOS Nordic
cp866       UCS2/4      Code page 866 DOS Cyrillic Russian
cp869       UCS2/4      Code page 869 DOS Greek2
cp874       UCS2/4      Code page 874 DOS Thai
cp875       UCS2/4      Code page 875 IBM Greek
cp932       UCS2/4      Code page 932 DOS double byte
cp936       UCS2/4      Code page 936 DOS double byte
cp949       UCS2/4      Code page 949 DOS double byte
cp950       UCS2/4      Code page 950 DOS double byte
cp1006      UCS2/4      Code page 1006 IBM Arabic
cp1026      UCS2/4      Code page 1026 IBM Latin Turkish
cp1250      UCS2/4      Code page 1250
cp1251      UCS2/4      Code page 1251
cp1252      UCS2/4      Code page 1252
cp1253      UCS2/4      Code page 1253
cp1254      UCS2/4      Code page 1254
cp1255      UCS2/4      Code page 1255
cp1256      UCS2/4      Code page 1256
cp1257      UCS2/4      Code page 1257
cp1258      UCS2/4      Code page 1258
cyrillic    UCS2/4      Code page 10007 Mac Cyrillic
ebcdic      UCS2/4      EBCDIC
euc         UCS2/4      EUC Japanese
euc_0208    JIS X0208   EUC Japanese
gb12345     UCS2/4      GB12345-80 double byte
gb2312      UCS2/4      GB2312-80 double byte
greek       UCS2/4      Code page 10006 Mac Greek
iceland     UCS2/4      Code page 10079 Mac Icelandic
jis         UCS2/4      JIS Japanese
jis0201     UCS2/4      JIS X0201 Japanese
jis_0208    JIS X0208   JIS Japanese
ksc5601     UCS2/4      KSC5601 Unified Hangul double byte
latin2      UCS2/4      Code page 10029 Mac Latin2
one_one     UCS2/4      UCS2/4 transparent, optional header, binary
roman       UCS2/4      Code page 10000 Mac Roman
sjis        UCS2/4      Shift JIS Japanese
sjis_0208   JIS X0208   Shift JIS Japanese
turkish     UCS2/4      Code page 10081 Mac Turkish
utf8        UCS2/4      UTF-8, optional header
utf8_utf16  UTF-16      UTF-8, optional header
utf16       UCS2/4      UTF-16, optional header, binary

The first column gives the name of the header file, in the include subdirectory Dinkum/codecvt. The second column describes the wide-character encoding assumed by the code-conversion facet:

UCS2/4 means Unicode (ISO 10646) encoded within the program as either a 16-bit integer (UCS-2) or a 32-bit integer (UCS-4).
UTF-16 means Unicode encoded within the program as either one or two 16-bit integers. (Note that this does not meet all the requirements of a valid wide-character encoding for Standard C or Standard C++.)
JIS X0208 means the Japanese standard for encoding wide characters within the program as 16-bit integers.

The third column briefly describes the multibyte encoding assumed by the code-conversion facet. For more information, see the header itself. Many are derived from tables made available by various standards bodies. In such cases, the header preserves as comments any descriptive information that accompanies the tables.

All the code-conversion facets are defined as templates with the common form:

template<Elem,
    unsigned long Maxcode = default value>
    class codecvt_XXX {....};

Elem is the wide-character type, typically wchar_t. Maxcode is the largest wide-character code that the code-conversion facet will read or write without reporting a conversion error. Each facet specifies a default value which is most appropriate for its multibyte encoding.

If the multibyte description in the table advertises an "optional header," the template class also has an optional third template parameter to provide additional information about the multibyte encoding. You specify this information as the union of three enumeration constants:

consume_header -- to consume an initial header sequence when reading a multibyte sequence and determine the endianness of the subsequent multibyte sequence to be read
generate_header -- to generate an initial header sequence when writing a multibyte sequence to advertise the endianness of the subsequent multibyte sequence to be written
little_endian -- to generate a multibyte sequence in little-endian order, as opposed to the default big-endian order

Finally, if the multibyte description in the table is labeled as "binary," then it is not suitable for reading and writing as a text file. Such a multibyte sequence may contain nul bytes that do not represent nul characters (which makes it unsuitable for storing in a nul-terminated byte string as well). It may also contain other bytes that get altered or discarded when reading or writing text files. Be sure to open files in binary mode if you read or write wide streams using these code-conversion facets.

To use one of these code-conversion facets, follow the pattern in the examples given above. In more detail:

Determine which header file XXX, from the table above, implements the code-conversion rule you want to use and include the header "Dinkum/codecvt/XXX" at the top of the C++ source file.
Refer to the code-conversion facet by the name Dinkum::codecvt::codecvt_XXX. Almost always you will want to specialize the template class on the element type wchar_t, as in Dinkum::codecvt::codecvt_XXX<wchar_t>.
If you want to disallow conversion of wide-character codes above a certain maximum value, add a second template parameter to specify this value, as in Dinkum::codecvt::codecvt_XXX<wchar_t, 0x10ffff>. (On an implementation with 32-bit wide characters, this particular value causes a conversion error if the code-conversion facet generates, or is asked to generate, a Unicode wide-character code that is currently undefined.)
If the multibyte description in the table advertises an "optional header," you can add a third template parameter as described above, as in Dinkum::codecvt::codecvt_XXX<wchar_t, 0x10ffff, generate_header>.

For more examples of how to use these code-conversion facets, study the test code that exercises each facet.

See also the Table of Contents and the Index.

Dinkum Conversions Library

Conversions Table of Contents

Overview

Code Conversions