ISO C Amendment 1 (MSE)

ISO C Amendment 1 (MSE)

The Single UNIX Specification, Version 2 includes in its System Interfaces Specification (XSH) the ISO/IEC 9899:1990/Amendment 1:1995 (E) to ISO/IEC 9899:1990, Programming Languages - C (ISO C). This paper is a brief introduction to this extension. It is assumed that the reader is familiar with the C language, and has some basic understanding of internationalization concepts and character encoding methods.

Introduction

ISO C Amendment 1 (MSE) was part of the first amendment made to the ISO C standard. The MSE consists of a set of library functions that provide a relatively complete and consistent set of functions for application programming using multibyte and wide-characters.

The other major items included in this amendment are digraphs, alternate spellings for several C tokens, and the header <iso646.h>. These items are not discussed here since they are outside the scope of this paper.

The ISO C standard laid some groundwork for multibyte and wide-character programming by providing a small number of multibyte and wide character functions. The working group decided to wait for the C developer community to acquire more experience with implementing multibyte and wide-character libraries before extending this model further.

A working group (ISO/JTC1/SC22/WG14) was set up to study the various existing implementations and developed the Multibyte Support Extension as part of the first amendment (called C Integrity) to the ISO C standard.

The System Interfaces Specification, XSH, Issue 4, Version 2, which was developed in 1994, incorporated a draft version of the MSE. XSH, Issue 5 incorporates the final version of the MSE. There were a small number of differences between the draft and final versions of the MSE and these are detailed later in this paper.

Extended Characters

We traditionally think of characters as one byte entities represented by the char data type. This is simple, but allows for a maximum of 256 distinct characters.

In the MSE model, the concept of a character has been extended. Extended characters can be represented in three ways:

multibyte character encodings
wide-character encodings
generalized multibyte encodings.

A multibyte character is a sequence of one or more bytes that can be represented as an array of type char; in other words a single character may occupy one or more consecutive bytes. An example of such an encoding is EUC (Extended UNIX Code). EUC provides a structure by which any number of codesets may be encoded into a multibyte encoding.

The primary advantage to the one byte/one character model is that it is very easy to process data in fixed-width chunks. For this reason, the concept of the wide character was invented. A wide character is an abstract data type large enough to contain the largest character that is supported on a particular platform. To date, most system implementors have chosen 32 bits, although there are implementations with 16 bit and 8 bit wide characters. It should be noted that although many vendors have chosen a 32-bit wide character, because the wide character is an abstract type it is not guaranteed to be the same across all platforms.

To support the concept of wide-characters, the MSE defines the integral type wchar_t. However, it does not define the size of wchar_t, but states it shall be as wide as necessary to hold the largest character in the code sets of the locales that an implementation supports.

In addition to the traditional concept of the multibyte character, the MSE has added the concept of the generalized multibyte character - see below.

Multibyte Characters

There are many different multibyte encoding schemes, but these can be broken down into three basic categories:

restartable multibyte encodings
stateful multibyte encodings
generalized multibyte encodings.

Restartable multibyte encodings are defined such that if you were to process a multibyte data stream it would be possible to determine the correct separation of characters no matter where you were positioned in the data stream. In the case of stateful encodings, you need one extra piece of information to be able to correctly process characters in the data stream. This extra piece of information is commonly referred to as the state of the data stream.

Why must we be able to unambiguously restart a data stream? If any byte sequence can have more than one meaning as a sequence of characters, then the multibyte code is ambiguous; that is, you could have multiple meanings for the same data stream depending upon where you started in the data stream. For example, the following multibyte encoding is not restartable:

    0x41  0x42  0x61  0x62  0x43

In this particular encoding, the combination of 0x61 and 0x62 produces an F. If we start processing this string at the beginning, all the characters would be processed correctly and the result would be the string:

    A B F C

If we start processing the string at 0x62, then the result would be the partial string:

b C

In a restartable encoding, the conversion interfaces would have recognized the 0x62 as an illegal multibyte character, and our program could choose to ignore that illegal character and move on, or perhaps it might try to back up and see if it could form a complete multibyte character.

In restartable multibyte encodings, each byte sequence in a particular encoding scheme stands for one character; the same character regardless of context. Stateful multibyte encoding schemes have a concept of shift state; certain codes called shift sequences effectively change the data stream to a different shift state, and the meaning of byte sequences is changed according to the current shift state.

If we use the same multibyte encoding and make it a stateful encoding, we will introduce two new operators called shift state operators, SS0 and SS1. The default shift state for this particular codeset is SS0. In this example, the 0x61 in its shifted state produces an F, and in its default state produces an a:

    0x41  0x42  SS1  0x61  SS0  0x43  0x61

Since the default shift state is SS0, the above sequence of bytes should produce the string:


    A B F C a

The stateful multibyte encodings are not restartable either, because if we started processing the string after a shift state operator, we could potentially get the wrong string.

Normally, if you try to pass a string containing multibyte characters to a function that does not know about them, such a function treats a string as a sequence of bytes, and interprets certain byte values specially; for example, the null byte, the slash character. Since it is illegal for a multibyte character to use any of the special byte values as part of its encoding, the function should pass it through as if it were a single byte string. [Note:The multibyte encoding may still use the slash or null byte, it just cannot use them as part of another multibyte character. ]

This is where the concept of the generalized multibyte encoding arises. Traditionally, we think of multibyte encodings as file code and wide characters as process code, where file code resides on disk and process code is used by an application. This is not to say that multibyte encodings are not used by applications. Indeed many applications today use multibyte encodings routinely, but because they do not require the ability to process characters as discreet chunks they have no need to convert the multibyte encodings to wide characters.

In summary, generalized multibyte encodings can be encoded in any way. The special byte values discussed above have no meaning in generalized multibyte encodings. Functions that have no concept of multibyte encodings would fail if they tried to process generalized multibyte encodings. By defining the concept of generalized multibyte encodings, we provide a method by which we can say a particular file is associated with a particular locale, and can only be processed by specific routines running in this locale. Generalized multibyte encodings are more of a logical grouping than a specific definition. They provide us with a way to associate files with specific locales and codesets, and allow us to safely operate on those files as long as we are in the proper locale. The important restriction is that generalized multibyte characters can never be processed directly, they can exist only on disk. [Note:Processed refers to the parsing routines available in C. Any file may be processed as binary data. ]

To take an example of a generalized multibyte encoding, Unicode is a 16-bit codeset that can be found on Windows 95 and Windows NT. One of the problems with Unicode is that it has NULL bytes embedded in its encoding. For example, the string:

    a b c

is actually encoded as follows:

    0x00  0x61  0x00  0x62  0x00  0x63  0x00 0x00

Those who are familiar with any of the string handling routines in C, can see that these routines will have problems with this string. Similarly, if you tried to read this file from a disk as a text file you would have problems. However, with the concept of generalized multibyte encodings we can say this file is associated with a Unicode locale and the stdio routines can be smart enough to know that when they are in the Unicode locale they can read the Unicode file properly.

Headers

The MSE defines two headers to support the new functionality:

<wctype.h>
Contains the declarations for the functions analogous to those in <ctype.h>; that is, the classification and mapping functions.
<wchar.h>
Contains the remaining declarations.

The header <wchar.h> declares the following types:

wchar_t
An integer type whose range is large enough to represent all distinct values in any extended character set in the supported locales. Known as the wide character type.
mbstate_t
Stores the current parse state of a stream.
wint_t
An integer type that can hold any wide-character and WEOF.

The following macros are declared:

WCHAR_MAX
Maximum value representable by an object of type wchar_t.
WCHAR_MIN
Minimum value representable by an object of type wchar_t.
WEOF
Wide-character end-of-file.

and the following error macro was added to the header <error.h>:

EILSEQ
A invalid wide-character encoding, or a sequence of bytes which do not form a valid multibyte character, was encountered.

Two standard macros can be used to find out the maximum possible number of bytes in a character:

MB_LEN_MAX
Returns the maximum length of a multibyte character for any supported locale as a positive integer. It is defined in <limits.h>.
MB_CUR_MAX
Returns the maximum number of bytes in a multibyte character in the current locale as a positive integer. The value is never greater than MB_LEN_MAX. It is defined in <stdlib.h>.

Character Classification

Character classification determines whether a particular character code refers to an upper-case alphabetic, lower-case alphabetic, alphanumeric, digit, punctuation, control or space character, or any one of a number of other groupings.

In the past macros were often used to classify character codes. This was possible since the assumption was that an application was dealing with ASCII characters. Today, classification functions are used which classify wide-character codes according to the type rules defined by the category LC_CTYPE of the application's current locale.

In the ISO C standard the behavior of character classification functions is affected by the current locale. Some functions have implementation-dependent behavior when not in the POSIX locale. For example, in the POSIX locale, isupper() returns true (non-zero) only for upper-case letters. The MSE contains no description of how the POSIX locale affects the behavior of the above functions, but states that when a character c causes an isxxx(c) function to return true, the corresponding wide-character wc shall cause the corresponding wide-character function to return true. Note, however, that the converse is not true.

The ISO C standard defines 11 classification (also known as character testing) functions. The MSE defines an analogous set of wide-character classification functions, returning non-zero for true and zero for false, for example iswalnum() is analogous to isalnum().

As the number of defined locales increased, the requirement for additional character classes increased. For example, while a classification function such as isupper() makes perfect sense in the English language, it does not make any sense in a language such as Japanese that has no concept of case. Conversely, a function such as iskana() makes perfect sense for Japanese, but doesn't make any sense in English. For this reason, the MSE defined two extensible wide-character classification functions - wctype() and iswctype() - as general-purpose solutions to this problem.

 Name
 Purpose
 Syntax
 wctype()
  Construct a value with type 
wctype_t
that describes a class of wide characters identified by property
  wctype_t wctype(const char *property); 
 iswctype()
  Determine whether a wide-character has the property identified by 
  int iswctype(wint_t wc, wctype_t desc);

Name	Purpose	Syntax
wctype()	Construct a value with type wctype_t* that describes a class of wide characters identified by property*	wctype_t wctype(const char property);*
iswctype()	Determine whether a wide-character has the property identified by	int iswctype(wint_t wc, wctype_t desc);

These two functions are generally used in combination. However, sometimes the wctype() function is used on its own by an application to test whether a character classification is available in a specific locale. If the current setting of the LC_CTYPE locale changes between calls, the behavior is undefined.

The MSE specifies that the following code segments are equivalent to each other:

 

iswctype(wc, wctype("alnum"))    iswalnum(wc)
iswctype(wc, wctype("alpha"))    iswalpha(wc)
iswctype(wc, wctype("cntrl"))    iswcntrl(wc)
iswctype(wc, wctype("digit"))    iswdigit(wc)
iswctype(wc, wctype("graph"))    iswgraph(wc)
iswctype(wc, wctype("lower"))    iswlower(wc)
iswctype(wc, wctype("print"))    iswprint(wc)
iswctype(wc, wctype("punct"))    iswpunct(wc)
iswctype(wc, wctype("space"))    iswspace(wc)
iswctype(wc, wctype("upper"))    iswupper(wc)
iswctype(wc, wctype("xdigit"))   iswxdigit(wc)

Mapping Functions

Mapping functions are sometimes called case conversion functions, because the original mapping functions simply mapped upper-case to lower-case and vice versa.

In the past, case conversion was often handled by means of macros. This was possible since the assumption was that an application was dealing with ASCII characters. Mapping functions are used to provide case conversion according to shift tables defined in the LC_CTYPE category of the application's current locale.

The following wide-character mapping functions are provided:

 MSE
 ISO C
 Purpose
 towlower()
 tolower()
  Convert an upper-case letter to its corresponding lower-case letter if iswupper() is true and there is a corresponding wide character for which iswlower() is true. 
 towupper()
 toupper()
  Convert a lower-case letter to its corresponding upper-case letter if iswlower() is true and there is a corresponding wide character for which iswupper() is true.

MSE	ISO C	Purpose
towlower()	tolower()	Convert an upper-case letter to its corresponding lower-case letter if iswupper() is true and there is a corresponding wide character for which iswlower() is true.
towupper()	toupper()	Convert a lower-case letter to its corresponding upper-case letter if iswlower() is true and there is a corresponding wide character for which iswupper() is true.

As the number of defined locales increased, the requirement for additional characters increased. For example, while a function such as toupper() makes perfect sense in the English language, it doesn't make any sense in a language such as Japanese which has no concept of case. Conversely, the tokana() function makes no sense in the English language.

For this reason, the MSE defined two extensible wide-character classification functions - wctrans() and towctrans() - as general-purpose solutions to this problem. The name of the required character conversion is passed as an argument to the wctrans() function to avoid name space pollution.

 Name
 Purpose
 Syntax
 wctrans()
  Construct a value with type wctrans_t that describes the mapping between wide characters identified by property
  wctrans_t wctrans(const char *property); 
 towctrans()
  Map the wide character per specified mapping. 
  wint_t towctrans(wint_t wc, wctrans_t desc);

Name	Purpose	Syntax
wctrans()	Construct a value with type wctrans_t* that describes the mapping between wide characters identified by property*	wctrans_t wctrans(const char property);*
towctrans()	Map the wide character per specified mapping.	wint_t towctrans(wint_t wc, wctrans_t desc);

In addition, the MSE specifies that the following code segments are equivalent to each other:

 

towctrans(wc, wctrans("tolower"))    towlower(wc)
towctrans(wc, wctrans("toupper"))    towupper(wc)

The wctype() function also enables an application to test whether a character classification is available in a specific locale.

Number Conversion Functions

Three new functions are included to facilitate conversion from wide-character strings (also known as wide strings) to a variety of numeric formats.

 MSE
 ISO C
 Purpose
 Syntax
 wcstod()
 strtod()
  Convert the initial portion of a wide string to a 
  double wcstod(const wchar_t *n, wchar_t **end); 
 wcstol()
 strtol()
  Convert the initial portion of a wide string to a 
  long wcstol(const wchar_t *n, wchar_t **end, int base); 
 wcstoul()
 strtoul()
  Convert the initial portion of a wide string to an 
  unsigned long wcstoul(const wchar_t *n, wchar_t **end, int base);

MSE	ISO C	Purpose	Syntax
wcstod()	strtod()	Convert the initial portion of a wide string to a	double wcstod(const wchar_t n, wchar_t *end);
wcstol()	strtol()	Convert the initial portion of a wide string to a	long wcstol(const wchar_t n, wchar_t *end, int base);
wcstoul()	strtoul()	Convert the initial portion of a wide string to an	unsigned long wcstoul(const wchar_t n, wchar_t *end, int base);

These functions work as follows:

First, the function decomposes the wide-character string into three parts:
- An initial, possibly empty, sequence of white-space wide characters as determined by the iswspace() function
- a subject sequence interpreted as either a floating point constant, long or unsigned long
- a final sequence of one or more unrecognized wide-character codes including the terminating null wide character.
The function then attempts to convert the subject sequence into the required number format by parsing the subject sequence and returning the result. If the subject sequence is empty or does not have the expected form, no conversion is performed.

In other than the POSIX locale, implementation-dependent forms of a subject sequence may be supported.

The function wcstod() has a dependency on the value of the RADIXCHAR item in the applications current locale. In locales where the radix character is not defined, it defaults to a period.

String Handling

Sixteen new wide-character string functions are defined. Most are similar to their char-based counterparts. For example, wcscopy() is analagous to strcpy(), but operates on wide strings. In general, the data types of some parameters differ, but the purpose of the parameters is the same.

The comparison functions wcscmp() and wcsncmp() compare two wide-character strings by comparing the wide characters based on the character's encoded value, while the wcscoll() function compares each wide character interpreted according to the collating sequence information specified by the LC_COLLATE category of the current locale.

The wcsxfrm() function transforms a wide-character string and places the result in an array of wide characters. The transformation is such that if the wcscmp() function is applied to two transformed wide-character strings, the result is the same as if the two wide-character strings were compared using wcscoll(). Both wide-character strings must be transformed using wcsxfrm(). It is invalid to compare a transformed string to a non-transformed string. Note that no function is defined to restore a transformed string to its original layout.

When wide-character strings are likely to be compared more than once, it is more efficient to transform them using wcsxfrm(), compare them using wcscmp(), and retain the transformed strings for subsequent comparisons.

The MSE also defines a number of wide-character array functions. These functions operate on arrays of type wchar_t whose size is specified by a separate count argument. These functions are not affected by locale and all wchar_t values are treated identically, including the null wide character and wide characters not corresponding to valid multibyte characters. Thus, the wmemcmp() function compares each wide-character array element using the encoded value of each wide-character.

 MSE
 ISO C
 Purpose
 Syntax
 wmemchr()
 memchr()
  Locate first occurrence of wide character c in the initial n wide characters of s
  wchar_t *wmemchr(const wchar_t *s, wchar_t c, size_t n); 
 wmemcmp()
 memcmp()
  Compare first n wide characters of s1 and s2 
  int wmemcmp(const wchar_t *s1, const wchar_t *s2, size_t n); 
 wmemcpy()
 memcpy()
  Copy first n wide characters from s2 to s1
  wchar_t *wmemcpy(wchar_t *s1, const wchar_t *s2, size_t n); 
 wmemmove()
 memmove()
  Copy first n wide characters from s2 to s1 
  wchar_t *wmemmove(wchar_t *s1, const wchar_t *s2, size_t n); 
 wmemset()
 memset()
  Set first n wide characters of s to wide character c
  wchar_t *wmemset(wchar_t *s, wchar_t c, size_t n);

MSE	ISO C	Purpose	Syntax
wmemchr()	memchr()	Locate first occurrence of wide character c in the initial n wide characters of s	wchar_t wmemchr(const wchar_t s, wchar_t c, size_t n);
wmemcmp()	memcmp()	Compare first n wide characters of s1 and s2	int wmemcmp(const wchar_t s1, const wchar_t s2, size_t n);
wmemcpy()	memcpy()	Copy first n wide characters from s2 to s1	wchar_t wmemcpy(wchar_t s1, const wchar_t s2, size_t n);*
wmemmove()	memmove()	Copy first n wide characters from s2 to s1	wchar_t wmemmove(wchar_t s1, const wchar_t s2, size_t n);*
wmemset()	memset()	Set first n wide characters of s to wide character c	wchar_t wmemset(wchar_t s, wchar_t c, size_t n);

The wmemmove() function copies the specified number of wide characters from the object pointed to by s2 into s1. However, unlike wmemcpy(), the objects s1 and s2 can safely overlap. Copying occurs as if the required elements of the object s2 are first copied into a temporary area and then copied into the object pointed to by s1.

The Input/Output Model

The MSE input/output model assumes that characters are handled as wide-characters within an application and stored as multibyte characters in files, and that all the wide-character input/output functions begin executing with the stream positioned at the boundary between two multibyte characters.

The definition of a stream was changed to include the concept of an orientation for both text and binary streams. After a stream is associated with a file, but before any operations are performed on the stream, the stream is without orientation. If a wide-character input or output function is applied to a stream without orientation, the stream becomes wide-oriented. Likewise, if a byte input or output operation is applied to a stream with orientation, the stream becomes byte-oriented. Thereafter, only the fwide() or freopen() functions can alter the orientation of a stream.

Byte input/output functions shall not be applied to a wide-oriented stream and wide-character input/output functions shall not be applied to a byte-oriented stream.

While wide-oriented streams are sequences of wide characters, the external file associated with a wide-oriented stream may be an implementation-dependent multibyte encoding. Furthermore, it is acceptable that the file associated with this stream is a generalized multibyte encoding such as Unicode.

The following function is specified to enable applications to determine and/or set the orientation of a stream:

 Name
 Purpose
 Syntax
 fwide()
  Determine the orientation of a stream. 
  int fwide(FILE *stream, int mode);

Name	Purpose	Syntax
fwide()	Determine the orientation of a stream.	int fwide(FILE stream, int mode);*

If mode is zero, stream orientation is not altered. If mode is >0, the function first attempts to make the stream wide-oriented. If mode <0, the function first attempts to make the stream byte-oriented.

Note that the input/output model does not preclude applications from storing date in external files as wide characters.

Wide-character Input Functions

Wide-character input functions read multibyte characters from a stream and convert them to wide characters. An encoding error occurs if the byte sequence does not form a valid wide character in the current locale.

The following table lists the wide-character input functions specified in the MSE together with their equivalent char-based functions:

MSE ISO C Purpose Syntax
getwc() getc() Get a wide character from a stream. wint_t getwc(FILE *stream);
getwchar() getchar() Get a wide character from stdin wint_t getwchar(void);
fgetwc() fgetc() Get a wide character from a stream. wint_t fgetwc(FILE *stream);
fgetws() fgetc() Get a wide-character string from a stream. wchar_t *fgetws(wchar_t *s, int n, FILE *stream);
fwscanf() fscanf() Get formatted input from a stream. int fwscanf(FILE *stream, const wchar_t *format, ...);
swscanf() sscanf() Get formatted input from a wide-character string. int swscanf(const whar_t *s, const wchar_t *format, ...);
wscanf() scanf() Get formatted input from stdin int wscanf(const wchar_t *format, ....);
ungetwc() ungetc() Push a wide character back on a stream. wint_t ungetwc(wint_t c, FILE *stream);

MSE	ISO C	Purpose	Syntax
getwc()	getc()	Get a wide character from a stream.	wint_t getwc(FILE stream);*
getwchar()	getchar()	Get a wide character from stdin	wint_t getwchar(void);
fgetwc()	fgetc()	Get a wide character from a stream.	wint_t fgetwc(FILE stream);*
fgetws()	fgetc()	Get a wide-character string from a stream.	wchar_t fgetws(wchar_t s, int n, FILE stream);*
fwscanf()	fscanf()	Get formatted input from a stream.	int fwscanf(FILE stream, const wchar_t format, ...);
swscanf()	sscanf()	Get formatted input from a wide-character string.	int swscanf(const whar_t s, const wchar_t format, ...);
wscanf()	scanf()	Get formatted input from stdin	int wscanf(const wchar_t format, ....);*
ungetwc()	ungetc()	Push a wide character back on a stream.	wint_t ungetwc(wint_t c, FILE stream);*

All of the above functions work in a similar manner to their corresponding char-based functions, except that format strings must be wide-character strings.

However, the following format specifiers accept an additional l (ell) qualifier:

[: Matches a non-empty sequence of wide characters from a set of expected characters. If an l qualifier is present, the corresponding argument shall be a pointer to a wide-character array; otherwise the corresponding argument is a pointer to a character array.
c: Matches a sequence of wide-characters. If an l qualifier is present, the corresponding argument is a pointer to a wide-character array large enough to hold the sequence; otherwise the corresponding argument is a pointer to a character array, and characters are converted as if by repeated calls to the wcrtomb() function.
s: Matches a sequence of non-white-space wide characters. If an l qualifier is present, the corresponding argument is a pointer to a wide-character array; otherwise it is a pointer to a character array, and characters from the input fields are converted as if by repeated calls to the wcrtomb() function.

Wide-character Output Functions

Wide-character output functions convert wide characters to multibyte characters and write them to the stream. An encoding error occurs if the wide character does not correspond to a valid multibyte character in the current locale.

The following table lists the wide-character output functions specified in the MSE together with their equivalent char-based functions:

MSE ISO C Purpose Syntax
putwc() putc() Write a wide character to a stream. wint_t putwc(wchar_t c, FILE *stream);
putwchar() putchar() Write a wide character to wint_t putwchar(wchar_t c);
fputwc() fputc() Write a wide character to a stream. wint_t fputwc(wchar_t c, FILE *stream);
fputws() fputs() Write a wide-character string to a stream. int fputws(const wchar_t *s, FILE *stream);
fwprintf() wprintf() Write to stdout a stream using a wide-character format specification. int fwprintf(FILE *stream, const wchar_t *format, ...);
wprintf() printf() Write to using a wide-character format specification. int wprintf(const wchar_t *format, ...);
swprintf() sprintf() Write to a wide-character array using a wide-character format specification. int swprintf(wchar_t *s, size_t n, const wchar_t *format, ...);
vfwprintf() vfprintf() Equivalent to fwprintf() except using va_list syntax. int vfwprintf(FILE *stream, const wchar_t *format, va_list arg);
vwprintf() vprintf() Equivalent to wprintf except using va_list syntax. int vwprintf(const wchar_t *format, va_list arg);
vswprintf() vsprintf() Equivalent to swprintf except using va_list syntax. int vswprintf(wchar_t *s, size_t n, const wchar_t *format, va_list arg);

MSE	ISO C	Purpose	Syntax
putwc()	putc()	Write a wide character to a stream.	wint_t putwc(wchar_t c, FILE stream);*
putwchar()	putchar()	Write a wide character to	wint_t putwchar(wchar_t c);
fputwc()	fputc()	Write a wide character to a stream.	wint_t fputwc(wchar_t c, FILE stream);*
fputws()	fputs()	Write a wide-character string to a stream.	int fputws(const wchar_t s, FILE stream);
fwprintf()	wprintf()	Write to stdout a stream using a wide-character format specification.	int fwprintf(FILE stream, const wchar_t format, ...);
wprintf()	printf()	Write to using a wide-character format specification.	int wprintf(const wchar_t format, ...);*
swprintf()	sprintf()	Write to a wide-character array using a wide-character format specification.	int swprintf(wchar_t s, size_t n, const wchar_t format, ...);
vfwprintf()	vfprintf()	Equivalent to fwprintf() except using va_list syntax.	int vfwprintf(FILE stream, const wchar_t format, va_list arg);
vwprintf()	vprintf()	Equivalent to wprintf except using va_list syntax.	int vwprintf(const wchar_t format, va_list arg);*
vswprintf()	vsprintf()	Equivalent to swprintf except using va_list syntax.	int vswprintf(wchar_t s, size_t n, const wchar_t format, va_list arg);

All of the above functions work in a similar manner to their corresponding char-based functions, except that format strings must be wide-character strings.

The following format specifiers accept an additional l (ell) qualifier:

c: Write a character. If an l qualifier is present, the corresponding wint_t argument is converted to wchar_t and written; otherwise the argument is converted to a wide character as if by calling btowc(), and the resulting wide-character is written.
s: Write a string. If an l qualifier is present, the corresponding argument is a pointer to a wide-character array; otherwise it is a pointer to a character array, and characters from the array are converted as if by repeated calls to the mbrtowc() function.

Conversion Functions

As discussed earlier, multibyte character streams may have state-dependent encodings. To handle state-dependent encodings, the MSE includes the concept of a conversion state that is associated with each FILE object that effects the behavior of a conversion between multibyte and a wide-character encoding.

The conversion state information augments the FILE object's information about the current position of the multibyte character stream with information about the parse state for the next multibyte character to be obtained from the stream. For state-dependent encodings, the remembered shift state is part of this parse state. Every wide-character input or output function makes use of this state information and updates its corresponding FILE object's conversion state accordingly.

The non-array type mbstate_t is defined to encode the conversion state under the rules of the current locale and provide a character accumulator. This implies that encoding rule information is part of the conversion state. No initialization function is provided to initialize mbstate_t. A zero-valued mbstate_t is assumed to describe the initial conversion state. Such a zero-valued mbstate_t object is said to be unbound. Once a multibyte or wide-character conversion function is called with the mbstate_t object as an argument, the object becomes bound and holds the conversion state information which it obtains from the LC_CTYPE category of the current locale. No comparison function is specified for comparing two mbstate_t objects.

The MSE assumes that only wide-character input/output functions can maintain consistency between a stream and its corresponding conversion state. Byte input/output functions do not manipulate or use conversion state information. Wide-character input/output functions are assumed to begin processing a stream at the boundary between two multibyte characters. Seek operations reset the conversion state corresponding to the new file position.

The function mbsinit() is specified because many conversion functions treat the initial shift state as a special case and need a portable means of determining whether an mbstate_t object is at initial conversion state.

 Name
 Purpose
 Syntax
 mbsinit()
  Determine whether the referenced mbstate_t object describes an initial conversion state. 
  int mbsinit(const mbstate_t *ps);

Name	Purpose	Syntax
mbsinit()	Determine whether the referenced mbstate_t* object describes an initial conversion state.*	int mbsinit(const mbstate_t ps);*

The MSE provides a method to distinguish between an invalid sequence of bytes in a multibyte stream and a valid prefix to a still incomplete multibyte character. Upon encountering such an incomplete multibyte sequence, the functions mbrlen() and mbrtowc() return -2 instead of -1, and the character accumulator in the mbstate_t object may store the partial character information. This allows applications to convert streams one byte at a time or even to suspend and resume conversion if required. The conversion functions are thus said to be restartable.

The MSE specifies the following single-byte wide-character conversion functions:

 Name
 Purpose
 Syntax
 btowc()
  Determine whether valid multibyte character is in the initial shift state and return corresponding wide character. 
  wint_t btowc(int c); 
 wctob()
  Determine whether member of the extended character set whose multibyte character representation is a single byte when in the initial shift state, and return corresponding single-byte character. 
  int wctob(wint_t c);

Name	Purpose	Syntax
btowc()	Determine whether valid multibyte character is in the initial shift state and return corresponding wide character.	wint_t btowc(int c);
wctob()	Determine whether member of the extended character set whose multibyte character representation is a single byte when in the initial shift state, and return corresponding single-byte character.	int wctob(wint_t c);

The function btowc() returns WEOF if the character has a value of EOF or if it is not a valid multibyte character in the initial shift state.

The function wctob() returns EOF if the character does not correspond to a valid multibyte character of length 1 in the initial shift state.

The MSE specifies the following restartable functions which take as their last argument a pointer to an object of type mbstate_t. If the pointer is NULL, each function uses its own internal mbstate_t object instead, which is initialized at startup to the initial conversion state. Note that, unlike their corresponding ISO C standard functions, a function's return value does not represent whether the encoding is state-dependent.

 MSE
 ISO C
 Purpose
 Syntax
 mbrlen()
 mblen()
  Determine the length in bytes of a multibyte character. 
  size_t mbrlen(const char *mbs, size_t n, mbstate_t *ps); 
 mbrtowc()
 mbtowc()
  Convert a multibyte character into a wide character. 
  size_t mbrtowc(wchar_t *pwc, const char *s, size_t n, mbstate_t *ps); 
 wcrtomb()
 wctomb()
  Convert a wide character into a multibyte character. 
  size_t wcrtomb(char *s, whar_t wc, mbstate_t *ps); 
 mbsrtowcs()
 mbstowcs()
  Convert a multibyte string into a wide-character string. 
  size_t mbsrtowcs(wchar_t *dst, const char **src, size_t len, mbstate_t *ps); 
 wcsrtombs()
 wcstombs()
  Convert a wide-character string into a multibyte string. 
  size_t wcsrtombs(char *dst, const wchar_t **src, size_t len, mbstate_t *ps);

MSE	ISO C	Purpose	Syntax
mbrlen()	mblen()	Determine the length in bytes of a multibyte character.	size_t mbrlen(const char mbs, size_t n, mbstate_t ps);
mbrtowc()	mbtowc()	Convert a multibyte character into a wide character.	size_t mbrtowc(wchar_t pwc, const char s, size_t n, mbstate_t ps);*
wcrtomb()	wctomb()	Convert a wide character into a multibyte character.	size_t wcrtomb(char s, whar_t wc, mbstate_t ps);
mbsrtowcs()	mbstowcs()	Convert a multibyte string into a wide-character string.	size_t mbsrtowcs(wchar_t dst, const char *src, size_t len, mbstate_t ps);*
wcsrtombs()	wcstombs()	Convert a wide-character string into a multibyte string.	size_t wcsrtombs(char dst, const wchar_t *src, size_t len, mbstate_t ps);*

A more detailed explanation of two of the above functions will help to clarify the concept of restartable functions.

The function mbrtowc() inspects at most n bytes to determine the number of bytes needed to complete the next multibyte character. If a multibyte character can be completed, mbrtowc() determines the corresponding wide character and returns it in *pwc. If the corresponding wide character is the null wide character, the conversion state is reset to the initial conversion state. This function returns one of the following:

(size_t)-2

The next n bytes contribute to, but do not complete, a valid multibyte character, and all n bytes have been processed.

(size_t)-1

An encoding error has occurred. The next n or fewer bytes do not contribute to a valid multibyte character. errno is set to EILSEQ. The conversion state is undefined.

Note:: (size_t)-2 and (size_t)-1 should be tested before the >0 case.

0

If the next n or fewer bytes complete a valid multibyte character that corresponds to the null wide character.

>0

The number of bytes used to complete a valid multibyte character.

The function mbsrtowcs() is a restartable string conversion routine which converts a sequence of multibyte characters, beginning with the conversion state described by the mbstate_t object pointed to by ps, from the array indirectly pointed to by src into a sequence of corresponding wide characters pointed to by dst. Conversion continues up to and including a terminating null character which is also stored in dst. Each conversion takes place as if by a call to the mbrtowc() function. If an error occurs, errno is set to the macro EILSEQ and mbsrtowcs() returns (size_t)-1.

Conversion stops when one of the following occurs:

A null multibyte character is encountered and processed. In this case the conversion state is reset to the initial conversion state and scr is assigned a null pointer.
A sequence of bytes is encountered which do not form a valid multibyte character.
When len multibyte characters have been processed.

Miscellaneous Functions

 MSE
 ISO C
 Purpose
 Syntax
 wcsftime()
 strftime()
  Convert a date and time to a wide-character string. 
  size_t wcsftime(wchar_t *wcs, size_t maxsize, const wchar_t *format, const struct tm *tptr);

MSE	ISO C	Purpose	Syntax
wcsftime()	strftime()	Convert a date and time to a wide-character string.	size_t wcsftime(wchar_t wcs, size_t maxsize, const wchar_t format, const struct tm tptr);*

The wcsftime() function behaves as if the character string generated by the strftime() function is passed to the mbstowcs() function as the character string parameter, and the mbstowcs() function places the result in the wcs parameter of wcsftime(), up to the limit of the number of wide characters specified by maxsize.

This function uses the local timezone information. The format parameter is a wide-character string consisting of a sequence of wide-character format codes that specify the format of the date and time to be written to wcs.

Compatibility Issues

When XSH, Issue 4, Version 2 was under development, the MSE had not yet been ratified as an amendment to the ISO C standard, but the working draft used (ISO Working Paper SC22/WG14/N204 dated 31st March 1992) was regarded as being stable.

Unfortunately, a number of interfaces changed slightly before the MSE became part of ISO/IEC 9899:1990/Amendment 1:1995 (E), and three functions, wcswcs(), wcswidth(), and wcwidth(), were dropped.

The differences between XSH, Issue 4, Version 2 and XSH, Issue 5 are detailed in the following table:

Name Purpose XSH, Issue 4, Version 2 XSH, Issue 5
wcswcs() Find a wide-character substring in a wide-character string. Included per draft MSE. Included but marked EX. Application developers are strongly encouraged to use wcsstr() instead.
wcswidth() Number of column positions of a wide-character string. Included per draft MSE. Included as an extension.
wcwidth() Number of column positions of a wide-character code. Included per draft MSE. Included as an extension.
fputwc() Put a wide character code on a stream. wint_t fputwc(wint_t wc, FILE *stream); wint_t fputwc(wchar_t wc, FILE *stream);
putwc() Put a wide character on a stream. wint_t putwc(wint_t wc, FILE *stream); wint_t putwc(wchar_t wc, FILE *stream);
putwchar() Put a wide character on a stream. wint_t putwchar(wint_t wc); wint_t putwchar(wchar_t wc);
wcsftime() Convert date and time to wide-character string. size_t wcsftime(wchar_t *wcs, size_t maxsize, const char *format, const struct tm *timptr); size_t wcsftime(wchar_t *wcs, size_t maxsize, const wchar_t *format, const struct tm *timptr);
wcstok() Split wide-character string into tokens. wchar_t *wcstok(wchar_t *ws1, const wchar_t *ws2); wchar_t *wcstok(wchar_t *ws1, const wchar_t *ws2, wchar_t **ptr);

Name	Purpose	XSH, Issue 4, Version 2	XSH, Issue 5
wcswcs()	Find a wide-character substring in a wide-character string.	Included per draft MSE.	Included but marked EX. Application developers are strongly encouraged to use wcsstr() instead.
wcswidth()	Number of column positions of a wide-character string.	Included per draft MSE.	Included as an extension.
wcwidth()	Number of column positions of a wide-character code.	Included per draft MSE.	Included as an extension.
fputwc()	Put a wide character code on a stream.	`wint_t fputwc(wint_t wc, FILE stream);`*	`wint_t fputwc(wchar_t wc, FILE stream);`*
putwc()	Put a wide character on a stream.	`wint_t putwc(wint_t wc, FILE stream);`*	`wint_t putwc(wchar_t wc, FILE stream);`*
putwchar()	Put a wide character on a stream.	`wint_t putwchar(wint_t wc);`	`wint_t putwchar(wchar_t wc);`
wcsftime()	Convert date and time to wide-character string.	`size_t wcsftime(wchar_t wcs, size_t maxsize, const char format, const struct tm timptr);`*	`size_t wcsftime(wchar_t wcs, size_t maxsize, const wchar_t format, const struct tm timptr);`*
wcstok()	Split wide-character string into tokens.	`wchar_t wcstok(wchar_t ws1, const wchar_t ws2);`*	`wchar_t wcstok(wchar_t ws1, const wchar_t ws2, wchar_t *ptr);`

More Information

More information on the Single UNIX Specification, Version 2 can be obtained from the following sources:

The Open Group Source Book "Go Solo 2 - The Authorized Guide to Version 2 of the Single UNIX Specification", 500 pages, ISBN 0-13-575689-8. This book provides complete information on what's new in Version 2 , with technical papers written by members of the working groups that developed the specifications , and a CD-ROM containing the complete 3000 page specification in both HTML and PDF formats (including PDF reader software). For more information on the book, see URL http://www.UNIX-systems.org/gosolo2 .
The Single UNIX Specification can be browsed and searched online at The Open Group world wide web site, see the URL http://www.UNIX-systems.org/go/unix .

About the Authors

David Lindner is a Principal Engineer with Digital Equipment Corporation and a former member of The Open Group Internationalization Technical Working Group.

Finnbarr P. Murphy is a principal software engineer with Digital Equipment Corporation and is Vice-Chair of The Open Group Base Technical Working Group.

Read other technical papers.

Read or download the complete Single UNIX Specification from http://www.UNIX-systems.org/go/unix.

UNIX is a registered trademark of The Open Group.