On Fri, Dec 03, 2010 at 07:46:37PM +0001, Jason McIntyre wrote:
> On Fri, Dec 03, 2010 at 07:27:46PM +0100, Stefan Sperling wrote:
> > I spent some more time trying to understand the beast that's mbrtowc(3).
> > 
> 
> some tweaks.

Thank you.

I found one error: EUC is a stateless encoding. I confused it with ISO-2022.

Updated diff:

Index: mbrtowc.3
===================================================================
RCS file: /cvs/src/lib/libc/locale/mbrtowc.3,v
retrieving revision 1.2
diff -u -p -r1.2 mbrtowc.3
--- mbrtowc.3   31 May 2007 19:19:29 -0000      1.2
+++ mbrtowc.3   3 Dec 2010 23:16:38 -0000
@@ -28,163 +28,214 @@
 .Dd $Mdocdate: May 31 2007 $
 .Dt MBRTOWC 3
 .Os
-.\" ----------------------------------------------------------------------
 .Sh NAME
 .Nm mbrtowc
 .Nd converts a multibyte character to a wide character (restartable)
-.\" ----------------------------------------------------------------------
 .Sh SYNOPSIS
 .Fd #include <wchar.h>
 .Ft size_t
-.Fn mbrtowc "wchar_t * restrict pwc" "const char * restrict s" "size_t n" \
-"mbstate_t * restrict ps"
-.\" ----------------------------------------------------------------------
+.Fn mbrtowc "wchar_t * restrict wc" "const char * restrict s" "size_t n" \
+"mbstate_t * restrict mbs"
 .Sh DESCRIPTION
 The
 .Fn mbrtowc
-usually converts the multibyte character pointed to by
-.Fa s
-to a wide character, and stores the wide character
+function examines at most
+.Fa n
+bytes of the multibyte character byte string pointed to by
+.Fa s ,
+converts those bytes to a wide character, and stores the wide character
 in the wchar_t object pointed to by
-.Fa pwc
+.Fa wc
 if
-.Fa pwc
-is non-null and
+.Fa wc
+is not null and
 .Fa s
 points to a valid character.
-The conversion happens in accordance with the conversion state
-described in the mbstate_t object pointed to by
-.Fa ps .
-This function may examine at most
-.Fa n
-bytes of the array beginning from
+.Pp
+The conversion will use at most
+.Dv MB_CUR_MAX
+bytes of the byte string pointed to by
 .Fa s .
+.Dv MB_CUR_MAX is always smaller or equal to
+.Dv MB_LEN_MAX .
 .Pp
-If
-.Fa s
-points to a valid character and the character corresponds to a null wide
-character, then the
+Conversion happens in accordance with the conversion state described
+by the mbstate_t object pointed to by
+.Fa mbs .
+The mbstate_t object pointed to by
+.Fa mbs
+must be initialized to zero before the application's first call to
+.Fn mbrtowc .
+.Fa mbs
+can safely be reused without reinitialization after successful conversion.
+.Pp
+The behaviour of
+.Fn mbrtowc
+is affected by the
+.Dv LC_CTYPE
+category of the current locale.
+If the locale is changed without reinitialization of
+.Fa mbs ,
+the behaviour of
 .Fn mbrtowc
-places the mbstate_t object pointed to by
-.Fa ps
-to an initial conversion state.
+is undefined.
 .Pp
 Unlike
 .Xr mbtowc 3 ,
-the
 .Fn mbrtowc
-may accept the byte sequence pointed to by
+will accept an incomplete byte sequence pointed to by
 .Fa s
-not forming a complete multibyte character
-but which may be part of a valid character.
-In this case, this function will accept all such bytes
-and save them into the conversion state object pointed to by
-.Fa ps .
-They will be used at subsequent calls of this function to restart
-the conversion suspended.
+which does not form a complete character but is potentially part of
+a valid character.
+In this case,
+.Fn mbrtowc
+will save all such bytes into the conversion
+state object pointed to by
+.Fa mbs .
+They will be used during subsequent calls of
+.Fn mbrtowc
+to restart the suspended conversion.
 .Pp
-The behaviour of the
+In state-dependent encodings,
+.Fa s
+may point to a special sequence of bytes called a
+.Dq shift sequence .
+Shift sequences switch between character encodings available within an
+encoding scheme, e.g. between one-byte characters and two-byte characters.
+One encoding scheme using shift sequences is ISO/IEC 2022.
+Shift sequence bytes correspond to no individual wide character, so
 .Fn mbrtowc
-is affected by the
-.Dv LC_CTYPE
-category of the current locale.
+treats them as if they were part of the subsequent multibyte character.
 .Pp
-These are the special cases:
+Special cases in interpretation of arguments are as follows:
 .Bl -tag -width 012345678901
-.It "s == NULL"
-.Fn mbrtowc
-sets the conversion state object pointed to by
-.Fa ps
-to an initial state and always returns 0.
-Unlike
-.Xr mbtowc 3 ,
-the value returned does not indicate whether the current encoding of
-the locale is state-dependent.
+.It "wc == NULL"
+The conversion from a multibyte character to a wide character is performed
+and the conversion state may be affected, but the resulting wide character
+is discarded.
 .Pp
-In this case,
+This can be used to find out how many bytes are contained in the
+multibyte character pointed to by
+.Fa s ,
+which is a number between 1 and
+.Dv MB_CUR_MAX
+upon successful conversion.
+.It "s == NULL"
 .Fn mbrtowc
 ignores
-.Fa pwc
+.Fa wc
 and
 .Fa n ,
-and is equivalent to the following call:
+and behaves equivalent to
 .Bd -literal -offset indent
-mbrtowc(NULL, "", 1, ps);
+mbrtowc(NULL, "", 1, mbs);
 .Ed
-.It "pwc == NULL"
-The conversion from a multibyte character to a wide character has
-taken place and the conversion state may be affected, but the resultant
-wide character is discarded.
-.It "ps == NULL"
+.Pp
+which attempts to use the state object pointed to by
+.Fa mbs
+to start or continue conversion using the empty zero-terminated string
+as input, and discards the conversion result.
+.Pp
+If conversion succeeds, this call always returns zero.
+Unlike
+.Xr mbtowc 3 ,
+the value returned does not indicate whether the current encoding of
+the locale is state-dependent, i.e. uses shift sequences.
+.It "mbs == NULL"
 .Fn mbrtowc
 uses its own internal state object to keep the conversion state,
 instead of
-.Fa ps
-mentioned in this manual page.
+.Fa mbs .
+This internal conversion state is initialized once at program startup,
+and is undefined after an encoding error occurred.
+It is not safe to call
+.Fn mbrtowc
+again with a NULL
+.Fa mbs
+argument if
+.Fn mbrtowc
+returned (size_t)-1.
 .Pp
 Calling any other functions in
 .Em libc
-never change the internal
-state of
-.Fn mbrtowc ,
-which is initialized at startup time of the program.
+never changes the internal
+conversion state object of
+.Fn mbrtowc .
 .El
-.\" ----------------------------------------------------------------------
 .Sh RETURN VALUES
-In the usual cases,
-.Fn mbrtowc
-returns:
 .Bl -tag -width 012345678901
 .It 0
-The next bytes pointed to by
+The bytes pointed to by
 .Fa s
 form a null character.
 .It positive
-If
 .Fa s
-points to a valid character,
+points to a valid character, and the value returned is the number of
+bytes in the character.
+.It (size_t)-1
+.Fa s
+points to an illegal byte sequence which does not form a valid multibyte
+character in the current locale.
+.Fn mbrtowc
+sets
+.Va errno
+to EILSEQ.
+The conversion state object pointed to by
+.Fa mbs
+is left in an undefined state and must be reinitialized before being
+used again.
+.Pp
+Because applications using
 .Fn mbrtowc
-returns the number of bytes in the character.
+are shielded from the specifics of the multibyte character encoding scheme,
+it is impossible to repair byte sequences containing encoding errors.
+Such byte sequences must be treated as invalid and potentially malicious input.
+Applications must stop processing the byte sequence pointed to by
+.Fa s
+and either discard any wide characters already converted, or cope with
+truncated input.
 .It (size_t)-2
 .Fa s
-points to the byte sequence which possibly contains part of a valid
-multibyte character, but which is incomplete.
+points to an incomplete byte sequence which contains part of a valid
+multibyte character.
+.Fn mbrtowc
+sets
+.Va errno
+to EILSEQ and stores the bytes belonging to the incomplete sequence in
+.Fa mbs .
+The character may be completed by calling
+.Fn mbrtowc
+again with
+.Fa s
+pointing to one or more subsequent bytes of the multibyte character and
+.Fa mbs
+pointing to the conversion state object used during conversion of the
+incomplete byte sequence.
+.Pp
 When
 .Fa n
 is at least
 .Dv MB_CUR_MAX
-only occurs if the array pointed to by
+this situation only occurs if the byte string pointed to by
 .Fa s
 contains a redundant shift sequence.
-.It (size_t)-1
-.Fa s
-points to an illegal byte sequence which does not form a valid multibyte
-character.
-In this case,
-.Fn mbrtowc
-sets
-.Va errno
-to indicate the error.
 .El
-.\" ----------------------------------------------------------------------
 .Sh ERRORS
 The
 .Fn mbrtowc
-may causes an error in the following case:
+function may cause an error in the following cases:
 .Bl -tag -width Er
 .It Bq Er EILSEQ
 .Fa s
 points to an invalid or incomplete multibyte character.
 .It Bq Er EINVAL
-.Fa ps
+.Fa mbs
 points to an invalid or uninitialized mbstate_t object.
 .El
-.\" ----------------------------------------------------------------------
 .Sh SEE ALSO
 .Xr mbrlen 3 ,
 .Xr mbtowc 3 ,
 .Xr setlocale 3
-.\" ----------------------------------------------------------------------
 .Sh STANDARDS
 The
 .Fn mbrtowc
@@ -196,3 +247,40 @@ The restrict qualifier is added at
 .\" .St -isoC99 .
 ISO/IEC 9899:1999
 .Pq Dq ISO C99 .
+.Sh CAVEATS
+.Fn mbrtowc
+is not suitable for use by programs that care about internals of
+character encoding schemes, and is cumbersome to use in programs
+that need to deal with multiple character encoding schemes.
+.Pp
+It is possible that
+.Fn mbrtowc
+fails because of locale configuration errors.
+An
+.Dq invalid
+character sequence may simply be encoded in a different encoding than that
+of the current locale.
+.Pp
+The special cases for
+.Fa s
+== NULL and
+.Fa mbs
+== NULL do not make any sense.
+Instead of passing NULL for
+.Fa mbs ,
+.Xr mbtowc 3
+can be used.
+.Pp
+Earlier versions of this man page implied that calling
+.Fn mbrtowc
+with a NULL
+.Fa s
+argument would always set
+.Fa mbs
+to the initial conversion state (and shift state, if applicable).
+But this is true only if the previous call to
+.Fn mbrtowc
+using
+.Fa mbs
+did not return (size_t)-1 or (size_t)-2.
+It is recommended to zero the mbstate_t object instead.

Reply via email to