Hi Paul, Thanks for the feedback.
> > +The @posixheader{ctype.h} API, that was designed only with unibyte > > +encodings in mind, is useless nowadays; it does not work in > > +multibyte locales. > > It's still useful, even in multibyte locales, when dealing with data > that is inherently unibyte. Perhaps prepend "for general text > processing" to the sentence. Similarly for the later occurrence of > "useless and obsolete". > > > > +While UTF-8 is the most common multibyte encoding, GB18030 is there as > > +well and will not go away within decades, because it is a Chinese > > +government standard, last revised in 2022. > > Again, let's not focus on GB18030 to the exclusion of other national > encodings. I still need to mention GB18030 as the worst-case example, to explain why strchr() and similar functions may be problematic. BIG5 is not _that_ bad. DEC-HANYU and ISO-IR-165, which are also that bad, are not supported as locale encodings in glibc. > > +For complex string processing, the provided strings functions may not be > > strings -> string Done as follows: 2023-06-19 Bruno Haible <br...@clisp.org> doc: Corrections to the "Strings and Characters" chapter. Suggested by Paul Eggert. * doc/strings.texi: Corrections: GB18030 is rarely used nowadays. <ctype.h> functions can be useful for specific data. diff --git a/doc/strings.texi b/doc/strings.texi index 131221f583..cbed6533c4 100644 --- a/doc/strings.texi +++ b/doc/strings.texi @@ -76,7 +76,7 @@ ``unibyte locale'', otherwise of a ``multibyte locale''. It is important to realize that the majority of Unix installations -nowadays use UTF-8 or GB18030 as locale encoding; therefore, the +nowadays use UTF-8 as locale encoding; therefore, the majority of users are using multibyte locales. Three important facts to remember are: @@ -89,8 +89,8 @@ @itemize @bullet @item The @posixheader{ctype.h} API, that was designed only with unibyte -encodings in mind, is useless nowadays; it does not work in -multibyte locales. +encodings in mind, is useless nowadays for general text processing; it +does not work in multibyte locales. @item The @posixfunc{strlen} function does not return the number of characters in a string. Nor does it return the number of screen columns occupied @@ -107,9 +107,9 @@ @emph{Multibyte does not imply UTF-8 encoding.} @end cartouche -While UTF-8 is the most common multibyte encoding, GB18030 is there as -well and will not go away within decades, because it is a Chinese -government standard, last revised in 2022. +While UTF-8 is the most common multibyte encoding, GB18030 is also a +supported locale encoding on GNU systems (mostly because it is a Chinese +government standard, last revised in 2022). @cartouche @emph{Searching for a character in a string is not the same as searching @@ -184,7 +184,7 @@ @node Iterating through strings @subsubsection Iterating through strings -For complex string processing, the provided strings functions may not be +For complex string processing, the provided string functions may not be enough, and what you need is a way to iterate through a string while processing each (possibly multibyte) character in turn. Gnulib provides two modules for this purpose. Both iterate through the string in @@ -604,7 +604,8 @@ program that runs only in unibyte locales. ISO C and POSIX standardized an API for characters of type @code{char}, -in @code{<ctype.h>}. This API is nowadays useless and obsolete. +in @code{<ctype.h>}. This API is nowadays useless and obsolete, when it +comes to general text processing. The important lessons to remember are: