Re: doc: New chapter "Strings and Characters"

Bruno Haible Mon, 19 Jun 2023 10:30:12 -0700

Hi Paul,

Thanks for the feedback.


> > +The @posixheader{ctype.h} API, that was designed only with unibyte
> > +encodings in mind, is useless nowadays; it does not work in
> > +multibyte locales.
> 
> It's still useful, even in multibyte locales, when dealing with data 
> that is inherently unibyte. Perhaps prepend "for general text 
> processing" to the sentence. Similarly for the later occurrence of 
> "useless and obsolete".
> 
> 
> > +While UTF-8 is the most common multibyte encoding, GB18030 is there as
> > +well and will not go away within decades, because it is a Chinese
> > +government standard, last revised in 2022.
> 
> Again, let's not focus on GB18030 to the exclusion of other national 
> encodings.

I still need to mention GB18030 as the worst-case example, to explain why
strchr() and similar functions may be problematic. BIG5 is not _that_ bad.
DEC-HANYU and ISO-IR-165, which are also that bad, are not supported as
locale encodings in glibc.

> > +For complex string processing, the provided strings functions may not be
> 
> strings -> string

Done as follows:


2023-06-19  Bruno Haible  <br...@clisp.org>

        doc: Corrections to the "Strings and Characters" chapter.
        Suggested by Paul Eggert.
        * doc/strings.texi: Corrections: GB18030 is rarely used nowadays.
        <ctype.h> functions can be useful for specific data.

diff --git a/doc/strings.texi b/doc/strings.texi
index 131221f583..cbed6533c4 100644
--- a/doc/strings.texi
+++ b/doc/strings.texi
@@ -76,7 +76,7 @@
 ``unibyte locale'', otherwise of a ``multibyte locale''.
 
 It is important to realize that the majority of Unix installations
-nowadays use UTF-8 or GB18030 as locale encoding; therefore, the
+nowadays use UTF-8 as locale encoding; therefore, the
 majority of users are using multibyte locales.
 
 Three important facts to remember are:
@@ -89,8 +89,8 @@
 @itemize @bullet
 @item
 The @posixheader{ctype.h} API, that was designed only with unibyte
-encodings in mind, is useless nowadays; it does not work in
-multibyte locales.
+encodings in mind, is useless nowadays for general text processing; it
+does not work in multibyte locales.
 @item
 The @posixfunc{strlen} function does not return the number of characters
 in a string.  Nor does it return the number of screen columns occupied
@@ -107,9 +107,9 @@
 @emph{Multibyte does not imply UTF-8 encoding.}
 @end cartouche
 
-While UTF-8 is the most common multibyte encoding, GB18030 is there as
-well and will not go away within decades, because it is a Chinese
-government standard, last revised in 2022.
+While UTF-8 is the most common multibyte encoding, GB18030 is also a
+supported locale encoding on GNU systems (mostly because it is a Chinese
+government standard, last revised in 2022).
 
 @cartouche
 @emph{Searching for a character in a string is not the same as searching
@@ -184,7 +184,7 @@
 @node Iterating through strings
 @subsubsection Iterating through strings
 
-For complex string processing, the provided strings functions may not be
+For complex string processing, the provided string functions may not be
 enough, and what you need is a way to iterate through a string while
 processing each (possibly multibyte) character in turn.  Gnulib provides
 two modules for this purpose.  Both iterate through the string in
@@ -604,7 +604,8 @@
 program that runs only in unibyte locales.
 
 ISO C and POSIX standardized an API for characters of type @code{char},
-in @code{<ctype.h>}.  This API is nowadays useless and obsolete.
+in @code{<ctype.h>}.  This API is nowadays useless and obsolete, when it
+comes to general text processing.
 
 The important lessons to remember are:

Re: doc: New chapter "Strings and Characters"

Reply via email to