Re: [gawk-devel] changing regex lib

Bruno Haible Fri, 10 Aug 2018 16:26:35 -0700

Hi Paul,

> Thanks for checking. I installed the regcomp.c change into glibc and gnulib 
> so 
> we should now have the same source there as we have in Gawk.


The patch [1] looks correct to me, but it introduces a misleading comment that
could become the cause of future bugs.

Recall that for arguments c in the range 0x80..0xFF, btowc(c) can very well
be different from c (this is obvious for encodings != ISO-8859-1 on glibc,
and true even for ISO-8859-1 on Solaris and FreeBSD [2]). So, a unibyte
and a wide character "live" in different domains. There is risk that
a wide character function (isw*) get called on a value that is a unibyte,
and there is risk that btowc() gets called on a value that is a wide
character; both would be bugs.

Therefore I would rewrite the comment

/* Convert the byte B to the corresponding wide character.  In a
   unibyte locale, treat B as itself.  In a multibyte locale, return
   WEOF if B is an encoding error.  */

to

/* Convert the byte B to a value that bounds the iteration through a
   character range.
   In a unibyte locale, we use a bit set based on byte values, therefore
   return B itself.  Note! This may be != btowc (B).
   In a multibyte locale, we use comparison of wide characters, therefore
   return the wide character corresponding to B, or WEOF if B is invalid.  */

Bruno

[1] 
https://sourceware.org/git/?p=glibc.git;a=commitdiff;h=c77bf91b4315efed2b61633567acc7ac3c46959c
[2] 
https://www.gnu.org/software/libunistring/manual/html_node/The-wchar_005ft-mess.html

Re: [gawk-devel] changing regex lib

Reply via email to