Hi Paul, > Thanks for checking. I installed the regcomp.c change into glibc and gnulib > so > we should now have the same source there as we have in Gawk.
The patch [1] looks correct to me, but it introduces a misleading comment that could become the cause of future bugs. Recall that for arguments c in the range 0x80..0xFF, btowc(c) can very well be different from c (this is obvious for encodings != ISO-8859-1 on glibc, and true even for ISO-8859-1 on Solaris and FreeBSD [2]). So, a unibyte and a wide character "live" in different domains. There is risk that a wide character function (isw*) get called on a value that is a unibyte, and there is risk that btowc() gets called on a value that is a wide character; both would be bugs. Therefore I would rewrite the comment /* Convert the byte B to the corresponding wide character. In a unibyte locale, treat B as itself. In a multibyte locale, return WEOF if B is an encoding error. */ to /* Convert the byte B to a value that bounds the iteration through a character range. In a unibyte locale, we use a bit set based on byte values, therefore return B itself. Note! This may be != btowc (B). In a multibyte locale, we use comparison of wide characters, therefore return the wide character corresponding to B, or WEOF if B is invalid. */ Bruno [1] https://sourceware.org/git/?p=glibc.git;a=commitdiff;h=c77bf91b4315efed2b61633567acc7ac3c46959c [2] https://www.gnu.org/software/libunistring/manual/html_node/The-wchar_005ft-mess.html