ctype.h functions on bytes 0x80..0xFF

Grisha Levit Fri, 26 May 2023 02:55:59 -0700

On Mon, May 1, 2023 at 11:48 AM Chet Ramey <chet.ra...@case.edu> wrote:
>
> (And once we get these issues straightened out, if you look back to your
> original example, 0x240 is a blank in my locale, en_US.UTF-8, and will be
> removed from the input stream by the parser unless it's quoted.)


On at least recent macos versions, it seems that the ctype.h functions
treat [0x80..0xFF] the same as wctype.h functions would.  So while
U+00A0 is a space character in the en_US.UTF-8 locale, and
iswspace(L'\u00A0') returns 1, it is also the case that isspace(0xA0)
returns 1.  But I don't think it's correct to actually rely on the
latter since the single byte 0xA0 doesn't represent _any_ character in
the locale, much less a space.

(I think that's the reason for the behavior Chet noted above from a
previous thread).

For example, these outputs would be correct with \uA0 in place of \xA0
below, but I don't think the current behaviour is expected:

$ eval $'printf "<%s>" [\xA0\xA0]'
<[><]>

[[ $'\xA0' == [[:space:]] ]]; echo $?
0

Perhaps on platforms like this it would be appropriate to mask ctype
results with something equivalent to `btowc(c) != WEOF'?

(See http://www.openradar.me/FB9973780 for an example of the issue in
an apple-supplied program)

ctype.h functions on bytes 0x80..0xFF

Reply via email to