Re: Corrupted multibyte characters in command substitutions fixes may be worse than problem.

Frank Heckenbach Mon, 07 Feb 2022 03:21:52 -0800

>  In the case of bash with environment having LC_CTYPE: C.UTF-8 or 
> en_US.UTF-8
> read:
> 0xC3 (len=1) i.e. Ã ('A' w/tilde in a legacy 8-bit latin-compatible 
> charset),
> but invalid if bash processes the environment setting of en_US.UTF-8.
> 
> Should bash process it as legacy input or invalid UTF8?
> Either way, what should it return? a UTF-8 char
> (hex 0xc30x83) transcoded from the latin value of A-tilde, or
> keep the binary value the same (return 0x83),
> should it return a warning message?  If it does, should
> it return NUL for the returned value because the input was erroneous?


Assuming Latin-1 when nothing in the environment points to it seems
questionable. It might just as well be a Cyrillic character in
ISO-8859-5 or whatever.

Email filters were mentioned. Emails may use charsets different from
the current environment -- even several different ones within a mail
(I've sent such mails myself). So if bash were to "fix" input
depending on the environment, even writing a pass-through filter
would require parsing the Content-Type headers and changing the
environment accordingly (or else, use an 8-bit clean charset
throughout).

So I don't think bash should change the input (unintentionally as
with the original bug or intentionally as discussed here) unless and
until it needs to do charset-dependent operations

Re: Corrupted multibyte characters in command substitutions fixes may be worse than problem.

Reply via email to