> In the case of bash with environment having LC_CTYPE: C.UTF-8 or > en_US.UTF-8 > read: > 0xC3 (len=1) i.e. Ã ('A' w/tilde in a legacy 8-bit latin-compatible > charset), > but invalid if bash processes the environment setting of en_US.UTF-8. > > Should bash process it as legacy input or invalid UTF8? > Either way, what should it return? a UTF-8 char > (hex 0xc30x83) transcoded from the latin value of A-tilde, or > keep the binary value the same (return 0x83), > should it return a warning message? If it does, should > it return NUL for the returned value because the input was erroneous?
Assuming Latin-1 when nothing in the environment points to it seems questionable. It might just as well be a Cyrillic character in ISO-8859-5 or whatever. Email filters were mentioned. Emails may use charsets different from the current environment -- even several different ones within a mail (I've sent such mails myself). So if bash were to "fix" input depending on the environment, even writing a pass-through filter would require parsing the Content-Type headers and changing the environment accordingly (or else, use an 8-bit clean charset throughout). So I don't think bash should change the input (unintentionally as with the original bug or intentionally as discussed here) unless and until it needs to do charset-dependent operations