Re: Corrupted multibyte characters in command substitutions fixes may be worse than problem.

Alex fxmbsw7 Ratchev Sun, 06 Feb 2022 20:56:19 -0800

On Mon, Feb 7, 2022 at 1:47 AM L A Walsh <b...@tlinx.org> wrote:
>
>
>
>
> On 2022/02/06 09:26, Frank Heckenbach wrote:
> >> On 2022/01/02 17:43, Frank Heckenbach wrote:
> >>
> >>
> >>> Why would you? Aren't you able to assess the severity of a bug
> >>> yourself? Silent data corruption is certainly one of the most severe
> >>> kind of bugs ...
> >>>
> >> ---
> >> That's debatable, BTW, as I was reminded of a similar
> >> passthrough of what one might call 'invalid input' w/o warning,
> >>
> >
> > I think you misunderstood the bug. It was not about passing through
> > invalid input or fixing it. It was about bash corrupting valid input
> > (if an internal buffer boundary happened to fall within a UTF-8
> > sequence)
> >
>
> I see that the cause of the bug you reported was due to entirely different
> circumstances, they question I might have, is if bash was returning
> input -- should bash scan that input for validity.  For example, if it
> bash read these values, from a 'read' (spaces separating separate
> bytes):
>  bytes
> read:      | returned:
> 1) first case is relatively clear:
> read (len=2)
>   0x41 0x31
> returned:
> A1
> 2)
>  read (len=4):
> 0x41 0x31 0x00 0x00
> returned: ???  A1 or nothing?
> error or warning message?


error on 0 byte ? err
i disrespect warn reports too due to ugly results on terminal and big slowdown

>  In the case of bash with environment having LC_CTYPE: C.UTF-8 or
> en_US.UTF-8
> read:
> 0xC3 (len=1) i.e. Ã ('A' w/tilde in a legacy 8-bit latin-compatible
> charset),
> but invalid if bash processes the environment setting of en_US.UTF-8.
>
> Should bash process it as legacy input or invalid UTF8?
> Either way, what should it return? a UTF-8 char
> (hex 0xc30x83) transcoded from the latin value of A-tilde, or
> keep the binary value the same (return 0x83),
> should it return a warning message?  If it does, should
> it return NUL for the returned value because the input was erroneous?
>
> I.e. should bash try to scan for validity of input?
> Should it use legacy ANSI or 8-bit charsets for such or
> should it try to decode legacy inputs into Unicode if the environment
> indicates it should be using unicode values?)
> on decode-errors should it issue a warning message if so, should
> it return the original unencoded value, NUL, or a decoded Unicode value?
>
> If bash is returning a value corrupted by a memory overlap (overlapping
> stack values)
> should it be testing the returned value as valid (especially if the
> environment suggests it should be returning unicode values?).
>
> I.e. if there was corruption -- either from reading a NUL
> unexpectedly, or incorrectly encode Unicode values, if warnings
> were "on", the corruption might be noticed -- even if noticed,
> what should bash return -- a binary DWORD value that makes no sense as
> a string: either ASCII or unicode, like
> 0x00 0x41 0x00 0xC1 -- maybe an attempt at 'AÀ' in UTF-16 on Windows --
> where my original bug occurred in reading a registry value that could
> easily be UTF-16 encoded where the user-shell was being run under
> cygwin running a Unicode C.UTF-8 environment.
>
> I.e. Bash might be expected to return different results based on
> the environment it was running in and the environment specified encoded
> or whether bash was expecting the reduced-ASCII character set.
>
> Depending on what one thinks bash 'should do' and what environment it
> was running in can result in very different results, which is why I
> balked at bash issuing warnings in some cases and not others and
> whether it returned the original binary values or some sanitized version.
>
> At the time, due to the warning being issued, the read 'failed' and
> a sanitized version was returned -- both responses preventing reading
> the desired value.  If bash detected invalid Unicode sequences it might
> help detect memory-based corruption, or might sanitize such sequences before
> returning them -- either way possibly causing harm due to silence or due
> to breaking compatibility.
>
> Just thought it might be desirable to be consistent about what was done or
> having controlled via an option (be strict+warn or ignore+don't warn).
>
> If its decided to ignore (don't test for validity) and don't issue a warning
> as the default action, then the warning for null bytes seems like it should
> be removed -- with the idea of bash not testing read input for validity.
>
>
> which was very unhelpful.
>
> Or more basically should
>               based character set -- as in legacy input)
> returned: ???  should bash return Ã (U+00C3) or hexbytes 0xc3\0x83
> if
>
>
>

Re: Corrupted multibyte characters in command substitutions fixes may be worse than problem.

Reply via email to