On Mon, Feb 7, 2022 at 1:47 AM L A Walsh <b...@tlinx.org> wrote: > > > > > On 2022/02/06 09:26, Frank Heckenbach wrote: > >> On 2022/01/02 17:43, Frank Heckenbach wrote: > >> > >> > >>> Why would you? Aren't you able to assess the severity of a bug > >>> yourself? Silent data corruption is certainly one of the most severe > >>> kind of bugs ... > >>> > >> --- > >> That's debatable, BTW, as I was reminded of a similar > >> passthrough of what one might call 'invalid input' w/o warning, > >> > > > > I think you misunderstood the bug. It was not about passing through > > invalid input or fixing it. It was about bash corrupting valid input > > (if an internal buffer boundary happened to fall within a UTF-8 > > sequence) > > > > I see that the cause of the bug you reported was due to entirely different > circumstances, they question I might have, is if bash was returning > input -- should bash scan that input for validity. For example, if it > bash read these values, from a 'read' (spaces separating separate > bytes): > bytes > read: | returned: > 1) first case is relatively clear: > read (len=2) > 0x41 0x31 > returned: > A1 > 2) > read (len=4): > 0x41 0x31 0x00 0x00 > returned: ??? A1 or nothing? > error or warning message?
error on 0 byte ? err i disrespect warn reports too due to ugly results on terminal and big slowdown > In the case of bash with environment having LC_CTYPE: C.UTF-8 or > en_US.UTF-8 > read: > 0xC3 (len=1) i.e. à ('A' w/tilde in a legacy 8-bit latin-compatible > charset), > but invalid if bash processes the environment setting of en_US.UTF-8. > > Should bash process it as legacy input or invalid UTF8? > Either way, what should it return? a UTF-8 char > (hex 0xc30x83) transcoded from the latin value of A-tilde, or > keep the binary value the same (return 0x83), > should it return a warning message? If it does, should > it return NUL for the returned value because the input was erroneous? > > I.e. should bash try to scan for validity of input? > Should it use legacy ANSI or 8-bit charsets for such or > should it try to decode legacy inputs into Unicode if the environment > indicates it should be using unicode values?) > on decode-errors should it issue a warning message if so, should > it return the original unencoded value, NUL, or a decoded Unicode value? > > If bash is returning a value corrupted by a memory overlap (overlapping > stack values) > should it be testing the returned value as valid (especially if the > environment suggests it should be returning unicode values?). > > I.e. if there was corruption -- either from reading a NUL > unexpectedly, or incorrectly encode Unicode values, if warnings > were "on", the corruption might be noticed -- even if noticed, > what should bash return -- a binary DWORD value that makes no sense as > a string: either ASCII or unicode, like > 0x00 0x41 0x00 0xC1 -- maybe an attempt at 'AÀ' in UTF-16 on Windows -- > where my original bug occurred in reading a registry value that could > easily be UTF-16 encoded where the user-shell was being run under > cygwin running a Unicode C.UTF-8 environment. > > I.e. Bash might be expected to return different results based on > the environment it was running in and the environment specified encoded > or whether bash was expecting the reduced-ASCII character set. > > Depending on what one thinks bash 'should do' and what environment it > was running in can result in very different results, which is why I > balked at bash issuing warnings in some cases and not others and > whether it returned the original binary values or some sanitized version. > > At the time, due to the warning being issued, the read 'failed' and > a sanitized version was returned -- both responses preventing reading > the desired value. If bash detected invalid Unicode sequences it might > help detect memory-based corruption, or might sanitize such sequences before > returning them -- either way possibly causing harm due to silence or due > to breaking compatibility. > > Just thought it might be desirable to be consistent about what was done or > having controlled via an option (be strict+warn or ignore+don't warn). > > If its decided to ignore (don't test for validity) and don't issue a warning > as the default action, then the warning for null bytes seems like it should > be removed -- with the idea of bash not testing read input for validity. > > > which was very unhelpful. > > Or more basically should > based character set -- as in legacy input) > returned: ??? should bash return à (U+00C3) or hexbytes 0xc3\0x83 > if > > >