Re: Corrupted multibyte characters in command substitutions fixes may be worse than problem.

2022-02-07 Thread Alex fxmbsw7 Ratchev
On Mon, Feb 7, 2022 at 7:45 AM Lawrence Velázquez  wrote:
>
> On Mon, Feb 7, 2022, at 1:26 AM, Alex fxmbsw7 Ratchev wrote:
> > well i saw now, printf a char of "\0" results in 0 bytes out to wc -c
>
> % /usr/bin/printf '\0' | wc -c
>1
>
>
> > however my solution still stays
> > you just use memory locations instead of c strings
> > and those entries in memory are of course of known length, before setting
> > and all is fine
>
> "Your" solution is decades old.  Everyone knows how Pascal-style
> strings work.  This is not cutting-edge computer science.

i dunno what pascal strings are, sorry

> > of course this means to not use these fauly 'c strings', but a self
> > coded solution
>
> As Greg already mentioned, such a system requires converting back
> to C strings for system calls and other external APIs.  It's not
> insurmountable, but it's more involved than just swapping all your
> char * to my_string or whatever

hard work this way i see
sorry, thanks.
>
> I repeat:
>
> >> It's so simple that you should have no problem converting the entire
> >> bash codebase to Pascal-style strings yourself.  We'll wait.
>
>
> --
> vq



Re: Corrupted multibyte characters in command substitutions fixes may be worse than problem.

2022-02-07 Thread Frank Heckenbach
>  In the case of bash with environment having LC_CTYPE: C.UTF-8 or 
> en_US.UTF-8
> read:
> 0xC3 (len=1) i.e. Ã ('A' w/tilde in a legacy 8-bit latin-compatible 
> charset),
> but invalid if bash processes the environment setting of en_US.UTF-8.
> 
> Should bash process it as legacy input or invalid UTF8?
> Either way, what should it return? a UTF-8 char
> (hex 0xc30x83) transcoded from the latin value of A-tilde, or
> keep the binary value the same (return 0x83),
> should it return a warning message?  If it does, should
> it return NUL for the returned value because the input was erroneous?

Assuming Latin-1 when nothing in the environment points to it seems
questionable. It might just as well be a Cyrillic character in
ISO-8859-5 or whatever.

Email filters were mentioned. Emails may use charsets different from
the current environment -- even several different ones within a mail
(I've sent such mails myself). So if bash were to "fix" input
depending on the environment, even writing a pass-through filter
would require parsing the Content-Type headers and changing the
environment accordingly (or else, use an 8-bit clean charset
throughout).

So I don't think bash should change the input (unintentionally as
with the original bug or intentionally as discussed here) unless and
until it needs to do charset-dependent operations



Re: Corrupted multibyte characters in command substitutions fixes may be worse than problem.

2022-02-07 Thread Ángel
On 2022-02-07 at 11:55 +0100, Alex fxmbsw7 Ratchev wrote:
> > > however my solution still stays
> > > you just use memory locations instead of c strings
> > > and those entries in memory are of course of known length, before
> > > setting and all is fine
> > 
> > "Your" solution is decades old.  Everyone knows how Pascal-style
> > strings work.  This is not cutting-edge computer science.
> 
> i dunno what pascal strings are, sorry

Pascal strings refers to strings prefixed with their length:
https://en.wikipedia.org/wiki/String_(computer_science)#Length-prefixed

Basically, what you were proposing.


And as Veláquez said, it's ingenuous propose a solution nobody else
asked for, expecting others to spend the effort of actually
implementing it (plus the critics of their result, such as a limitation
on the string length, or of wasted memory for every pointer).





Re: Corrupted multibyte characters in command substitutions fixes may be worse than problem.

2022-02-07 Thread Alex fxmbsw7 Ratchev
On Tue, Feb 8, 2022 at 12:09 AM Ángel  wrote:
>
> On 2022-02-07 at 11:55 +0100, Alex fxmbsw7 Ratchev wrote:
> > > > however my solution still stays
> > > > you just use memory locations instead of c strings
> > > > and those entries in memory are of course of known length, before
> > > > setting and all is fine
> > >
> > > "Your" solution is decades old.  Everyone knows how Pascal-style
> > > strings work.  This is not cutting-edge computer science.
> >
> > i dunno what pascal strings are, sorry
>
> Pascal strings refers to strings prefixed with their length:
> https://en.wikipedia.org/wiki/String_(computer_science)#Length-prefixed
>
> Basically, what you were proposing.

i see, thank you for good explaintion ( in your words not url )
>
>
> And as Veláquez said, it's ingenuous propose a solution nobody else
> asked for, expecting others to spend the effort of actually
> implementing it (plus the critics of their result, such as a limitation
> on the string length, or of wasted memory for every pointer).

well as im an outsuder i agree
else, i can just say, rather keep the nulls
you kept the \1'en and \xff :)) ( yeah not you, the c library language
or whatever )

greets