Re: Patch for unicode in varnames...

2017-06-05 Thread George
On Sun, 2017-06-04 at 11:47 -0700, L A Walsh wrote:
> dualbus wrote:
> > 
> > I hadn't realized that bash already supports Unicode in function names!
> > FWIW:
> > 
> >   bash-4.4$ 
> >   Lēv=?
> >   Φ=0.618033988749894848
> >   
> > 
> > With this terrible patch:
> > 
> > dualbus@debian:~/src/gnu/bash$ PAGER= git diff
> >   
> 
> Clarification, please, but it looks like with your
> patch below, Unicode in variable names might be fairly close
> to being achieved?  Seeing how it was done for functions,
> gave you insight into how variables could be done, yes?
> 
> Why do you call it "terrible"? 
To hazard a guess: Each call to legal_identifier() and assignment() in the 
patched code requires copying the parameter and translating it to a wide-
character string (with no provision for skipping the added work as a build 
option). It appears the memory allocated for these copies leaks (I didn't
see any added calls to xfree() to go with those new xmallocs()), and the 
character type for the character conversion is derived from the user's locale
(which means there's not a reliable mechanism in place to run a script in a 
locale whose character encoding doesn't match that of the script.) And he
did mention "issues with compound assignments" as well. Those issues would need 
to be resolved.


Re: Patch for unicode in varnames...

2017-06-05 Thread Peter & Kelly Passchier
On 05/06/2560 15:52, George wrote:
> there's not a reliable mechanism in place to run a script in a locale
> whose character encoding doesn't match that of the script

>From my experience running such scripts is no problem, but correct
rendering it might depend on the client/editor.



Re: Very minor fixes thanks to cppcheck

2017-06-05 Thread Eric Blake
On 06/04/2017 11:39 AM, Nicola Spanti wrote:
> Hi.
> 
> I used that:
> cppcheck --verbose --quiet --enable=all --force --language=c --std=c89 .
> 
> I fixed some errors that were reported by cppcheck. I published that on
> GitLab.com.
> https://gitlab.com/RyDroid/bash
> 
> The git remote is: https://rydr...@gitlab.com/RyDroid/bash.git
> The branch is cppcheck-fix

Can you also post the patches directly to this list, rather than making
us chase a URL to see what the patch includes?

> 
> Feel to merge it upstream. I don't ask credit for this tiny thing. Of
> course, I give the copyright to the FSF.

Copyright assignment is more formal than that, if your patch is deemed
significant (small patches can be taken without assignment, but large
patches require actual paperwork and signatures, although these days
there are various countries where the paperwork is all electronic).

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3266
Virtualization:  qemu.org | libvirt.org



signature.asc
Description: OpenPGP digital signature


Re: RFE: Please allow unicode ID chars in identifiers

2017-06-05 Thread Greg Wooledge
On Sun, Jun 04, 2017 at 09:20:45AM +0700, Peter & Kelly Passchier wrote:
> On 04/06/2560 04:48, L A Walsh wrote:
> >> Greg Wooledge wrote:
> >>> Here is a demonstration of the cost of what you are proposing.  In my
> >>> mail user agent, your variable shows up as L??v.
> >>>
> >>> Source code with your UTF-8 identifiers would no longer even be
> >>> READABLE  
> >>
> >> What display/OS do you have that you can't run UTF-8 on?
> 
> So it's his mail client: reading unicode source in their old mail client
> is going to be problematic for some people...

imadev:~$ uname -a
HP-UX imadev B.10.20 A 9000/785 2008897791 two-user license

imadev:~$ ls -lt /usr/bin | tail -1
-r-xr-xr-x   1 binbin  12288 May 30  1996 from

It's the system on which I run my email MUA (mutt) and my IRC client.
It's also the NIS slave server for this floor (subnet).  It's also
where I put the squid proxy that all the Debian systems use for
apt-get.

Up until a few months ago, when our legacy Perforce server died, it
was also where I did all the development for an in-house Tcl (mostly)
application.  Having absolutely no intention of attempting to replace
Perforce with another Perforce, I simply took the existing checkout
(Perforce calls it a "client") that I already had, copied it over to
the Debian half of my dual-boot Debian/Windows system, and made a git
repository out of it.  (Compiling git for HP-UX 10.20 did not sound like
a fun thing to do, with all that https and stuff that it seems to want.)

This is the pace at which changes happen out here in the real world.



Re: Patch for unicode in varnames...

2017-06-05 Thread dualbus
On Mon, Jun 05, 2017 at 04:52:19AM -0400, George wrote:
[...]
> To hazard a guess: Each call to legal_identifier() and assignment() in
> the patched code requires copying the parameter and translating it to
> a wide-character string (with no provision for skipping the added work
> as a build option). It appears the memory allocated for these copies
> leaks (I didn't see any added calls to xfree() to go with those new
> xmallocs()), and the character type for the character conversion is
> derived from the user's locale (which means there's not a reliable
> mechanism in place to run a script in a locale whose character
> encoding doesn't match that of the script.) And he did mention "issues
> with compound assignments" as well. Those issues would need to be
> resolved.

Correct. There's also mixed use of wide-character strings and normal
strings, because that was easier to hack quickly.

By the way, ksh93 and zsh already support Unicode identifiers:

  dualbus@debian:~$ for sh in bash mksh ksh93 zsh; do LC_CTYPE=en_US.utf8 $sh 
-c 'φ=phi; echo $φ'; done
  bash: φ=phi: command not found
  $φ
  mksh: φ=phi: not found
  $φ
  phi
  phi

And all of these four support Unicode function names:

  dualbus@debian:~$ for sh in bash mksh ksh93 zsh; do LC_CTYPE=en_US.utf8
  $sh -c 'φ() { echo hi; }; φ'; done
  hi
  hi
  hi
  hi

-- 
Eduardo Bustamante
https://dualbus.me/



Re: Patch for unicode in varnames...

2017-06-05 Thread George
On Mon, 2017-06-05 at 15:59 +0700, Peter & Kelly Passchier wrote:
> On 05/06/2560 15:52, George wrote:
> > 
> > there's not a reliable mechanism in place to run a script in a locale
> > whose character encoding doesn't match that of the script
> From my experience running such scripts is no problem, but correct
> rendering it might depend on the client/editor.
> 
It depends on the source and target encodings. For most pairs of source and 
target encoding there is some case where reinterpreting a string from the
source encoding as a string in the target encoding (without proper conversion) 
will result in an invalid string in the target encoding.
For instance, if a script were written in ISO-8859-1, many possible sequences 
involving accented characters would actually be invalid in UTF-8. (For
UTF-8, a multi-byte character must start with a byte from the 0xC0-0xFF range, 
and be followed by the expected number of bytes in the 0x80-0xBF range.
So if you had "Pokémon" as an identifier in a Latin-1-encoded script (byte 
value 0xE9 between the "k" and "m") and then tried running that script in a
UTF-8 locale, that byte sequence (0xE9 0x6D) would actually be invalid in 
UTF-8, so Eduardo's patch would indicate that the identifier is invalid and
fail to run the script.
UTF-8 is a bit exceptional as variable-width encodings go, in that it is 
self-synchronizing, and there are many possible byte sequences that are not
valid UTF-8 byte sequences. So converting _from_ UTF-8 tends to be less 
problematic than converting _to_ UTF-8. But there are still corner cases where
reinterpreting a UTF-8 byte sequence as another variable-width encoding could 
result in failure (rather than just a strange character sequence). For
instance:
- If reinterpreting UTF-8 as GB-18030 (the current Chinese national standard) 
or EUC-JP (a Japanese encoding common on Unix systems), the end-byte of
a UTF-8 multi-byte character could be misinterpreted as the first byte of a 
GB-18030 or EUC multi-byte character. If that UTF-8 character is followed
by a byte that's not a valid continuation byte (for instance, any of the 
single-byte punctuation or numeral characters in 0x20-0x30), the string would
be invalid in the target encoding, and converting to a wide-character string 
would fail.
...So basically while there are many cases where a valid UTF-8 string could be 
reinterpreted without conversion to produce a valid string in another
encoding, there are cases where the conversion would fail as well.
It seems like Korn Shell's behavior is similar to Eduardo's patch: the 
session's locale settings determine the behavior of functions like isalpha()
and mbstowcs, and thus what is considered a "valid" identifier in various 
contexts. (Interestingly enough, "フフ" is a valid parameter name in a UTF8
script, but in an EUC-JP script it needs a leading underscore to work.)
Bash's present behavior seems a bit more cavalier: For instance, if a function 
name contains a byte outside the ASCII range, it's apparently accepted
regardless of the locale settings. (This works out for encodings like 
ISO-8859-15, UTF-8, and EUC-JP, where bytes in the ASCII range always represent
the corresponding ASCII character, but it's problematic for encodings like 
GB18030 where bytes in the ASCII range are sometimes part of multibyte
characters) - and while it works, it's not a great rule for how these 
characters (esp. whitespace) are treated in the syntax.
If Bash did go the route of using the locale to set the character encoding of a 
script, I think it would be best to have a mechanism a script can use
to define the character encoding for the whole script file up front, rather 
than setting LC_CTYPE to procedurally change the behavior of the shell.
This is because, in principle at least, the meaning of shell code shouldn't 
change based on the state of the shell. (That's not always the case, there
are compatibility options that enable or disable certain keywords, and some of 
those keywords have specific syntax associated with them...)  The
character encoding used to interpret a script can fundamentally change how a 
script is parsed (especially for encodings like GB18030 where bytes that
look like ASCII characters may actually be part of multi-byte characters) - so 
it should be allowed just once, at the start of parsing a file, rather
than at any point in the script's execution. And for scripts loaded with 
"source", such a script should be able to communicate its own character
encoding without impacting the locale settings of the shell loading the script.


Re: Patch for unicode in varnames...

2017-06-05 Thread L A Walsh

George wrote:

On Mon, 2017-06-05 at 15:59 +0700, Peter & Kelly Passchier wrote:
  

On 05/06/2560 15:52, George wrote:


there's not a reliable mechanism in place to run a script in a locale
whose character encoding doesn't match that of the script
  

From my experience running such scripts is no problem, but correct
rendering it might depend on the client/editor.



It depends on the source and target encodings. For most pairs of source and 
target encoding there is some case where reinterpreting a string from the
source encoding as a string in the target encoding (without proper conversion) 
will result in an invalid string in the target encoding.
For instance, if a script were written in ISO-8859-1, many possible sequences 
involving accented characters would actually be invalid in UTF-8.

---
   Um... I think you are answering a case that is different than
what is stated (i.e. locale being same as used in script).  So no
conversion should take place.

(if you have an issue w/that, talk to George.  :-) )

-linda





Re: Patch for unicode in varnames...

2017-06-05 Thread Peter & Kelly Passchier
On 06/06/2560 05:39, George wrote:
> So if you had "Pokémon" as an identifier in a Latin-1-encoded script (byte 
> value 0xE9 between the "k" and "m") and then tried running that script in a
> UTF-8 locale, that byte sequence (0xE9 0x6D) would actually be invalid in 
> UTF-8, so Eduardo's patch would indicate that the identifier is invalid and
> fail to run the script.

I often work with a locale that has a UTF-8 encoding and an
different/older encoding that are incompatible. I haven't tried the
patch, but if I use unicode characters in function names, if I write a
script in one encoding, and run it in an environment in the other
encoding, it still runs correctly, but it won't render correctly. (I
guess this depends whether the editor recognizes different encodings,
like Geany does render it correctly, but I don't know of a console
editor that does that.)

Peter