Re: Patch for unicode in varnames...
On Sun, 2017-06-04 at 11:47 -0700, L A Walsh wrote: > dualbus wrote: > > > > I hadn't realized that bash already supports Unicode in function names! > > FWIW: > > > > bash-4.4$ > > Lēv=? > > Φ=0.618033988749894848 > > > > > > With this terrible patch: > > > > dualbus@debian:~/src/gnu/bash$ PAGER= git diff > > > > Clarification, please, but it looks like with your > patch below, Unicode in variable names might be fairly close > to being achieved? Seeing how it was done for functions, > gave you insight into how variables could be done, yes? > > Why do you call it "terrible"? To hazard a guess: Each call to legal_identifier() and assignment() in the patched code requires copying the parameter and translating it to a wide- character string (with no provision for skipping the added work as a build option). It appears the memory allocated for these copies leaks (I didn't see any added calls to xfree() to go with those new xmallocs()), and the character type for the character conversion is derived from the user's locale (which means there's not a reliable mechanism in place to run a script in a locale whose character encoding doesn't match that of the script.) And he did mention "issues with compound assignments" as well. Those issues would need to be resolved.
Re: Patch for unicode in varnames...
On 05/06/2560 15:52, George wrote: > there's not a reliable mechanism in place to run a script in a locale > whose character encoding doesn't match that of the script >From my experience running such scripts is no problem, but correct rendering it might depend on the client/editor.
Re: Very minor fixes thanks to cppcheck
On 06/04/2017 11:39 AM, Nicola Spanti wrote: > Hi. > > I used that: > cppcheck --verbose --quiet --enable=all --force --language=c --std=c89 . > > I fixed some errors that were reported by cppcheck. I published that on > GitLab.com. > https://gitlab.com/RyDroid/bash > > The git remote is: https://rydr...@gitlab.com/RyDroid/bash.git > The branch is cppcheck-fix Can you also post the patches directly to this list, rather than making us chase a URL to see what the patch includes? > > Feel to merge it upstream. I don't ask credit for this tiny thing. Of > course, I give the copyright to the FSF. Copyright assignment is more formal than that, if your patch is deemed significant (small patches can be taken without assignment, but large patches require actual paperwork and signatures, although these days there are various countries where the paperwork is all electronic). -- Eric Blake, Principal Software Engineer Red Hat, Inc. +1-919-301-3266 Virtualization: qemu.org | libvirt.org signature.asc Description: OpenPGP digital signature
Re: RFE: Please allow unicode ID chars in identifiers
On Sun, Jun 04, 2017 at 09:20:45AM +0700, Peter & Kelly Passchier wrote: > On 04/06/2560 04:48, L A Walsh wrote: > >> Greg Wooledge wrote: > >>> Here is a demonstration of the cost of what you are proposing. In my > >>> mail user agent, your variable shows up as L??v. > >>> > >>> Source code with your UTF-8 identifiers would no longer even be > >>> READABLE > >> > >> What display/OS do you have that you can't run UTF-8 on? > > So it's his mail client: reading unicode source in their old mail client > is going to be problematic for some people... imadev:~$ uname -a HP-UX imadev B.10.20 A 9000/785 2008897791 two-user license imadev:~$ ls -lt /usr/bin | tail -1 -r-xr-xr-x 1 binbin 12288 May 30 1996 from It's the system on which I run my email MUA (mutt) and my IRC client. It's also the NIS slave server for this floor (subnet). It's also where I put the squid proxy that all the Debian systems use for apt-get. Up until a few months ago, when our legacy Perforce server died, it was also where I did all the development for an in-house Tcl (mostly) application. Having absolutely no intention of attempting to replace Perforce with another Perforce, I simply took the existing checkout (Perforce calls it a "client") that I already had, copied it over to the Debian half of my dual-boot Debian/Windows system, and made a git repository out of it. (Compiling git for HP-UX 10.20 did not sound like a fun thing to do, with all that https and stuff that it seems to want.) This is the pace at which changes happen out here in the real world.
Re: Patch for unicode in varnames...
On Mon, Jun 05, 2017 at 04:52:19AM -0400, George wrote: [...] > To hazard a guess: Each call to legal_identifier() and assignment() in > the patched code requires copying the parameter and translating it to > a wide-character string (with no provision for skipping the added work > as a build option). It appears the memory allocated for these copies > leaks (I didn't see any added calls to xfree() to go with those new > xmallocs()), and the character type for the character conversion is > derived from the user's locale (which means there's not a reliable > mechanism in place to run a script in a locale whose character > encoding doesn't match that of the script.) And he did mention "issues > with compound assignments" as well. Those issues would need to be > resolved. Correct. There's also mixed use of wide-character strings and normal strings, because that was easier to hack quickly. By the way, ksh93 and zsh already support Unicode identifiers: dualbus@debian:~$ for sh in bash mksh ksh93 zsh; do LC_CTYPE=en_US.utf8 $sh -c 'φ=phi; echo $φ'; done bash: φ=phi: command not found $φ mksh: φ=phi: not found $φ phi phi And all of these four support Unicode function names: dualbus@debian:~$ for sh in bash mksh ksh93 zsh; do LC_CTYPE=en_US.utf8 $sh -c 'φ() { echo hi; }; φ'; done hi hi hi hi -- Eduardo Bustamante https://dualbus.me/
Re: Patch for unicode in varnames...
On Mon, 2017-06-05 at 15:59 +0700, Peter & Kelly Passchier wrote: > On 05/06/2560 15:52, George wrote: > > > > there's not a reliable mechanism in place to run a script in a locale > > whose character encoding doesn't match that of the script > From my experience running such scripts is no problem, but correct > rendering it might depend on the client/editor. > It depends on the source and target encodings. For most pairs of source and target encoding there is some case where reinterpreting a string from the source encoding as a string in the target encoding (without proper conversion) will result in an invalid string in the target encoding. For instance, if a script were written in ISO-8859-1, many possible sequences involving accented characters would actually be invalid in UTF-8. (For UTF-8, a multi-byte character must start with a byte from the 0xC0-0xFF range, and be followed by the expected number of bytes in the 0x80-0xBF range. So if you had "Pokémon" as an identifier in a Latin-1-encoded script (byte value 0xE9 between the "k" and "m") and then tried running that script in a UTF-8 locale, that byte sequence (0xE9 0x6D) would actually be invalid in UTF-8, so Eduardo's patch would indicate that the identifier is invalid and fail to run the script. UTF-8 is a bit exceptional as variable-width encodings go, in that it is self-synchronizing, and there are many possible byte sequences that are not valid UTF-8 byte sequences. So converting _from_ UTF-8 tends to be less problematic than converting _to_ UTF-8. But there are still corner cases where reinterpreting a UTF-8 byte sequence as another variable-width encoding could result in failure (rather than just a strange character sequence). For instance: - If reinterpreting UTF-8 as GB-18030 (the current Chinese national standard) or EUC-JP (a Japanese encoding common on Unix systems), the end-byte of a UTF-8 multi-byte character could be misinterpreted as the first byte of a GB-18030 or EUC multi-byte character. If that UTF-8 character is followed by a byte that's not a valid continuation byte (for instance, any of the single-byte punctuation or numeral characters in 0x20-0x30), the string would be invalid in the target encoding, and converting to a wide-character string would fail. ...So basically while there are many cases where a valid UTF-8 string could be reinterpreted without conversion to produce a valid string in another encoding, there are cases where the conversion would fail as well. It seems like Korn Shell's behavior is similar to Eduardo's patch: the session's locale settings determine the behavior of functions like isalpha() and mbstowcs, and thus what is considered a "valid" identifier in various contexts. (Interestingly enough, "フフ" is a valid parameter name in a UTF8 script, but in an EUC-JP script it needs a leading underscore to work.) Bash's present behavior seems a bit more cavalier: For instance, if a function name contains a byte outside the ASCII range, it's apparently accepted regardless of the locale settings. (This works out for encodings like ISO-8859-15, UTF-8, and EUC-JP, where bytes in the ASCII range always represent the corresponding ASCII character, but it's problematic for encodings like GB18030 where bytes in the ASCII range are sometimes part of multibyte characters) - and while it works, it's not a great rule for how these characters (esp. whitespace) are treated in the syntax. If Bash did go the route of using the locale to set the character encoding of a script, I think it would be best to have a mechanism a script can use to define the character encoding for the whole script file up front, rather than setting LC_CTYPE to procedurally change the behavior of the shell. This is because, in principle at least, the meaning of shell code shouldn't change based on the state of the shell. (That's not always the case, there are compatibility options that enable or disable certain keywords, and some of those keywords have specific syntax associated with them...) The character encoding used to interpret a script can fundamentally change how a script is parsed (especially for encodings like GB18030 where bytes that look like ASCII characters may actually be part of multi-byte characters) - so it should be allowed just once, at the start of parsing a file, rather than at any point in the script's execution. And for scripts loaded with "source", such a script should be able to communicate its own character encoding without impacting the locale settings of the shell loading the script.
Re: Patch for unicode in varnames...
George wrote: On Mon, 2017-06-05 at 15:59 +0700, Peter & Kelly Passchier wrote: On 05/06/2560 15:52, George wrote: there's not a reliable mechanism in place to run a script in a locale whose character encoding doesn't match that of the script From my experience running such scripts is no problem, but correct rendering it might depend on the client/editor. It depends on the source and target encodings. For most pairs of source and target encoding there is some case where reinterpreting a string from the source encoding as a string in the target encoding (without proper conversion) will result in an invalid string in the target encoding. For instance, if a script were written in ISO-8859-1, many possible sequences involving accented characters would actually be invalid in UTF-8. --- Um... I think you are answering a case that is different than what is stated (i.e. locale being same as used in script). So no conversion should take place. (if you have an issue w/that, talk to George. :-) ) -linda
Re: Patch for unicode in varnames...
On 06/06/2560 05:39, George wrote: > So if you had "Pokémon" as an identifier in a Latin-1-encoded script (byte > value 0xE9 between the "k" and "m") and then tried running that script in a > UTF-8 locale, that byte sequence (0xE9 0x6D) would actually be invalid in > UTF-8, so Eduardo's patch would indicate that the identifier is invalid and > fail to run the script. I often work with a locale that has a UTF-8 encoding and an different/older encoding that are incompatible. I haven't tried the patch, but if I use unicode characters in function names, if I write a script in one encoding, and run it in an environment in the other encoding, it still runs correctly, but it won't render correctly. (I guess this depends whether the editor recognizes different encodings, like Geany does render it correctly, but I don't know of a console editor that does that.) Peter