Re: RFE: Please allow unicode ID chars in identifiers

tetsujin Fri, 02 Jun 2017 09:55:02 -0700

It's all but impossible to establish an "absolute need" for a new
feature. In the extreme case there is always the argument that one
does not "absolutely need" a feature added to bash, because one could
simply use Bash to launch another tool that provides that feature -
code in Python rather than shell scripts, etc. Very, very few "needs"
qualify as "absolutes". But there are decisions that can make Bash a
better programming language that are worth considering, even if they
are not "absolutely" needed - just as associative arrays or regular
expressions were not "absolutely" needed, but are quite useful.


I agree that allowing Unicode in parameter names is problematic:
- there are characters that should be equivalent in principle, but
aren't (For instance, the Greek letter pi (π) and the mathematical
symbol pi (𝛑) - in some fonts they may render the same, but they
are distinct code points) Some of these characters will look like Bash
syntax, but be encoded differently.
- there are characters that are equivalent, but can be encoded
multiple ways (For instance, 'é' may be encoded as $'u00E9' or as
$'eu0309') - broadly speaking this falls under the scope of "Unicode
Normalization" - it's a well-explored problem but not a trivial one.
(And it gets much worse with Asian languages, for instance)
- Unicode introduces more whitespace characters - to allow Unicode
glyphs in parameter names, one must also decide whether to interpret
Unicode whitespace as whitespace (complicating parsing rules, word
splitting rules, etc.) - or to treat only ASCII whitespace as
whitespace (in which case Unicode whitespace can form part of a "word"
without quoting - which could make such code visually confusing)
- As you pointed out, this requires the shell to somehow establish a
convention governing the character set used to interpret shell scripts
- a point on which the shell has so far been able to remain
noncommittal (as long as the character set is ASCII-compatible)

I think the value is that it allows the programmer to express
themselves in their native language. The history of computing up to
this point has produced this situation where most programming
langauges are oriented toward native English speakers and use of the
Latin alphabet. And it would be very difficult to make a programming
language truly multi-lingual. But I think it's a worthwhile goal,
particularly in the shell (which can be used for scripting, but also
for general, interactive operation of the computer) to accommodate
foreign vocabulary in those contexts in which the user chooses to use
it.

From an English-speaker's perspective, allowing Unicode could allow
variable names to distinguish between "resumé" and "resume". (We
don't use accented characters much)
For various European languages, the use of accented characters (and
other characters that lie outside of ASCII) can be necessary to make
certain words clear. If you drop an accent mark from a word that would
normally use it, the word looks like something else.
And for Asian languages like Japanese, the use of ideogram characters
can help to distinguish between homophones.

Personally I also feel it's worthwhile to support symbols as in
Walsh's examples as well: For instance, the symbol "pi" has a very
clear meaning. You can see it and know it doesn't mean "previous value
of an iteration variable that was casually/sloppily named $i" or
something else. Where that symbol appears it almost always refers to
the circle ratio. Some of Walsh's other examples I think would just be
needlessly confusing: for instance using 'Ⅽ' (U+216D, the ROMAN
NUMERAL C) as a variable name just results in a variable name that's
hard to type, and distinct from but almost visually indistinguishable
from 'C'. I see cases like that as "enough rope to hang oneself with".
You can use Unicode to do silly or confusing things, and it's not the
responsibility of the shell to prevent someone from doing such things.

Greg raises a fair point: Some platforms will simply be unable to view
scripts written with Unicode symbols...
...Likewise, some platforms will be unable to view scripts written
with other features that are already present, such as Unicode string
literals, or Unicode command names.

But, on the other hand:
- Even if your editor or terminal can't display the UTF-8 code, that
doesn't mean the shell process can't RUN it.
- If your terminal can't display UTF-8...  Well, you can edit the
code from a terminal that CAN.
So I don't think it's that big of a problem.

Frankly, making shell scripts broadly compatible (let alone
"universally" compatible) is very, very difficult. Can you believe
someone wrote a shell script that used the colon character as part of
a "tar" archive name? They clearly didn't have GNU tar in mind when
they wrote that...  This would be just another case of something
useful-but-not-universally-compatible. (In other words, if the feature
is added, it becomes one of those things you don't use if you want
your script to be portable. Frankly if it were added to Bash now, most
of us would have to wait 5-10 years for today's version of Bash to
become "the oldest version our project must still support" before we
could use the feature...  This is part of why I tend to be rather
forward-looking when it comes to the question of what should be in the
shell.)

To address your questions on related design decisions that would have
to be made for this to work, here's how I'd approach it for a
pre-existing language like Bash with a long legacy:

1: For an interactive session, the character encoding of commands is
taken from the locale.
2: For a script, the character encoding of commands must be explicitly
specified, probably via a shell option. (Ideally I think it should be
specified per-file, but I don't know if Bash supports any kind of
per-file shell options. This is so, for instance, a non-Unicode
session that sources a Unicode shell script does not become a Unicode
session, or a Unicode script that sources a non-Unicode script does
not interpret that script as Unicode.)
3: If a script does not specify its character encoding, then the
behavior is like current versions of Bash: multi-byte characters are
supported in some contexts (quoted strings, command names, words) but
not others (parameter names)
4: Sub-shell invocation, command/process substitution, etc. inherit
the character encoding of the parent shell.
5: Enabling Unicode parsing of the script doesn't add Unicode
whitespace characters to IFS - it affects how the script is
interpreted, not how the parsed code operates.
6: If POSIX mode is active, I think all of this gets disabled and
Unicode characters are disallowed as parameter names.

I'm not certain about normalization. I think the easy answer is to
just say the shell doesn't normalize the Unicode it gets - which means
scripts could contain instances of Unicode code points that should be
equivalent (character composition, etc.) but aren't considered equal
by the shell.

I think another angle worth considering is that most of what I
outlined above is only important if you want to take advantage of it
for other features. For instance:
- In order to support transcoding, to allow (for instance) a Latin-1
script to source a Unicode script and for equivalently-named entities
in the two to connect.
- In order to support Unicode whitespace as part of the language
syntax

Without those additional requirements, the current status quo (which
allows for UTF-8 characters to appear in function names, command
names, coprocess names, alias names, as part of file paths, quoted or
unquoted command arguments, associative array subscripts, almost
everything but variable names for assignment or expansion) could
simply be expanded to allow UTF-8 in parameter names as well. I think
it would be BETTER to have an implementation that explicitly
establishes source file encoding and provides transcoding in some
cases, provide validation of the character encoding, and to support
Unicode whitespace, at least, as part of the language syntax...  But
we could also just keep things as they are, except allow bytes outside
the ASCII range to be part of variable names in assignment and
parameter expansion.

----- Original Message -----
From: "dualbus" <dual...@gmail.com>
To:"L A Walsh" <b...@tlinx.org>
Cc:"bug-bash" <bug-bash@gnu.org>
Sent:Thu, 1 Jun 2017 23:52:55 -0500
Subject:Re: RFE: Please allow unicode ID chars in identifiers

 Please remember that there's always a *cost* associated with
 new features:

 - Someone has to do the work of extending the current lexer / parser
to
 be able to ingest multibyte character sequences. It helps a bunch if
 you do that work, and submit it for review.

 - People then have to test the new implementation, to ensure that
there
 are no regressions, and no new bugs introduced. I'm happy to
volunteer
 once there's a working implementation.

 - There are some questions that must be answered first:

 * How do you how to decode multibyte character sequences into
Unicode? 
 Should UTF-8 be assumed?

 * Will the parsing of a script depend upon the user locale?

 * Should this special parsing code be disabled if POSIX mode is
 enabled?

 * Right now `name' or `identifier' is defined as:

 name: A word consisting only of alphanumeric characters and
 underscores, and beginning with an alphabetic character or an
 underscore. Also referred to as an identifier.

 How will the definition look like with Unicode identifiers?

 > Variable names like:
 > 
 > Lēv -- (3 letter word pronounced like Leave), (most recent try...)

 Use `Lev'.

 > string constants:
 > $Φ="0.618033988749894848"
 > $ɸ="1.61803398874989485"
 > $π="3.14159265358979324"
 > $␉=$'x09'
 > $Ⅼ=50 $Ⅽ=100 $Ⅾ=500 $Ⅿ=1000
 > $Ⅰ=1 $Ⅴ=5 $Ⅹ=10
 > $㎏="kilogram"
 > $㎆=1024*$㎅,
 > etc...

 What prevents you from using?

 phi='...'
 pi='...'
 ht='...'
 kg='...'

 I'm still not convinced there's an actual need here.

 -- 
 Eduardo Bustamante
 https://dualbus.me/

Re: RFE: Please allow unicode ID chars in identifiers

Reply via email to