It's all but impossible to establish an "absolute need" for a new feature. In the extreme case there is always the argument that one does not "absolutely need" a feature added to bash, because one could simply use Bash to launch another tool that provides that feature - code in Python rather than shell scripts, etc. Very, very few "needs" qualify as "absolutes". But there are decisions that can make Bash a better programming language that are worth considering, even if they are not "absolutely" needed - just as associative arrays or regular expressions were not "absolutely" needed, but are quite useful.
I agree that allowing Unicode in parameter names is problematic: - there are characters that should be equivalent in principle, but aren't (For instance, the Greek letter pi (π) and the mathematical symbol pi (𝛑) - in some fonts they may render the same, but they are distinct code points) Some of these characters will look like Bash syntax, but be encoded differently. - there are characters that are equivalent, but can be encoded multiple ways (For instance, 'é' may be encoded as $'u00E9' or as $'eu0309') - broadly speaking this falls under the scope of "Unicode Normalization" - it's a well-explored problem but not a trivial one. (And it gets much worse with Asian languages, for instance) - Unicode introduces more whitespace characters - to allow Unicode glyphs in parameter names, one must also decide whether to interpret Unicode whitespace as whitespace (complicating parsing rules, word splitting rules, etc.) - or to treat only ASCII whitespace as whitespace (in which case Unicode whitespace can form part of a "word" without quoting - which could make such code visually confusing) - As you pointed out, this requires the shell to somehow establish a convention governing the character set used to interpret shell scripts - a point on which the shell has so far been able to remain noncommittal (as long as the character set is ASCII-compatible) I think the value is that it allows the programmer to express themselves in their native language. The history of computing up to this point has produced this situation where most programming langauges are oriented toward native English speakers and use of the Latin alphabet. And it would be very difficult to make a programming language truly multi-lingual. But I think it's a worthwhile goal, particularly in the shell (which can be used for scripting, but also for general, interactive operation of the computer) to accommodate foreign vocabulary in those contexts in which the user chooses to use it. From an English-speaker's perspective, allowing Unicode could allow variable names to distinguish between "resumé" and "resume". (We don't use accented characters much) For various European languages, the use of accented characters (and other characters that lie outside of ASCII) can be necessary to make certain words clear. If you drop an accent mark from a word that would normally use it, the word looks like something else. And for Asian languages like Japanese, the use of ideogram characters can help to distinguish between homophones. Personally I also feel it's worthwhile to support symbols as in Walsh's examples as well: For instance, the symbol "pi" has a very clear meaning. You can see it and know it doesn't mean "previous value of an iteration variable that was casually/sloppily named $i" or something else. Where that symbol appears it almost always refers to the circle ratio. Some of Walsh's other examples I think would just be needlessly confusing: for instance using 'Ⅽ' (U+216D, the ROMAN NUMERAL C) as a variable name just results in a variable name that's hard to type, and distinct from but almost visually indistinguishable from 'C'. I see cases like that as "enough rope to hang oneself with". You can use Unicode to do silly or confusing things, and it's not the responsibility of the shell to prevent someone from doing such things. Greg raises a fair point: Some platforms will simply be unable to view scripts written with Unicode symbols... ...Likewise, some platforms will be unable to view scripts written with other features that are already present, such as Unicode string literals, or Unicode command names. But, on the other hand: - Even if your editor or terminal can't display the UTF-8 code, that doesn't mean the shell process can't RUN it. - If your terminal can't display UTF-8... Well, you can edit the code from a terminal that CAN. So I don't think it's that big of a problem. Frankly, making shell scripts broadly compatible (let alone "universally" compatible) is very, very difficult. Can you believe someone wrote a shell script that used the colon character as part of a "tar" archive name? They clearly didn't have GNU tar in mind when they wrote that... This would be just another case of something useful-but-not-universally-compatible. (In other words, if the feature is added, it becomes one of those things you don't use if you want your script to be portable. Frankly if it were added to Bash now, most of us would have to wait 5-10 years for today's version of Bash to become "the oldest version our project must still support" before we could use the feature... This is part of why I tend to be rather forward-looking when it comes to the question of what should be in the shell.) To address your questions on related design decisions that would have to be made for this to work, here's how I'd approach it for a pre-existing language like Bash with a long legacy: 1: For an interactive session, the character encoding of commands is taken from the locale. 2: For a script, the character encoding of commands must be explicitly specified, probably via a shell option. (Ideally I think it should be specified per-file, but I don't know if Bash supports any kind of per-file shell options. This is so, for instance, a non-Unicode session that sources a Unicode shell script does not become a Unicode session, or a Unicode script that sources a non-Unicode script does not interpret that script as Unicode.) 3: If a script does not specify its character encoding, then the behavior is like current versions of Bash: multi-byte characters are supported in some contexts (quoted strings, command names, words) but not others (parameter names) 4: Sub-shell invocation, command/process substitution, etc. inherit the character encoding of the parent shell. 5: Enabling Unicode parsing of the script doesn't add Unicode whitespace characters to IFS - it affects how the script is interpreted, not how the parsed code operates. 6: If POSIX mode is active, I think all of this gets disabled and Unicode characters are disallowed as parameter names. I'm not certain about normalization. I think the easy answer is to just say the shell doesn't normalize the Unicode it gets - which means scripts could contain instances of Unicode code points that should be equivalent (character composition, etc.) but aren't considered equal by the shell. I think another angle worth considering is that most of what I outlined above is only important if you want to take advantage of it for other features. For instance: - In order to support transcoding, to allow (for instance) a Latin-1 script to source a Unicode script and for equivalently-named entities in the two to connect. - In order to support Unicode whitespace as part of the language syntax Without those additional requirements, the current status quo (which allows for UTF-8 characters to appear in function names, command names, coprocess names, alias names, as part of file paths, quoted or unquoted command arguments, associative array subscripts, almost everything but variable names for assignment or expansion) could simply be expanded to allow UTF-8 in parameter names as well. I think it would be BETTER to have an implementation that explicitly establishes source file encoding and provides transcoding in some cases, provide validation of the character encoding, and to support Unicode whitespace, at least, as part of the language syntax... But we could also just keep things as they are, except allow bytes outside the ASCII range to be part of variable names in assignment and parameter expansion. ----- Original Message ----- From: "dualbus" <dual...@gmail.com> To:"L A Walsh" <b...@tlinx.org> Cc:"bug-bash" <bug-bash@gnu.org> Sent:Thu, 1 Jun 2017 23:52:55 -0500 Subject:Re: RFE: Please allow unicode ID chars in identifiers Please remember that there's always a *cost* associated with new features: - Someone has to do the work of extending the current lexer / parser to be able to ingest multibyte character sequences. It helps a bunch if you do that work, and submit it for review. - People then have to test the new implementation, to ensure that there are no regressions, and no new bugs introduced. I'm happy to volunteer once there's a working implementation. - There are some questions that must be answered first: * How do you how to decode multibyte character sequences into Unicode? Should UTF-8 be assumed? * Will the parsing of a script depend upon the user locale? * Should this special parsing code be disabled if POSIX mode is enabled? * Right now `name' or `identifier' is defined as: name: A word consisting only of alphanumeric characters and underscores, and beginning with an alphabetic character or an underscore. Also referred to as an identifier. How will the definition look like with Unicode identifiers? > Variable names like: > > Lēv -- (3 letter word pronounced like Leave), (most recent try...) Use `Lev'. > string constants: > $Φ="0.618033988749894848" > $ɸ="1.61803398874989485" > $π="3.14159265358979324" > $␉=$'x09' > $Ⅼ=50 $Ⅽ=100 $Ⅾ=500 $Ⅿ=1000 > $Ⅰ=1 $Ⅴ=5 $Ⅹ=10 > $㎏="kilogram" > $㎆=1024*$㎅, > etc... What prevents you from using? phi='...' pi='...' ht='...' kg='...' I'm still not convinced there's an actual need here. -- Eduardo Bustamante https://dualbus.me/