On 05/21/2012 01:51 PM, Linda Walsh wrote: > POSIX is not supposed to be prescriptive -- but **descriptive**... > > I can't think of anywhere that a-z or A-Z would have included letters > from the opposite case... so how did POSIX come to *prescribe* that this > be the case... since I can't see that as being descriptive.
POSIX 1992 was the culprit that proscribed that [A-Z] must be in collation order across all locales, but without giving good guidance on how to write a collation sequence, and without defining a C function to easily get at that collation ordering. And remember, 20 years ago when POSIX 1992 was written, there was very little implementation experience with internationalization, compared to what has happened in the meantime (that was back when Unicode was brand new, and most users still had single-byte locales or used shift-lock encodings like Big5). It is possible to write a locale definition where [A-Z] gives only upper-case letters while still providing case-insensitive sorting, but not all locale writes know how to do this (even now in 2012, while most glibc locales have been corrected in this manner, there still exist several glibc locales that aren't written very well - the complication stems from the fact that your locale file becomes exponentially harder to write: instead of having a single upper and lower case rule, you have to have one rule per letter, with rules intermixed in a different order). As soon as people started obeying POSIX 1992 to the letter, and realizing that range expressions had unusual semantics as a result of the 1992 specification, POSIX 2001 quickly reverted things, but by then, the cat was out of the bag. POSIX 2001 had to continue to allow existing implementations, by stating that range expressions in anything but the C locale are explicitly undefined. There is currently a movement under way to introduce 'Rational Range Intepretation' (RRI), where [A-Z] means the 26 uppercase letters across ALL locales, by omitting all accented letters and ignoring collation ordering. Since POSIX 2001 and later allow this behavior, it is gaining traction - already, GNU sed, GNU grep, and GNU awk have had patches applied or under consideration to introduce this consistent behavior. Search those mailing list archives if you want more details. Gnulib has already had patches as part of this movement, and GNU coreutils and bash should be picking up on these improvements in a future version; we also hope to get glibc to agree to them. In other words, we recognize that this is an issue, and eventually, we _do_ want to reach the point where all GNU tools use RRI, since POSIX 2001 already allows RRI as part of its recognition that the decision made in POSIX 1992 causes pain when coupled with poorly-written locale definitions. For example, here is an RRI patch for gnulib: https://lists.gnu.org/archive/html/bug-gnulib/2012-04/msg00185.html -- Eric Blake ebl...@redhat.com +1-919-301-3266 Libvirt virtualization library http://libvirt.org
signature.asc
Description: OpenPGP digital signature