On 6/2/11 9:12 PM, Marcel (Felix) Giannelia wrote: > Hello, > > I realize the issue of character range expressions not working as expected > (because of locale settings) has been done to death, but I thought I should > point this out. > > The bash man page says: > > "A pair of characters separated by a hyphen denotes a range expression; any > character that ***sorts between those two characters,*** inclusive, using > the current locale's collating sequence and character set, is matched." > (emphasis mine) > > That is incorrect because, for instance, an uppercase 'C' sorts between > lowercase 'a' and lowercase 'c' (sometimes), as in this example (locale is > en_GB.UTF-8):
I'm not going to add much to this discussion except to note that I believe `sorts' is correct. Consider the following script: unset LANG LC_ALL LC_COLLATE export LC_COLLATE=de_DE.UTF-8 printf "%s\n" {A..Z} {a..z} | sort | tr $'\n' ' ' echo export LC_COLLATE=en_GB.UTF-8 printf "%s\n" {A..Z} {a..z} | sort | tr $'\n' ' ' echo export LC_COLLATE=C printf "%s\n" {A..Z} {a..z} | sort | tr $'\n' ' ' echo It uses the system `sort' to decide how things sort according to the locale. When I run it on a random Linux system, RHEL5 in this case, I get a A b B c C d D e E f F g G h H i I j J k K l L m M n N o O p P q Q r R s S t T u U v V w W x X y Y z Z a A b B c C d D e E f F g G h H i I j J k K l L m M n N o O p P q Q r R s S t T u U v V w W x X y Y z Z A B C D E F G H I J K L M N O P Q R S T U V W X Y Z a b c d e f g h i j k l m n o p q r s t u v w x y z That sure looks like `C' doesn't sort between `a' and `c' in de_DE.UTF-8 and en_GB.UTF-8. > I believe it would also be helpful for the documentation to then go on to > say something like this: > > "This means that character ranges are neither case-sensitive nor > case-insensitive in most locales. For instance (in the en_ locales), the > range [a-c] is equivalent to [aAbBc] (note the absence of uppercase 'C'!). > Thus, sub-ranges of the character class [[:alpha:]] must be used with great > care, and probably should not be used at all, in locales other than C. It > is not possible, for example, to specify a range of greater than one or > fewer than 26 lowercase letters in the en_US.UTF-8 locale. If you desire to > match [abcdefghij] in this locale, you must not use a range, but specify > all of those characters explicitly, or use LC_COLLATE from the C locale." You might like the text in item 13 of the COMPAT file included in the bash distribution. It doesn't take quite so cautionary a tone, but the basic information is there. Chet -- ``The lyf so short, the craft so long to lerne.'' - Chaucer ``Ars longa, vita brevis'' - Hippocrates Chet Ramey, ITS, CWRU c...@case.edu http://cnswww.cns.cwru.edu/~chet/