On 6/8/11 5:45 PM, Marcel (Felix) Giannelia wrote: > On 07/06/11 13:45, Chet Ramey wrote: >> [...] >> I'm not going to add much to this discussion except to note that I believe >> `sorts' is correct. Consider the following script: >> >> unset LANG LC_ALL LC_COLLATE >> >> export LC_COLLATE=de_DE.UTF-8 >> printf "%s\n" {A..Z} {a..z} | sort | tr $'\n' ' ' >> echo > That's really interesting -- and not just your intended point, but what > happens with those ranges if you take 'sort' out of the pipe. The curly > brace {A..Z} syntax doesn't obey the locale! Observe:
No, it doesn't. It's not part of any standard, and it's not part of pattern matching, so I implemented it with the traditional C semantics because that seemed the most straightforward. I'd also argue that it's not really feasible to implement it any other way, since there's no standard way to enumerate a collating sequence from C using Posix interfaces. > $ printf "%s\n" {A..Z} {a..z} | sort | tr $'\n' ' ' > a A b B c C d D e E f F g G h H i I j J k K l L m M n N o O p P q Q r R s S > t T u U v V w W x X y Y z Z > > (as you expect, but...) > > $ printf "%s\n" {A..Z} {a..z} | tr $'\n' ' ' > A B C D E F G H I J K L M N O P Q R S T U V W X Y Z a b c d e f g h i j k l > m n o p q r s t u v w x y z > > So, if I want C-like behaviour out of "[a-z]*", I can write it as > "{a..z}*"? Is that a bug or a feature? Neither. They are two different features. >> [...] >> >> That sure looks like `C' doesn't sort between `a' and `c' in de_DE.UTF-8 >> and en_GB.UTF-8. > Not in a case like that, with single-character strings. But my point was > that it's possible for 'C' to sort between 'a' and 'c' in longer strings. True. However, a bracket expression matches only a single character. > I realize it's pedantic, but documentation should be pedantically accurate > :) I would be OK with changing the man page so it says, "sorts between > those two characters in a list of single-character strings", as that would > also describe the current behaviour. But it only matches a single character, by definition. It should not be necessary to specify the list of single-character strings part. > "Within a bracket expression, a range expression consists of two characters > separated by a hyphen. It matches any single character that sorts between > the two characters, inclusive, using the locale's collating sequence and > character set. For example, in the default C locale, [a-d] is equivalent to > [abcd]. Many locales sort characters in dictionary order, and in these > locales [a-d] is typically not equivalent to [abcd]; it might be equivalent > to [aBbCcDd], for example. To obtain the traditional interpretation of > bracket expressions, you can use the C locale by setting the LC_ALL > environment variable to the value C." The bash texinfo documentation says just about the same thing. -- ``The lyf so short, the craft so long to lerne.'' - Chaucer ``Ars longa, vita brevis'' - Hippocrates Chet Ramey, ITS, CWRU c...@case.edu http://cnswww.cns.cwru.edu/~chet/