On 28-May-10 14:37:39, Duncan Murdoch wrote: > On 28/05/2010 9:24 AM, (Ted Harding) wrote: >> An experiment: >> >> sort(c("AACD","A CD")) >> # [1] "AACD" "A CD" >> >> sort(c("ABCD","A CD")) >> # [1] "ABCD" "A CD" >> >> sort(c("ACCD","A CD")) >> # [1] "ACCD" "A CD" >> >> sort(c("ADCD","A CD")) >> # [1] "A CD" "ADCD" >> >> sort(c("AECD","A CD")) >> # [1] "A CD" "AECD" >> ## (with results for "AFCD", ... "AZCD" similar to the last two). >> >> LC_COLLATE=en_GB.UTF-8 >> >> (R version 2.11.0 (2010-04-22) on Linux). >> >> So this behaves, in en_GB.UTF-8, as though " " (SPACE) is between >> "C" and "D". >> >> This is nuts!!! >> >> Curable if I set (e.g.) LC_LOCALE="C" on startup. But what else >> might break if I do so? >> > > You have to realize that to a large extent this is not under our > control. Your system will have linked to some library (outside of R) > to do string collation, and the problem lies in that library. You > should determine which system library is handling your collations. > > I'd like to tell you how to do that, but I don't know for your build. > You can find out if you're using the recommended ICU library by > running example(icuSetCollate); that gives a number of warnings like > > In icuSetCollate(locale = "da_DK", case_first = "default") : > ICU is not supported on this build > > in Windows. If you don't see those, then you want to talk to the ICU > people. If you do, then you'll need to look deeper to find out what > you're actually using. > > Duncan Murdoch
Thanks for the further guidance, Duncan. I indeed get 4 such warnings from example(icuSetCollate), indicating that ICU is not being used. I have now thrown the above experiment straight at Linux, entering command-line commands as follows (with the results shown on the lines starting with "#"): sort << EOT "AACD" "A CD" EOT # "AACD" # "A CD" sort << EOT "ABCD" "A CD" EOT # "ABCD" # "A CD" sort << EOT "ACCD" "A CD" EOT # "ACCD" # "A CD" sort << EOT "ADCD" "A CD" EOT # "A CD" # "ADCD" This clearly shows that the Linux collating order sees " " (SPACE) as coming between "C" and "D", as when I tried it in R. I am now spamming my Linux contacts about it! The result of the "locale" command in Linux includes: LC_COLLATE="en_GB.UTF-8" This happens consistently on a Debian Lenny and a Debian Etch system. Thanks, Ted. -------------------------------------------------------------------- E-Mail: (Ted Harding) <ted.hard...@manchester.ac.uk> Fax-to-email: +44 (0)870 094 0861 Date: 28-May-10 Time: 21:14:54 ------------------------------ XFMail ------------------------------ ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.