Re: [Rd] collation order

Peter Dalgaard Fri, 17 Mar 2006 14:56:51 -0800

Thomas Lumley <[EMAIL PROTECTED]> writes:

> The following caused a hard-to-diagnose problem for a user of the survey 
> package.  Presumably this is a strange Unicode thing, but is there a 
> convenient reference for how the collation order is determined? I am 
> surprised that adding the same character to the end of two strings of the 
> same length can change the sorting order.
> 
> in en_US.utf8 locale
> > "1//"<"10/"
> [1] TRUE
> > "1//2"<"10/2"
> [1] FALSE
> 
> in C locale on the same system.
> > "1//"<"10/"
> [1] TRUE
> > "1//2"<"10/2"
> [1] TRUE
> 
> [This is in r-devel of March 6, but the problem that was reported to me 
> involved Windows vs Linux on released versions]


Unicode has nothing to do with it (same thing in ISO-8859-1. It is
(I think) about characters being skipped during collating, i.e. same
effect as this:

> Sys.setlocale(locale="C")
[1] "C"
> "Thomas  O'Malley" < "Thomas Lumley"
[1] TRUE
> Sys.setlocale(locale="en_US.UTF8")
[1] 
"LC_CTYPE=en_US.UTF8;LC_NUMERIC=C;LC_TIME=en_US.UTF8;LC_COLLATE=en_US.UTF8;LC_MONETARY=en_US.UTF8;LC_MESSAGES=C;LC_PAPER=C;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=C;LC_IDENTIFICATION=C"
> "Thomas  O'Malley" <" Thomas Lumley"
[1] FALSE


> 
>       -thomas
> 
> Thomas Lumley                 Assoc. Professor, Biostatistics
> [EMAIL PROTECTED]     University of Washington, Seattle
> 
> ______________________________________________
> R-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
> 

-- 
   O__  ---- Peter Dalgaard             Øster Farimagsgade 5, Entr.B
  c/ /'_ --- Dept. of Biostatistics     PO Box 2099, 1014 Cph. K
 (*) \(*) -- University of Copenhagen   Denmark          Ph:  (+45) 35327918
~~~~~~~~~~ - ([EMAIL PROTECTED])                  FAX: (+45) 35327907

______________________________________________
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] collation order

Reply via email to