Thomas Lumley <[EMAIL PROTECTED]> writes: > The following caused a hard-to-diagnose problem for a user of the survey > package. Presumably this is a strange Unicode thing, but is there a > convenient reference for how the collation order is determined? I am > surprised that adding the same character to the end of two strings of the > same length can change the sorting order. > > in en_US.utf8 locale > > "1//"<"10/" > [1] TRUE > > "1//2"<"10/2" > [1] FALSE > > in C locale on the same system. > > "1//"<"10/" > [1] TRUE > > "1//2"<"10/2" > [1] TRUE > > [This is in r-devel of March 6, but the problem that was reported to me > involved Windows vs Linux on released versions]
Unicode has nothing to do with it (same thing in ISO-8859-1. It is (I think) about characters being skipped during collating, i.e. same effect as this: > Sys.setlocale(locale="C") [1] "C" > "Thomas O'Malley" < "Thomas Lumley" [1] TRUE > Sys.setlocale(locale="en_US.UTF8") [1] "LC_CTYPE=en_US.UTF8;LC_NUMERIC=C;LC_TIME=en_US.UTF8;LC_COLLATE=en_US.UTF8;LC_MONETARY=en_US.UTF8;LC_MESSAGES=C;LC_PAPER=C;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=C;LC_IDENTIFICATION=C" > "Thomas O'Malley" <" Thomas Lumley" [1] FALSE > > -thomas > > Thomas Lumley Assoc. Professor, Biostatistics > [EMAIL PROTECTED] University of Washington, Seattle > > ______________________________________________ > R-devel@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel > -- O__ ---- Peter Dalgaard Ă˜ster Farimagsgade 5, Entr.B c/ /'_ --- Dept. of Biostatistics PO Box 2099, 1014 Cph. K (*) \(*) -- University of Copenhagen Denmark Ph: (+45) 35327918 ~~~~~~~~~~ - ([EMAIL PROTECTED]) FAX: (+45) 35327907 ______________________________________________ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel