The 'stringi' package claims robust cross-platform performance. It exports much functionality of the ICU library and will attempt to install it when not present. The function 'stri_sort' accepts a collation argument that can be defined with 'stri_opts_collator'.
On Sun, Nov 23, 2014 at 5:15 PM, Martin Morgan <mtmor...@fredhutch.org> wrote: > > For many scientific applications one is really dealing with ASCII > characters and LC_COLLATE="C", even if the user is running in non-C > locales. What robust approaches (if any?) are available to write code that > sorts in a locale-independent way? The Note in ?Sys.setlocale is not overly > optimistic about setting the locale within a session. > > Martin Morgan > > > On 11/23/2014 03:44 AM, Prof Brian Ripley wrote: > >> On 23/11/2014 09:39, peter dalgaard wrote: >> >>> >>> On 23 Nov 2014, at 01:05 , Henrik Bengtsson <h...@biostat.ucsf.edu> >>>> wrote: >>>> >>>> On Sat, Nov 22, 2014 at 12:42 PM, Duncan Murdoch >>>> <murdoch.dun...@gmail.com> wrote: >>>> >>>>> On 22/11/2014, 2:59 PM, Stuart Ambler wrote: >>>>> >>>>>> A colleague¹s R program behaved differently when I ran it, and we >>>>>> thought >>>>>> we traced it probably to different results from string comparisons as >>>>>> below, with different R versions. However the platforms also >>>>>> differed. A >>>>>> friend ran it on a few machines and found that the comparison behavior >>>>>> didn¹t correlate with R version, but rather with platform. >>>>>> >>>>>> I wonder if you¹ve seen this. If it¹s not some setting I¹m unaware >>>>>> of, >>>>>> maybe someone should look into it. Sorry I haven¹t taken the time to >>>>>> read >>>>>> the source code myself. >>>>>> >>>>> >>>>> Looks like a collation order issue. See ?Comparison. >>>>> >>>> >>>> With the oddity that both platforms use what look like similar locales: >>>> >>>> LC_COLLATE=en_US.UTF-8 >>>> LC_COLLATE=en_US.utf8 >>>> >>> >>> It's the sort of thing thay I've tried to wrap my mind around multiple >>> times >>> and failed, but have a look at >>> >>> http://stackoverflow.com/questions/19967555/postgres- >>> collation-differences-osx-v-ubuntu >>> >>> >>> which seems to be essentially the same issue, just for Postgres. If you >>> have >>> the stamina, also look into the python question that it links to. >>> >>> As I understand it, there are two potential reasons: Either the two >>> platforms >>> are not using the same collation table for en_US, or at least one of >>> them is >>> not fully implementing the Unicode Collation Algorithm. >>> >> >> And I have seen both with R. At the very least, check if ICU is being >> used >> (capabilities("ICU") in current R, maybe not in some of the obsolete >> versions >> seen in this thread). >> >> As a further possibility, there are choices in the UCA (in R, see >> ?icuSetCollate) and ICU can be compiled with different default choices. >> It is >> not clear to me what (if any) difference ICU versions make, but in R-devel >> extSoftVersion() reports that. >> >> >> In general, collation is a minefield: Some languages have the same >>> letters in >>> different order (e.g. Estonian with Z between S and T); accented >>> characters >>> sort with the unaccented counterpart in some languages but as separate >>> characters in others; some locales sort ABab, others AaBb, yet others >>> aAbB; >>> sometimes punctuation is ignored, sometimes not; sometimes multiple >>> characters >>> count as one, etc. >>> >>> As ?Comparison has long said. >> >> >> > > -- > Computational Biology / Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N. > PO Box 19024 Seattle, WA 98109 > > Location: Arnold Building M1 B861 > Phone: (206) 667-2793 > > > ______________________________________________ > R-devel@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel > [[alternative HTML version deleted]] ______________________________________________ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel