Re: [Rd] R string comparisons may vary with platform (plain text)

Martin Morgan Sun, 23 Nov 2014 08:18:26 -0800

For many scientific applications one is really dealing with ASCII characters andLC_COLLATE="C", even if the user is running in non-C locales. What robustapproaches (if any?) are available to write code that sorts in alocale-independent way? The Note in ?Sys.setlocale is not overly optimisticabout setting the locale within a session.


Martin Morgan

On 11/23/2014 03:44 AM, Prof Brian Ripley wrote:

On 23/11/2014 09:39, peter dalgaard wrote:

On 23 Nov 2014, at 01:05 , Henrik Bengtsson <h...@biostat.ucsf.edu> wrote:

On Sat, Nov 22, 2014 at 12:42 PM, Duncan Murdoch
<murdoch.dun...@gmail.com> wrote:

On 22/11/2014, 2:59 PM, Stuart Ambler wrote:

A colleague¹s R program behaved differently when I ran it, and we thought
we traced it probably to different results from string comparisons as
below, with different R versions.  However the platforms also differed.  A
friend ran it on a few machines and found that the comparison behavior
didn¹t correlate with R version, but rather with platform.

I wonder if you¹ve seen this.  If it¹s not some setting I¹m unaware of,
maybe someone should look into it.  Sorry I haven¹t taken the time to read
the source code myself.


Looks like a collation order issue.  See ?Comparison.


With the oddity that both platforms use what look like similar locales:

LC_COLLATE=en_US.UTF-8
LC_COLLATE=en_US.utf8


It's the sort of thing thay I've tried to wrap my mind around multiple times
and failed, but have a look at

http://stackoverflow.com/questions/19967555/postgres-collation-differences-osx-v-ubuntu


which seems to be essentially the same issue, just for Postgres. If you have
the stamina, also look into the python question that it links to.

As I understand it, there are two potential reasons: Either the two platforms
are not using the same collation table for en_US, or at least one of them is
not fully implementing the Unicode Collation Algorithm.


And I have seen both with R.  At the very least, check if ICU is being used
(capabilities("ICU") in current R, maybe not in some of the obsolete versions
seen in this thread).

As a further possibility, there are choices in the UCA (in R, see
?icuSetCollate) and ICU can be compiled with different default choices.  It is
not clear to me what (if any) difference ICU versions make, but in R-devel
extSoftVersion() reports that.

In general, collation is a minefield: Some languages have the same letters in
different order (e.g. Estonian with Z between S and T); accented characters
sort with the unaccented counterpart in some languages but as separate
characters in others; some locales sort ABab, others AaBb, yet others aAbB;
sometimes punctuation is ignored, sometimes not; sometimes multiple characters
count as one, etc.

As ?Comparison has long said.



--
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M1 B861
Phone: (206) 667-2793

______________________________________________
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] R string comparisons may vary with platform (plain text)

Reply via email to