Re: [R] Problem comparing two strings

peter dalgaard Mon, 18 Nov 2019 07:48:52 -0800

A version of this came up not long ago in a slightly different context (bug 
17369: parse() doesn't honor unicode in NFD normalization).


The basic issue is that there are different unicode normalizations (look it 
up...).

Briefly, accented characters exist in two forms, one as a single code point and 
another as the base letter followed by the accent. 

I.e. there is the single letter "ä" and then "a\u308" which is a followed by 
"combining diaeresis" which effectively put a ¨ on top of the preceding 
character.

The utf8 package has code for normalizing strings.

-pd

> On 18 Nov 2019, at 16:11 , Björn Fisseler <bjoern.fisse...@googlemail.com> 
> wrote:
> 
> Hello,
> 
> I'm struggling comparing two strings, which come from different data 
> sets. This strings are identical: "Alexander Jäger"
> 
> But when I compare these strings: string1 == string2
> the result is FALSE.
> 
> Looking at the raw bytes used to encode the strings, the results are 
> different:
> 
> string1: 41 6c 65 78 61 6e 64 65 72 20 4a c3 a4 67 65 72
> string2: 41 6c 65 78 61 6e 64 65 72 20 4a 61 cc 88 67 65 72
> 
> string2 comes from the file names of different files on my machine 
> (macOS), string1 comes from a data file (csv, UTF8 encoding).
> 
> It's obviously the umlaut "ä" in this example which is encoded with two 
> respectively three bytes. The question is how to change this? This 
> problem makes it impossible to join the two data sets based on the 
> names. I already checked the settings on my machine: Sys.getlocale() 
> returns "de_DE.UTF-8/de_DE.UTF-8/de_DE.UTF-8/C/de_DE.UTF-8/de_DE.UTF-8". 
> Changing/forcing the encoding of the data didn't bring the results I 
> expected.
> 
> What else can I try?
> 
> Best regards
> 
>         Björn
> 
> 
>       [[alternative HTML version deleted]]
> 
> ______________________________________________
> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

-- 
Peter Dalgaard, Professor,
Center for Statistics, Copenhagen Business School
Solbjerg Plads 3, 2000 Frederiksberg, Denmark
Phone: (+45)38153501
Office: A 4.23
Email: pd....@cbs.dk  Priv: pda...@gmail.com

______________________________________________
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Problem comparing two strings

Reply via email to