On 18/11/2019 10:11 a.m., Björn Fisseler wrote:
Hello,
I'm struggling comparing two strings, which come from different data
sets. This strings are identical: "Alexander Jäger"
But when I compare these strings: string1 == string2
the result is FALSE.
Looking at the raw bytes used to encode the strings, the results are
different:
string1: 41 6c 65 78 61 6e 64 65 72 20 4a c3 a4 67 65 72
string2: 41 6c 65 78 61 6e 64 65 72 20 4a 61 cc 88 67 65 72
string2 comes from the file names of different files on my machine
(macOS), string1 comes from a data file (csv, UTF8 encoding).
It's obviously the umlaut "ä" in this example which is encoded with two
respectively three bytes. The question is how to change this? This
problem makes it impossible to join the two data sets based on the
names. I already checked the settings on my machine: Sys.getlocale()
returns "de_DE.UTF-8/de_DE.UTF-8/de_DE.UTF-8/C/de_DE.UTF-8/de_DE.UTF-8".
Changing/forcing the encoding of the data didn't bring the results I
expected.
What else can I try?
Characters like ä have two (or more) representations in Unicode: a
single code point, or the code point for "a" followed by a code point
that says "add an umlaut".
If you want to compare strings, you need a consistent representation.
This is called normalizing the string.
There are several possible normalizations; for your purposes it doesn't
matter which one you use, as long as you use the same normalization for
both strings. See
<https://withblue.ink/2019/03/11/why-you-need-to-normalize-unicode-strings.html>
for details.
In R, there are several functions that do the normalization for you.
Two are utf8::utf8_normalize or stringi::stri_trans_nfc. So you'd want
something like
library(utf8)
string1 <- utf8_normalize(string1)
string2 <- utf8_normalize(string2)
string1 == string2 # Should now work as expected
Duncan Murdoch
______________________________________________
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.