A version of this came up not long ago in a slightly different context (bug 17369: parse() doesn't honor unicode in NFD normalization).
The basic issue is that there are different unicode normalizations (look it up...). Briefly, accented characters exist in two forms, one as a single code point and another as the base letter followed by the accent. I.e. there is the single letter "ä" and then "a\u308" which is a followed by "combining diaeresis" which effectively put a ¨ on top of the preceding character. The utf8 package has code for normalizing strings. -pd > On 18 Nov 2019, at 16:11 , Björn Fisseler <bjoern.fisse...@googlemail.com> > wrote: > > Hello, > > I'm struggling comparing two strings, which come from different data > sets. This strings are identical: "Alexander Jäger" > > But when I compare these strings: string1 == string2 > the result is FALSE. > > Looking at the raw bytes used to encode the strings, the results are > different: > > string1: 41 6c 65 78 61 6e 64 65 72 20 4a c3 a4 67 65 72 > string2: 41 6c 65 78 61 6e 64 65 72 20 4a 61 cc 88 67 65 72 > > string2 comes from the file names of different files on my machine > (macOS), string1 comes from a data file (csv, UTF8 encoding). > > It's obviously the umlaut "ä" in this example which is encoded with two > respectively three bytes. The question is how to change this? This > problem makes it impossible to join the two data sets based on the > names. I already checked the settings on my machine: Sys.getlocale() > returns "de_DE.UTF-8/de_DE.UTF-8/de_DE.UTF-8/C/de_DE.UTF-8/de_DE.UTF-8". > Changing/forcing the encoding of the data didn't bring the results I > expected. > > What else can I try? > > Best regards > > Björn > > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. -- Peter Dalgaard, Professor, Center for Statistics, Copenhagen Business School Solbjerg Plads 3, 2000 Frederiksberg, Denmark Phone: (+45)38153501 Office: A 4.23 Email: pd....@cbs.dk Priv: pda...@gmail.com ______________________________________________ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.