Re: [R] Matching names with non-English characters

Duncan Murdoch Mon, 13 May 2013 10:00:06 -0700

On 13/05/2013 12:05 PM, Spencer Graves wrote:

Hello:



        How can one match names containing non-English characters that
appear differently in different but related data files?  For example, I
have data on Raúl Grijalva, who represents the third district of Arizona
in the US House of Representatives.  This first name appears as "RaÃºl"
in data read from one file and "Raul" from another.


        The ideal would convert both "RaÃºl" and "Raúl" to "Raul".

You shouldn't have both "RaÃºl" and "Raúl" in the same file. They aredifferent encodings for the same characters. (The first looks likeUTF-8, the second is your native encoding, presumably the WindowsLatin-1 variant, CP-1252. So your first problem is to identify theencodings of your input files, and read them all in to a commonencoding. Converting them to UTF-8 in R makes the most sense, becauseit includes the characters from all other encodings you're ever likelyto see.

Having both "Raúl" and "Raul" in the same file is a different issue.The second one is an error or a variant spelling. In this case, you canuse


iconv("Raúl", to="ASCII//TRANSLIT")

on most platforms to find an ASCII approximation. (This works on myWindows system; your mileage may vary.) As Jeff said, this is animpossible problem in general, so you may well need some manual fixupsat the end.


Duncan Murdoch

A
reasonable alternative would identify the non-English characters and
match on everything else ("^Ra" and "l$" in this case).  The files all
contain state and district, so "AZ-3" could be part of the solution.
However, the file also contains data on Grijalva's predecessor in that
office, Ben Quayle, so "AZ-3" is not enough.


        Thanks,
        Spencer


p.s.  My current data contains other similar cases, e.g.:


      Recipient     District
RaÃºl Grijalva   AZ House 3
Tony CÃ¡rdenas   CA House 29
Linda SÃ¡nchez   CA House 38
RaÃºl Labrador   ID House 1
AndrÃ© Carson    IN House 7
Bob MenÃ©ndez    NJ Senate
Ben Ray LujÃ¡n   NM House 3
JosÃ© Serrano    NY House 15
Nydia VelÃ¡zquez NY House 7
RubÃ©n Hinojosa  TX House 15


        These names all appear differently in another file I have. I've
written an ugly function that can identify "nonstandard characters".
I'm confident I can solve this problem.  However, I'm adding things like
this to the Ecdat package, and it would be more useful for others if I
made better use of other capabilities in R.

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Matching names with non-English characters

Reply via email to