Hello:

How can one match names containing non-English characters that appear differently in different but related data files? For example, I have data on Raúl Grijalva, who represents the third district of Arizona in the US House of Representatives. This first name appears as "Raúl" in data read from one file and "Raul" from another.


The ideal would convert both "Raúl" and "Raúl" to "Raul". A reasonable alternative would identify the non-English characters and match on everything else ("^Ra" and "l$" in this case). The files all contain state and district, so "AZ-3" could be part of the solution. However, the file also contains data on Grijalva's predecessor in that office, Ben Quayle, so "AZ-3" is not enough.


      Thanks,
      Spencer


p.s.  My current data contains other similar cases, e.g.:


    Recipient     District
Raúl Grijalva   AZ House 3
Tony Cárdenas   CA House 29
Linda Sánchez   CA House 38
Raúl Labrador   ID House 1
André Carson    IN House 7
Bob Menéndez    NJ Senate
Ben Ray Luján   NM House 3
José Serrano    NY House 15
Nydia Velázquez NY House 7
Rubén Hinojosa  TX House 15


These names all appear differently in another file I have. I've written an ugly function that can identify "nonstandard characters". I'm confident I can solve this problem. However, I'm adding things like this to the Ecdat package, and it would be more useful for others if I made better use of other capabilities in R.

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to