Build a lookup table for your data. I think it is a fools errand to think that you can automatically "normalize" arbitrary Unicode characters to an ASCII form that everyone will agree on.
BTW: To avoid propagating open joins your data should probably have some kind of id for the term those Representatives are serving. --------------------------------------------------------------------------- Jeff Newmiller The ..... ..... Go Live... DCN:<jdnew...@dcn.davis.ca.us> Basics: ##.#. ##.#. Live Go... Live: OO#.. Dead: OO#.. Playing Research Engineer (Solar/Batteries O.O#. #.O#. with /Software/Embedded Controllers) .OO#. .OO#. rocks...1k --------------------------------------------------------------------------- Sent from my phone. Please excuse my brevity. Spencer Graves <spencer.gra...@structuremonitoring.com> wrote: >Hello: > > > How can one match names containing non-English characters that >appear differently in different but related data files? For example, I > >have data on Raúl Grijalva, who represents the third district of >Arizona >in the US House of Representatives. This first name appears as "Raúl" > >in data read from one file and "Raul" from another. > > > The ideal would convert both "Raúl" and "Raúl" to "Raul". A >reasonable alternative would identify the non-English characters and >match on everything else ("^Ra" and "l$" in this case). The files all >contain state and district, so "AZ-3" could be part of the solution. >However, the file also contains data on Grijalva's predecessor in that >office, Ben Quayle, so "AZ-3" is not enough. > > > Thanks, > Spencer > > >p.s. My current data contains other similar cases, e.g.: > > > Recipient District >Raúl Grijalva AZ House 3 >Tony Cárdenas CA House 29 >Linda Sánchez CA House 38 >Raúl Labrador ID House 1 >André Carson IN House 7 >Bob Menéndez NJ Senate >Ben Ray Luján NM House 3 >José Serrano NY House 15 >Nydia Velázquez NY House 7 >Rubén Hinojosa TX House 15 > > > These names all appear differently in another file I have. I've >written an ugly function that can identify "nonstandard characters". >I'm confident I can solve this problem. However, I'm adding things >like >this to the Ecdat package, and it would be more useful for others if I >made better use of other capabilities in R. > >______________________________________________ >R-help@r-project.org mailing list >https://stat.ethz.ch/mailman/listinfo/r-help >PLEASE do read the posting guide >http://www.R-project.org/posting-guide.html >and provide commented, minimal, self-contained, reproducible code. ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.