This may work for your needs with a little fine tuning. Special and accented characters can be represented in HTML with a character name or a numeric value. For example, " can be represented as " or as " and it appears from your example that both are used. I've attached a dput(HTMLChars) to the end of this message with the concordances. The following works on your data, but I haven't included any error checking. Assuming your .csv file is called txt and the data.frame HTMLChars is loaded:
# Search for &Name; lsta <- unique(unlist(regmatches(txt, gregexpr("&[[:alpha:]]+;", txt)))) lsta <- data.frame(Name=lsta) matches <- merge(HTMLChars, lsta) for (i in 1:nrow(matches)) { txt <- gsub(matches$Name[i], matches$Character[i], txt) } # Search for &#Number; lstn <- unique(unlist(regmatches(txt, gregexpr("&#[[:digit:]]+;", txt)))) lstn <- data.frame(Number=lstn) matches <- merge(HTMLChars, lstn) for (i in 1:nrow(matches)) { txt <- gsub(matches$Number[i], matches$Character[i], txt) } txt now contains the converted characters. dput(HTMLChars) structure(list(Character = c("\"", "'", "&", "<", ">", "", "¡", "¢", "£", "¤", "¥", "¦", "§", "¨", "©", "ª", "«", "¬", "", "®", "¯", "°", "±", "²", "³", "´", "µ", "¶", "·", "¸", "¹", "º", "»", "¼", "½", "¾", "¿", "×", "÷", "À", "Á", "Â", "Ã", "Ä", "Å", "Æ", "Ç", "È", "É", "Ê", "Ë", "Ì", "Í", "Î", "Ï", "Ð", "Ñ", "Ò", "Ó", "Ô", "Õ", "Ö", "Ø", "Ù", "Ú", "Û", "Ü", "Ý", "Þ", "ß", "à", "á", "â", "ã", "ä", "å", "æ", "ç", "è", "é", "ê", "ë", "ì", "í", "î", "ï", "ð", "ñ", "ò", "ó", "ô", "õ", "ö", "ø", "ù", "ú", "û", "ü", "ý", "þ"), Number = c(""", "'", "&", "<", ">", " ", "¡", "¢", "£", "¤", "¥", "¦", "§", "¨", "©", "ª", "«", "¬", "­", "®", "¯", "°", "±", "²", "³", "´", "µ", "¶", "·", "¸", "¹", "º", "»", "¼", "½", "¾", "¿", "×", "÷", "À", "Á", "Â", "Ã", "Ä", "Å", "Æ", "Ç", "È", "É", "Ê", "Ë", "Ì", "Í", "Î", "Ï", "Ð", "Ñ", "Ò", "Ó", "Ô", "Õ", "Ö", "Ø", "Ù", "Ú", "Û", "Ü", "Ý", "Þ", "ß", "à", "á", "â", "ã", "ä", "å", "æ", "ç", "è", "é", "ê", "ë", "ì", "í", "î", "ï", "ð", "ñ", "ò", "ó", "ô", "õ", "ö", "ø", "ù", "ú", "û", "ü", "ý", "þ"), Name = c(""", "'", "&", "<", ">", " ", "¡", "¢", "£", "¤", "¥", "¦", "§", "¨", "©", "ª", "«", "¬", "­", "®", "¯", "°", "±", "²", "³", "´", "µ", "¶", "·", "¸", "¹", "º", "»", "¼", "½", "¾", "¿", "×", "÷", "À", "Á", "Â", "Ã", "Ä", "Å", "Æ", "Ç", "È", "É", "Ê", "Ë", "Ì", "Í", "Î", "Ï", "Ð", "Ñ", "Ò", "Ó", "Ô", "Õ", "Ö", "Ø", "Ù", "Ú", "Û", "Ü", "Ý", "Þ", "ß", "à", "á", "â", "ã", "ä", "å", "æ", "ç", "è", "é", "ê", "ë", "ì", "í", "î", "ï", "ð", "ñ", "ò", "ó", "ô", "õ", "ö", "ø", "ù", "ú", "û", "ü", "ý", "þ")), .Names = c("Character", "Number", "Name"), row.names = c(NA, 100L), class = "data.frame") ------- David > -----Original Message----- > From: Michael Friendly [mailto:frien...@yorku.ca] > Sent: Friday, August 10, 2012 12:14 PM > To: dcarl...@tamu.edu > Cc: 'R-help' > Subject: Re: [R] translating HTML character entities to accented > characters > > Thanks, David > > I need an all-R solution for this, because the author.csv file is > exported from a database that enforces the HTML > encoding and the import into R may have to be repeated several times as > the database is updated. > > -Michael > > On 8/10/2012 12:40 PM, David L Carlson wrote: > > It's not quite an R solution, but I just pasted your examples into a > script > > window in R and saved it as chars.html. Then I opened it in Firefox > and > > pasted the results here (with returns inserted to match your > original). > > > >> grep("&", author$lname, value=TRUE) > > [1] "Frère de Montizon" "Lumière" > > [3] "Lumière" "Niépce" > > [5] "Süssmilch" "Schüpbach" > >> grep("&", author$birthplace, value=TRUE) > > [1] "Marbach, Württemberg" > > [2] "Côte-d'Or" > > [3] "Chalon-sur-Saône, Saône-et-Loire" > > [4] "Groß Särchen, Germany" > >> apropos("HTML") > > For a CSV file you would want to preserve the lines by adding <br> to > the > > end of each line first. > > > > ---------------------------------------------- > > David L Carlson > > Associate Professor of Anthropology > > Texas A&M University > > College Station, TX 77843-4352 > > > > > > > >> -----Original Message----- > >> From: r-help-boun...@r-project.org [mailto:r-help-bounces@r- > >> project.org] On Behalf Of Michael Friendly > >> Sent: Friday, August 10, 2012 11:15 AM > >> To: R-help > >> Subject: [R] translating HTML character entities to accented > characters > >> > >> I've imported a .csv file where character strings that contained > >> accented characters were written as HTML > >> character entities. Is there a function that works on a vector to > >> translate them back to accented (latin1) characters? > >> > >> Some examples: > >> > >> > grep("&", author$lname, value=TRUE) > >> [1] "Frère de Montizon" "Lumière" > >> [3] "Lumière" "Niépce" > >> [5] "Süssmilch" "Schüpbach" > >> > grep("&", author$birthplace, value=TRUE) > >> [1] "Marbach, Württemberg" > >> [2] "Côte-d'Or" > >> [3] "Chalon-sur-Saône, Saône-et-Loire" > >> [4] "Groß Särchen, Germany" > >> > apropos("HTML") > >> > >> thx, > >> -Michael > >> > >> -- > >> Michael Friendly Email: friendly AT yorku DOT ca > >> Professor, Psychology Dept. > >> York University Voice: 416 736-2100 x66249 Fax: 416 736-5814 > >> 4700 Keele Street Web: http://www.datavis.ca > >> Toronto, ONT M3J 1P3 CANADA > >> > >> ______________________________________________ > >> R-help@r-project.org mailing list > >> https://stat.ethz.ch/mailman/listinfo/r-help > >> PLEASE do read the posting guide http://www.R-project.org/posting- > >> guide.html > >> and provide commented, minimal, self-contained, reproducible code. > > > -- > Michael Friendly Email: friendly AT yorku DOT ca > Professor, Psychology Dept. > York University Voice: 416 736-2100 x66249 Fax: 416 736-5814 > 4700 Keele Street Web: http://www.datavis.ca > Toronto, ONT M3J 1P3 CANADA ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.