Hi Matt and everyone else, Thanks for the help so far. I ended up using the tips provided to create a "dirty hack" based on a translation table between the code and the Hebrew letters.
For the future (and for any suggestions), I am attaching this code bellow: Best, Tal # the translation table: translation.table.Hebrew <- structure(list(V1 = structure(1:27, .Label = c("05D0", "05D1", "05D2", "05D3", "05D4", "05D5", "05D6", "05D7", "05D8", "05D9", "05DA", "05DB", "05DC", "05DD", "05DE", "05DF", "05E0", "05E1", "05E2", "05E3", "05E4", "05E5", "05E6", "05E7", "05E8", "05E9", "05EA"), class = "factor"), V2 = structure(1:27, .Label = c("×", "×", "×", "×", "×", "×", "×", "×", "×", "×", "×", "×", "×", "×", "×", "×", "× ", "ס", "×¢", "×£", "פ", "×¥", "צ", "×§", "ר", "ש", "ת" ), class = "factor")), .Names = c("CODE", "HEBREW"), class = "data.frame", row.names = c(NA, -27L)) # translation.table # STRING = inp turn_nohash <- function(STRING) { require(stringr) nohash <- str_replace(STRING, "#", "0") # cvrt # to 0 nohash <- str_replace(nohash, ";", "") # cvrt # to 0 nohash <- str_replace(nohash, "&", "") # cvrt # to 0 nohash <- str_replace(nohash, "x", "") # cvrt # to 0 return(nohash) } translate.all.chars <- function(STRING, TABLE = translation.table.Hebrew) { # TABLE is of the form: # CODE HEBREW # 1 05D0 × # 2 05D1 × # 3 05D2 × require(stringr) i.chars.to.check <- seq_len(dim(TABLE)[1]) for(i in i.chars.to.check) { STRING <- str_replace(STRING, as.character(TABLE[i,1]), as.character(TABLE[i,2])) } return(STRING) } HTML_heb_decode <- function(STRING, TABLE = translation.table.Hebrew) { STRING <- turn_nohash(STRING) STRING <- translate.all.chars(STRING, TABLE) return(STRING) } # example of use: inp <- "שלום" HTML_heb_decode(inp) inp <- "שלום חנוך\ " HTML_heb_decode(inp) ourput: > HTML_heb_decode(inp) Loading required package: stringr Loading required package: plyr [1] "ש×××" > inp <- "שלום חנוך\ " > HTML_heb_decode(inp) [1] "ש××× ×× ×× " ----------------Contact Details:------------------------------------------------------- Contact me: tal.gal...@gmail.com | 972-52-7275845 Read me: www.talgalili.com (Hebrew) | www.biostatistics.co.il (Hebrew) | www.r-statistics.com (English) ---------------------------------------------------------------------------------------------- On Fri, Dec 10, 2010 at 12:00 AM, Matt Shotwell <shotw...@musc.edu> wrote: > Tal, > > OK, let me clarify my understanding. The original and decoded file are > text, encoded by UTF-8. In the original file, there are HTML `entities' > that represent UTF-8 Hebrew characters. In the decoded file, the > entities are converted to UTF-8 characters. The question is how to > convert these entities within R. It's not the same as converting between > character encodings, otherwise iconv() might offer a solution. > > I'll have a look around to find a solution, and I hope others will too. > My first idea is to check RCurl, XML, and the related utils::URLdecode. > If there really is no existing solution, I think it might be worthwhile > to look at how PHP and Python do it (and maybe borrow some code :) ). > > -Matt > > > On Thu, 2010-12-09 at 14:27 -0500, Tal Galili wrote: > > Hi Matt, > > Thanks for having a look at this. > > I just spent some time looking around and couldn't find any R function > > to decode decimal HTML code. > > > > > > Do you (or someone else on the list) knows how to program this sort of > > thing? (is there a formula for the translation? > > > > > > > > > > p.s: > > For it to work on my end I added the encoding parameter: > > readLines("http://biostatmatt.com/temp/Hebrew-decoded", warn=FALSE, > > encoding= "UTF-8") > > > > > > p.p.s: The Hebrew word I used means "peace" > > > > > > Cheers, > > Tal > > > > > > ----------------Contact > > Details:------------------------------------------------------- > > Contact me: tal.gal...@gmail.com | 972-52-7275845 > > Read me: www.talgalili.com (Hebrew) | www.biostatistics.co.il (Hebrew) > > | www.r-statistics.com (English) > > > ---------------------------------------------------------------------------------------------- > > > > > > > > > > On Thu, Dec 9, 2010 at 8:38 PM, Matt Shotwell <shotw...@musc.edu> > > wrote: > > Tal, > > > > It looks like the data you received has HTML special hex > > characters. > > That is, 'ש' is just an ASCII HTML representation of a > > hex > > character. It's not encoded in a special manner. > > > > The trick is to substitute the HTML encoded hex character for > > its binary > > representation, or "decode" the character. I don't know of any > > R > > function that does this, but there are web services, for > > example: > > http://www.hashemian.com/tools/html-url-encode-decode.php > > > > I decoded your file using this service and posted it on my > > website. You > > can see the difference by running: > > > > readLines("http://biostatmatt.com/temp/Hebrew-original", > > warn=FALSE) > > > > readLines("http://biostatmatt.com/temp/Hebrew-decoded", > > warn=FALSE) > > > > The second should display the Hebrew characters correctly (it > > does in my > > terminal). The next thing to think about is how to automate > > this in R > > without using the web service... We may need to write an > > HTMLDecode > > function if there isn't one already. > > > > By the way, what's the Hebrew text in English? > > > > Best, > > Matt > > > > > > > > On Thu, 2010-12-09 at 12:21 -0500, Tal Galili wrote: > > > I am bumping this question in the hopes that someone might > > be able to > > > advise. > > > This Hebrew and R business is not as smooth as I had > > hoped... > > > > > > Thanks, > > > Tal > > > > > > Older massage: > > > > > > On Tue, Dec 7, 2010 at 2:30 PM, Tal Galili > > <tal.gal...@gmail.com> wrote: > > > > > > > Hello all, > > > > > > > > # I am trying to read the text in this URL: > > > > u <- > > > > http://google.com/complete/search?output=toolbar&q=%d7%a9% > > d7%9c%d7%95%d7%9d > > > > # By using this command: > > > > readLines(u) > > > > > > > > And no matter what variation I tried, I keep getting this > > output: > > > > [1] "<?xml version=\"1.0 > > \"?><toplevel><CompleteSuggestion><suggestion > > > > data=\"שלום\"/>< (etc...) > > > > > > > > > > > > > > Instead of this output: > > > > <?xml > > version="1.0"?><toplevel><CompleteSuggestion><suggestion > > data="ש××× > > > > "/><num_queries > > > int="16800000"/></CompleteSuggestion><CompleteSuggestion><suggestion > > > > data="ש××× ×× ××"/><num_queries > > int="232000"/></CompleteSuggestion> > > > > <CompleteSuggestion><suggestion data="ש××× ×¢××××"/ > > > > (etc....) > > > > > > > > > > > > > > > I tried: > > > > readLines(u, encoding= "latin1") > > > > readLines(u, encoding= "UTF-8") > > > > And also changing Sys.setlocale: > > > > Sys.setlocale("LC_ALL", "Hebrew") # must be done for > > Hebrew to work. > > > > Sys.setlocale("LC_ALL", "English") # must be done for > > Hebrew to work. > > > > > > > > Are there any more options I could try to get this text > > properly encoded? > > > > > > > > Thanks! > > > > Tal > > > > > > > > > > > > > > > > ----------------Contact > > > > > > Details:------------------------------------------------------- > > > > Contact me: tal.gal...@gmail.com | 972-52-7275845 > > > > Read me: www.talgalili.com (Hebrew) | > > www.biostatistics.co.il (Hebrew) | > > > > www.r-statistics.com (English) > > > > > > > > > > > ---------------------------------------------------------------------------------------------- > > > > > > > > > > > > > > > > > > > > [[alternative HTML version deleted]] > > > > > > > -- > > Matthew S. Shotwell > > Graduate Student > > Division of Biostatistics and Epidemiology > > Medical University of South Carolina > > > > > > > > -- > Matthew S. Shotwell > Graduate Student > Division of Biostatistics and Epidemiology > Medical University of South Carolina > > [[alternative HTML version deleted]]
______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.