Re: [R] Encoding problem - I fails to read Hebrew text from online

Tal Galili Fri, 10 Dec 2010 12:35:44 -0800

Hi Matt and everyone else,
Thanks for the help so far.

I ended up using the tips provided to create a "dirty hack" based on a
translation table between the code and the Hebrew letters.


For the future (and for any suggestions), I am attaching this code bellow:

Best,
Tal

# the translation table:
translation.table.Hebrew <- structure(list(V1 = structure(1:27, .Label =
c("05D0", "05D1",
 "05D2", "05D3", "05D4", "05D5", "05D6", "05D7", "05D8",
"05D9", "05DA", "05DB", "05DC", "05DD", "05DE", "05DF",
 "05E0", "05E1", "05E2", "05E3", "05E4", "05E5", "05E6",
"05E7", "05E8", "05E9", "05EA"), class = "factor"), V2 = structure(1:27,
.Label = c("×",
 "×", "×", "×", "×", "×", "×", "×", "×", "×", "×", "×", "×", "×",
 "×", "×", "× ", "×¡", "×¢", "×£", "×¤", "×¥", "×¦", "×§", "×¨", "×©", "×ª"
 ), class = "factor")), .Names = c("CODE", "HEBREW"), class = "data.frame",
row.names = c(NA,
-27L))
# translation.table
# STRING = inp

turn_nohash <- function(STRING)
{
require(stringr)
 nohash <- str_replace(STRING, "#", "0")  # cvrt # to 0
nohash <- str_replace(nohash, ";", "")  # cvrt # to 0
 nohash <- str_replace(nohash, "&", "")  # cvrt # to 0
nohash <- str_replace(nohash, "x", "")  # cvrt # to 0
 return(nohash)
}

translate.all.chars <- function(STRING, TABLE = translation.table.Hebrew)
{
# TABLE is of the form:
   # CODE HEBREW
# 1  05D0      ×
# 2  05D1      ×
 # 3  05D2      ×
 require(stringr)
 i.chars.to.check <- seq_len(dim(TABLE)[1])
for(i in i.chars.to.check)
{
 STRING <- str_replace(STRING, as.character(TABLE[i,1]),
as.character(TABLE[i,2]))
}

return(STRING)
}


HTML_heb_decode <- function(STRING, TABLE = translation.table.Hebrew)
{
STRING <- turn_nohash(STRING)
STRING <- translate.all.chars(STRING, TABLE)
 return(STRING)
}


# example of use:
inp <- "&#x5E9;&#x5DC;&#x5D5;&#x5DD;"
HTML_heb_decode(inp)
inp <- "&#x5E9;&#x5DC;&#x5D5;&#x5DD; &#x5D7;&#x5E0;&#x5D5;&#x5DA;\ "
HTML_heb_decode(inp)


ourput:


> HTML_heb_decode(inp)
Loading required package: stringr
Loading required package: plyr
[1] "×©×××"
> inp <- "&#x5E9;&#x5DC;&#x5D5;&#x5DD; &#x5D7;&#x5E0;&#x5D5;&#x5DA;\ "
> HTML_heb_decode(inp)
[1] "×©××× ×× ×× "




----------------Contact
Details:-------------------------------------------------------
Contact me: tal.gal...@gmail.com |  972-52-7275845
Read me: www.talgalili.com (Hebrew) | www.biostatistics.co.il (Hebrew) |
www.r-statistics.com (English)
----------------------------------------------------------------------------------------------




On Fri, Dec 10, 2010 at 12:00 AM, Matt Shotwell <shotw...@musc.edu> wrote:

> Tal,
>
> OK, let me clarify my understanding. The original and decoded file are
> text, encoded by UTF-8. In the original file, there are HTML `entities'
> that represent UTF-8 Hebrew characters. In the decoded file, the
> entities are converted to UTF-8 characters. The question is how to
> convert these entities within R. It's not the same as converting between
> character encodings, otherwise iconv() might offer a solution.
>
> I'll have a look around to find a solution, and I hope others will too.
> My first idea is to check RCurl, XML, and the related utils::URLdecode.
> If there really is no existing solution, I think it might be worthwhile
> to look at how PHP and Python do it (and maybe borrow some code :) ).
>
> -Matt
>
>
> On Thu, 2010-12-09 at 14:27 -0500, Tal Galili wrote:
> > Hi Matt,
> > Thanks for having a look at this.
> > I just spent some time looking around and couldn't find any R function
> > to decode  decimal HTML code.
> >
> >
> > Do you (or someone else on the list) knows how to program this sort of
> > thing? (is there a formula for the translation?
> >
> >
> >
> >
> > p.s:
> > For it to work on my end I added the encoding parameter:
> > readLines("http://biostatmatt.com/temp/Hebrew-decoded";, warn=FALSE,
> > encoding= "UTF-8")
> >
> >
> > p.p.s: The Hebrew word I used means "peace"
> >
> >
> > Cheers,
> > Tal
> >
> >
> > ----------------Contact
> > Details:-------------------------------------------------------
> > Contact me: tal.gal...@gmail.com |  972-52-7275845
> > Read me: www.talgalili.com (Hebrew) | www.biostatistics.co.il (Hebrew)
> > | www.r-statistics.com (English)
> >
> ----------------------------------------------------------------------------------------------
> >
> >
> >
> >
> > On Thu, Dec 9, 2010 at 8:38 PM, Matt Shotwell <shotw...@musc.edu>
> > wrote:
> >         Tal,
> >
> >         It looks like the data you received has HTML special hex
> >         characters.
> >         That is, '&#x5E9;' is just an ASCII HTML representation of a
> >         hex
> >         character. It's not encoded in a special manner.
> >
> >         The trick is to substitute the HTML encoded hex character for
> >         its binary
> >         representation, or "decode" the character. I don't know of any
> >         R
> >         function that does this, but there are web services, for
> >         example:
> >         http://www.hashemian.com/tools/html-url-encode-decode.php
> >
> >         I decoded your file using this service and posted it on my
> >         website. You
> >         can see the difference by running:
> >
> >         readLines("http://biostatmatt.com/temp/Hebrew-original";,
> >         warn=FALSE)
> >
> >         readLines("http://biostatmatt.com/temp/Hebrew-decoded";,
> >         warn=FALSE)
> >
> >         The second should display the Hebrew characters correctly (it
> >         does in my
> >         terminal). The next thing to think about is how to automate
> >         this in R
> >         without using the web service... We may need to write an
> >         HTMLDecode
> >         function if there isn't one already.
> >
> >         By the way, what's the Hebrew text in English?
> >
> >         Best,
> >         Matt
> >
> >
> >
> >         On Thu, 2010-12-09 at 12:21 -0500, Tal Galili wrote:
> >         > I am bumping this question in the hopes that someone might
> >         be able to
> >         > advise.
> >         > This Hebrew and R business is not as smooth as I had
> >         hoped...
> >         >
> >         > Thanks,
> >         > Tal
> >         >
> >         > Older massage:
> >         >
> >         > On Tue, Dec 7, 2010 at 2:30 PM, Tal Galili
> >         <tal.gal...@gmail.com> wrote:
> >         >
> >         > > Hello all,
> >         > >
> >         > > # I am trying to read the text in this URL:
> >         > > u <-
> >         > > http://google.com/complete/search?output=toolbar&q=%d7%a9%
> >         d7%9c%d7%95%d7%9d
> >         > > # By using this command:
> >         > > readLines(u)
> >         > >
> >         > > And no matter what variation I tried, I keep getting this
> >         output:
> >         > > [1] "<?xml version=\"1.0
> >         \"?><toplevel><CompleteSuggestion><suggestion
> >         > > data=\"&#x5E9;&#x5DC;&#x5D5;&#x5DD;\"/><   (etc...)
> >         > >
> >         >
> >         >
> >         > > Instead of this output:
> >         > > <?xml
> >         version="1.0"?><toplevel><CompleteSuggestion><suggestion
> >         data="×©×××
> >         > > "/><num_queries
> >
> int="16800000"/></CompleteSuggestion><CompleteSuggestion><suggestion
> >         > > data="×©××× ×× ××"/><num_queries
> >         int="232000"/></CompleteSuggestion>
> >         > > <CompleteSuggestion><suggestion data="×©××× ×¢××××"/
> >         > > (etc....)
> >         > >
> >         > >
> >         >
> >         > > I tried:
> >         > >   readLines(u, encoding= "latin1")
> >         > >   readLines(u, encoding= "UTF-8")
> >         > > And also changing Sys.setlocale:
> >         > >   Sys.setlocale("LC_ALL", "Hebrew") # must be done for
> >         Hebrew to work.
> >         > >   Sys.setlocale("LC_ALL", "English") # must be done for
> >         Hebrew to work.
> >         > >
> >         > > Are there any more options I could try to get this text
> >         properly encoded?
> >         > >
> >         > > Thanks!
> >         > > Tal
> >         > >
> >         > >
> >         > >
> >         > > ----------------Contact
> >         > >
> >         Details:-------------------------------------------------------
> >         > > Contact me: tal.gal...@gmail.com |  972-52-7275845
> >         > > Read me: www.talgalili.com (Hebrew) |
> >         www.biostatistics.co.il (Hebrew) |
> >         > > www.r-statistics.com (English)
> >         > >
> >         > >
> >
> ----------------------------------------------------------------------------------------------
> >         > >
> >         > >
> >         > >
> >         >
> >
> >         >       [[alternative HTML version deleted]]
> >         >
> >
> >         --
> >         Matthew S. Shotwell
> >         Graduate Student
> >         Division of Biostatistics and Epidemiology
> >         Medical University of South Carolina
> >
> >
> >
>
> --
> Matthew S. Shotwell
> Graduate Student
> Division of Biostatistics and Epidemiology
> Medical University of South Carolina
>
>

        [[alternative HTML version deleted]]

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Encoding problem - I fails to read Hebrew text from online

Reply via email to