Hi Matt and everyone else,
Thanks for the help so far.
I ended up using the tips provided to create a "dirty hack" based on a
translation table between the code and the Hebrew letters.
For the future (and for any suggestions), I am attaching this code bellow:
Best,
Tal
# the translation table:
translation.table.Hebrew <- structure(list(V1 = structure(1:27, .Label =
c("05D0", "05D1",
"05D2", "05D3", "05D4", "05D5", "05D6", "05D7", "05D8",
"05D9", "05DA", "05DB", "05DC", "05DD", "05DE", "05DF",
"05E0", "05E1", "05E2", "05E3", "05E4", "05E5", "05E6",
"05E7", "05E8", "05E9", "05EA"), class = "factor"), V2 = structure(1:27,
.Label = c("×",
"×", "×", "×", "×", "×", "×", "×", "×", "×", "×", "×", "×", "×",
"×", "×", "× ", "ס", "×¢", "×£", "פ", "×¥", "צ", "×§", "ר", "ש", "ת"
), class = "factor")), .Names = c("CODE", "HEBREW"), class = "data.frame",
row.names = c(NA,
-27L))
# translation.table
# STRING = inp
turn_nohash <- function(STRING)
{
require(stringr)
nohash <- str_replace(STRING, "#", "0") # cvrt # to 0
nohash <- str_replace(nohash, ";", "") # cvrt # to 0
nohash <- str_replace(nohash, "&", "") # cvrt # to 0
nohash <- str_replace(nohash, "x", "") # cvrt # to 0
return(nohash)
}
translate.all.chars <- function(STRING, TABLE = translation.table.Hebrew)
{
# TABLE is of the form:
# CODE HEBREW
# 1 05D0 ×
# 2 05D1 ×
# 3 05D2 ×
require(stringr)
i.chars.to.check <- seq_len(dim(TABLE)[1])
for(i in i.chars.to.check)
{
STRING <- str_replace(STRING, as.character(TABLE[i,1]),
as.character(TABLE[i,2]))
}
return(STRING)
}
HTML_heb_decode <- function(STRING, TABLE = translation.table.Hebrew)
{
STRING <- turn_nohash(STRING)
STRING <- translate.all.chars(STRING, TABLE)
return(STRING)
}
# example of use:
inp <- "שלום"
HTML_heb_decode(inp)
inp <- "שלום חנוך\ "
HTML_heb_decode(inp)
ourput:
> HTML_heb_decode(inp)
Loading required package: stringr
Loading required package: plyr
[1] "ש×××"
> inp <- "שלום חנוך\ "
> HTML_heb_decode(inp)
[1] "ש××× ×× ×× "
----------------Contact
Details:-------------------------------------------------------
Contact me: [email protected] | 972-52-7275845
Read me: www.talgalili.com (Hebrew) | www.biostatistics.co.il (Hebrew) |
www.r-statistics.com (English)
----------------------------------------------------------------------------------------------
On Fri, Dec 10, 2010 at 12:00 AM, Matt Shotwell <[email protected]> wrote:
> Tal,
>
> OK, let me clarify my understanding. The original and decoded file are
> text, encoded by UTF-8. In the original file, there are HTML `entities'
> that represent UTF-8 Hebrew characters. In the decoded file, the
> entities are converted to UTF-8 characters. The question is how to
> convert these entities within R. It's not the same as converting between
> character encodings, otherwise iconv() might offer a solution.
>
> I'll have a look around to find a solution, and I hope others will too.
> My first idea is to check RCurl, XML, and the related utils::URLdecode.
> If there really is no existing solution, I think it might be worthwhile
> to look at how PHP and Python do it (and maybe borrow some code :) ).
>
> -Matt
>
>
> On Thu, 2010-12-09 at 14:27 -0500, Tal Galili wrote:
> > Hi Matt,
> > Thanks for having a look at this.
> > I just spent some time looking around and couldn't find any R function
> > to decode decimal HTML code.
> >
> >
> > Do you (or someone else on the list) knows how to program this sort of
> > thing? (is there a formula for the translation?
> >
> >
> >
> >
> > p.s:
> > For it to work on my end I added the encoding parameter:
> > readLines("http://biostatmatt.com/temp/Hebrew-decoded", warn=FALSE,
> > encoding= "UTF-8")
> >
> >
> > p.p.s: The Hebrew word I used means "peace"
> >
> >
> > Cheers,
> > Tal
> >
> >
> > ----------------Contact
> > Details:-------------------------------------------------------
> > Contact me: [email protected] | 972-52-7275845
> > Read me: www.talgalili.com (Hebrew) | www.biostatistics.co.il (Hebrew)
> > | www.r-statistics.com (English)
> >
> ----------------------------------------------------------------------------------------------
> >
> >
> >
> >
> > On Thu, Dec 9, 2010 at 8:38 PM, Matt Shotwell <[email protected]>
> > wrote:
> > Tal,
> >
> > It looks like the data you received has HTML special hex
> > characters.
> > That is, 'ש' is just an ASCII HTML representation of a
> > hex
> > character. It's not encoded in a special manner.
> >
> > The trick is to substitute the HTML encoded hex character for
> > its binary
> > representation, or "decode" the character. I don't know of any
> > R
> > function that does this, but there are web services, for
> > example:
> > http://www.hashemian.com/tools/html-url-encode-decode.php
> >
> > I decoded your file using this service and posted it on my
> > website. You
> > can see the difference by running:
> >
> > readLines("http://biostatmatt.com/temp/Hebrew-original",
> > warn=FALSE)
> >
> > readLines("http://biostatmatt.com/temp/Hebrew-decoded",
> > warn=FALSE)
> >
> > The second should display the Hebrew characters correctly (it
> > does in my
> > terminal). The next thing to think about is how to automate
> > this in R
> > without using the web service... We may need to write an
> > HTMLDecode
> > function if there isn't one already.
> >
> > By the way, what's the Hebrew text in English?
> >
> > Best,
> > Matt
> >
> >
> >
> > On Thu, 2010-12-09 at 12:21 -0500, Tal Galili wrote:
> > > I am bumping this question in the hopes that someone might
> > be able to
> > > advise.
> > > This Hebrew and R business is not as smooth as I had
> > hoped...
> > >
> > > Thanks,
> > > Tal
> > >
> > > Older massage:
> > >
> > > On Tue, Dec 7, 2010 at 2:30 PM, Tal Galili
> > <[email protected]> wrote:
> > >
> > > > Hello all,
> > > >
> > > > # I am trying to read the text in this URL:
> > > > u <-
> > > > http://google.com/complete/search?output=toolbar&q=%d7%a9%
> > d7%9c%d7%95%d7%9d
> > > > # By using this command:
> > > > readLines(u)
> > > >
> > > > And no matter what variation I tried, I keep getting this
> > output:
> > > > [1] "<?xml version=\"1.0
> > \"?><toplevel><CompleteSuggestion><suggestion
> > > > data=\"שלום\"/>< (etc...)
> > > >
> > >
> > >
> > > > Instead of this output:
> > > > <?xml
> > version="1.0"?><toplevel><CompleteSuggestion><suggestion
> > data="ש×××
> > > > "/><num_queries
> >
> int="16800000"/></CompleteSuggestion><CompleteSuggestion><suggestion
> > > > data="ש××× ×× ××"/><num_queries
> > int="232000"/></CompleteSuggestion>
> > > > <CompleteSuggestion><suggestion data="ש××× ×¢××××"/
> > > > (etc....)
> > > >
> > > >
> > >
> > > > I tried:
> > > > readLines(u, encoding= "latin1")
> > > > readLines(u, encoding= "UTF-8")
> > > > And also changing Sys.setlocale:
> > > > Sys.setlocale("LC_ALL", "Hebrew") # must be done for
> > Hebrew to work.
> > > > Sys.setlocale("LC_ALL", "English") # must be done for
> > Hebrew to work.
> > > >
> > > > Are there any more options I could try to get this text
> > properly encoded?
> > > >
> > > > Thanks!
> > > > Tal
> > > >
> > > >
> > > >
> > > > ----------------Contact
> > > >
> > Details:-------------------------------------------------------
> > > > Contact me: [email protected] | 972-52-7275845
> > > > Read me: www.talgalili.com (Hebrew) |
> > www.biostatistics.co.il (Hebrew) |
> > > > www.r-statistics.com (English)
> > > >
> > > >
> >
> ----------------------------------------------------------------------------------------------
> > > >
> > > >
> > > >
> > >
> >
> > > [[alternative HTML version deleted]]
> > >
> >
> > --
> > Matthew S. Shotwell
> > Graduate Student
> > Division of Biostatistics and Epidemiology
> > Medical University of South Carolina
> >
> >
> >
>
> --
> Matthew S. Shotwell
> Graduate Student
> Division of Biostatistics and Epidemiology
> Medical University of South Carolina
>
>
[[alternative HTML version deleted]]
______________________________________________
[email protected] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.