XML is a good tool reading data from web within R. But I wonder how could get
the encoding correctly.
library(XML)
url <- 'http://www.szitic.com/docc/jz-lmzq.html'
xml <- htmlTreeParse(url, useInternal=TRUE)
q <- "//tbody/tr/td"
dat <- unlist(xpathApply(xml, q, xmlValue))
df <- as.data.frame(t(matrix(dat, 4)))
dt<-as.character(df[15,1])
The first column of df is dates in Chinese. dt is one of the Chinese dates.
When I copied the content of dt into the email, it become the following:
> dt
[1]
"2008å岹砀戀㐀㄀㈀鰀\x8825æ岗砀愀㔀∀ഀ਀㸀 
Indeed in R, it looks like:
>dt
[1] "2008\345\271\xb412\346\234\x8825\346\227\xa5"
and the color of the numbers differs a little.
> getOption("encoding")
[1] "native.enc"
> Sys.getlocale()
[1] "LC_COLLATE=Chinese (Simplified)_People's Republic of
China.936;LC_CTYPE=Chinese (Simplified)_People's Republic of
China.936;LC_MONETARY=Chinese (Simplified)_People's Republic of
China.936;LC_NUMERIC=C;LC_TIME=Chinese (Simplified)_People's Republic of
China.936"
>
Package: XML
Version: 1.98-1
Date: 2008/10/17
R version 2.8.0 (2008-10-20)
Windows Vista Basic, Simplified Chinese edition.
There is no problem using Chinese characters in R codes.
I wonder how could get the Chinese characters with XML. Or is there any
methods which could help me convert the encoding of characters from UTF-8 to
unicode in R?
Regards,
Wind
------------------
http://windspeedo.spaces.live.com
______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.