Problems focused on XML methods. xml is OK. And the heading of xml as following: <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd"> <html><head><meta http-equiv="Content-Type" content="text/html; charset=gb2312"><title>深圳国投</title>
There is correct charset=gb2312, which is also the content of the web page. >doc<-xmlRoot(xml) >doc[[1]] <head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8"><title>娣卞湷鍥芥姇</title> The charset has been changed to UTF-8. > doc1<-xmlRoot(xml,encoding="gb2312") > doc1[[1]] <head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8"><title>娣卞湷鍥芥姇</title> It seems that some methods of XML will change the charset to UTF-8 on their own will. Wind2 wrote: > > XML is a good tool reading data from web within R. But I wonder how could > get the encoding correctly. > > library(XML) > url <- 'http://www.szitic.com/docc/jz-lmzq.html' > xml <- htmlTreeParse(url, useInternal=TRUE) > q <- "//tbody/tr/td" > dat <- unlist(xpathApply(xml, q, xmlValue)) > df <- as.data.frame(t(matrix(dat, 4))) > dt<-as.character(df[15,1]) > > The first column of df is dates in Chinese. dt is one of the Chinese > dates. > When I copied the content of dt into the email, it become the following: >> dt > [1] > "2008å岹砀戀㐀㄀㈀鰀\x8825æ岗砀愀㔀∀ഀ਀㸀  > > Indeed in R, it looks like: >>dt > [1] "2008\345\271\xb412\346\234\x8825\346\227\xa5" > > and the color of the numbers differs a little. > >> getOption("encoding") > [1] "native.enc" >> Sys.getlocale() > [1] "LC_COLLATE=Chinese (Simplified)_People's Republic of > China.936;LC_CTYPE=Chinese (Simplified)_People's Republic of > China.936;LC_MONETARY=Chinese (Simplified)_People's Republic of > China.936;LC_NUMERIC=C;LC_TIME=Chinese (Simplified)_People's Republic of > China.936" >> > > Package: XML > Version: 1.98-1 > Date: 2008/10/17 > > R version 2.8.0 (2008-10-20) > Windows Vista Basic, Simplified Chinese edition. > > There is no problem using Chinese characters in R codes. > > I wonder how could get the Chinese characters with XML. Or is there any > methods which could help me convert the encoding of characters from UTF-8 > to unicode in R? > > Regards, > Wind > > ------------------ > http://windspeedo.spaces.live.com > ______________________________________________ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > > -- View this message in context: http://www.nabble.com/Chinese-characters-encoding-problem-with-XML-tp21225957p21230340.html Sent from the R help mailing list archive at Nabble.com. ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.