Hi all, Sorry for the rather uninformative subject, but the error I get is not very informative either.
When using the XML and RCurl package to retrieve the content of an html page, htmlTreeParse fails, printing out the beginning of the HTML: Error in htmlTreeParse(getURL(url)) : File <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="de" lang="de"> <head> <title>Deutsches Krebsforschungszentrum</title> <meta http-equiv="content-type" content="text/html; charset=ISO-8859-1" /> <meta http-equiv="Content-Style-Type" content="text/css" /> <meta http-equiv="imagetoolbar" content="no" /> <meta name="MSSmartTagsPreventParsing" content="true" /> <meta name="revisit-after" content="5 days" /> <meta name="language" content="de" /> <meta lang="de" content="" xml:lang="de" name="keywords"> <meta lang="de" xml:lang="de" name="description" content="Das Deutsche Krebsforschungszentrum hat die Aufgabe, die Mechanismen der Krebsentstehung systematisch zu erforschen und Risikofaktoren f√ºr Krebserkrankungen zu erfassen. Aus den Ergebnissen dieser grundlegenden Arbeiten sollen neue Ans√ This code reproduces the error: library(RCurl) library(XML) url <- "www.dkfz.de/en/genetics/pages/projects/bioinformatics/Custom_Chip_Definition_File.html" htmlTreeParse(getURL(url)) The issue seems to originate in htmlTreeParse as getURL alone works and returns the expected content. I checked that it could not be an encoding issue and as far as I can tell it seems not to be. Moreover, using htmlParse(paste("http://",url,sep="") works. Note that htmlTreeParse(getURL(paste("http://",url,sep=""))) fails too, the "http://" is important only for htmlParse, so that it identifies it as an URL. This issue is rather new, and as I've been using the same version of XML and RCurl, I suppose it might have to do with some of the content of the website having been updated, but given the error, I can't quite figure out what is raising it. Although it works on that simple example, using htmlParse is not really a work around, as I need to use additional arguments in the getURL call (such as userpwd), which I can't provide to htmlParse. Any hints would be greatly appreciated, Cheers, Nico sessionInfo() R version 2.15.0 (2012-03-30) Platform: i386-apple-darwin9.8.0/i386 (32-bit) locale: [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] XML_3.9-4 RCurl_1.91-1 bitops_1.0-4.1 loaded via a namespace (and not attached): [1] tools_2.15.0 --------------------------------------------------------------- Nicolas Delhomme Nathaniel Street Lab Department of Plant Physiology Umeå Plant Science Center Tel: +46 90 786 7989 Email: nicolas.delho...@plantphys.umu.se SLU - Umeå universitet Umeå S-901 87 Sweden --------------------------------------------------------------- ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.