Hi all,

Sorry for the rather uninformative subject, but the error I get is not very 
informative either.

When using the XML and RCurl package to retrieve the content of an html page, 
htmlTreeParse fails, printing out the beginning of the HTML:

Error in htmlTreeParse(getURL(url)) : 
  File   <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" 
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd";>
  <html xmlns="http://www.w3.org/1999/xhtml"; xml:lang="de" lang="de">
    <head>
      <title>Deutsches Krebsforschungszentrum</title>
      <meta http-equiv="content-type" content="text/html; charset=ISO-8859-1" />
      <meta http-equiv="Content-Style-Type" content="text/css" />
      <meta http-equiv="imagetoolbar" content="no" />
      <meta name="MSSmartTagsPreventParsing" content="true" />
      <meta name="revisit-after" content="5 days" />
      <meta name="language" content="de" />
      <meta lang="de" content="" xml:lang="de" name="keywords">
      <meta lang="de" xml:lang="de" name="description" content="Das Deutsche 
Krebsforschungszentrum hat die Aufgabe, die Mechanismen der Krebsentstehung 
systematisch zu erforschen und Risikofaktoren für Krebserkrankungen zu 
erfassen. Aus den Ergebnissen dieser grundlegenden Arbeiten sollen neue Ans√

This code reproduces the error:

library(RCurl)
library(XML)
url <- 
"www.dkfz.de/en/genetics/pages/projects/bioinformatics/Custom_Chip_Definition_File.html"
htmlTreeParse(getURL(url))

The issue seems to originate in htmlTreeParse as getURL alone works and returns 
the expected content. I checked that it could not be an encoding issue and as 
far as I can tell it seems not to be.

Moreover, using htmlParse(paste("http://",url,sep="";) works. Note that 
htmlTreeParse(getURL(paste("http://",url,sep="";))) fails too, the "http://"; is 
important only for htmlParse, so that it identifies it as an URL.

This issue is rather new, and as I've been using the same version of XML and 
RCurl, I suppose it might have to do with some of the content of the website 
having been updated, but given the error, I can't quite figure out what is 
raising it.

Although it works on that simple example, using htmlParse is not really a work 
around, as I need to use additional arguments in the getURL call (such as 
userpwd), which I can't provide to htmlParse.

Any hints would be greatly appreciated,

Cheers,

Nico

sessionInfo()
R version 2.15.0 (2012-03-30)
Platform: i386-apple-darwin9.8.0/i386 (32-bit)

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] XML_3.9-4      RCurl_1.91-1   bitops_1.0-4.1

loaded via a namespace (and not attached):
[1] tools_2.15.0

---------------------------------------------------------------
Nicolas Delhomme

Nathaniel Street Lab
Department of Plant Physiology
Umeå Plant Science Center

Tel: +46 90 786 7989
Email: nicolas.delho...@plantphys.umu.se
SLU - Umeå universitet
Umeå S-901 87 Sweden
---------------------------------------------------------------

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to