Re: [R] import HTML tables

Duncan Temple Lang Wed, 13 May 2009 06:57:22 -0700

Dieter Menne wrote:


Dimitri Szerman-2 wrote:

Hello,
I was wondering if there is a function in R that imports tables directly
from a HTML document.


The XML package can do this:

http://markmail.org/message/cyicoa3htme4gei2

Duncan Temple Lang:

The htmlParse() and htmlTreeParse() functions in the XML package use the

non-strict HTML parser in libxml2 and so the HTML document can be malformed.


Indeed. Thanks Dieter.

htmlParse() reads the document; getNodeSet allows us to
easily find the table or tables of interest.
We can find the th and td entries easily using XPath also.

The less automated part is how to meaningfully process the content.
That is where a human  should be involved, deciding whether to trim
white space, how to convert text to values, dealing with missing cells.
We can do a lot by default, but ...


There is a relatively simple function at

  http://www.omegahat.org/ParseXML/readHTMLTable.R

that provides something resembling read.table.
It is not well tested as in the past, I have just used XPath
directly as, once you know XPath, extracting content from HTML/XML is
very straightforward.

  D.



Dieter


______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] import HTML tables

Reply via email to