Re: [R] Need help reading website info with XML package and XPath

Martin Morgan Tue, 31 May 2011 09:17:06 -0700

On 05/30/2011 09:04 AM, eric wrote:

Hi, I'm looking for help extracting some information of the zillow website.
I'd like to do this for the general case where I manually change the address
by modifying the url (see code below). With the url containing the address,
I'd like to be able to extract the same information each time. The specific
information I'd like to be able to extract includes the homedetails url,
price (zestimate), number of beds, number of baths, and the Sqft. All this
information is shown in a bubble on the webpage.


I use the code below to try and do this but it's not working. I know the
infomation I'm interested in is there because if I print out "doc", I see it
all in one area. I've attached the relevant section of "doc" that shows and
highlights all the information I'm interested in (note that either url
that's highligted in doc is fine).
http://r.789695.n4.nabble.com/file/n3561075/relevant-section-of-doc.pdf
relevant-section-of-doc.pdf

Hi Eric -- the problem is that the highlighted text is not in the XMLper se, but embedded in a comment. You can extract the text of thecomment as


getNodeSet(doc, 'string(//div[@id="resurrection-page-state"]/comment()))

you could go on to put some of that text into another XML document anduse xpath on that, but... you're really 'screen scraping' here, whichdoesn't really showcase what XML is about. If you're trying to learn touse XML, then I'd suggest choosing a simpler example. If you're tryingto corner the housing market (or whatever one does to housing markets)then you'll want to find a better data source.


Hope that helps,

Martin


I'm guessing my xpath statements are wrong or getNodeSet needs something
else to get to information contained in a bubble on a webpage. Any
suggestions or ideas would be GREATLY appreciated.


library(XML)
url<- "http://www.zillow.com/homes/511 W Lafayette St, Norristown, PA_rb"
doc<- htmlTreeParse(url, useInternalNode=TRUE, isURL=TRUE)
f1<- getNodeSet(doc, "//a[contains(@href,'homedetails')]")
f2<- getNodeSet(doc, "//span[contains(@class,'price')]")
f3<- getNodeSet(doc, "//LIST[@Beds]")
f4<- getNodeSet(doc, "//LIST[@Baths]")
f5<- getNodeSet(doc, "//LIST[@Sqft]")
g1<-sapply(f1, xmlValue)
g2<-sapply(f2, xmlValue)
g3<-sapply(f3, xmlValue)
g4<-sapply(f4, xmlValue)
g5<-sapply(f5, xmlValue)
print(f1)



--
View this message in context: 
http://r.789695.n4.nabble.com/Need-help-reading-website-info-with-XML-package-and-XPath-tp3561075p3561075.html
Sent from the R help mailing list archive at Nabble.com.

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.



--
Computational Biology
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109

Location: M1-B861
Telephone: 206 667-2793

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Need help reading website info with XML package and XPath

Reply via email to