On 05/30/2011 09:04 AM, eric wrote:
Hi, I'm looking for help extracting some information of the zillow website.
I'd like to do this for the general case where I manually change the address
by modifying the url (see code below). With the url containing the address,
I'd like to be able to extract the same information each time. The specific
information I'd like to be able to extract includes the homedetails url,
price (zestimate), number of beds, number of baths, and the Sqft. All this
information is shown in a bubble on the webpage.
I use the code below to try and do this but it's not working. I know the
infomation I'm interested in is there because if I print out "doc", I see it
all in one area. I've attached the relevant section of "doc" that shows and
highlights all the information I'm interested in (note that either url
that's highligted in doc is fine).
http://r.789695.n4.nabble.com/file/n3561075/relevant-section-of-doc.pdf
relevant-section-of-doc.pdf
Hi Eric -- the problem is that the highlighted text is not in the XML
per se, but embedded in a comment. You can extract the text of the
comment as
getNodeSet(doc, 'string(//div[@id="resurrection-page-state"]/comment()))
you could go on to put some of that text into another XML document and
use xpath on that, but... you're really 'screen scraping' here, which
doesn't really showcase what XML is about. If you're trying to learn to
use XML, then I'd suggest choosing a simpler example. If you're trying
to corner the housing market (or whatever one does to housing markets)
then you'll want to find a better data source.
Hope that helps,
Martin
I'm guessing my xpath statements are wrong or getNodeSet needs something
else to get to information contained in a bubble on a webpage. Any
suggestions or ideas would be GREATLY appreciated.
library(XML)
url<- "http://www.zillow.com/homes/511 W Lafayette St, Norristown, PA_rb"
doc<- htmlTreeParse(url, useInternalNode=TRUE, isURL=TRUE)
f1<- getNodeSet(doc, "//a[contains(@href,'homedetails')]")
f2<- getNodeSet(doc, "//span[contains(@class,'price')]")
f3<- getNodeSet(doc, "//LIST[@Beds]")
f4<- getNodeSet(doc, "//LIST[@Baths]")
f5<- getNodeSet(doc, "//LIST[@Sqft]")
g1<-sapply(f1, xmlValue)
g2<-sapply(f2, xmlValue)
g3<-sapply(f3, xmlValue)
g4<-sapply(f4, xmlValue)
g5<-sapply(f5, xmlValue)
print(f1)
--
View this message in context:
http://r.789695.n4.nabble.com/Need-help-reading-website-info-with-XML-package-and-XPath-tp3561075p3561075.html
Sent from the R help mailing list archive at Nabble.com.
______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
--
Computational Biology
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109
Location: M1-B861
Telephone: 206 667-2793
______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.