Michael Conklin wrote: > > I would like to be able to submit a list of URLs of various webpages and > extract the "content" i.e. not the mark-up of those pages. I can find > plenty of examples in the XML library of extracting links from pages but I > cannot seem to find a way to extract the text. Any help would be greatly > appreciated - I will not know the structure of the URLs I would submit in > advance. Any suggestions on where to look would be greatly appreciated. > > Mike > > W. Michael Conklin > Chief Methodologist >
What kind of "content" are you after? Tables? Chunks of Text? For tables you can use the readHTMLTable() function in the XML package. There was also some discussion of alternate ways to extract data from tables in this thread: http://n4.nabble.com/Downloading-data-from-from-internet-td889838.html#a889845 If you're after text, then it's probably a matter of locating the element that encloses the data you want-- perhaps by using getNodeSet along with an XPath[1] that specifies the element you are interest with. The text can then be recovered using the xmlValue() function. Hope this helps! -Charlie [1]: http://www.w3schools.com/XPath/xpath_syntax.asp -- View this message in context: http://n4.nabble.com/Scraping-a-web-page-tp948069p948103.html Sent from the R help mailing list archive at Nabble.com. ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.