If you only need to grab text it can be conveniently done with lynx. This example is for Windows but its nearly the same on other platforms:
> out <- shell("lynx.bat --dump --nolist http://www.google.com", intern = TRUE) > head(out) [1] "" [2] " Web Images Videos Maps News Books Gmail more ยป" [3] " iGoogle | Search settings | Sign in" [4] " " [5] " Google" [6] " " On Thu, Dec 3, 2009 at 5:29 PM, Michael Conklin < michael.conk...@markettools.com> wrote: > I would like to be able to submit a list of URLs of various webpages and > extract the "content" i.e. not the mark-up of those pages. I can find plenty > of examples in the XML library of extracting links from pages but I cannot > seem to find a way to extract the text. Any help would be greatly > appreciated - I will not know the structure of the URLs I would submit in > advance. Any suggestions on where to look would be greatly appreciated. > > Mike > > W. Michael Conklin > Chief Methodologist > > MarketTools, Inc. | www.markettools.com<http://www.markettools.com> > 6465 Wayzata Blvd | Suite 170 | St. Louis Park, MN 55426. PHONE: > 952.417.4719 | CELL: 612.201.8978 > This email and attachment(s) may contain confidential and/or proprietary > information and is intended only for the intended addressee(s) or its > authorized agent(s). Any disclosure, printing, copying or use of such > information is strictly prohibited. If this email and/or attachment(s) were > received in error, please immediately notify the sender and delete all > copies > > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > [[alternative HTML version deleted]]
______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.