[R] Scraping a web page.

2012-05-16 Thread Keith Weintraub
for your detailed reply, KW Message: 139 Date: Tue, 15 May 2012 21:02:05 -0700 From: Duncan Temple Lang To: r-help@r-project.org Subject: Re: [R] Scraping a web page. Message-ID: <4fb326bd.9080...@wald.ucdavis.edu> Content-Type: text/plain; charset=ISO-8859-1 Hi Keith Of course, it d

Re: [R] Scraping a web page.

2012-05-16 Thread Keith Weintraub
Thanks Gabor, Nifty regexp. I never used strapplyc before and I am sure this will become a nice addition to my toolkit. KW Message: 5 Date: Tue, 15 May 2012 07:55:33 -0400 From: Gabor Grothendieck To: Keith Weintraub Cc: r-help@r-project.org Subject: Re: [R] Scraping a web page. Message-ID

Re: [R] Scraping a web page.

2012-05-15 Thread Duncan Temple Lang
Hi Keith Of course, it doesn't necessarily matter how you get the job done if it actually works correctly. But for a general approach, it is useful to use general tools and can lead to more correct, more robust, and more maintainable code. Since htmlParse() in the XML package can both retrieve

Re: [R] Scraping a web page.

2012-05-15 Thread Gabor Grothendieck
On Tue, May 15, 2012 at 7:06 AM, Keith Weintraub wrote: > Thanks, >  That was very helpful. > > I am using readLines and grep. If grep isn't powerful enough I might end up > using the XML package but I hope that won't be necessary. > This only uses readLines and strapplyc (from gsubfn). It scra

Re: [R] Scraping a web page.

2012-05-15 Thread Keith Weintraub
Thanks, That was very helpful. I am using readLines and grep. If grep isn't powerful enough I might end up using the XML package but I hope that won't be necessary. Thanks again, KW -- On May 14, 2012, at 7:18 PM, J Toll wrote: > On Mon, May 14, 2012 at 4:17 PM, Keith Weintraub wrote: >> F

Re: [R] Scraping a web page.

2012-05-14 Thread J Toll
On Mon, May 14, 2012 at 4:17 PM, Keith Weintraub wrote: > Folks, >  I want to scrape a series of web-page sources for strings like the following: > > "/en/Ships/A-8605507.html" > "/en/Ships/Aalborg-8122830.html" > > which appear in an href inside an tag inside a tag inside a table. > > In fact a

[R] Scraping a web page.

2012-05-14 Thread Keith Weintraub
Folks, I want to scrape a series of web-page sources for strings like the following: "/en/Ships/A-8605507.html" "/en/Ships/Aalborg-8122830.html" which appear in an href inside an tag inside a tag inside a table. In fact all I want is the (exactly) 7-digit number before ".html". The good new

Re: [R] Scraping a web page

2009-12-03 Thread Duncan Temple Lang
Hi Michael If you just want all of the text that is displayed in the HTML docment, then you might use an XPath expression to get all the text() nodes and get their value. An example is doc = htmlParse("http://www.omegahat.org/";) txt = xpathSApply(doc, "//body//text()", xmlValue) The resul

Re: [R] Scraping a web page

2009-12-03 Thread hadley wickham
> If you're after text, then it's probably a matter of locating the element > that encloses the data you want-- perhaps by using getNodeSet along with an > XPath[1] that specifies the element you are interest with.  The text can > then be recovered using the xmlValue() function. And rather than tr

Re: [R] Scraping a web page

2009-12-03 Thread Sharpie
Michael Conklin wrote: > > I would like to be able to submit a list of URLs of various webpages and > extract the "content" i.e. not the mark-up of those pages. I can find > plenty of examples in the XML library of extracting links from pages but I > cannot seem to find a way to extract the text

Re: [R] Scraping a web page

2009-12-03 Thread Gabor Grothendieck
If you only need to grab text it can be conveniently done with lynx. This example is for Windows but its nearly the same on other platforms: > out <- shell("lynx.bat --dump --nolist http://www.google.com";, intern = TRUE) > head(out) [1] "" [2] " Web Images Videos Maps News Books Gmail more »"

[R] Scraping a web page

2009-12-03 Thread Michael Conklin
I would like to be able to submit a list of URLs of various webpages and extract the "content" i.e. not the mark-up of those pages. I can find plenty of examples in the XML library of extracting links from pages but I cannot seem to find a way to extract the text. Any help would be greatly appr