for your detailed reply,
KW
Message: 139
Date: Tue, 15 May 2012 21:02:05 -0700
From: Duncan Temple Lang
To: r-help@r-project.org
Subject: Re: [R] Scraping a web page.
Message-ID: <4fb326bd.9080...@wald.ucdavis.edu>
Content-Type: text/plain; charset=ISO-8859-1
Hi Keith
Of course, it d
Thanks Gabor,
Nifty regexp. I never used strapplyc before and I am sure this will become a
nice addition to my toolkit.
KW
Message: 5
Date: Tue, 15 May 2012 07:55:33 -0400
From: Gabor Grothendieck
To: Keith Weintraub
Cc: r-help@r-project.org
Subject: Re: [R] Scraping a web page.
Message-ID
Hi Keith
Of course, it doesn't necessarily matter how you get the job done
if it actually works correctly. But for a general approach,
it is useful to use general tools and can lead to more correct,
more robust, and more maintainable code.
Since htmlParse() in the XML package can both retrieve
On Tue, May 15, 2012 at 7:06 AM, Keith Weintraub wrote:
> Thanks,
> That was very helpful.
>
> I am using readLines and grep. If grep isn't powerful enough I might end up
> using the XML package but I hope that won't be necessary.
>
This only uses readLines and strapplyc (from gsubfn). It scra
Thanks,
That was very helpful.
I am using readLines and grep. If grep isn't powerful enough I might end up
using the XML package but I hope that won't be necessary.
Thanks again,
KW
--
On May 14, 2012, at 7:18 PM, J Toll wrote:
> On Mon, May 14, 2012 at 4:17 PM, Keith Weintraub wrote:
>> F
On Mon, May 14, 2012 at 4:17 PM, Keith Weintraub wrote:
> Folks,
> I want to scrape a series of web-page sources for strings like the following:
>
> "/en/Ships/A-8605507.html"
> "/en/Ships/Aalborg-8122830.html"
>
> which appear in an href inside an tag inside a tag inside a table.
>
> In fact a
Folks,
I want to scrape a series of web-page sources for strings like the following:
"/en/Ships/A-8605507.html"
"/en/Ships/Aalborg-8122830.html"
which appear in an href inside an tag inside a tag inside a table.
In fact all I want is the (exactly) 7-digit number before ".html".
The good new
Hi Michael
If you just want all of the text that is displayed in the
HTML docment, then you might use an XPath expression to get
all the text() nodes and get their value.
An example is
doc = htmlParse("http://www.omegahat.org/";)
txt = xpathSApply(doc, "//body//text()", xmlValue)
The resul
> If you're after text, then it's probably a matter of locating the element
> that encloses the data you want-- perhaps by using getNodeSet along with an
> XPath[1] that specifies the element you are interest with. The text can
> then be recovered using the xmlValue() function.
And rather than tr
Michael Conklin wrote:
>
> I would like to be able to submit a list of URLs of various webpages and
> extract the "content" i.e. not the mark-up of those pages. I can find
> plenty of examples in the XML library of extracting links from pages but I
> cannot seem to find a way to extract the text
If you only need to grab text it can be conveniently done with lynx. This
example is for Windows but its nearly the same on other platforms:
> out <- shell("lynx.bat --dump --nolist http://www.google.com";, intern =
TRUE)
> head(out)
[1] ""
[2] " Web Images Videos Maps News Books Gmail more »"
I would like to be able to submit a list of URLs of various webpages and
extract the "content" i.e. not the mark-up of those pages. I can find plenty of
examples in the XML library of extracting links from pages but I cannot seem to
find a way to extract the text. Any help would be greatly appr
12 matches
Mail list logo