[R] Web scrabing - getURL with delay

Kasper Christensen Sun, 12 Aug 2012 13:56:11 -0700

Hi R people.

Im currently trying to construct a piece of R-code that can retrieve a list
of webpages I have stored as a csv file and save the content of the
webpages into separate txt files. I want to retrieve a total number of 6000
threads posted at a forum, to try to build/train a classifier that can tell
me if the thread contains valuable information.


*Until now* I have managed to get the following code to work:

*> library(foreign)*
*> library(RCurl)*
*Indlæser krævet pakke: bitops*
*> addresses <- read.csv("~/Extract post - forum.csv")*
*> for (i in addresses) full.text <- getURL(i) *
*> text.sub <- gsub("<.+?>", "", full.text)*
*> text <- data.frame(text.sub)*
*> outpath <-"~/forum - RawData"*
*> x <- 1:nrow(text)*
*> for(i in x) {*
*+ write(as.character(text[i,1]), file = paste(outpath,"/",i,".txt",sep=""))
*
*+ }*
*> *
*
*
(I have both mac iOS and Windows)

This piece of code is not my own work and I therefore send a warm thank you
to Christopher Gandrud and co authors for providing this piece of code.

*The problem*
The code works like a charm looking up all the different addresses I have
stored in my csv file. The csv file I constructed as:

*Link*
*"webadress 1"*
*"webadress 2"*
*"webadress n"*
*
*
The problem is that i get empty output files and files saying "Server
overloaded". However I do also get files that contains the information
intended. The pattern of "bad" and "good" files a different from each time
i run the code with total n, telling me that it is not the code that is the
problem. No need to say it is probably my many request that is causing the
overload and as I am pretty new in the area I did not believe that this
would be a problem. When realizing that this WAS a problem I tried reducing
the number of requests to 100 at a time, which gave me all text files
containing the info I wanted.

Therefore I am looking for some kind of solution to this problem, and my
own best solution would be to build something into the code that makes it
send x number of request with a z given interval (5 seconds maybe), until I
have retrieved the total n of webpages in the csv file. If it fails to
retrieve a webpage it would be nice to sort the "bad" text files into a
"redo" folder which could then be run afterwards.

Any type of solution is welcome. As said I am pretty new with r-coding but
i some coding experience with VBA.

Best
Kasper

        [[alternative HTML version deleted]]

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

[R] Web scrabing - getURL with delay

Reply via email to