Perhaps ?Sys.sleep between scrapes. If this slows things down too much you may be able to parallelize by host site with ?mclapply. --------------------------------------------------------------------------- Jeff Newmiller The ..... ..... Go Live... DCN:<jdnew...@dcn.davis.ca.us> Basics: ##.#. ##.#. Live Go... Live: OO#.. Dead: OO#.. Playing Research Engineer (Solar/Batteries O.O#. #.O#. with /Software/Embedded Controllers) .OO#. .OO#. rocks...1k --------------------------------------------------------------------------- Sent from my phone. Please excuse my brevity.
Kasper Christensen <kasper2...@gmail.com> wrote: >Hi R people. > >Im currently trying to construct a piece of R-code that can retrieve a >list >of webpages I have stored as a csv file and save the content of the >webpages into separate txt files. I want to retrieve a total number of >6000 >threads posted at a forum, to try to build/train a classifier that can >tell >me if the thread contains valuable information. > >*Until now* I have managed to get the following code to work: > >*> library(foreign)* >*> library(RCurl)* >*Indl�ser kr�vet pakke: bitops* >*> addresses <- read.csv("~/Extract post - forum.csv")* >*> for (i in addresses) full.text <- getURL(i) * >*> text.sub <- gsub("<.+?>", "", full.text)* >*> text <- data.frame(text.sub)* >*> outpath <-"~/forum - RawData"* >*> x <- 1:nrow(text)* >*> for(i in x) {* >*+ write(as.character(text[i,1]), file = >paste(outpath,"/",i,".txt",sep="")) >* >*+ }* >*> * >* >* >(I have both mac iOS and Windows) > >This piece of code is not my own work and I therefore send a warm thank >you >to Christopher Gandrud and co authors for providing this piece of code. > >*The problem* >The code works like a charm looking up all the different addresses I >have >stored in my csv file. The csv file I constructed as: > >*Link* >*"webadress 1"* >*"webadress 2"* >*"webadress n"* >* >* >The problem is that i get empty output files and files saying "Server >overloaded". However I do also get files that contains the information >intended. The pattern of "bad" and "good" files a different from each >time >i run the code with total n, telling me that it is not the code that is >the >problem. No need to say it is probably my many request that is causing >the >overload and as I am pretty new in the area I did not believe that this >would be a problem. When realizing that this WAS a problem I tried >reducing >the number of requests to 100 at a time, which gave me all text files >containing the info I wanted. > >Therefore I am looking for some kind of solution to this problem, and >my >own best solution would be to build something into the code that makes >it >send x number of request with a z given interval (5 seconds maybe), >until I >have retrieved the total n of webpages in the csv file. If it fails to >retrieve a webpage it would be nice to sort the "bad" text files into a >"redo" folder which could then be run afterwards. > >Any type of solution is welcome. As said I am pretty new with r-coding >but >i some coding experience with VBA. > >Best >Kasper > > [[alternative HTML version deleted]] > >______________________________________________ >R-help@r-project.org mailing list >https://stat.ethz.ch/mailman/listinfo/r-help >PLEASE do read the posting guide >http://www.R-project.org/posting-guide.html >and provide commented, minimal, self-contained, reproducible code. ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.