Hi R people. Im currently trying to construct a piece of R-code that can retrieve a list of webpages I have stored as a csv file and save the content of the webpages into separate txt files. I want to retrieve a total number of 6000 threads posted at a forum, to try to build/train a classifier that can tell me if the thread contains valuable information.
*Until now* I have managed to get the following code to work: *> library(foreign)* *> library(RCurl)* *Indlæser krævet pakke: bitops* *> addresses <- read.csv("~/Extract post - forum.csv")* *> for (i in addresses) full.text <- getURL(i) * *> text.sub <- gsub("<.+?>", "", full.text)* *> text <- data.frame(text.sub)* *> outpath <-"~/forum - RawData"* *> x <- 1:nrow(text)* *> for(i in x) {* *+ write(as.character(text[i,1]), file = paste(outpath,"/",i,".txt",sep="")) * *+ }* *> * * * (I have both mac iOS and Windows) This piece of code is not my own work and I therefore send a warm thank you to Christopher Gandrud and co authors for providing this piece of code. *The problem* The code works like a charm looking up all the different addresses I have stored in my csv file. The csv file I constructed as: *Link* *"webadress 1"* *"webadress 2"* *"webadress n"* * * The problem is that i get empty output files and files saying "Server overloaded". However I do also get files that contains the information intended. The pattern of "bad" and "good" files a different from each time i run the code with total n, telling me that it is not the code that is the problem. No need to say it is probably my many request that is causing the overload and as I am pretty new in the area I did not believe that this would be a problem. When realizing that this WAS a problem I tried reducing the number of requests to 100 at a time, which gave me all text files containing the info I wanted. Therefore I am looking for some kind of solution to this problem, and my own best solution would be to build something into the code that makes it send x number of request with a z given interval (5 seconds maybe), until I have retrieved the total n of webpages in the csv file. If it fails to retrieve a webpage it would be nice to sort the "bad" text files into a "redo" folder which could then be run afterwards. Any type of solution is welcome. As said I am pretty new with r-coding but i some coding experience with VBA. Best Kasper [[alternative HTML version deleted]]
______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.