On 19.08.2011 15:50, Paul Hiemstra wrote:
On 08/17/2011 10:53 PM, Alex Ruiz Euler wrote:Dear R community, I have a 2 million by 2 matrix that looks like this: x<-sample(1:15,2000000, replace=T) y<-sample(1:10*1000, 2000000, replace=T) x y [1,] 10 4000 [2,] 3 1000 [3,] 3 4000 [4,] 8 6000 [5,] 2 9000 [6,] 3 8000 [7,] 2 10000 (...) The first column is a population expansion factor for the number in the second column (household income). I want to expand the second column with the first so that I end up with a vector beginning with 10 observations of 4000, then 3 observations of 1000 and so on. In my mind the natural approach would be to create a NULL vector and append the expansions: myvar<-NULL myvar<-append(myvar, replicate(x[1],y[1]), 1) for (i in 2:length(x)) { myvar<-append(myvar,replicate(x[i],y[i]),sum(x[1:i])+1) } to end with a vector of sum(x), which in my real database corresponds to 22 million observations. This works fine --if I only run it for the first, say, 1000 observations. If I try to perform this on all 2 million observations it takes long, way too long for this to be useful (I left it running 11 hours yesterday to no avail). I know R performs well with operations on relatively large vectors. Why is this so inefficient? And what would be the smart way to do this?Hi Alex, The other reply already gave you the R way of doing this while avoiding the for loop. However, there is a more general reason why your for loop is terribly inefficient. A small set of examples: largeVector = runif(10e4) outputVector = NULL system.time(for(i in 1:length(largeVector)) {
Please do teach people to use seq_along(largeVector) rather than 1:length(largeVector) (the latter is not save in case of length 0 objects).
Uwe Ligges
outputVector = append(outputVector, largeVector[i] + 1) }) # user system elapsed # 6.591 0.168 6.786 The problem in this code is that outputVector keeps on growing and growing. The operating system needs to allocate more and more space as the object grows. This process is really slow. Several (much) faster alternatives exist: # Pre-allocating the outputVector outputVector = rep(0,length(largeVector)) system.time(for(i in 1:length(largeVector)) { outputVector[i] = largeVector[i] + 1 }) # user system elapsed # 0.178 0.000 0.178 # speed up of 37 times, this will only increase for large # lengths of largeVector # Using apply functions system.time(outputVector<- sapply(largeVector, function(x) return(x + 1))) # user system elapsed # 0.124 0.000 0.125 # Even a bit faster # Using vectorisation system.time(outputVector<- largeVector + 1) # user system elapsed # 0.000 0.000 0.001 # Practically instant, 6780 times faster than the first example It is not always clear which method is most suitable and which performs best. At least they all perform much, much better than the naive option of letting outputVector grow. cheers, PaulThanks in advance. Alex ______________________________________________ [email protected] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
______________________________________________ [email protected] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.

