When doing repeated regressions on large data sets, I'm finding that the time spent on garbage collection often exceeds the time spent on the regression itself. Consider this test program which I'm running on an Intel Haswell i7-4470 processor under Linux 3.13 using R 3.1.2 compiled with ICPC 14.1:
nate@haswell:~$ cat > gc.R library(speedglm) createData <- function(n) { int <- -5 x <- rnorm(n, 50, 7) e <- rnorm(n, 0, 1) y <- int + (1.2 * x) + e return(data.frame(y, x)) } gc.time() data <- createData(500000) data.y <- as.matrix(data[1]) data.x <- model.matrix(y ~ ., data) for (i in 1:100) speedglm.wfit(X=data.x, y=data.y, family=gaussian()) gc.time() nate@haswell:~$ time Rscript gc.R Loading required package: Matrix Loading required package: methods [1] 0 0 0 0 0 [1] 10.410 0.024 10.441 0.000 0.000 real 0m17.167s user 0m16.996s sys 0m0.176s The total execution time is 17 seconds, and the time spent on garbage collection is almost 2/3 of that. My actual use case is a package that creates an ensemble from a variety of cross-validated regressions, and exhibits the same poor performance. Is this expected behavior? I've found that I can reduce the garbage collection time to a tolerable level by setting the R_VSIZE environment value to a large enough value: nate@haswell:~$ time R_VSIZE=1GB Rscript gc.R Loading required package: Matrix Loading required package: methods [1] 0 0 0 0 0 [1] 0.716 0.025 0.739 0.000 0.000 real 0m7.694s user 0m7.388s sys 0m0.309s I can do slightly better with even higher values, and by using R_GC_MEM_GROW=3. But while using the environment variables solves the issue for me, I fear that the end users of my package won't be able to set them. Is there a way that I can achieve the higher performance from within R rather than from the command line? Thanks! --nate ______________________________________________ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel