Re: [Rd] Build R with MKL and ICC
As a short and simple approach, I just compiled the current R release on Ubuntu with ICC and MKL using just this: $ tar -xzf R-3.2.2.tar.gz $ cd R-3.2.2 $ CC=icc CXX=icpc AR=xiar LD=xild CFLAGS="-g -O3 -xHost" CXXFLAGS="-g -O3 -xHost" ./configure --with-blas="-lmkl_rt -lpthread" --with-lapack --enable-memory-profiling --enable-R-shlib $ make $ sudo make install $ R --version R version 3.2.2 (2015-08-14) -- "Fire Safety" If you have 'ifort' available, you would probably want to add it to the list of environment variables. --nate __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
[Rd] sample.int() algorithms
I was experiencing a strange pattern of slowdowns when using sample.int(), where sampling from a one population would sometimes take 1000x longer than taking the same number of samples from a slightly larger population. For my application, this resulted in a runtime of several hours rather than a few seconds. Looking into it, I saw that sample.int() is hardcoded to switch algorithms when the population is larger than 1e+7, and I was straddling this boundary: sample.int <- function(n, size = n, replace = FALSE, prob = NULL) { if (!replace && is.null(prob) && n > 1e7 && size <= n/2) .Internal(sample2(n, size)) else .Internal(sample(n, size, replace, prob)) } do_sample2() takes the approach of taking a sample, and then checking if this sample is a duplicate. As long as the population size is much larger than numbers of samples, this will be efficient. This explains the check for "size <= n/2". But I'm not sure why the "n > 1e7" check is needed. I put together some sample code to show the difference in timing letting sample.int() choose the cutoff point versus manually specifying the use of do_sample2(): ### compare times for sample.int() vs internal function sample2() compareSampleTimes = function(popSizeList=c(1e5, 1e6, 1e7, 1e8, 1e9), sampleSizeList=c(10, 100, 1000, 1), numReplications=1000) { for (sampleSize in sampleSizeList) { for (popSize in popSizeList) { elapsed1 = system.time(replicate(numReplications, sample.int(popSize, sampleSize)))[["elapsed"]] elapsed2 = system.time(replicate(numReplications, .Internal(sample2(popSize, sampleSize[["elapsed"]] cat(sprintf("Sample %d from %.0e: %f vs %f seconds\n", sampleSize, popSize, elapsed1, elapsed2)) } cat("\n") } } compareSampleTimes() https://gist.github.com/nkurz/8fa6ff3772a054294531 And here's the output showing the 1000x slowdowns at population sizes of 10e7 under R-3.2.2: $ Rscript compareSampleTimes.R Sample 10 from 1e+05: 0.133000 vs 0.003000 seconds Sample 10 from 1e+06: 0.931000 vs 0.003000 seconds Sample 10 from 1e+07: 13.19 vs 0.003000 seconds Sample 10 from 1e+08: 0.004000 vs 0.003000 seconds Sample 10 from 1e+09: 0.004000 vs 0.002000 seconds Sample 100 from 1e+05: 0.18 vs 0.007000 seconds Sample 100 from 1e+06: 0.908000 vs 0.006000 seconds Sample 100 from 1e+07: 13.161000 vs 0.007000 seconds Sample 100 from 1e+08: 0.007000 vs 0.006000 seconds Sample 100 from 1e+09: 0.007000 vs 0.006000 seconds Sample 1000 from 1e+05: 0.194000 vs 0.057000 seconds Sample 1000 from 1e+06: 1.084000 vs 0.049000 seconds Sample 1000 from 1e+07: 13.226000 vs 0.049000 seconds Sample 1000 from 1e+08: 0.047000 vs 0.046000 seconds Sample 1000 from 1e+09: 0.048000 vs 0.047000 seconds Sample 1 from 1e+05: 0.414000 vs 0.712000 seconds Sample 1 from 1e+06: 1.10 vs 0.453000 seconds Sample 1 from 1e+07: 14.882000 vs 0.443000 seconds Sample 1 from 1e+08: 0.448000 vs 0.443000 seconds Sample 1 from 1e+09: 0.445000 vs 0.443000 seconds Since my usage involves taking samples of 1K from populations of about 10M, the do_sample2() approach is the clear winner: .05 seconds vs 13 seconds. I tested on a couple machines, and got similar results on both. Would there be a downside to having sample.int() use the faster do_sample2() approach for populations of all sizes whenever the ratio of population size to sample size is large? --nate __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] Build R with MKL and ICC
Hi Arnaud -- I'm glad it's working for you. I'm not sure I understand your final answer. Are you saying that the version I posted worked for you as given, or that you had to remove some of the other options? Perhaps you could post the full final recipe in a way that others can copy and paste if they find this thread in the future? Also, did you determine if the the same approach worked for linking to MKL worked when using a non-Intel compiler? That is, can you substitute the gcc or clang tool names and get the same result? --nate On Tue, Sep 29, 2015 at 9:07 AM, arnaud gaboury wrote: > On Wed, Sep 9, 2015 at 11:26 PM, Nathan Kurz wrote: >> >> As a short and simple approach, I just compiled the current R release >> on Ubuntu with ICC and MKL using just this: >> >> $ tar -xzf R-3.2.2.tar.gz >> $ cd R-3.2.2 >> $ CC=icc CXX=icpc AR=xiar LD=xild CFLAGS="-g -O3 -xHost" CXXFLAGS="-g >> -O3 -xHost" ./configure --with-blas="-lmkl_rt -lpthread" --with-lapack >> --enable-memory-profiling --enable-R-shlib >> $ make >> $ sudo make install >> $ R --version >> R version 3.2.2 (2015-08-14) -- "Fire Safety" > > > That is exactly the right combo: with-blas="-lmkl_rt -lpthread" > Nothing more for $MKL > > now > $ ldd bin/exec/R > linux-vdso.so.1 (0x7ffe305f9000) > libmkl_rt.so => /opt/intel/mkl/lib/intel64_lin/libmkl_rt.so > (0x7f216c9e3000) > . __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
[Rd] Cost of garbage collection seems excessive
When doing repeated regressions on large data sets, I'm finding that the time spent on garbage collection often exceeds the time spent on the regression itself. Consider this test program which I'm running on an Intel Haswell i7-4470 processor under Linux 3.13 using R 3.1.2 compiled with ICPC 14.1: nate@haswell:~$ cat > gc.R library(speedglm) createData <- function(n) { int <- -5 x <- rnorm(n, 50, 7) e <- rnorm(n, 0, 1) y <- int + (1.2 * x) + e return(data.frame(y, x)) } gc.time() data <- createData(50) data.y <- as.matrix(data[1]) data.x <- model.matrix(y ~ ., data) for (i in 1:100) speedglm.wfit(X=data.x, y=data.y, family=gaussian()) gc.time() nate@haswell:~$ time Rscript gc.R Loading required package: Matrix Loading required package: methods [1] 0 0 0 0 0 [1] 10.410 0.024 10.441 0.000 0.000 real 0m17.167s user 0m16.996s sys 0m0.176s The total execution time is 17 seconds, and the time spent on garbage collection is almost 2/3 of that. My actual use case is a package that creates an ensemble from a variety of cross-validated regressions, and exhibits the same poor performance. Is this expected behavior? I've found that I can reduce the garbage collection time to a tolerable level by setting the R_VSIZE environment value to a large enough value: nate@haswell:~$ time R_VSIZE=1GB Rscript gc.R Loading required package: Matrix Loading required package: methods [1] 0 0 0 0 0 [1] 0.716 0.025 0.739 0.000 0.000 real 0m7.694s user 0m7.388s sys 0m0.309s I can do slightly better with even higher values, and by using R_GC_MEM_GROW=3. But while using the environment variables solves the issue for me, I fear that the end users of my package won't be able to set them. Is there a way that I can achieve the higher performance from within R rather than from the command line? Thanks! --nate __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] Request to speed up save()
On Thu, Jan 15, 2015 at 11:08 AM, Simon Urbanek wrote: > In addition to the major points that others made: if you care about speed, > don't use compression. With today's fast disks it's an order of magnitude > slower to use compression: > >> d=lapply(1:10, function(x) as.integer(rnorm(1e7))) >> system.time(saveRDS(d, file="test.rds.gz")) >user system elapsed > 17.210 0.148 17.397 >> system.time(saveRDS(d, file="test.rds", compress=F)) >user system elapsed > 0.482 0.355 0.929 > > The above example is intentionally well compressible, in real life the > differences are actually even bigger. As people that deal with big data know > well, disks are no longer the bottleneck - it's the CPU now. Respectfully, while your example would imply this, I don't think this is correct in the general case.Much faster compression schemes exist, and using these can improve disk I/O tremendously. Some schemes that are so fast that it's even faster to transfer compressed data from main RAM to CPU cache and then decompress to avoid being limited by RAM bandwidth: https://github.com/Blosc/c-blosc Repeating that for emphasis, compressing and uncompressing can be actually be faster than a straight memcpy()! Really, the issue is that 'gzip' and 'bzip2' are bottlenecks. As Steward suggests, this can be mitigated by throwing more cores at the problem. This isn't a bad solution, as there are often excess underutilized cores. But much better would be to choose a faster compression scheme first, and then parallelize that across cores if still necessary. Sometimes the tradeoff is between amount of compression and speed, and sometimes some algorithms are just faster than others. Here's some sample data for the test file that your example creates: > d=lapply(1:10, function(x) as.integer(rnorm(1e7))) > system.time(saveRDS(d, file="test.rds", compress=F)) user system elapsed 0.554 0.336 0.890 nate@ubuntu:~/R/rds$ ls -hs test.rds 382M test.rds nate@ubuntu:~/R/rds$ time gzip -c test.rds > test.rds.gz real: 16.207 sec nate@ubuntu:~/R/rds$ ls -hs test.rds.gz 35M test.rds.gz nate@ubuntu:~/R/rds$ time gunzip -c test.rds.gz > discard real: 2.330 sec nate@ubuntu:~/R/rds$ time gzip -c --fast test.rds > test.rds.gz real: 4.759 sec nate@ubuntu:~/R/rds$ ls -hs test.rds.gz 56M test.rds.gz nate@ubuntu:~/R/rds$ time gunzip -c test.rds.gz > discard real: 2.942 sec nate@ubuntu:~/R/rds$ time pigz -c test.rds > test.rds.gz real: 2.180 sec nate@ubuntu:~/R/rds$ ls -hs test.rds.gz 35M test.rds.gz nate@ubuntu:~/R/rds$ time gunzip -c test.rds.gz > discard real: 2.375 sec nate@ubuntu:~/R/rds$ time pigz -c --fast test.rds > test.rds.gz real: 0.739 sec nate@ubuntu:~/R/rds$ ls -hs test.rds.gz 57M test.rds.gz nate@ubuntu:~/R/rds$ time gunzip -c test.rds.gz > discard real: 2.851 sec nate@ubuntu:~/R/rds$ time lz4c test.rds > test.rds.lz4 Compressed 40102 bytes into 125584749 bytes ==> 31.40% real: 1.024 sec nate@ubuntu:~/R/rds$ ls -hs test.rds.lz4 120M test.rds.lz4 nate@ubuntu:~/R/rds$ time lz4 test.rds.lz4 > discard Compressed 125584749 bytes into 95430573 bytes ==> 75.99% real: 0.775 sec Reading that last one more closely, with single threaded lz4 compression, we're getting 3x compression at about 400MB/s, and decompression at about 500MB/s. This is faster than almost any single disk will be. Multithreaded implementations will make even the fastest RAID be the bottleneck. It's probably worth noting that the speeds reported in your simple example for the uncompressed case are likely the speed of writing to memory, with the actual write to disk happening at some later time. Sustained throughput will likely be slower than your example would imply If saving data to disk is a bottleneck, I think Stewart is right that there is a lot of room for improvement. --nate __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] default min-v/nsize parameters
On Thu, Jan 15, 2015 at 3:55 PM, Michael Lawrence wrote: > Just wanted to start a discussion on whether R could ship with more > appropriate GC parameters. I've been doing a number of similar measurements, and have come to the same conclusion. R is currently very conservative about memory usage, and this leads to unnecessarily poor performance on certain problems. Changing the defaults to sizes that are more appropriate for modern machines can often produce a 2x speedup. On Sat, Jan 17, 2015 at 8:39 AM, wrote: > Martin Morgan discussed this a year or so ago and as I recall bumped > up these values to the current defaults. I don't recall details about > why we didn't go higher -- maybe Martin does. I just checked, and it doesn't seem that any of the relevant values have been increased in the last ten years. Do you have a link to the discussion you recall so we can see why the changes weren't made? > I suspect the main concern would be with small memory machines in student labs > and less developed countries. While a reasonable concern, I'm doubtful there are many machines for which the current numbers are optimal. The current minimum size increases for node and vector heaps are 40KB and 80KB respectively. This grows as the heap grows (min + .05 * heap), but still means that we do many more expensive garbage collections at while growing than we need to. Paradoxically, the SMALL_MEMORY compile option (which is suggestd for computers with up to 32MB of RAM) has slightly larger at 50KB and 100KB. I think we'd get significant benefit for most users by being less conservative about memory consumption.The exact sizes should be discussed, but with RAM costing about $10/GB it doesn't seem unreasonable to assume most machines running R have multiple GB installed, and those that don't will quite likely be running an OS that needs a custom compiled binary anyway. I could be way off, but my suggestion might be a 10MB start with 1MB minimum increments for SMALL_MEMORY, 100MB start with 10MB increments for NORMAL_MEMORY, and 1GB start with 100MB increments for LARGE_MEMORY might be a reasonable spread. Or one could go even larger, noting that on most systems, overcommitted memory is not a problem until it is used. Until we write to it, it doesn't actually use physical RAM, just virtual address space. Or we could stay small, but make it possible to programmatically increase the granularity from within R. For ease of reference, here are the relevant sections of code: https://github.com/wch/r-source/blob/master/src/include/Defn.h#L217 (ripley last authored on Jan 26, 2000 / pd last authored on May 8, 1999) 217 #ifndef R_NSIZE 218 #define R_NSIZE 35L 219 #endif 220 #ifndef R_VSIZE 221 #define R_VSIZE 6291456L 222 #endif https://github.com/wch/r-source/blob/master/src/main/startup.c#L169 (ripley last authored on Jun 9, 2004) 157 Rp->vsize = R_VSIZE; 158 Rp->nsize = R_NSIZE; 166 #define Max_Nsize 5000 /* about 1.4Gb 32-bit, 2.8Gb 64-bit */ 167 #define Max_Vsize R_SIZE_T_MAX /* unlimited */ 169 #define Min_Nsize 22 170 #define Min_Vsize (1*Mega) https://github.com/wch/r-source/blob/master/src/main/memory.c#L335 (luke last authored on Nov 1, 2000) #ifdef SMALL_MEMORY 336 /* On machines with only 32M of memory (or on a classic Mac OS port) 337 it might be a good idea to use settings like these that are more 338 aggressive at keeping memory usage down. */ 339 static double R_NGrowIncrFrac = 0.0, R_NShrinkIncrFrac = 0.2; 340 static int R_NGrowIncrMin = 5, R_NShrinkIncrMin = 0; 341 static double R_VGrowIncrFrac = 0.0, R_VShrinkIncrFrac = 0.2; 342 static int R_VGrowIncrMin = 10, R_VShrinkIncrMin = 0; 343#else 344 static double R_NGrowIncrFrac = 0.05, R_NShrinkIncrFrac = 0.2; 345 static int R_NGrowIncrMin = 4, R_NShrinkIncrMin = 0; 346 static double R_VGrowIncrFrac = 0.05, R_VShrinkIncrFrac = 0.2; 347 static int R_VGrowIncrMin = 8, R_VShrinkIncrMin = 0; 348#endif static void AdjustHeapSize(R_size_t size_needed) { R_size_t R_MinNFree = (R_size_t)(orig_R_NSize * R_MinFreeFrac); R_size_t R_MinVFree = (R_size_t)(orig_R_VSize * R_MinFreeFrac); R_size_t NNeeded = R_NodesInUse + R_MinNFree; R_size_t VNeeded = R_SmallVallocSize + R_LargeVallocSize + size_needed + R_MinVFree; double node_occup = ((double) NNeeded) / R_NSize; double vect_occup = ((double) VNeeded) / R_VSize; if (node_occup > R_NGrowFrac) { R_size_t change = (R_size_t)(R_NGrowIncrMin + R_NGrowIncrFrac * R_NSize); if (R_MaxNSize >= R_NSize + change) R_NSize += change; } else if (node_occup < R_NShrinkFrac) { R_NSize -= (R_NShrinkIncrMin + R_NShrinkIncrFrac * R_NSize); if (R_NSize < NNeeded) R_NSize = (NNeeded < R_MaxNSize) ? NNeeded: R_MaxNSize; if (R_NSize < orig_R_NSize) R_NSize = orig_R_NSize; } if (vect_occup > 1.0 && VNeeded < R_MaxVSize) R_VSize = VNeeded; if
Re: [Rd] [PATCH] Makefile: add support for git svn clones
On Mon, Jan 19, 2015 at 1:00 PM, Felipe Balbi wrote: > I just thought that such a small patch which causes no visible change to > SVN users and allow for git users to build R would be acceptable, but if > it isn't, that's fine too. Felipe --- It would appear that you are unaware that you are walking a minefield of entrenched positions and personality conflicts. For those like myself who are mystified by the positions taken in this thread, a partial back story may be helpful. In 2012, Han-Tak Leung reported a problem compiling the development version of R that he had checked out using git's svn compability feature: https://stat.ethz.ch/pipermail/r-devel/2012-October/065133.html In 2013, Brian Ripley applied a patch with the comment "trap HK Leung misuse" explicitly to prevent users from being able to do this: https://github.com/wch/r-source/commit/4f13e5325dfbcb9fc8f55fc6027af9ae9c7750a3 Shortly thereafter, Han-Tak tried to start discussion on this list about that patch, suggesting that preventing the use of non-SVN mirrors reduced the frequency with which development versions would be tested: https://stat.ethz.ch/pipermail/r-devel/2013-March/066128.html The opinions expressed on the thread were universally against Leung. Peter Dalgaard summarized as: "The generic point is that you are given access to a working tool that is internal to the core R developers. We are not putting restrictions on what you do with that access, but if you want to play the game by other rules than we do, you need to take the consequences. If things don't work and you start complaining about them being "broken", steps may be taken to make it clearer who broke them." https://stat.ethz.ch/pipermail/r-devel/2013-March/066131.html As a newcomer hoping to contribute to R who had already encountered this same compilation issue and considered it was a bug, I am astounded to learn that it is instead desired and intentional behavior. --nate __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
[Rd] Recycling memory with a small free list
I'm trying to improve the performance of the update loop within a logistic regression, and am struggling against the overhead of memory allocation and garbage collection. The main issue I'd like to solve is with assignments inside of loops like this: reweight = function(iter, w, Q) { for (i in 1:iter) { wT = w * Q } } If the matrix Q is large I can get a significant gain in performance if I can reuse the same allocation for wT rather than making a new allocation on each loop iteration. While I can solve this problem in this case with an Rcpp function that does the modifications in place, this seems like a case where a more general solution might work well. While normally we don't know whether an allocation has gone out of scope, the left-hand-side of an assignment is a frequent exception. Instead of simply letting the LHS become unreferenced, has the possibility of adding a small free-list been considered? That is, before the RHS is executed, the LHS allocation would be added to a small fixed length list of available space which is checked before future allocations. If the same size is requested before the next garbage collection, the allocation is short-circuited and the allocation is reused. This list could be very small, possibly even only a single entry. Entries would only be put on the list if they have no other references. If a garbage collection is triggered, the list would be emptied and the contents collected. While my understanding of R's memory strategy is still limited, it looks like this would be a reasonably straightforward patch that could touch only the assignment operator, the vector allocator, and the collection routine. Overall memory usage should go down, and in the cases I'm considering, the performance improvement would be large. Measurement details are here: http://stackoverflow.com/questions/28532493/reusing-existing-memory-for-blas-operations-in-r Are there downsides or difficulties to this approach that I'm missing? Nathan Kurz n...@verse.com __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] Recycling memory with a small free list
On Wed, Feb 18, 2015 at 7:19 AM, Radford Neal wrote: >> ... with assignments inside of loops like this: >> >> reweight = function(iter, w, Q) { >> for (i in 1:iter) { >> wT = w * Q >> } >> } >> ... before the RHS is executed, the LHS allocation would be added >> to a small fixed length list of available space which is checked >> before future allocations. If the same size is requested before the >> next garbage collection, the allocation is short-circuited and the >> allocation is reused. This list could be very small, possibly even >> only a single entry. Entries would only be put on the list if they >> have no other references. Here's an article about the benefits of this approach in Go that might explain better than I was able: https://blog.cloudflare.com/recycling-memory-buffers-in-go/ Their charts explain the goal very clearly: stabilize at a smaller amount of memory to reduce churn, which improves performance in a myriad of ways. > Reusing the LHS storage immediately isn't possible in general, because > evaluation of the RHS might produce an error, in which case the LHS > variable is supposed to be unchanged. What's the guarantee R actually makes? What's an example of the use case where this behaviour would be required? More generally, can one not assume "a = NULL; a = func()" is equivalent to "a = func()" unless func() references 'a' or has it as an argument? Or is the difficulty that there is no way to know in advance it if will be referenced? > Detecting special cases where > there is guaranteed to be no error, or at least no error after the > first modification to newly allocated memory, might be too > complicated. Yes, if required, the complexity of guaranteeing this might well rule out the approach I suggested. > Putting the LHS storage on a small free list for later reuse (only > after the old value of the variable will definitely be replaced) seems > more promising (then one would need only two copies for examples such > as above, with them being used in alternate iterations). OK, let's consider that potentially easier option instead: do nothing immediately, but add a small queue for recycling from which the temporary might be drawn. It has slightly worse cache behavior, but should handle most of the issues with memory churn. > However, > there's a danger of getting carried away and essentially rewriting > malloc. To avoid this, one might try just calling "free" on the > no-longer-needed object, letting "malloc" then figure out when it can > be re-used. Yes, I think that's what I was anticipating: add a free() equivalent that does nothing if the object has multiple references/names, but adds the object to small fixed size "free list" if it does not. Perhaps this is only for certain types or for objects above a certain size. When requesting memory, allocvector() or perhaps R_alloc() does a quick check of that "free list" to see if it has anything of the exact requested size. If it does, it short circuits and recycles it. If it doesn't, normal allocation takes place. The "free list" is stored as two small fixed size arrays containing size/address pairs. Searching is done linearly using code that optimizes to SIMD comparisons. For 4/8/16 slots overhead of the search should be unmeasurably fast. The key to the approach would be keeping it simple, and realizing that the goal is only to get the lowest hanging fruit: repeated assignments of large arrays used in a loop. If it's complex, skip it --- the behavior will be no worse than current. By the way, what's happening with Luke's refcnt patches? From the outside, they seem like a great improvement. http://homepage.stat.uiowa.edu/~luke/talks/dsc2014.pdf http://developer.r-project.org/Refcnt.html Are they slated to become the standard approach? Are they going to be dropped? Will both approaches be kept in parallel? > Unfortunately, that seems not to be safe, because it's > possible that there is a reference to the no-longer-needed object on > the PROTECT stack, even though no one should actually be looking at > it any more. Can you explain this case? I don't think I understand it. > In the current version of pqR (see pqR-project.org), modifications are > (often) done in place for statements such as w = w * Q, but not > curretly when the LHS variable does not appear on the RHS. Yes, I looked at it earlier, and was excited to see that Luke had ported half of your approach to standard R: https://github.com/wch/r-source/blob/trunk/src/main/arithmetic.h#L65 But only the RHS temporary variables optimizations made it over. Your LHS "w = w * Q" optimizations did not, but I didn't see any discussion of why. Was it attempted and issues were found? I think what I'm suggesting is complementary to that. Direct reuse is best if it can be detected, but recycling will provide more opportunities for optimization. Of course, what I'm suggesting is always quite obvious, and I presume it's part what he includes