Re: [Rd] Build R with MKL and ICC

2015-09-09 Thread Nathan Kurz
As a short and simple approach, I just compiled the current R release
on Ubuntu with ICC and MKL using just this:

$ tar -xzf R-3.2.2.tar.gz
$ cd R-3.2.2
$ CC=icc CXX=icpc AR=xiar LD=xild CFLAGS="-g -O3 -xHost" CXXFLAGS="-g
-O3 -xHost" ./configure --with-blas="-lmkl_rt -lpthread" --with-lapack
--enable-memory-profiling --enable-R-shlib
$ make
$ sudo make install
$ R --version
R version 3.2.2 (2015-08-14) -- "Fire Safety"

If you have 'ifort' available, you would probably want to add it to
the list of environment variables.

--nate

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


[Rd] sample.int() algorithms

2015-09-09 Thread Nathan Kurz
I was experiencing a strange pattern of slowdowns when using
sample.int(), where sampling from a one  population would sometimes
take 1000x longer than taking the same number of samples from a
slightly larger population.   For my application, this resulted in a
runtime of several hours rather than a few seconds.  Looking into it,
I saw that sample.int() is hardcoded to switch algorithms when the
population is larger than 1e+7, and I was straddling this boundary:

sample.int  <- function(n, size = n, replace = FALSE, prob = NULL)
{
if (!replace && is.null(prob) && n > 1e7 && size <= n/2)
.Internal(sample2(n, size))
else .Internal(sample(n, size, replace, prob))
}

do_sample2() takes the approach of taking a sample, and then checking
if this sample is a duplicate.  As long as the population size is much
larger than numbers of samples, this will be efficient.  This explains
the check for "size <= n/2".   But I'm not sure why the "n > 1e7"
check is needed.

I put together some sample code to show the difference in timing
letting sample.int() choose the cutoff point versus manually
specifying the use of do_sample2():

###  compare times for sample.int() vs internal function sample2()
compareSampleTimes = function(popSizeList=c(1e5, 1e6, 1e7, 1e8, 1e9),
sampleSizeList=c(10, 100, 1000, 1),
numReplications=1000) {
for (sampleSize in sampleSizeList) {
for (popSize in popSizeList)  {
elapsed1 = system.time(replicate(numReplications,
sample.int(popSize, sampleSize)))[["elapsed"]]
elapsed2 = system.time(replicate(numReplications,
.Internal(sample2(popSize, sampleSize[["elapsed"]]
cat(sprintf("Sample %d from %.0e: %f vs %f seconds\n",
sampleSize, popSize, elapsed1, elapsed2))
}
cat("\n")
}
}

compareSampleTimes()

https://gist.github.com/nkurz/8fa6ff3772a054294531

And here's the output showing the 1000x slowdowns at population sizes
of 10e7 under R-3.2.2:

$ Rscript compareSampleTimes.R
Sample 10 from 1e+05: 0.133000 vs 0.003000 seconds
Sample 10 from 1e+06: 0.931000 vs 0.003000 seconds
Sample 10 from 1e+07: 13.19 vs 0.003000 seconds
Sample 10 from 1e+08: 0.004000 vs 0.003000 seconds
Sample 10 from 1e+09: 0.004000 vs 0.002000 seconds

Sample 100 from 1e+05: 0.18 vs 0.007000 seconds
Sample 100 from 1e+06: 0.908000 vs 0.006000 seconds
Sample 100 from 1e+07: 13.161000 vs 0.007000 seconds
Sample 100 from 1e+08: 0.007000 vs 0.006000 seconds
Sample 100 from 1e+09: 0.007000 vs 0.006000 seconds

Sample 1000 from 1e+05: 0.194000 vs 0.057000 seconds
Sample 1000 from 1e+06: 1.084000 vs 0.049000 seconds
Sample 1000 from 1e+07: 13.226000 vs 0.049000 seconds
Sample 1000 from 1e+08: 0.047000 vs 0.046000 seconds
Sample 1000 from 1e+09: 0.048000 vs 0.047000 seconds

Sample 1 from 1e+05: 0.414000 vs 0.712000 seconds
Sample 1 from 1e+06: 1.10 vs 0.453000 seconds
Sample 1 from 1e+07: 14.882000 vs 0.443000 seconds
Sample 1 from 1e+08: 0.448000 vs 0.443000 seconds
Sample 1 from 1e+09: 0.445000 vs 0.443000 seconds

Since my usage involves taking samples of 1K from populations of about
10M, the do_sample2() approach is the clear winner: .05 seconds vs 13
seconds.  I tested on a couple machines, and got similar results on
both. Would there be a downside to having sample.int() use the faster
do_sample2() approach for populations of all sizes whenever the ratio
of population size to sample size is large?

--nate

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Build R with MKL and ICC

2015-09-29 Thread Nathan Kurz
Hi Arnaud --

I'm glad it's working for you.  I'm not sure I understand your final
answer.  Are you saying that the version I posted worked for you as
given, or that you had to remove some of the other options?

Perhaps you could post the full final recipe in a way that others can
copy and paste if they find this thread in the future?

Also, did you determine if the the same approach worked for linking to
MKL worked when using a non-Intel compiler?   That is, can you
substitute the gcc or clang tool names and get the same result?

--nate

On Tue, Sep 29, 2015 at 9:07 AM, arnaud gaboury
 wrote:
> On Wed, Sep 9, 2015 at 11:26 PM, Nathan Kurz  wrote:
>>
>> As a short and simple approach, I just compiled the current R release
>> on Ubuntu with ICC and MKL using just this:
>>
>> $ tar -xzf R-3.2.2.tar.gz
>> $ cd R-3.2.2
>> $ CC=icc CXX=icpc AR=xiar LD=xild CFLAGS="-g -O3 -xHost" CXXFLAGS="-g
>> -O3 -xHost" ./configure --with-blas="-lmkl_rt -lpthread" --with-lapack
>> --enable-memory-profiling --enable-R-shlib
>> $ make
>> $ sudo make install
>> $ R --version
>> R version 3.2.2 (2015-08-14) -- "Fire Safety"
>
>
> That is exactly the right combo: with-blas="-lmkl_rt -lpthread"
> Nothing more for $MKL
>
> now
> $ ldd bin/exec/R
> linux-vdso.so.1 (0x7ffe305f9000)
> libmkl_rt.so => /opt/intel/mkl/lib/intel64_lin/libmkl_rt.so 
> (0x7f216c9e3000)
> .

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


[Rd] Cost of garbage collection seems excessive

2015-01-09 Thread Nathan Kurz
When doing repeated regressions on large data sets, I'm finding that
the time spent on garbage collection often exceeds the time spent on
the regression itself.   Consider this test program which I'm running
on  an Intel Haswell i7-4470 processor under Linux 3.13 using R 3.1.2
compiled with ICPC 14.1:

nate@haswell:~$ cat > gc.R
  library(speedglm)
  createData <- function(n) {
  int <- -5
  x <- rnorm(n, 50, 7)
  e <- rnorm(n, 0, 1)
  y <- int + (1.2 * x) + e
  return(data.frame(y, x))
  }
 gc.time()
 data <- createData(50)
 data.y <- as.matrix(data[1])
 data.x <- model.matrix(y ~ ., data)
 for (i in 1:100) speedglm.wfit(X=data.x, y=data.y, family=gaussian())
 gc.time()

nate@haswell:~$ time Rscript gc.R
 Loading required package: Matrix
 Loading required package: methods
 [1] 0 0 0 0 0
 [1] 10.410  0.024 10.441  0.000  0.000
real 0m17.167s
user 0m16.996s
sys 0m0.176s

The total execution time is 17 seconds, and the time spent on garbage
collection is almost 2/3 of that.  My actual use case is a package
that creates an ensemble from a variety of cross-validated
regressions, and exhibits the same poor performance. Is this expected
behavior?

I've found that I can reduce the garbage collection time to a
tolerable level by setting the R_VSIZE environment value to a large
enough value:

nate@haswell:~$ time R_VSIZE=1GB Rscript gc.R
Loading required package: Matrix
Loading required package: methods
[1] 0 0 0 0 0
[1] 0.716 0.025 0.739 0.000 0.000
real 0m7.694s
user 0m7.388s
sys 0m0.309s

I can do slightly better with even higher values, and by using
R_GC_MEM_GROW=3.  But while using the environment variables solves the
issue for me, I fear that the end users of my package won't be able to
set them.   Is there a way that I can achieve the higher performance
from within R rather than from the command line?

Thanks!

--nate

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Request to speed up save()

2015-01-15 Thread Nathan Kurz
On Thu, Jan 15, 2015 at 11:08 AM, Simon Urbanek
 wrote:
> In addition to the major points that others made: if you care about speed, 
> don't use compression. With today's fast disks it's an order of magnitude 
> slower to use compression:
>
>> d=lapply(1:10, function(x) as.integer(rnorm(1e7)))
>> system.time(saveRDS(d, file="test.rds.gz"))
>user  system elapsed
>  17.210   0.148  17.397
>> system.time(saveRDS(d, file="test.rds", compress=F))
>user  system elapsed
>   0.482   0.355   0.929
>
> The above example is intentionally well compressible, in real life the 
> differences are actually even bigger. As people that deal with big data know 
> well, disks are no longer the bottleneck - it's the CPU now.

Respectfully, while your example would imply this, I don't think this
is correct in the general case.Much faster compression schemes
exist, and using these can improve disk I/O tremendously.  Some
schemes that are so fast that it's even faster to transfer compressed
data from main RAM to CPU cache and then decompress to avoid being
limited by RAM bandwidth: https://github.com/Blosc/c-blosc

Repeating that for emphasis, compressing and uncompressing can be
actually be faster than a straight memcpy()!

Really, the issue is that 'gzip' and 'bzip2' are bottlenecks.   As
Steward suggests, this can be mitigated by throwing more cores at the
problem.  This isn't a bad solution, as there are often excess
underutilized cores.  But much better would be to choose a faster
compression scheme first, and then parallelize that across cores if
still necessary.

Sometimes the tradeoff is between amount of compression and speed, and
sometimes some algorithms are just faster than others.   Here's some
sample data for the test file that your example creates:

> d=lapply(1:10, function(x) as.integer(rnorm(1e7)))
> system.time(saveRDS(d, file="test.rds", compress=F))
   user  system elapsed
  0.554   0.336   0.890

nate@ubuntu:~/R/rds$ ls -hs test.rds
382M test.rds
nate@ubuntu:~/R/rds$ time gzip -c test.rds > test.rds.gz
real: 16.207 sec
nate@ubuntu:~/R/rds$ ls -hs test.rds.gz
35M test.rds.gz
nate@ubuntu:~/R/rds$ time gunzip -c test.rds.gz > discard
real: 2.330 sec

nate@ubuntu:~/R/rds$ time gzip -c --fast test.rds > test.rds.gz
real: 4.759 sec
nate@ubuntu:~/R/rds$ ls -hs test.rds.gz
56M test.rds.gz
nate@ubuntu:~/R/rds$ time gunzip -c test.rds.gz > discard
real: 2.942 sec

nate@ubuntu:~/R/rds$ time pigz -c  test.rds > test.rds.gz
real: 2.180 sec
nate@ubuntu:~/R/rds$ ls -hs test.rds.gz
35M test.rds.gz
nate@ubuntu:~/R/rds$ time gunzip -c test.rds.gz > discard
real: 2.375 sec

nate@ubuntu:~/R/rds$ time pigz -c --fast test.rds > test.rds.gz
real: 0.739 sec
nate@ubuntu:~/R/rds$ ls -hs test.rds.gz
57M test.rds.gz
nate@ubuntu:~/R/rds$ time gunzip -c test.rds.gz > discard
real: 2.851 sec

nate@ubuntu:~/R/rds$ time lz4c test.rds > test.rds.lz4
Compressed 40102 bytes into 125584749 bytes ==> 31.40%
real: 1.024 sec
nate@ubuntu:~/R/rds$ ls -hs test.rds.lz4
120M test.rds.lz4
nate@ubuntu:~/R/rds$ time lz4 test.rds.lz4 > discard
Compressed 125584749 bytes into 95430573 bytes ==> 75.99%
real: 0.775 sec

Reading that last one more closely, with single threaded lz4
compression, we're getting 3x compression at about 400MB/s, and
decompression at about 500MB/s.   This is faster than almost any
single disk will be.  Multithreaded implementations will make even the
fastest RAID be the bottleneck.

It's probably worth noting that the speeds reported in your simple
example for the uncompressed case are likely the speed of writing to
memory, with the actual write to disk happening at some later time.
Sustained throughput will likely be slower than your example would
imply

If saving data to disk is a bottleneck, I think Stewart is right that
there is a lot of room for improvement.

--nate

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] default min-v/nsize parameters

2015-01-17 Thread Nathan Kurz
On Thu, Jan 15, 2015 at 3:55 PM, Michael Lawrence
 wrote:
> Just wanted to start a discussion on whether R could ship with more
> appropriate GC parameters.

I've been doing a number of similar measurements, and have come to the
same conclusion.  R is currently very conservative about memory usage,
and this leads to unnecessarily poor performance on certain problems.
Changing the defaults to sizes that are more appropriate for modern
machines can often produce a 2x speedup.

On Sat, Jan 17, 2015 at 8:39 AM,   wrote:
> Martin Morgan discussed this a year or so ago and as I recall bumped
> up these values to the current defaults. I don't recall details about
> why we didn't go higher -- maybe Martin does.

I just checked, and it doesn't seem that any of the relevant values
have been increased in the last ten years.  Do you have a link to the
discussion you recall so we can see why the changes weren't made?

> I suspect the main concern would be with small memory machines in student labs
> and less developed countries.

While a reasonable concern, I'm doubtful there are many machines for
which the current numbers are optimal.  The current minimum size
increases for node and vector heaps are 40KB and 80KB respectively.
This grows as the heap grows (min + .05 * heap), but still means that
we do many more expensive garbage collections at while growing than we
need to.  Paradoxically, the SMALL_MEMORY compile option (which is
suggestd for computers with up to 32MB of RAM) has slightly larger at
50KB and 100KB.

I think we'd get significant benefit for most users by being less
conservative about memory consumption.The exact sizes should be
discussed, but with RAM costing about $10/GB it doesn't seem
unreasonable to assume most machines running R have multiple GB
installed, and those that don't will quite likely be running an OS
that needs a custom compiled binary anyway.

I could be way off, but my suggestion might be a 10MB start with 1MB
minimum increments for SMALL_MEMORY, 100MB start with 10MB increments
for NORMAL_MEMORY, and 1GB start with 100MB increments for
LARGE_MEMORY might be a reasonable spread.

Or one could go even larger, noting that on most systems,
overcommitted memory is not a problem until it is used.  Until we
write to it, it doesn't actually use physical RAM, just virtual
address space.  Or we could stay small, but make it possible to
programmatically increase the granularity from within R.

For ease of reference, here are the relevant sections of code:

https://github.com/wch/r-source/blob/master/src/include/Defn.h#L217
(ripley last authored on Jan 26, 2000 / pd last authored on May 8, 1999)
217  #ifndef R_NSIZE
218  #define R_NSIZE 35L
219  #endif
220  #ifndef R_VSIZE
221  #define R_VSIZE 6291456L
222  #endif

https://github.com/wch/r-source/blob/master/src/main/startup.c#L169
(ripley last authored on Jun 9, 2004)
157 Rp->vsize = R_VSIZE;
158 Rp->nsize = R_NSIZE;
166  #define Max_Nsize 5000 /* about 1.4Gb 32-bit, 2.8Gb 64-bit */
167  #define Max_Vsize R_SIZE_T_MAX /* unlimited */
169  #define Min_Nsize 22
170  #define Min_Vsize (1*Mega)

https://github.com/wch/r-source/blob/master/src/main/memory.c#L335
(luke last authored on Nov 1, 2000)
#ifdef SMALL_MEMORY
336  /* On machines with only 32M of memory (or on a classic Mac OS port)
337  it might be a good idea to use settings like these that are more
338  aggressive at keeping memory usage down. */
339  static double R_NGrowIncrFrac = 0.0, R_NShrinkIncrFrac = 0.2;
340  static int R_NGrowIncrMin = 5, R_NShrinkIncrMin = 0;
341  static double R_VGrowIncrFrac = 0.0, R_VShrinkIncrFrac = 0.2;
342  static int R_VGrowIncrMin = 10, R_VShrinkIncrMin = 0;
343#else
344  static double R_NGrowIncrFrac = 0.05, R_NShrinkIncrFrac = 0.2;
345  static int R_NGrowIncrMin = 4, R_NShrinkIncrMin = 0;
346  static double R_VGrowIncrFrac = 0.05, R_VShrinkIncrFrac = 0.2;
347  static int R_VGrowIncrMin = 8, R_VShrinkIncrMin = 0;
348#endif

static void AdjustHeapSize(R_size_t size_needed)
{
R_size_t R_MinNFree = (R_size_t)(orig_R_NSize * R_MinFreeFrac);
R_size_t R_MinVFree = (R_size_t)(orig_R_VSize * R_MinFreeFrac);
R_size_t NNeeded = R_NodesInUse + R_MinNFree;
R_size_t VNeeded = R_SmallVallocSize + R_LargeVallocSize +
size_needed + R_MinVFree;
double node_occup = ((double) NNeeded) / R_NSize;
double vect_occup = ((double) VNeeded) / R_VSize;

if (node_occup > R_NGrowFrac) {
R_size_t change = (R_size_t)(R_NGrowIncrMin + R_NGrowIncrFrac
* R_NSize);
if (R_MaxNSize >= R_NSize + change)
   R_NSize += change;
}
else if (node_occup < R_NShrinkFrac) {
R_NSize -= (R_NShrinkIncrMin + R_NShrinkIncrFrac * R_NSize);
if (R_NSize < NNeeded)
 R_NSize = (NNeeded < R_MaxNSize) ? NNeeded: R_MaxNSize;
if (R_NSize < orig_R_NSize)
 R_NSize = orig_R_NSize;
 }

if (vect_occup > 1.0 && VNeeded < R_MaxVSize)
R_VSize = VNeeded;
if 

Re: [Rd] [PATCH] Makefile: add support for git svn clones

2015-01-19 Thread Nathan Kurz
On Mon, Jan 19, 2015 at 1:00 PM, Felipe Balbi  wrote:
> I just thought that such a small patch which causes no visible change to
> SVN users and allow for git users to build R would be acceptable, but if
> it isn't, that's fine too.

Felipe ---

It would appear that you are unaware that you are walking a minefield
of entrenched positions and personality conflicts.  For those like
myself who are mystified by the positions taken in this thread, a
partial back story may be helpful.

In 2012, Han-Tak Leung reported a problem compiling the development
version of R that he had checked out using git's svn compability
feature: https://stat.ethz.ch/pipermail/r-devel/2012-October/065133.html

In 2013, Brian Ripley applied a patch with the comment "trap HK Leung
misuse" explicitly to prevent users from being able to do this:
https://github.com/wch/r-source/commit/4f13e5325dfbcb9fc8f55fc6027af9ae9c7750a3

Shortly thereafter, Han-Tak tried to start discussion on this list
about that patch, suggesting that preventing the use of non-SVN
mirrors reduced the frequency with which development versions would be
tested:
https://stat.ethz.ch/pipermail/r-devel/2013-March/066128.html

The opinions expressed on the thread were universally against Leung.
Peter Dalgaard summarized as:
"The generic point is that you are given access to a working tool that
is internal to the core R developers. We are not putting restrictions
on what you do with that access, but if you want to play the game by
other rules than we do, you need to take the consequences. If things
don't work and you start complaining about them being "broken", steps
may be taken to make it clearer who broke them."
https://stat.ethz.ch/pipermail/r-devel/2013-March/066131.html

As a newcomer hoping to contribute to R who had already encountered
this same compilation issue and considered it was a bug, I am
astounded to learn that it is instead desired and intentional
behavior.

--nate

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


[Rd] Recycling memory with a small free list

2015-02-17 Thread Nathan Kurz
I'm trying to improve the performance of the update loop within a
logistic regression, and am struggling against the overhead of memory
allocation and garbage collection.   The main issue I'd like to solve
is with assignments inside of loops like this:

reweight = function(iter, w, Q) {
  for (i in 1:iter) {
wT = w * Q
  }
}

If the matrix Q is large I can get a significant gain in performance
if I can reuse the same allocation for wT rather than making a new
allocation on each loop iteration.

While I can solve this problem in this case with an Rcpp function that
does the modifications in place, this seems like a case where a more
general solution might work well.   While normally we don't know
whether an allocation has gone out of scope, the left-hand-side of an
assignment is a frequent exception.  Instead of simply letting the LHS
become unreferenced, has the possibility of adding a small free-list
been considered?

That is, before the RHS is executed, the LHS allocation would be added
to a small fixed length list of available space which is checked
before future allocations.   If the same size is requested before the
next garbage collection, the allocation is short-circuited and the
allocation is reused.   This list could be very small, possibly even
only a single entry.  Entries would only be put on the list if they
have no other references.   If a garbage collection is triggered, the
list would be emptied and the contents collected.

While my understanding of R's memory strategy is still limited, it
looks like this would be a reasonably straightforward patch that could
touch only the assignment operator, the vector allocator, and the
collection routine.   Overall memory usage should go down, and in the
cases I'm considering, the performance improvement would be large.
Measurement details are here:
http://stackoverflow.com/questions/28532493/reusing-existing-memory-for-blas-operations-in-r

Are there downsides or difficulties to this approach that I'm missing?

Nathan Kurz
n...@verse.com

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Recycling memory with a small free list

2015-02-18 Thread Nathan Kurz
On Wed, Feb 18, 2015 at 7:19 AM, Radford Neal  wrote:
>> ... with assignments inside of loops like this:
>>
>> reweight = function(iter, w, Q) {
>>   for (i in 1:iter) {
>> wT = w * Q
>>   }
>> }
>> ... before the RHS is executed, the LHS allocation would be added
>> to a small fixed length list of available space which is checked
>> before future allocations.   If the same size is requested before the
>> next garbage collection, the allocation is short-circuited and the
>> allocation is reused.   This list could be very small, possibly even
>> only a single entry.  Entries would only be put on the list if they
>> have no other references.

Here's an article about the benefits of this approach in Go that might
explain better than I was able:
https://blog.cloudflare.com/recycling-memory-buffers-in-go/
Their charts explain the goal very clearly: stabilize at a smaller
amount of memory to reduce churn, which improves performance in a
myriad of ways.

> Reusing the LHS storage immediately isn't possible in general, because
> evaluation of the RHS might produce an error, in which case the LHS
> variable is supposed to be unchanged.

What's the guarantee R actually makes?  What's an example of the use
case where this behaviour would be required? More generally, can one
not assume "a = NULL; a = func()" is equivalent to "a = func()" unless
func() references 'a' or has it as an argument?  Or is the difficulty
that there is no way to know in advance it if will be referenced?

> Detecting special cases where
> there is guaranteed to be no error, or at least no error after the
> first modification to newly allocated memory, might be too
> complicated.

Yes, if required, the complexity of guaranteeing this might  well rule
out the approach I suggested.

> Putting the LHS storage on a small free list for later reuse (only
> after the old value of the variable will definitely be replaced) seems
> more promising (then one would need only two copies for examples such
> as above, with them being used in alternate iterations).

OK, let's consider that potentially easier option instead:  do nothing
immediately, but add a small queue for recycling from which the
temporary might be drawn.   It has slightly worse cache behavior, but
should handle most of the issues with memory churn.

> However,
> there's a danger of getting carried away and essentially rewriting
> malloc.  To avoid this, one might try just calling "free" on the
> no-longer-needed object, letting "malloc" then figure out when it can
> be re-used.

Yes, I think that's what I was anticipating:  add a free() equivalent
that does nothing if the object has multiple references/names, but
adds the object to small fixed size "free list" if it does not.
Perhaps this is only for certain types or for objects above a certain
size.

When requesting memory, allocvector() or perhaps R_alloc() does a
quick check of that "free list" to see if it has anything of the exact
requested size.  If it does, it short circuits and recycles it.  If it
doesn't, normal allocation takes place.

The "free list" is stored as two small fixed size arrays containing
size/address pairs.   Searching is done linearly using code that
optimizes to SIMD comparisons.   For 4/8/16 slots overhead of the
search should be unmeasurably fast.

The key to the approach would be keeping it simple, and realizing that
the goal is only to get the lowest hanging fruit:  repeated
assignments of large arrays used in a loop.  If it's complex, skip it
--- the behavior will be no worse than current.

By the way, what's happening with Luke's refcnt patches?  From the
outside, they seem like a great improvement.
http://homepage.stat.uiowa.edu/~luke/talks/dsc2014.pdf
http://developer.r-project.org/Refcnt.html
Are they slated to become the standard approach?  Are they going to be dropped?
 Will both approaches be kept in parallel?

> Unfortunately, that seems not to be safe, because it's
> possible that there is a reference to the no-longer-needed object on
> the PROTECT stack, even though no one should actually be looking at
> it any more.

Can you explain this case?   I don't think I understand it.

> In the current version of pqR (see pqR-project.org), modifications are
> (often) done in place for statements such as w = w * Q, but not
> curretly when the LHS variable does not appear on the RHS.

Yes, I looked at it earlier, and was excited to see that Luke had
ported half of your approach to standard R:
https://github.com/wch/r-source/blob/trunk/src/main/arithmetic.h#L65

But only the RHS temporary variables optimizations made it over. Your
LHS "w = w * Q" optimizations did not, but I didn't see any discussion
of why.   Was
it attempted and issues were found?

I think what I'm suggesting is complementary to that.   Direct reuse
is best if it can be detected, but recycling will provide more
opportunities for optimization.  Of course, what I'm suggesting is
always quite obvious, and I presume it's part what he includes