On 10/1/21 6:07 PM, Brodie Gaslam via R-devel wrote:
On Thursday, September 30, 2021, 01:25:02 PM EDT,
wrote:
On Thu, 30 Sep 2021, brodie gaslam via R-devel wrote:
André,
I'm not an R core member, but happen to have looked a little bit at this
issue myself. I've seen similar things on Skylake and Coffee Lake 2
(9700, one generation past your latest) too. I think it would make sense
to have some handling of this, although I would want to show the trade-off
with performance impacts on CPUs that are not affected by this, and on
vectors that don't actually have NAs and similar. I think the performance
impact is likely to be small so long as branch prediction is active, but
since branch prediction is involved you might need to check with different
ratios of NAs (not for your NA bailout branch, but for e.g. interaction
of what you add and the existing `na.rm=TRUE` logic).
I would want to see realistic examples where this matters, not
microbenchmarks, before thinking about complicating the code. Not all
but most cases where sum(x) returns NaN/NA would eventually result in
an error; getting to the error faster is not likely to be useful.
That's a very good point, and I'll admit I did not consider it
sufficiently. There are examples such as `rowSums`/`colSums` where some
rows/columns evaluate to NA thus the result is still contains meaningful
data. By extension, any loop that applies `sum` to list elements where
some might contain NAs, and others not. `tapply` or any other group based
aggregation come to mind.
My understanding is that arm64 does not support proper long doubles
(they are the same as regular doubles).
Mine is the same.
Then there are issues with that "long double" (where not equivalent to
"double") is implemented differently on different platforms, providing
different properties. We have ran into that on Power, where "long
double" it is implemented using a sum of doubles ("double-double"). If
we could rely just on "double", we would not have to worry about such
things.
So code using long doubles isn't getting the hoped-for improved
precision. Since that architecture is becoming more common we should
probably be looking at replacing uses of long doubles with better
algorithms that can work with regular doubles, e.g Kahan summation or
variants for sum.
This is probably the bigger issue. If the future is ARM/AMD, the value of
Intel x87-only optimizations becomes questionable.
More generally is the question of whether to completely replace long
double with algorithmic precision methods, at a cost of performance on
systems that do support hardware long doubles (Intel or otherwise), or
whether both code pathways are kept and selected at compile time. Or
maybe the aggregation functions gain a low-precision flag for simple
double precision accumulation.
I'm curious to look at the performance and precision implications of e.g.
Kahan summation if no one has done that yet. Some quick poking around
shows people using processor specific intrinsics to take advantage of
advanced multi-element instructions, but I imagine R would not want to do
that. Assuming others have not done this already, I will have a look and
report back.
Processor-specific (or even compiler-specific) code is better avoided,
but sometimes it is possible to write portable code that is tailored to
run fast on common platforms, while still working correctly with
acceptable performance on other.
Sometimes one can give hints to the compiler via OpenMP pragmas to
vectorize code and/or use vectorized instructions, e.g. when it is ok to
ignore some specific corner cases (R uses this in mayHaveNaNOrInf to
tell the compiler it is ok to assume associativity of addition in a
specific loop/variable, hence allowing it to vectorize better).
Since we're on the topic I want to point out that the default NA in R
starts off as a signaling NA:
example(numToBits) # for `bitC`
bitC(NA_real_)
## [1] 0 111 |
00100010
bitC(NA_real_ + 0)
## [1] 0 111 |
10100010
Notice the leading bit of the significant starts off as zero, which marks
it as a signaling NA, but becomes 1, i.e. non-signaling, after any
operation[2].
This is meaningful because the mere act of loading a signaling NA into the
x87 FPU is sufficient to trigger the slowdowns, even if the NA is not
actually used in arithmetic operations. This happens sometimes under some
optimization levels. I don't now of any benefit of starting off with a
signaling NA, especially since the encoding is lost pretty much as soon as
it is used. If folks are interested I can provide patch to turn the NA
quiet by default.
In principle this might be a good idea, but the current bit pattern is
unfortunately baked into a number of packages and documents on
internals, as well as serialized objects. The work needed to sort that
out is probably