https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106902
Alexander Monakov <amonakov at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
CC| |amonakov at gcc dot gnu.org
--- Comment #6 from Alexander Monakov <amonakov at gcc dot gnu.org> ---
This is a lovely showcase how optimizations cooperatively produce something
unexpected.
TL;DR: SLP introduces redundant computations and then fma formation contracts
some (but not all) of those, dramatically reducing numerical stability. In
principle that's similar to incorrectly "optimizing"
double f(double x)
{
double y = x * x;
return y - y;
}
(which is guaranteed to return either NaN or 0) to
double f(double x)
{
return fma(x, x, -(x * x));
}
which returns the round-off tail of x * x (or NaN). I think there's already
another bug with a similar root cause.
In this bug, we begin with (note, all following examples are supposed to be
compiled without fma contraction, i.e. -O0, plain -O2, or -O2 -ffp-contract=off
if your target has fma):
#include <math.h>
#include <stdio.h>
double one = 1;
double b1 = 0x1.70e906b54fe4fp+1;
double b2 = -0x1.62adb4752c14ep+1;
double b3 = 0x1.c7001a6f3bd8p-1;
double B = 0x1.29c9034e7cp-13;
void f1(void)
{
double x1 = 0, x2 = 0, x3 = 0;
for (int i = 0; i < 99; ) {
double t = B * one + x1 * b1 + x2 * b2 + x3 * b3;
printf("%d %g\t%a\n", i++, t, t);
x3 = x2, x2 = x1, x1 = t;
}
}
predcom unrolls by 3 to get rid of moves:
void f2(void)
{
double x1 = 0, x2 = 0, x3 = 0;
for (int i = 0; i < 99; ) {
x3 = B * one + x1 * b1 + x2 * b2 + x3 * b3;
printf("%d %g\t%a\n", i++, x3, x3);
x2 = B * one + x3 * b1 + x1 * b2 + x2 * b3;
printf("%d %g\t%a\n", i++, x2, x2);
x1 = B * one + x2 * b1 + x3 * b2 + x1 * b3;
printf("%d %g\t%a\n", i++, x1, x1);
}
}
SLP introduces some redundant vector computations:
typedef double f64v2 __attribute__((vector_size(16)));
void f3(void)
{
double x1 = 0, x2 = 0, x3 = 0;
f64v2 x32 = { 0 }, x21 = { 0 };
for (int i = 0; i < 99; ) {
x3 = B * one + x21[1] * b1 + x2 * b2 + x3 * b3;
f64v2 x13b1 = { x21[1] * b1, x3 * b1 };
x32 = B * one + x13b1 + x21 * b2 + x32 * b3;
x2 = B * one + x3 * b1 + x1 * b2 + x2 * b3;
f64v2 x13b2 = { b2 * x1, b2 * x32[0] };
x21 = B * one + x32 * b1 + x13b2 + x21 * b3;
x1 = B * one + x2 * b1 + x32[0] * b2 + x1 * b3;
printf("%d %g\t%a\n", i++, x32[0], x32[0]);
printf("%d %g\t%a\n", i++, x32[1], x32[1]);
printf("%d %g\t%a\n", i++, x21[1], x21[1]);
}
}
Note that this is still bit-identical to the initial function. But then
tree-ssa-math-opts "randomly" forms some FMAs:
f64v2 vfma(f64v2 x, f64v2 y, f64v2 z)
{
return (f64v2){ fma(x[0], y[0], z[0]), fma(x[1], y[1], z[1]) };
}
void f4(void)
{
f64v2 vone = { one, one }, vB = { B, B };
f64v2 vb1 = { b1, b1 }, vb2 = { b2, b2 }, vb3 = { b3, b3 };
double x1 = 0, x2 = 0, x3 = 0;
f64v2 x32 = { 0 }, x21 = { 0 };
for (int i = 0; i < 99; ) {
x3 = fma(b3, x3, fma(b2, x2, fma(B, one, x21[1] * b1)));
f64v2 x13b1 = { x21[1] * b1, x3 * b1 };
x32 = vfma(vb3, x32, vfma(vb2, x21, vfma(vB, vone, x13b1)));
x2 = fma(b3, x2, b2 * x1 + fma(B, one, x3 * b1));
f64v2 x13b2 = { b2 * x1, b2 * x32[0] };
x21 = vfma(vb3, x21, x13b2 + vfma(vB, vone, x32 * vb1));
x1 = fma(b3, x1, b2 * x32[0] + fma(B, one, b1 * x2));
printf("%d %g\t%a\n", i++, x32[0], x32[0]);
printf("%d %g\t%a\n", i++, x32[1], x32[1]);
printf("%d %g\t%a\n", i++, x21[1], x21[1]);
}
}
and here some of the redundantly computed values are computed differently
depending on where rounding after multiplication was omitted. Somehow this is
enough to make the computation explode numerically.