https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106902
Alexander Monakov <amonakov at gcc dot gnu.org> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |amonakov at gcc dot gnu.org --- Comment #6 from Alexander Monakov <amonakov at gcc dot gnu.org> --- This is a lovely showcase how optimizations cooperatively produce something unexpected. TL;DR: SLP introduces redundant computations and then fma formation contracts some (but not all) of those, dramatically reducing numerical stability. In principle that's similar to incorrectly "optimizing" double f(double x) { double y = x * x; return y - y; } (which is guaranteed to return either NaN or 0) to double f(double x) { return fma(x, x, -(x * x)); } which returns the round-off tail of x * x (or NaN). I think there's already another bug with a similar root cause. In this bug, we begin with (note, all following examples are supposed to be compiled without fma contraction, i.e. -O0, plain -O2, or -O2 -ffp-contract=off if your target has fma): #include <math.h> #include <stdio.h> double one = 1; double b1 = 0x1.70e906b54fe4fp+1; double b2 = -0x1.62adb4752c14ep+1; double b3 = 0x1.c7001a6f3bd8p-1; double B = 0x1.29c9034e7cp-13; void f1(void) { double x1 = 0, x2 = 0, x3 = 0; for (int i = 0; i < 99; ) { double t = B * one + x1 * b1 + x2 * b2 + x3 * b3; printf("%d %g\t%a\n", i++, t, t); x3 = x2, x2 = x1, x1 = t; } } predcom unrolls by 3 to get rid of moves: void f2(void) { double x1 = 0, x2 = 0, x3 = 0; for (int i = 0; i < 99; ) { x3 = B * one + x1 * b1 + x2 * b2 + x3 * b3; printf("%d %g\t%a\n", i++, x3, x3); x2 = B * one + x3 * b1 + x1 * b2 + x2 * b3; printf("%d %g\t%a\n", i++, x2, x2); x1 = B * one + x2 * b1 + x3 * b2 + x1 * b3; printf("%d %g\t%a\n", i++, x1, x1); } } SLP introduces some redundant vector computations: typedef double f64v2 __attribute__((vector_size(16))); void f3(void) { double x1 = 0, x2 = 0, x3 = 0; f64v2 x32 = { 0 }, x21 = { 0 }; for (int i = 0; i < 99; ) { x3 = B * one + x21[1] * b1 + x2 * b2 + x3 * b3; f64v2 x13b1 = { x21[1] * b1, x3 * b1 }; x32 = B * one + x13b1 + x21 * b2 + x32 * b3; x2 = B * one + x3 * b1 + x1 * b2 + x2 * b3; f64v2 x13b2 = { b2 * x1, b2 * x32[0] }; x21 = B * one + x32 * b1 + x13b2 + x21 * b3; x1 = B * one + x2 * b1 + x32[0] * b2 + x1 * b3; printf("%d %g\t%a\n", i++, x32[0], x32[0]); printf("%d %g\t%a\n", i++, x32[1], x32[1]); printf("%d %g\t%a\n", i++, x21[1], x21[1]); } } Note that this is still bit-identical to the initial function. But then tree-ssa-math-opts "randomly" forms some FMAs: f64v2 vfma(f64v2 x, f64v2 y, f64v2 z) { return (f64v2){ fma(x[0], y[0], z[0]), fma(x[1], y[1], z[1]) }; } void f4(void) { f64v2 vone = { one, one }, vB = { B, B }; f64v2 vb1 = { b1, b1 }, vb2 = { b2, b2 }, vb3 = { b3, b3 }; double x1 = 0, x2 = 0, x3 = 0; f64v2 x32 = { 0 }, x21 = { 0 }; for (int i = 0; i < 99; ) { x3 = fma(b3, x3, fma(b2, x2, fma(B, one, x21[1] * b1))); f64v2 x13b1 = { x21[1] * b1, x3 * b1 }; x32 = vfma(vb3, x32, vfma(vb2, x21, vfma(vB, vone, x13b1))); x2 = fma(b3, x2, b2 * x1 + fma(B, one, x3 * b1)); f64v2 x13b2 = { b2 * x1, b2 * x32[0] }; x21 = vfma(vb3, x21, x13b2 + vfma(vB, vone, x32 * vb1)); x1 = fma(b3, x1, b2 * x32[0] + fma(B, one, b1 * x2)); printf("%d %g\t%a\n", i++, x32[0], x32[0]); printf("%d %g\t%a\n", i++, x32[1], x32[1]); printf("%d %g\t%a\n", i++, x21[1], x21[1]); } } and here some of the redundantly computed values are computed differently depending on where rounding after multiplication was omitted. Somehow this is enough to make the computation explode numerically.