https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106902

Alexander Monakov <amonakov at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |amonakov at gcc dot gnu.org

--- Comment #6 from Alexander Monakov <amonakov at gcc dot gnu.org> ---
This is a lovely showcase how optimizations cooperatively produce something
unexpected.

TL;DR: SLP introduces redundant computations and then fma formation contracts
some (but not all) of those, dramatically reducing numerical stability. In
principle that's similar to incorrectly "optimizing"

double f(double x)
{
  double y = x * x;
  return y - y;
}

(which is guaranteed to return either NaN or 0) to

double f(double x)
{
  return fma(x, x, -(x * x));
}

which returns the round-off tail of x * x (or NaN). I think there's already
another bug with a similar root cause.

In this bug, we begin with (note, all following examples are supposed to be
compiled without fma contraction, i.e. -O0, plain -O2, or -O2 -ffp-contract=off
if your target has fma):

#include <math.h>
#include <stdio.h>

double one = 1;

double b1 = 0x1.70e906b54fe4fp+1;
double b2 = -0x1.62adb4752c14ep+1;
double b3 = 0x1.c7001a6f3bd8p-1;
double B = 0x1.29c9034e7cp-13;

void f1(void)
{
        double x1 = 0, x2 = 0, x3 = 0;

        for (int i = 0; i < 99; ) {
                double t = B * one + x1 * b1 + x2 * b2 + x3 * b3;
                printf("%d %g\t%a\n", i++, t, t);

                x3 = x2, x2 = x1, x1 = t;
        }
}

predcom unrolls by 3 to get rid of moves:

void f2(void)
{
        double x1 = 0, x2 = 0, x3 = 0;

        for (int i = 0; i < 99; ) {
                x3 = B * one + x1 * b1 + x2 * b2 + x3 * b3;
                printf("%d %g\t%a\n", i++, x3, x3);

                x2 = B * one + x3 * b1 + x1 * b2 + x2 * b3;
                printf("%d %g\t%a\n", i++, x2, x2);

                x1 = B * one + x2 * b1 + x3 * b2 + x1 * b3;
                printf("%d %g\t%a\n", i++, x1, x1);
        }
}

SLP introduces some redundant vector computations:

typedef double f64v2 __attribute__((vector_size(16)));

void f3(void)
{
        double x1 = 0, x2 = 0, x3 = 0;

        f64v2 x32 = { 0 }, x21 = { 0 };

        for (int i = 0; i < 99; ) {
                x3 = B * one + x21[1] * b1 + x2 * b2 + x3 * b3;

                f64v2 x13b1 = { x21[1] * b1, x3 * b1 };

                x32 = B * one + x13b1 + x21 * b2 + x32 * b3;

                x2 = B * one + x3 * b1 + x1 * b2 + x2 * b3;

                f64v2 x13b2 = { b2 * x1, b2 * x32[0] };

                x21 = B * one + x32 * b1 + x13b2 + x21 * b3;

                x1 = B * one + x2 * b1 + x32[0] * b2 + x1 * b3;

                printf("%d %g\t%a\n", i++, x32[0], x32[0]);
                printf("%d %g\t%a\n", i++, x32[1], x32[1]);
                printf("%d %g\t%a\n", i++, x21[1], x21[1]);
        }
}

Note that this is still bit-identical to the initial function. But then
tree-ssa-math-opts "randomly" forms some FMAs:

f64v2 vfma(f64v2 x, f64v2 y, f64v2 z)
{
        return (f64v2){ fma(x[0], y[0], z[0]), fma(x[1], y[1], z[1]) };
}

void f4(void)
{
        f64v2 vone = { one, one }, vB = { B, B };
        f64v2 vb1 = { b1, b1 }, vb2 = { b2, b2 }, vb3 = { b3, b3 };

        double x1 = 0, x2 = 0, x3 = 0;

        f64v2 x32 = { 0 }, x21 = { 0 };

        for (int i = 0; i < 99; ) {
                x3 = fma(b3, x3, fma(b2, x2, fma(B, one, x21[1] * b1)));

                f64v2 x13b1 = { x21[1] * b1, x3 * b1 };

                x32 = vfma(vb3, x32, vfma(vb2, x21, vfma(vB, vone, x13b1)));

                x2 = fma(b3, x2, b2 * x1 + fma(B, one, x3 * b1));

                f64v2 x13b2 = { b2 * x1, b2 * x32[0] };

                x21 = vfma(vb3, x21, x13b2 + vfma(vB, vone, x32 * vb1));

                x1 = fma(b3, x1, b2 * x32[0] + fma(B, one, b1 * x2));

                printf("%d %g\t%a\n", i++, x32[0], x32[0]);
                printf("%d %g\t%a\n", i++, x32[1], x32[1]);
                printf("%d %g\t%a\n", i++, x21[1], x21[1]);
        }
}

and here some of the redundantly computed values are computed differently
depending on where rounding after multiplication was omitted. Somehow this is
enough to make the computation explode numerically.

Reply via email to