13 Regression] Program compiled with -O3 -mfma produces different result

amonakov at gcc dot gnu.org via Gcc-bugs Wed, 14 Sep 2022 08:20:19 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106902


Alexander Monakov <amonakov at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |amonakov at gcc dot gnu.org

--- Comment #6 from Alexander Monakov <amonakov at gcc dot gnu.org> ---
This is a lovely showcase how optimizations cooperatively produce something
unexpected.

TL;DR: SLP introduces redundant computations and then fma formation contracts
some (but not all) of those, dramatically reducing numerical stability. In
principle that's similar to incorrectly "optimizing"

double f(double x)
{
  double y = x * x;
  return y - y;
}

(which is guaranteed to return either NaN or 0) to

double f(double x)
{
  return fma(x, x, -(x * x));
}

which returns the round-off tail of x * x (or NaN). I think there's already
another bug with a similar root cause.

In this bug, we begin with (note, all following examples are supposed to be
compiled without fma contraction, i.e. -O0, plain -O2, or -O2 -ffp-contract=off
if your target has fma):

#include <math.h>
#include <stdio.h>

double one = 1;

double b1 = 0x1.70e906b54fe4fp+1;
double b2 = -0x1.62adb4752c14ep+1;
double b3 = 0x1.c7001a6f3bd8p-1;
double B = 0x1.29c9034e7cp-13;

void f1(void)
{
        double x1 = 0, x2 = 0, x3 = 0;

        for (int i = 0; i < 99; ) {
                double t = B * one + x1 * b1 + x2 * b2 + x3 * b3;
                printf("%d %g\t%a\n", i++, t, t);

                x3 = x2, x2 = x1, x1 = t;
        }
}

predcom unrolls by 3 to get rid of moves:

void f2(void)
{
        double x1 = 0, x2 = 0, x3 = 0;

        for (int i = 0; i < 99; ) {
                x3 = B * one + x1 * b1 + x2 * b2 + x3 * b3;
                printf("%d %g\t%a\n", i++, x3, x3);

                x2 = B * one + x3 * b1 + x1 * b2 + x2 * b3;
                printf("%d %g\t%a\n", i++, x2, x2);

                x1 = B * one + x2 * b1 + x3 * b2 + x1 * b3;
                printf("%d %g\t%a\n", i++, x1, x1);
        }
}

SLP introduces some redundant vector computations:

typedef double f64v2 __attribute__((vector_size(16)));

void f3(void)
{
        double x1 = 0, x2 = 0, x3 = 0;

        f64v2 x32 = { 0 }, x21 = { 0 };

        for (int i = 0; i < 99; ) {
                x3 = B * one + x21[1] * b1 + x2 * b2 + x3 * b3;

                f64v2 x13b1 = { x21[1] * b1, x3 * b1 };

                x32 = B * one + x13b1 + x21 * b2 + x32 * b3;

                x2 = B * one + x3 * b1 + x1 * b2 + x2 * b3;

                f64v2 x13b2 = { b2 * x1, b2 * x32[0] };

                x21 = B * one + x32 * b1 + x13b2 + x21 * b3;

                x1 = B * one + x2 * b1 + x32[0] * b2 + x1 * b3;

                printf("%d %g\t%a\n", i++, x32[0], x32[0]);
                printf("%d %g\t%a\n", i++, x32[1], x32[1]);
                printf("%d %g\t%a\n", i++, x21[1], x21[1]);
        }
}

Note that this is still bit-identical to the initial function. But then
tree-ssa-math-opts "randomly" forms some FMAs:

f64v2 vfma(f64v2 x, f64v2 y, f64v2 z)
{
        return (f64v2){ fma(x[0], y[0], z[0]), fma(x[1], y[1], z[1]) };
}

void f4(void)
{
        f64v2 vone = { one, one }, vB = { B, B };
        f64v2 vb1 = { b1, b1 }, vb2 = { b2, b2 }, vb3 = { b3, b3 };

        double x1 = 0, x2 = 0, x3 = 0;

        f64v2 x32 = { 0 }, x21 = { 0 };

        for (int i = 0; i < 99; ) {
                x3 = fma(b3, x3, fma(b2, x2, fma(B, one, x21[1] * b1)));

                f64v2 x13b1 = { x21[1] * b1, x3 * b1 };

                x32 = vfma(vb3, x32, vfma(vb2, x21, vfma(vB, vone, x13b1)));

                x2 = fma(b3, x2, b2 * x1 + fma(B, one, x3 * b1));

                f64v2 x13b2 = { b2 * x1, b2 * x32[0] };

                x21 = vfma(vb3, x21, x13b2 + vfma(vB, vone, x32 * vb1));

                x1 = fma(b3, x1, b2 * x32[0] + fma(B, one, b1 * x2));

                printf("%d %g\t%a\n", i++, x32[0], x32[0]);
                printf("%d %g\t%a\n", i++, x32[1], x32[1]);
                printf("%d %g\t%a\n", i++, x21[1], x21[1]);
        }
}

and here some of the redundantly computed values are computed differently
depending on where rounding after multiplication was omitted. Somehow this is
enough to make the computation explode numerically.

[Bug target/106902] [11/12/13 Regression] Program compiled with -O3 -mfma produces different result

Reply via email to