[Bug tree-optimization/81082] [8 Regression] Failure to vectorise after reassociating index computation

rguenth at gcc dot gnu.org Tue, 02 Jan 2018 01:49:19 -0800

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81082


--- Comment #3 from Richard Biener <rguenth at gcc dot gnu.org> ---
(In reply to Jakub Jelinek from comment #2)
> So is there anything we can do here?
> Isn't the bigger problem that we no longer notice the multiplications can't
> overflow?

Yes, that's the issue with all arithmetic we drop to unsigned because of
association.  It's always a trade-off ...

>  With the int multiplications gone, we need to assume [1, INT_MAX]
> ranges on b1, b2 and b3 in the loop (otherwise it wouldn't loop at all), and
> [0, INT_MAX - 1] ranges for i1, i2 and i3.  But, from the i1 * b2 * b3 + i2
> * b3 + (i3 - 1) we could have determined that b1 * b2 * b3 is [0, INT_MAX].
> 
> We do vectorize:
> int
> f (int *x, int b1, int b2, int b3)
> {
>   int foo = 0;
>   for (int i1 = 0; i1 < b1; ++i1)
>     for (int i2 = 0; i2 < b2; ++i2)
>       for (int i3 = 0; i3 < b3; ++i3)
>         foo += x[(i1 * b2 + i2) * b3 + (i3 - 1)];
>   return foo;
> }
> without gather loads, while the #c0 only with them (if available).
> 
> Regarding the PR66313 optimization, first of all, what happened to the move
> of it to match.pd?

Given we have the reassoc pass doing this "properly" (well, but not for
signed integer types...) I have not pursused fully moving it.  We might
want to move some of the more "obvious" cases (but IIRC some cases involving
constants are already handled).

So, the ultimate plan is to make reassoc stronger, rewriting assoicated
signed arithmetic to unsigned where required.

> As we vectorize:
> int
> f (int *x, int b1, int b2, int b3)
> {
>   int foo = 0;
>   for (int i1 = 0; i1 < b1; ++i1)
>     for (int i2 = 0; i2 < b2; ++i2)
>       for (int i3 = 0; i3 < b3; ++i3)
>       {
>         int t1 = i1 * b2 * b3;
>         int t2 = i2 * b3;
>         int t3 = i3 - 1;
>         foo += x[t1 + t2 + t3];
>       }
>   return foo;
> }
> 
> it seems that hasn't been done.
> 
> Second thing, in fold_plusminus_mult_expr I don't understand the:
>       /* We are neither factoring zero nor minus one.  */
>       || TREE_CODE (same) == INTEGER_CST)
> part, shouldn't that be || (TREE_CODE (same) == INTEGER_CST &&
> !integer_zerop (same) && !integer_minus_onep (same)) ?

The comment says that same == 0 or same == -1 don't happen by construction.

> Another thing is, for signed a, b, c we can't optimize a * b + a * c as
> a * (b + c) if a can be -1 or 0, exhaustive proof for signed char:
> int
> main ()
> {
>   int a, b, c;
>   for (a = -128; a < 128; a++)
>     for (b = -128; b < 128; b++)
>       for (c = -128; c < 128; c++)
>       {
>         int ab = a * b;
>         int ac = a * c;
>         int ab_ac = ab + ac;
>         if (ab < -128 || ab >= 128 || ac < -128 || ac >= 128 || ab_ac < -128 
> ||
> ab_ac >= 128)
>           continue;
>         /* Ok, in this case a * b + a * c is without UB.  */
>         int b_c = b + c;
> #ifdef NEW
>         b_c = (signed char) b_c;
> #endif
>         int r = a * b_c;
>         if (b_c < -128 || b_c >= 128 || r != ab_ac)
>           __builtin_printf ("%d * %d + %d * %d = %d wrong opt (%d)\n", a, b, 
> a,
> c, ab_ac, r);
>       }
>   return 0;
> }
> 
> But, if we know for sure that a is not -1, we could optimize it as
>   a * (signed)((unsigned) b + c)
> (which doesn't work for a == -1, but works for others) instead of
>   (signed)(a * ((unsigned) b + c))
> (which works for all, but is a too big hammer).
> 
> Perhaps if this optimization is moved over to match.pd and we'd defer it if
> a isn't constant until at least vrp1 and tune based on range info?
> The range for b3 at least during vrp pass itself is [1, INT_MAX], so neither
> -1 nor 0 are possible and we could safely reassociate it as a * (b + c)
> rather than the variants with unsigned.
> 
> Or do we want to somehow mark the MULT_EXPR we've created, propagate it
> through the IL and if we in vrp notice this way marked MULT_EXPR (or IFN) we
> fold it as integer multiplication?  That would be ugly.

I don't really see a good solutuon to this PR either :/

Delaying the folding to GIMPLE would be ok with me but as said I'd like to
see reassoc enhanced rather than stuffing more and more "complex" patterns
to match.pd.

Ultimatively we'd want to compute SCEV for all relevant IVs early and
cache them.  That should be easy when only constants are involved but
will be tricky (impossible) when any symbols are.  In the case of this
testcase it's all constants of course.

I once did experiment with "lowering" GIMPLE to -fwrapv (as RTL is) before
late reassoc and that "worked" (well, it of course regressed stuff, and it
was merely to prove reassoc of signed arithmetic is useful).

[Bug tree-optimization/81082] [8 Regression] Failure to vectorise after reassociating index computation

Reply via email to