https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101895
Bug ID: 101895
Summary: [11/12 Regression] SLP Vectorizer change pushes
VEC_PERM_EXPR into bad location spoiling further
optimization opportunities
Product: gcc
Version: 11.0
Status: UNCONFIRMED
Severity: normal
Priority: P3
Component: tree-optimization
Assignee: unassigned at gcc dot gnu.org
Reporter: law at gcc dot gnu.org
CC: rguenth at gcc dot gnu.org
Target Milestone: ---
Consider this code:
void foo(int * restrict a, int b, int *c) {
a[0] = c[0]*b + a[0];
a[1] = c[2]*b + a[1];
a[2] = c[1]*b + a[2];
a[3] = c[3]*b + a[3];
}
Prior to this commit:
commit 126ed72b9f48f8530b194532cc281fb761690435
Author: Richard Biener <[email protected]>
Date: Wed Sep 30 17:08:01 2020 +0200
optimize permutes in SLP, remove vect_attempt_slp_rearrange_stmts
This introduces a permute optimization phase for SLP which is
intended to cover the existing permute eliding for SLP reductions
plus handling commonizing the easy cases.
It currently uses graphds to compute a postorder on the reverse
SLP graph and it handles all cases vect_attempt_slp_rearrange_stmts
did (hopefully - I've adjusted most testcases that triggered it
a few days ago). It restricts itself to move around bijective
permutations to simplify things for now, mainly around constant nodes.
As a prerequesite it makes the SLP graph cyclic (ugh). It looks
like it would pay off to compute a PRE/POST order visit array
once and elide all the recursive SLP graph walks and their
visited hash-set. At least for the time where we do not change
the SLP graph during such walk.
I do not like using graphds too much but at least I don't have to
re-implement yet another RPO walk, so maybe it isn't too bad.
It now computes permute placement during iteration and thus should
get cycles more obviously correct.
[ ... ]
GCC would generate this (x86_64 -O3 -march=native):
vect__1.6_27 = VEC_PERM_EXPR <vect__1.5_25, vect__1.5_25, { 0, 2, 1, 3 }>;
vect__2.7_29 = vect__1.6_27 * _28;
_1 = *c_18(D);
_2 = _1 * b_19(D);
vectp.9_30 = a_20(D);
vect__3.10_31 = MEM <vector(4) int> [(int *)vectp.9_30];
vect__4.11_32 = vect__2.7_29 + vect__3.10_31;
This is good. Note how the VEC_PERM_EXPR happens before the vector multiply
and how the vector multiply directly feeds the vector add. On our target we
have a vector multiply-add which would be generated and all is good.
After the above change we generate this:
vect__2.6_28 = vect__1.5_25 * _27;
_29 = VEC_PERM_EXPR <vect__2.6_28, vect__2.6_28, { 0, 2, 1, 3 }>;
_1 = *c_18(D);
_2 = _1 * b_19(D);
vectp.8_30 = a_20(D);
vect__3.9_31 = MEM <vector(4) int> [(int *)vectp.8_30];
vect__4.10_32 = _29 + vect__3.9_31;
Note how we have the vmul, then permute, then vadd. This spoils our ability to
generate a vmadd. This behavior is still seen on the trunk as well.
Conceptually it seems to me that having a permute at the start or end of a
chain of vector operations is better than moving the permute into the middle of
a chain of dependent vector operations.
We could probably fix this in the backend with some special patterns, but ISTM
that getting it right in SLP would be better.