https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98138
--- Comment #3 from Kewen Lin <linkw at gcc dot gnu.org> ---
(In reply to Richard Biener from comment #2)
> So the expected vectorization builds vectors
>
> { tmp[0][0], tmp[1][0], tmp[2][0], tmp[3][0] }
>
> that's not SLP, SLP tries to build the
>
> { tmp[i][0], tmp[i][1], tmp[i][2], tmp[i][3] }
>
> vector and "succeeds" - the SLP tree turns out to be
> highly inefficient though. So for the stores your desire
> is to see an interleaving scheme with VF 4 (the number of
> iterations). But interleaving fails because it would require
> a VF of 16 and there are not enough iteration in the loop.
>
> The classical SLP scheme degenerates (also due to the plus/minus
> mixed ops) to uniform vectors as we venture beyond the a{0,2} {+,-} a{1,3}
> expression.
>
> Starting SLP discovery from the grouped loads would get things going
> up to the above same expression.
>
> So not sure what's the best approach to this case. The testcase
> can be simplified still showing the SLP discovery issue:
>
> extern void test(unsigned int t[4][4]);
>
> void foo(int *p1, int i1, int *p2, int i2)
> {
> unsigned int tmp[4][4];
> unsigned int a0, a1, a2, a3;
>
> for (int i = 0; i < 4; i++, p1 += i1, p2 += i2) {
> a0 = (p1[0] - p2[0]);
> a1 = (p1[1] - p2[1]);
> a2 = (p1[2] - p2[2]);
> a3 = (p1[3] - p2[3]);
>
> int t0 = a0 + a1;
> int t1 = a0 - a1;
> int t2 = a2 + a3;
> int t3 = a2 - a3;
>
> tmp[i][0] = t0 + t2;
> tmp[i][2] = t0 - t2;
> tmp[i][1] = t1 + t3;
> tmp[i][3] = t1 - t3;
> }
> test(tmp);
> }
>
> So it's basically SLP discovery degenerating to an interleaving scheme
> on the load side but not actually "implementing" it.
IIUC, in current implementation, we get four grouped stores:
{ tmp[i][0], tmp[i][1], tmp[i][2], tmp[i][3] } /i=0,1,2,3/ independently
When all these tryings fail, could we do some re-try on the groups
{ tmp[0][i], tmp[1][i], tmp[2][i], tmp[3][i] } /i=0,1,2,3/
with one extra intermediate layer covering those original groups, then start
from these newly adjusted groups? the built operands should isomorphic then.
May be too hackish?