https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91573
Tamar Christina <tnfchris at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
CC| |tnfchris at gcc dot gnu.org
--- Comment #6 from Tamar Christina <tnfchris at gcc dot gnu.org> ---
In this case though, wouldn't the loop vectorizer also be able to handle it if
the permute was simpler? re-rolling the loop or creating a minimum SLP tree
should be equivalent to
char src[512];
char dst[512];
#define WIDTH 8
void foo(int height, int a, int b, int c, int d, int dst_stride) {
char * ptr_src = src;
char * ptr_dst = dst;
for( int y = 0; y < height; y++ )
{
for( int x = 0; x < WIDTH; x++ )
{
int p1 = a + c;
int p2 = b + d;
char x1 = (p1 * ptr_src[x] ) >> 6;
char x2 = (p2 * ptr_src[x+1]) >> 6;
ptr_dst[x] = x1 + x2;
}
ptr_dst += dst_stride;
ptr_src += 32;
}
}
Which does vectorize (using Andre's patch for the SUM reductions with
sign-change casts).
We've seen multiple other cases where doing so would (significantly) improve
vectorization and code generation. So perhaps we should try re-rolling the loop
or create the smallest (in terms of height) possible SLP tree?