https://gcc.gnu.org/bugzilla/show_bug.cgi?id=121393

--- Comment #4 from Andrew Stubbs <ams at gcc dot gnu.org> ---
Here's a minimal reproducer. It requires an unsigned offset variable, negative
step, and non-contiguous load large enough to warrant vectorization. The
original testcase doesn't vectorize without collapsing only because the array
dimensions are unhelpful sizes.

float a[100];              

int f()                    
{                          
  unsigned int i;          

#pragma omp target simd    
  for (i=100; i != 0; i-=2)
    a[i] = a[i] + 5.5;     
}                          

int main()                 
{                          
  f();                     
  return 0;                
}                          

This will reproduce both with offloading, and when compiled using the
standalone amdgcn compiler directly.

The "optimized" dump shows this stmt:

vect__1.7_16 = .MASK_GATHER_LOAD (&MEM <float[100]> [(void *)&a + 400B], 64B, {
0, 4294967294, 4294967292, 4294967290, 4294967288, [snip.....]

The offsets are 32-bit signed values, but they are marked as unsigned.

The problem, therefore is *not* in the OMP collapse, but in the vectorizer, but
I've yet to put my finger on it.  Basically, when the step is negative and the
loop variable is unsigned the gather routines are broken.

If I adjust the testcase so that the step is -1 (not -2) it emits the following
contiguous load and permutation pair:

vect__1.9_25 = MEM <vector(32) float> [(float *)_35];                           
vect__1.10_26 = VEC_PERM_EXPR <vect__1.9_25, vect__1.9_25, { 31, 30, 29, 28,
27, 26, [snip ....]

I'm not sure if the hardware would prefer the gather operation, for
performance, or not, but for correctness this approach would also work for the
step==-2 case too.

Reply via email to