https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98544

--- Comment #8 from Richard Biener <rguenth at gcc dot gnu.org> ---
(In reply to Martin Liška from comment #2)
> Confirmed, one can reduce that to a single loop vectorization:
> 
> $ g++ bug2.cc  -std=c++17 -O1 -mavx -ftree-loop-vectorize
> -fdbg-cnt=vect_loop:10-10 && ./a.out
> 
> but the loop is quite huge.

btw, 11-11 or 12-12 or 13-13 also is enough individually to trigger a
miscompare.
The 11-11 loop looks smallest to me:

***dbgcnt: lower limit 11 reached for vect_loop.***
***dbgcnt: upper limit 11 reached for vect_loop.***
fft1d.h:1256:23: optimized: loop vectorized using 32 byte vectors
fft1d.h:1256:23: optimized:  loop versioned for vectorization because of
possible aliasing

it also only needs a single alias check (just guessing where things may go
wrong)

The source corresponds to

template<typename T> void radb2(size_t ido, size_t l1,
  const T * DUCC0_RESTRICT cc, T * DUCC0_RESTRICT ch,
  const T0 * DUCC0_RESTRICT wa) const
  {
  auto WA = [wa,ido](size_t x, size_t i) { return wa[i+x*(ido-1)]; };
  auto CC = [cc,ido](size_t a, size_t b, size_t c) -> const T&
    { return cc[a+ido*(b+2*c)]; };
  auto CH = [ch,ido,l1](size_t a, size_t b, size_t c) -> T&
    { return ch[a+ido*(b+l1*c)]; };

  for (size_t k=0; k<l1; k++)
    PM (CH(0,k,0),CH(0,k,1),CC(0,0,k),CC(ido-1,1,k));
  if ((ido&1)==0)
    for (size_t k=0; k<l1; k++)
      {
      CH(ido-1,k,0) = T0( 2)*CC(ido-1,0,k);
      CH(ido-1,k,1) = T0(-2)*CC(0    ,1,k);
      }
  if (ido<=2) return;
  for (size_t k=0; k<l1;++k)
====>  this loop
    for (size_t i=2; i<ido; i+=2)
      {
      size_t ic=ido-i;
      T ti2, tr2;
      PM (CH(i-1,k,0),tr2,CC(i-1,0,k),CC(ic-1,1,k));
      PM (ti2,CH(i  ,k,0),CC(i  ,0,k),CC(ic  ,1,k));
      MULPM (CH(i,k,1),CH(i-1,k,1),WA(0,i-2),WA(0,i-1),ti2,tr2);
      }
<====
  }

Reply via email to