13 regression] vectorization causes loop unrolling test slowdown as measured by Adobe's C++Benchmark

crazylht at gmail dot com via Gcc-bugs Sun, 29 May 2022 23:41:05 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53533


--- Comment #45 from Hongtao.liu <crazylht at gmail dot com> ---
A reduced testcase.

int a[256];
int b[256];

void foo (void)
{
  int i;
  for (i = 0; i < 256; ++i)
    {
      int tmp = a[i] + 12345;
      tmp *= 914237;
      tmp += 12332;
      tmp *= 914237;
      tmp += 12332;
      tmp *= 914237;
      tmp -= 13;
      tmp *= 8000;
      b[i] = tmp;
    }
}

GCC now simply pmulld to pslld + padd + psub, the vectorizer cost model looks
fine,  but for scalar version, it's extraly optimized in pass_combine from 4 *
mult + 3 * add to 1 * mult + 2 * add which is not taken in count by vectorizer.
The vectorized version is not simplified later.

        mov     eax, DWORD PTR a[rdx]
        add     rdx, 4
        add     eax, 12345
        imul    eax, eax, -1564285888
        sub     eax, 333519936
        mov     DWORD PTR b[rdx-4], eax
        cmp     rdx, 1024
        jne     .L2


I'm wondering could Gimple also simplify 

      tmp *= 914237;
      tmp += 12332;
      tmp *= 914237;
      tmp += 12332;
      tmp *= 914237;
      tmp -= 13;
      tmp *= 8000;

to 
     tmp *= -1564285888;
     tmp -= 333519936;

refer to https://godbolt.org/z/qYMYMTxEY

Then the vectorized code would be more optimal.

[Bug rtl-optimization/53533] [10/11/12/13 regression] vectorization causes loop unrolling test slowdown as measured by Adobe's C++Benchmark

Reply via email to