https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103144

--- Comment #3 from rguenther at suse dot de <rguenther at suse dot de> ---
On Wed, 1 Dec 2021, crazylht at gmail dot com wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103144
> 
> --- Comment #2 from Hongtao.liu <crazylht at gmail dot com> ---
> Another issue is for SLP, when trip count is small and loop is completely
> unrolled. SLP failed to generate vlshr_optab.
> 
> #include<stdint.h>
> void
> foo (uint64_t* __restrict pdst, uint64_t* psrc, uint64_t shift)
> {
>     for (int64_t i = 0; i != 4; i++)
>     {
>       pdst[i] = psrc[i] + shift;
>       shift >>= 1;
>     }
> }
> 
> .175t.slp1 dump:
> 
>   _8 = *psrc_12(D);
>   _28 = _8 + shift_10(D);
>   shift_30 = shift_10(D) >> 1;
>   _39 = psrc_12(D) + 8;
>   _40 = *_39;
>   _41 = pdst_13(D) + 8;
>   _42 = shift_30 + _40;
>   shift_44 = shift_30 >> 1;
>   _53 = psrc_12(D) + 16;
>   _54 = *_53;
>   _55 = pdst_13(D) + 16;
>   _56 = shift_44 + _54;
>   shift_58 = shift_44 >> 1;
>   _3 = psrc_12(D) + 24;
>   _4 = *_3;
>   _5 = pdst_13(D) + 24;
>   _6 = _4 + shift_58;
>   _15 = {_28, _42, _56, _6};
>   vectp.8_18 = pdst_13(D);
>   MEM <vector(4) long unsigned int> [(uint64_t *)vectp.8_18] = _15;

The issue is that the shifts are dependent on each other:

  shift_30 = shift_10(D) >> 1;
  shift_44 = shift_30 >> 1;
  shift_58 = shift_44 >> 1;

and one addend is shift_10(D) which isn't shifted (SLP build does
not support inserting no-op operations into some lanes like +- 0
or */ 1 or in this case >> 0).  So first of all we either need
to simplify the above to

  shift_30 = shift_10(D) >> 1;
  shift_44 = shift_10(D) >> 2;
  shift_58 = shift_10(D) >> 3;

then the shift_10(D) >> 0 issue remains.

With

void
foo (uint64_t* __restrict pdst, uint64_t* psrc, uint64_t shift)
{
    for (int64_t i = 0; i != 4; i++)
    {
      shift >>= 1;
      pdst[i] = psrc[i] + shift;
    }
}

you see SLP discovery works but we vectorize it as

  shift_23 = shift_10(D) >> 1;
  shift_37 = shift_23 >> 1;
  shift_51 = shift_37 >> 1;
  _17 = {shift_10(D), shift_23, shift_37, shift_51};
  vect_shift_23.10_8 = _17 >> 1;

which eats from the profitability.  We also fail to improve
such code (if it were written by a user) later realizing
we could merge the scalar shifts with the vector shift.

Reply via email to