Re: [PATCH] tree-optimization/110838 - vectorization of widened shifts

Richard Sandiford via Gcc-patches Tue, 01 Aug 2023 02:38:25 -0700

Richard Sandiford <[email protected]> writes:
> Richard Biener via Gcc-patches <[email protected]> writes:
>> The following makes sure to limit the shift operand when vectorizing
>> (short)((int)x >> 31) via (short)x >> 31 as the out of bounds shift
>> operand otherwise invokes undefined behavior.  When we determine
>> whether we can demote the operand we know we at most shift in the
>> sign bit so we can adjust the shift amount.
>>
>> Note this has the possibility of un-CSEing common shift operands
>> as there's no good way to share pattern stmts between patterns.
>> We'd have to separately pattern recognize the definition.
>>
>> Bootstrapped on x86_64-unknown-linux-gnu, testing in progress.
>>
>> Not sure about LSHIFT_EXPR, it probably has the same issue but
>> the fallback optimistic zero for out-of-range shifts is at least
>> "corrrect".  Not sure we ever try to demote rotates (probably not).
>
> I guess you mean "correct" for x86?  But that's just a quirk of x86.
> IMO the behaviour is equally wrong for LSHIFT_EXPR.


Sorry for the multiple messages.  Wanted to get something out quickly
because I wasn't sure how long it would take me to write this...

On rotates, for:

void
foo (unsigned short *restrict ptr)
{
  for (int i = 0; i < 200; ++i)
    {
      unsigned int x = ptr[i] & 0xff0;
      ptr[i] = (x << 1) | (x >> 31);
    }
}

we do get:

can narrow to unsigned:13 without loss of precision: _5 = x_12 r>> 31;

although aarch64 doesn't provide rrotate patterns, so nothing actually
comes of it.

I think the handling of variable shifts is flawed for other reasons.  Given:

void
uu (unsigned short *restrict ptr1, unsigned short *restrict ptr2)
{
  for (int i = 0; i < 200; ++i)
    ptr1[i] = ptr1[i] >> ptr2[i];
}

void
us (unsigned short *restrict ptr1, short *restrict ptr2)
{
  for (int i = 0; i < 200; ++i)
    ptr1[i] = ptr1[i] >> ptr2[i];
}

void
su (short *restrict ptr1, unsigned short *restrict ptr2)
{
  for (int i = 0; i < 200; ++i)
    ptr1[i] = ptr1[i] >> ptr2[i];
}

void
ss (short *restrict ptr1, short *restrict ptr2)
{
  for (int i = 0; i < 200; ++i)
    ptr1[i] = ptr1[i] >> ptr2[i];
}

we only narrow uu and ss, due to:

            /* Ignore codes that don't take uniform arguments.  */
            if (!types_compatible_p (TREE_TYPE (op), type))
              return;

in vect_determine_precisions_from_range.  Maybe we should drop
the shift handling from there and instead rely on
vect_determine_precisions_from_users, extending:

        if (TREE_CODE (shift) != INTEGER_CST
            || !wi::ltu_p (wi::to_widest (shift), precision))
          return;

to handle ranges where the max is known to be < precision.

There again, if masking is enough for right shifts and right rotates,
maybe we should keep the current handling for then (with your fix)
and skip the types_compatible_p check for those cases.

So:

- restrict shift handling in vect_determine_precisions_from_range to
  RSHIFT_EXPR and RROTATE_EXPR

- remove types_compatible_p restriction for those cases

- extend vect_determine_precisions_from_users shift handling to check
  for ranges on the shift amount

Does that sound right?

Thanks,
Richard

Re: [PATCH] tree-optimization/110838 - vectorization of widened shifts

Reply via email to