On Wed, 2 Aug 2023, Richard Sandiford wrote:
> Richard Biener <[email protected]> writes:
> > On Tue, 1 Aug 2023, Richard Sandiford wrote:
> >
> >> Richard Sandiford <[email protected]> writes:
> >> > Richard Biener via Gcc-patches <[email protected]> writes:
> >> >> The following makes sure to limit the shift operand when vectorizing
> >> >> (short)((int)x >> 31) via (short)x >> 31 as the out of bounds shift
> >> >> operand otherwise invokes undefined behavior. When we determine
> >> >> whether we can demote the operand we know we at most shift in the
> >> >> sign bit so we can adjust the shift amount.
> >> >>
> >> >> Note this has the possibility of un-CSEing common shift operands
> >> >> as there's no good way to share pattern stmts between patterns.
> >> >> We'd have to separately pattern recognize the definition.
> >> >>
> >> >> Bootstrapped on x86_64-unknown-linux-gnu, testing in progress.
> >> >>
> >> >> Not sure about LSHIFT_EXPR, it probably has the same issue but
> >> >> the fallback optimistic zero for out-of-range shifts is at least
> >> >> "corrrect". Not sure we ever try to demote rotates (probably not).
> >> >
> >> > I guess you mean "correct" for x86? But that's just a quirk of x86.
> >> > IMO the behaviour is equally wrong for LSHIFT_EXPR.
> >
> > I meant "correct" for the constant folding that evaluates out-of-bound
> > shifts as zero.
> >
> >> Sorry for the multiple messages. Wanted to get something out quickly
> >> because I wasn't sure how long it would take me to write this...
> >>
> >> On rotates, for:
> >>
> >> void
> >> foo (unsigned short *restrict ptr)
> >> {
> >> for (int i = 0; i < 200; ++i)
> >> {
> >> unsigned int x = ptr[i] & 0xff0;
> >> ptr[i] = (x << 1) | (x >> 31);
> >> }
> >> }
> >>
> >> we do get:
> >>
> >> can narrow to unsigned:13 without loss of precision: _5 = x_12 r>> 31;
> >>
> >> although aarch64 doesn't provide rrotate patterns, so nothing actually
> >> comes of it.
> >
> > I think it's still correct that we only need unsigned:13 for the input,
> > we know other bits are zero. But of course when actually applying
> > this as documented
> >
> > /* Record that STMT_INFO could be changed from operating on TYPE to
> > operating on a type with the precision and sign given by PRECISION
> > and SIGN respectively.
> >
> > the operation itself has to be altered (the above doesn't suggest
> > promoting/demoting the operands to TYPE is the only thing to do).
> >
> > So it seems to be the burden is on the consumers of the information?
>
> Yeah, textually that seems fair. Not sure I was thinking of it in
> those terms at the time though. :)
>
> >> I think the handling of variable shifts is flawed for other reasons.
> >> Given:
> >>
> >> void
> >> uu (unsigned short *restrict ptr1, unsigned short *restrict ptr2)
> >> {
> >> for (int i = 0; i < 200; ++i)
> >> ptr1[i] = ptr1[i] >> ptr2[i];
> >> }
> >>
> >> void
> >> us (unsigned short *restrict ptr1, short *restrict ptr2)
> >> {
> >> for (int i = 0; i < 200; ++i)
> >> ptr1[i] = ptr1[i] >> ptr2[i];
> >> }
> >>
> >> void
> >> su (short *restrict ptr1, unsigned short *restrict ptr2)
> >> {
> >> for (int i = 0; i < 200; ++i)
> >> ptr1[i] = ptr1[i] >> ptr2[i];
> >> }
> >>
> >> void
> >> ss (short *restrict ptr1, short *restrict ptr2)
> >> {
> >> for (int i = 0; i < 200; ++i)
> >> ptr1[i] = ptr1[i] >> ptr2[i];
> >> }
> >>
> >> we only narrow uu and ss, due to:
> >>
> >> /* Ignore codes that don't take uniform arguments. */
> >> if (!types_compatible_p (TREE_TYPE (op), type))
> >> return;
> >
> > I suppose that's because we care about the shift operand at all here.
> > We could possibly use [0 .. precision-1] as known range for it
> > and only if that doesn't fit 'type' give up (and otherwise simply
> > ignore the input range of the shift operands here).
> >
> >> in vect_determine_precisions_from_range. Maybe we should drop
> >> the shift handling from there and instead rely on
> >> vect_determine_precisions_from_users, extending:
> >>
> >> if (TREE_CODE (shift) != INTEGER_CST
> >> || !wi::ltu_p (wi::to_widest (shift), precision))
> >> return;
> >>
> >> to handle ranges where the max is known to be < precision.
> >>
> >> There again, if masking is enough for right shifts and right rotates,
> >> maybe we should keep the current handling for then (with your fix)
> >> and skip the types_compatible_p check for those cases.
> >
> > I think it should be enough for left-shifts as well? If we lshift
> > out like 0x100 << 9 so the lhs range is [0,0] the input range from
> > op0 will still make us use HImode. I think we only ever get overly
> > conservative answers for left-shifts from this function?
>
> But if we have:
>
> short x, y;
> int z = (int) x << (int) y;
>
> and at runtime, x == 1, y == 16, (short) z should be 0 (no UB),
> whereas x << y would invoke UB and x << (y & 15) would be 1.
True, but we start with the range of the LHS which in this case
would be of type 'int' and thus 1 << 16 and not zero. You
might call that a failure of vect_determine_precisions_from_range
of course, since it makes it not exactly a forward propagation ...
> > Whatever works for RROTATE should also work for LROTATE.
>
> I think the same problem affects LROTATE.
>
> >> So:
> >>
> >> - restrict shift handling in vect_determine_precisions_from_range to
> >> RSHIFT_EXPR and RROTATE_EXPR
> >>
> >> - remove types_compatible_p restriction for those cases
> >>
> >> - extend vect_determine_precisions_from_users shift handling to check
> >> for ranges on the shift amount
> >>
> >> Does that sound right?
> >
> > I'm not sure. This all felt somewhat fragile when looking closer
> > (I was hoping you would eventually tackle it from the older
> > referenced bug) ... so my main idea was to perform incremental changes
> > where I have test coverage (as with the new wrong-code bug).
>
> Fair :)
>
> > Originally I completely disabled shift support but that regressed
> > the over-widen testcases a lot which at least have widened shifts
> > by constants a lot.
> >
> > x86 has vector rotates only for AMD XOP (which is dead) plus
> > some for V1TImode AFAICS, but I think we pattern-match rotates
> > to shifts, so maybe the precision stuff is interesting for the
> > case where we match the pattern rotate sequence for widenings?
> >
> > So for the types_compatible_p issue something along
> > the following? We could also exempt the shift operand from
> > being covered by min_precision so the consumer would have
> > to make sure it can be represented (I think that's never going
> > to be an issue in practice until we get 256bit integers vectorized).
> > It will have to fixup the shift operands anyway.
> >
> > diff --git a/gcc/tree-vect-patterns.cc b/gcc/tree-vect-patterns.cc
> > index e4ab8c2d65b..cdeeaf98a47 100644
> > --- a/gcc/tree-vect-patterns.cc
> > +++ b/gcc/tree-vect-patterns.cc
> > @@ -6378,16 +6378,26 @@ vect_determine_precisions_from_range
> > (stmt_vec_info stmt_info, gassign *stmt)
> > }
> > else if (TREE_CODE (op) == SSA_NAME)
> > {
> > - /* Ignore codes that don't take uniform arguments. */
> > - if (!types_compatible_p (TREE_TYPE (op), type))
> > + /* Ignore codes that don't take uniform arguments. For shifts
> > + the shift amount is known to be in-range. */
>
> I guess it's more "we can assume that the amount is in range"?
Yes.
> > + if (code == LSHIFT_EXPR
> > + || code == RSHIFT_EXPR
> > + || code == LROTATE_EXPR
> > + || code == RROTATE_EXPR)
> > + {
> > + min_value = wi::min (min_value, 0, sign);
> > + max_value = wi::max (max_value, TYPE_PRECISION (type),
> > sign);
>
> LGTM for shifts right. Because of the above lshift thing, I think we
> need something like:
>
> if (code == LSHIFT_EXPR || code == LROTATE_EXPR)
> {
> wide_int op_min_value, op_max_value;
> if (!vect_get_range_info (op, &op_min_value, op_max_value))
> return;
>
> /* We can ignore left shifts by negative amounts, which are UB. */
> min_value = wi::min (min_value, 0, sign);
>
> /* Make sure the highest non-UB shift amount doesn't become UB. */
> op_max_value = wi::umin (op_max_value, TYPE_PRECISION (type));
> auto mask = wi::mask (TYPE_PRECISION (type), false,
> op_max_value.to_uhwi ());
> max_value = wi::max (max_value, mask, sign);
> }
>
> Does that look right?
As said it looks overly conservative to me? For example with my patch
for
void foo (signed char *v, int s)
{
if (s < 1 || s > 7)
return;
for (int i = 0; i < 1024; ++i)
v[i] = v[i] << s;
}
I get
t.c:5:21: note: _7 has range [0xffffc000, 0x3f80]
t.c:5:21: note: can narrow to signed:15 without loss of precision: _7 =
_6 << s_12(D);
t.c:5:21: note: only the low 15 bits of _6 are significant
t.c:5:21: note: _6 has range [0xffffff80, 0x7f]
...
t.c:5:21: note: vect_recog_over_widening_pattern: detected: _7 = _6 <<
s_12(D);
t.c:5:21: note: demoting int to signed short
t.c:5:21: note: Splitting statement: _6 = (int) _5;
t.c:5:21: note: into pattern statements: patt_24 = (signed short) _5;
t.c:5:21: note: and: patt_23 = (int) patt_24;
t.c:5:21: note: created pattern stmt: patt_21 = patt_24 << patt_22;
t.c:5:21: note: over_widening pattern recognized: patt_18 = (int)
patt_21;
and eventually
patt_22 = (signed short) s_12(D);
...
vect__5.7_28 = MEM <vector(16) signed char> [(signed char
*)vectp_v.5_10];
vect_patt_24.8_29 = [vec_unpack_lo_expr] vect__5.7_28;
vect_patt_24.8_30 = [vec_unpack_hi_expr] vect__5.7_28;
vect_patt_21.9_31 = vect_patt_24.8_29 << patt_22;
vect_patt_21.9_32 = vect_patt_24.8_30 << patt_22;
vect_patt_17.10_33 = VEC_PACK_TRUNC_EXPR <vect_patt_21.9_31,
vect_patt_21.9_32>;
MEM <vector(16) signed char> [(signed char *)vectp_v.11_34] =
vect_patt_17.10_33;
which I think is correct. If I'd up the range of the shift
operand then the range for _7 would be larger and we wouldn't
be able to shrink the range. I suppose the _from_users
might back-propagate the truncation to signed char and enable
that though?
> Thanks, and sorry for the scope creep.
Well, if we can agree on that the patch as originally is
correct (but incomplete) we can maybe still install it.
As seen above I wonder if there's really an existing problem
for LSHIFT_EXPR. I'm sure we mishandle rotates, but again
at the consumer side - in priciple we could analyze them
as lshift + rshift which means the existing behavior is
probably also correct.
Ricahrd.
> Richard
>
> > + }
> > + else if (types_compatible_p (TREE_TYPE (op), type))
> > + {
> > + wide_int op_min_value, op_max_value;
> > + if (!vect_get_range_info (op, &op_min_value,
> > &op_max_value))
> > + return;
> > + min_value = wi::min (min_value, op_min_value, sign);
> > + max_value = wi::max (max_value, op_max_value, sign);
> > + }
> > + else
> > return;
> > -
> > - wide_int op_min_value, op_max_value;
> > - if (!vect_get_range_info (op, &op_min_value, &op_max_value))
> > - return;
> > -
> > - min_value = wi::min (min_value, op_min_value, sign);
> > - max_value = wi::max (max_value, op_max_value, sign);
> > }
> > else
> > return;