RE: [PATCH 1/2]middle-end: Fix wrong overmatching of div-bitmask by using new optabs [PR108583]

Tamar Christina via Gcc-patches Fri, 10 Feb 2023 08:33:54 -0800

> -----Original Message-----
> From: Richard Sandiford <richard.sandif...@arm.com>
> Sent: Friday, February 10, 2023 4:25 PM
> To: Tamar Christina <tamar.christ...@arm.com>
> Cc: Tamar Christina via Gcc-patches <gcc-patches@gcc.gnu.org>; nd
> <n...@arm.com>; rguent...@suse.de; j...@ventanamicro.com
> Subject: Re: [PATCH 1/2]middle-end: Fix wrong overmatching of div-bitmask
> by using new optabs [PR108583]
> 
> Tamar Christina <tamar.christ...@arm.com> writes:
> >> -----Original Message-----
> >> From: Richard Sandiford <richard.sandif...@arm.com>
> >> Sent: Friday, February 10, 2023 3:57 PM
> >> To: Tamar Christina <tamar.christ...@arm.com>
> >> Cc: Tamar Christina via Gcc-patches <gcc-patches@gcc.gnu.org>; nd
> >> <n...@arm.com>; rguent...@suse.de; j...@ventanamicro.com
> >> Subject: Re: [PATCH 1/2]middle-end: Fix wrong overmatching of
> >> div-bitmask by using new optabs [PR108583]
> >>
> >> Tamar Christina <tamar.christ...@arm.com> writes:
> >> >> > a/gcc/tree-vect-patterns.cc b/gcc/tree-vect-patterns.cc index
> >> >> >
> >> >>
> >>
> 6934aebc69f231af24668f0a1c3d140e97f55487..e39d7e6b362ef44eb2fc467f33
> >> >> 69
> >> >> > de2afea139d6 100644
> >> >> > --- a/gcc/tree-vect-patterns.cc
> >> >> > +++ b/gcc/tree-vect-patterns.cc
> >> >> > @@ -3914,12 +3914,82 @@ vect_recog_divmod_pattern (vec_info
> >> *vinfo,
> >> >> >        return pattern_stmt;
> >> >> >      }
> >> >> >    else if ((cst = uniform_integer_cst_p (oprnd1))
> >> >> > -        && targetm.vectorize.can_special_div_by_const (rhs_code,
> >> >> vectype,
> >> >> > -                                                       wi::to_wide
> (cst),
> >> >> > -                                                       NULL,
> NULL_RTX,
> >> >> > -                                                       NULL_RTX))
> >> >> > +        && TYPE_UNSIGNED (itype)
> >> >> > +        && rhs_code == TRUNC_DIV_EXPR
> >> >> > +        && vectype
> >> >> > +        && direct_internal_fn_supported_p (IFN_ADDH, vectype,
> >> >> > +                                           OPTIMIZE_FOR_SPEED))
> >> >> >      {
> >> >> > -      return NULL;
> >> >> > +      /* div optimizations using narrowings
> >> >> > +       we can do the division e.g. shorts by 255 faster by 
> >> >> > calculating it
> as
> >> >> > +       (x + ((x + 257) >> 8)) >> 8 assuming the operation is done in
> >> >> > +       double the precision of x.
> >> >> > +
> >> >> > +       If we imagine a short as being composed of two blocks of
> >> >> > + bytes
> >> then
> >> >> > +       adding 257 or 0b0000_0001_0000_0001 to the number is
> >> >> > + equivalent
> >> to
> >> >> > +       adding 1 to each sub component:
> >> >> > +
> >> >> > +         short value of 16-bits
> >> >> > +       ┌──────────────┬────────────────┐
> >> >> > +       │              │                │
> >> >> > +       └──────────────┴────────────────┘
> >> >> > +      8-bit part1 ▲  8-bit part2   ▲
> >> >> > +                  │                │
> >> >> > +                  │                │
> >> >> > +                 +1               +1
> >> >> > +
> >> >> > +       after the first addition, we have to shift right by 8, and 
> >> >> > narrow
> the
> >> >> > +       results back to a byte.  Remember that the addition must
> >> >> > + be done
> >> in
> >> >> > +       double the precision of the input.  However if we know
> >> >> > + that the
> >> >> addition
> >> >> > +       `x + 257` does not overflow then we can do the operation
> >> >> > + in the
> >> >> current
> >> >> > +       precision.  In which case we don't need the pack and unpacks.
> */
> >> >> > +      auto wcst = wi::to_wide (cst);
> >> >> > +      int pow = wi::exact_log2 (wcst + 1);
> >> >> > +      if (pow == (int) (element_precision (vectype) / 2))
> >> >> > +     {
> >> >> > +       wide_int min,max;
> >> >> > +       /* If we're in a pattern we need to find the orginal
> definition.  */
> >> >> > +       tree op0 = oprnd0;
> >> >> > +       gimple *stmt = SSA_NAME_DEF_STMT (oprnd0);
> >> >> > +       stmt_vec_info stmt_info = vinfo->lookup_stmt (stmt);
> >> >> > +       if (is_pattern_stmt_p (stmt_info))
> >> >> > +         {
> >> >> > +           auto orig_stmt = STMT_VINFO_RELATED_STMT
> (stmt_info);
> >> >> > +           if (is_gimple_assign (STMT_VINFO_STMT (orig_stmt)))
> >> >> > +             op0 = gimple_assign_lhs (STMT_VINFO_STMT
> (orig_stmt));
> >> >> > +         }
> >> >>
> >> >> If this is generally safe (I'm skipping thinking about it in the
> >> >> interests of a quick review :-)), then I think it should be done
> >> >> in vect_get_range_info instead.  Using gimple_get_lhs would be
> >> >> more general than handling just assignments.
> >> >>
> >> >> > +
> >> >> > +       /* Check that no overflow will occur.  If we don't have range
> >> >> > +          information we can't perform the optimization.  */
> >> >> > +       if (vect_get_range_info (op0, &min, &max))
> >> >> > +         {
> >> >> > +           wide_int one = wi::to_wide (build_one_cst (itype));
> >> >> > +           wide_int adder = wi::add (one, wi::lshift (one, pow));
> >> >> > +           wi::overflow_type ovf;
> >> >> > +           /* We need adder and max in the same precision.  */
> >> >> > +           wide_int zadder
> >> >> > +             = wide_int_storage::from (adder, wi::get_precision
> (max),
> >> >> > +                                       UNSIGNED);
> >> >> > +           wi::add (max, zadder, UNSIGNED, &ovf);
> >> >>
> >> >> Could you explain this a bit more?  When do we have mismatched
> >> >> precisions?
> >> >
> >> > C promotion rules will promote e.g.
> >> >
> >> > void fun2(uint8_t* restrict pixel, uint8_t level, int n) {
> >> >   for (int i = 0; i < n; i+=1)
> >> >     pixel[i] = (pixel[i] + level) / 0xff; }
> >> >
> >> > And have the addition be done as a 32 bit integer.  The vectorizer
> >> > will demote this down to a short, but range information is not
> >> > stored for patterns.  So In the above the range will correctly be
> >> > 0x1fe but the precision will be that of the original expression, so
> >> > 32.  This will be a mismatch with itype which is derived from the
> >> > size the vectorizer
> >> will perform the operation in.
> >>
> >> Gah, missed this first time round, sorry.
> >>
> >> Richi would know better than me, but I think it's dangerous to rely
> >> on the orig/pattern link for range information.  The end result of a
> >> pattern
> >> (vect_stmt_to_vectorize) has to have the same type as the lhs of the
> >> original statement.  But the other statements in the pattern sequence
> >> can do arbitrary things.  Their range isn't predictable from the
> >> range of the original statement result.
> >>
> >> IIRC, the addition above is converted to:
> >>
> >>   a' = (uint16_t) pixel[i]
> >>   b' = (uint16_t) level
> >>   sum' = a' + b'
> >>   sum = (int) sum'
> >>
> >> where sum is the direct replacement of "pixel[i] + level", with the
> >> same type and range.  The division then uses sum' instead of sum.
> >>
> >> But the fact that sum' is part of the same pattern as sum doesn't
> >> guarantee that sum' has the same range as sum.  E.g. the pattern
> >> statements added by the division optimisation wouldn't have this
> property.
> >
> > So my assumption is that no pattern would replace a statement with
> > something That has higher precision than the C statement. The pattern
> > above is demoted By the vectorizer based on range information already.
> > My assumption was that the precision can only ever be smaller, because
> > otherwise the pattern has violated the semantics of the C code, which
> would be dangerous if e.g. the expression escapes?
> 
> IMO the difference in precisions was a symptom of the problem rather than
> the direct cause.
> 
> The point is more that "B = vect_orig_stmt(A)" just says "A is used somehow
> in a new calculation of B".  A might equal B (if A replaces B), or A might be 
> an
> arbitrary temporary result.  The code above is instead using it to mean "A
> equals B, expressed in a different type".  That happens to be true for sum' in
> the sequence above, but it isn't true of non-final pattern statements in
> general.
>


Sorry for being dense, but I though that's exactly what the code does and what I
tried explain before. If B isn't a final statement than it won't have an 
original statement.
AFAIK, the only places we set original statement is the root of the pattern 
expression.

> In other words, the code hasn't proved that the path from A to
> vect_stmt_to_vectorize(B) just involves conversions.
> 
> Applying the range of a pattern result to all temporary results in the pattern
> could lead to wrong results even when the precisions are all the same.

But maybe I'm misremembering here. I don't believe we'd ever match in the 
middle of
A multi pattern sequence because the additional patterns are not emitted in the 
instruction
stream.  That's why we have append_pattern_def_seq which appends the additional
statemens to the pattern's def sequence.

Unlike the original seed for the pattern, these aren't materialized until 
codegen or SLP build.

But I could be wrong...

Tamar

> 
> >> Is it possible to tell ranger to compute the range of expressions
> >> that haven't been added to the IL?  (Genuine question, haven't looked.
> >> It seems pretty powerful though.)
> >
> > I don't know either, I guess for things it has explicit knowledge
> > about it's ok, so
> > +w or *w would be fine, but with a random IFN_ it'll likely have to punt as
> varying.
> 
> Yeah.  But sum' above involves simple arithmetic and conversions, so IFNs
> shouldn't be a problem.
> 
> Thanks,
> Richard

RE: [PATCH 1/2]middle-end: Fix wrong overmatching of div-bitmask by using new optabs [PR108583]

Reply via email to