> -----Original Message----- > From: Richard Sandiford <richard.sandif...@arm.com> > Sent: Friday, February 10, 2023 4:25 PM > To: Tamar Christina <tamar.christ...@arm.com> > Cc: Tamar Christina via Gcc-patches <gcc-patches@gcc.gnu.org>; nd > <n...@arm.com>; rguent...@suse.de; j...@ventanamicro.com > Subject: Re: [PATCH 1/2]middle-end: Fix wrong overmatching of div-bitmask > by using new optabs [PR108583] > > Tamar Christina <tamar.christ...@arm.com> writes: > >> -----Original Message----- > >> From: Richard Sandiford <richard.sandif...@arm.com> > >> Sent: Friday, February 10, 2023 3:57 PM > >> To: Tamar Christina <tamar.christ...@arm.com> > >> Cc: Tamar Christina via Gcc-patches <gcc-patches@gcc.gnu.org>; nd > >> <n...@arm.com>; rguent...@suse.de; j...@ventanamicro.com > >> Subject: Re: [PATCH 1/2]middle-end: Fix wrong overmatching of > >> div-bitmask by using new optabs [PR108583] > >> > >> Tamar Christina <tamar.christ...@arm.com> writes: > >> >> > a/gcc/tree-vect-patterns.cc b/gcc/tree-vect-patterns.cc index > >> >> > > >> >> > >> > 6934aebc69f231af24668f0a1c3d140e97f55487..e39d7e6b362ef44eb2fc467f33 > >> >> 69 > >> >> > de2afea139d6 100644 > >> >> > --- a/gcc/tree-vect-patterns.cc > >> >> > +++ b/gcc/tree-vect-patterns.cc > >> >> > @@ -3914,12 +3914,82 @@ vect_recog_divmod_pattern (vec_info > >> *vinfo, > >> >> > return pattern_stmt; > >> >> > } > >> >> > else if ((cst = uniform_integer_cst_p (oprnd1)) > >> >> > - && targetm.vectorize.can_special_div_by_const (rhs_code, > >> >> vectype, > >> >> > - wi::to_wide > (cst), > >> >> > - NULL, > NULL_RTX, > >> >> > - NULL_RTX)) > >> >> > + && TYPE_UNSIGNED (itype) > >> >> > + && rhs_code == TRUNC_DIV_EXPR > >> >> > + && vectype > >> >> > + && direct_internal_fn_supported_p (IFN_ADDH, vectype, > >> >> > + OPTIMIZE_FOR_SPEED)) > >> >> > { > >> >> > - return NULL; > >> >> > + /* div optimizations using narrowings > >> >> > + we can do the division e.g. shorts by 255 faster by > >> >> > calculating it > as > >> >> > + (x + ((x + 257) >> 8)) >> 8 assuming the operation is done in > >> >> > + double the precision of x. > >> >> > + > >> >> > + If we imagine a short as being composed of two blocks of > >> >> > + bytes > >> then > >> >> > + adding 257 or 0b0000_0001_0000_0001 to the number is > >> >> > + equivalent > >> to > >> >> > + adding 1 to each sub component: > >> >> > + > >> >> > + short value of 16-bits > >> >> > + ┌──────────────┬────────────────┐ > >> >> > + │ │ │ > >> >> > + └──────────────┴────────────────┘ > >> >> > + 8-bit part1 ▲ 8-bit part2 ▲ > >> >> > + │ │ > >> >> > + │ │ > >> >> > + +1 +1 > >> >> > + > >> >> > + after the first addition, we have to shift right by 8, and > >> >> > narrow > the > >> >> > + results back to a byte. Remember that the addition must > >> >> > + be done > >> in > >> >> > + double the precision of the input. However if we know > >> >> > + that the > >> >> addition > >> >> > + `x + 257` does not overflow then we can do the operation > >> >> > + in the > >> >> current > >> >> > + precision. In which case we don't need the pack and unpacks. > */ > >> >> > + auto wcst = wi::to_wide (cst); > >> >> > + int pow = wi::exact_log2 (wcst + 1); > >> >> > + if (pow == (int) (element_precision (vectype) / 2)) > >> >> > + { > >> >> > + wide_int min,max; > >> >> > + /* If we're in a pattern we need to find the orginal > definition. */ > >> >> > + tree op0 = oprnd0; > >> >> > + gimple *stmt = SSA_NAME_DEF_STMT (oprnd0); > >> >> > + stmt_vec_info stmt_info = vinfo->lookup_stmt (stmt); > >> >> > + if (is_pattern_stmt_p (stmt_info)) > >> >> > + { > >> >> > + auto orig_stmt = STMT_VINFO_RELATED_STMT > (stmt_info); > >> >> > + if (is_gimple_assign (STMT_VINFO_STMT (orig_stmt))) > >> >> > + op0 = gimple_assign_lhs (STMT_VINFO_STMT > (orig_stmt)); > >> >> > + } > >> >> > >> >> If this is generally safe (I'm skipping thinking about it in the > >> >> interests of a quick review :-)), then I think it should be done > >> >> in vect_get_range_info instead. Using gimple_get_lhs would be > >> >> more general than handling just assignments. > >> >> > >> >> > + > >> >> > + /* Check that no overflow will occur. If we don't have range > >> >> > + information we can't perform the optimization. */ > >> >> > + if (vect_get_range_info (op0, &min, &max)) > >> >> > + { > >> >> > + wide_int one = wi::to_wide (build_one_cst (itype)); > >> >> > + wide_int adder = wi::add (one, wi::lshift (one, pow)); > >> >> > + wi::overflow_type ovf; > >> >> > + /* We need adder and max in the same precision. */ > >> >> > + wide_int zadder > >> >> > + = wide_int_storage::from (adder, wi::get_precision > (max), > >> >> > + UNSIGNED); > >> >> > + wi::add (max, zadder, UNSIGNED, &ovf); > >> >> > >> >> Could you explain this a bit more? When do we have mismatched > >> >> precisions? > >> > > >> > C promotion rules will promote e.g. > >> > > >> > void fun2(uint8_t* restrict pixel, uint8_t level, int n) { > >> > for (int i = 0; i < n; i+=1) > >> > pixel[i] = (pixel[i] + level) / 0xff; } > >> > > >> > And have the addition be done as a 32 bit integer. The vectorizer > >> > will demote this down to a short, but range information is not > >> > stored for patterns. So In the above the range will correctly be > >> > 0x1fe but the precision will be that of the original expression, so > >> > 32. This will be a mismatch with itype which is derived from the > >> > size the vectorizer > >> will perform the operation in. > >> > >> Gah, missed this first time round, sorry. > >> > >> Richi would know better than me, but I think it's dangerous to rely > >> on the orig/pattern link for range information. The end result of a > >> pattern > >> (vect_stmt_to_vectorize) has to have the same type as the lhs of the > >> original statement. But the other statements in the pattern sequence > >> can do arbitrary things. Their range isn't predictable from the > >> range of the original statement result. > >> > >> IIRC, the addition above is converted to: > >> > >> a' = (uint16_t) pixel[i] > >> b' = (uint16_t) level > >> sum' = a' + b' > >> sum = (int) sum' > >> > >> where sum is the direct replacement of "pixel[i] + level", with the > >> same type and range. The division then uses sum' instead of sum. > >> > >> But the fact that sum' is part of the same pattern as sum doesn't > >> guarantee that sum' has the same range as sum. E.g. the pattern > >> statements added by the division optimisation wouldn't have this > property. > > > > So my assumption is that no pattern would replace a statement with > > something That has higher precision than the C statement. The pattern > > above is demoted By the vectorizer based on range information already. > > My assumption was that the precision can only ever be smaller, because > > otherwise the pattern has violated the semantics of the C code, which > would be dangerous if e.g. the expression escapes? > > IMO the difference in precisions was a symptom of the problem rather than > the direct cause. > > The point is more that "B = vect_orig_stmt(A)" just says "A is used somehow > in a new calculation of B". A might equal B (if A replaces B), or A might be > an > arbitrary temporary result. The code above is instead using it to mean "A > equals B, expressed in a different type". That happens to be true for sum' in > the sequence above, but it isn't true of non-final pattern statements in > general. >
Sorry for being dense, but I though that's exactly what the code does and what I tried explain before. If B isn't a final statement than it won't have an original statement. AFAIK, the only places we set original statement is the root of the pattern expression. > In other words, the code hasn't proved that the path from A to > vect_stmt_to_vectorize(B) just involves conversions. > > Applying the range of a pattern result to all temporary results in the pattern > could lead to wrong results even when the precisions are all the same. But maybe I'm misremembering here. I don't believe we'd ever match in the middle of A multi pattern sequence because the additional patterns are not emitted in the instruction stream. That's why we have append_pattern_def_seq which appends the additional statemens to the pattern's def sequence. Unlike the original seed for the pattern, these aren't materialized until codegen or SLP build. But I could be wrong... Tamar > > >> Is it possible to tell ranger to compute the range of expressions > >> that haven't been added to the IL? (Genuine question, haven't looked. > >> It seems pretty powerful though.) > > > > I don't know either, I guess for things it has explicit knowledge > > about it's ok, so > > +w or *w would be fine, but with a random IFN_ it'll likely have to punt as > varying. > > Yeah. But sum' above involves simple arithmetic and conversions, so IFNs > shouldn't be a problem. > > Thanks, > Richard