On Mon, Oct 09, 2023 at 01:54:19PM +0100, Richard Sandiford wrote: > > I've additionally built it with the incremental attached patch and > > on make -C gcc check-gcc check-g++ -j32 -k it didn't show any > > wide_int/widest_int heap allocations unless a > 128-bit _BitInt or wb/uwb > > constant needing > 128-bit _BitInt was used in a testcase. > > Overall it looks really good to me FWIW. Some comments about the > wide-int.h changes below. Will send a separate message about wide-int.cc.
Thanks, just quick answers, will work on patch adjustments after trying to get rid of rwide_int (seems dwarf2out has very limited needs from it, just some routine to construct it in GCed memory (and never change afterwards) from const wide_int_ref & or so, and then working operator ==, get_precision, elt, get_len and get_val methods, so I think we could just have a struct dw_wide_int { unsigned int prec, len; HOST_WIDE_INT val[1]; }; and perform the methods on it after converting to a storage ref. > > @@ -380,7 +406,11 @@ namespace wi > > > > /* The integer has a constant precision (known at GCC compile time) > > and is signed. */ > > - CONST_PRECISION > > + CONST_PRECISION, > > + > > + /* Like CONST_PRECISION, but with WIDEST_INT_MAX_PRECISION or larger > > + precision where not all elements of arrays are always present. */ > > + WIDEST_CONST_PRECISION > > }; > > Sorry to bring this up so late, but how about using INL_CONST_PRECISION > for the fully inline case and CONST_PRECISION for the general case? > That seems more consistent with the other naming in the patch. Ok. > > @@ -482,6 +541,18 @@ namespace wi > > }; > > > > template <typename T1, typename T2> > > + struct binary_traits <T1, T2, WIDEST_CONST_PRECISION, > > WIDEST_CONST_PRECISION> > > + { > > + STATIC_ASSERT (int_traits <T1>::precision == int_traits > > <T2>::precision); > > Should this assert for equal inl_precision too? Although it probably > isn't necessary computationally, it seems a bit arbitrary to pick the > first inl_precision... inl_precision is only used for widest_int/widest2_int, so if precision is equal, inl_precision is as well. > > +inline wide_int_storage::wide_int_storage (const wide_int_storage &x) > > +{ > > + len = x.len; > > + precision = x.precision; > > + if (UNLIKELY (precision > WIDE_INT_MAX_INL_PRECISION)) > > + { > > + u.valp = XNEWVEC (HOST_WIDE_INT, CEIL (precision, > > HOST_BITS_PER_WIDE_INT)); > > + memcpy (u.valp, x.u.valp, len * sizeof (HOST_WIDE_INT)); > > + } > > + else if (LIKELY (precision)) > > + memcpy (u.val, x.u.val, len * sizeof (HOST_WIDE_INT)); > > +} > > Does the variable-length memcpy pay for itself? If so, perhaps that's a > sign that we should have a smaller inline buffer for this class (say 2 HWIs). Guess I'll try to see what results in smaller .text size. > > +namespace wi > > +{ > > + template <int N> > > + struct int_traits < widest_int_storage <N> > > > + { > > + static const enum precision_type precision_type = > > WIDEST_CONST_PRECISION; > > + static const bool host_dependent_precision = false; > > + static const bool is_sign_extended = true; > > + static const bool needs_write_val_arg = true; > > + static const unsigned int precision > > + = N / WIDE_INT_MAX_INL_PRECISION * WIDEST_INT_MAX_PRECISION; > > What's the reasoning behind this calculation? It would give 0 for > N < WIDE_INT_MAX_INL_PRECISION, and the "MAX" suggests that N > shouldn't be > WIDE_INT_MAX_INL_PRECISION either. > > I wonder whether this should be a second template parameter, with an > assert that precision > inl_precision. Maybe. Yet another option would be to always use WIDE_INT_MAX_INL_PRECISION as the inline precision (and use N template parameter just to decide about the overall precision), regardless of whether it is widest_int or widest2_int. The latter is very rare and even much rarer that something wouldn't fit into the WIDE_INT_MAX_INL_PRECISION when not using _BitInt. The reason for introducing inl_precision was to avoid the heap allocation for widest2_int unless _BitInt is in use, but maybe that isn't worth it. > Nit: might format more naturally with: > > using res_traits = wi::int_traits <WI_BINARY_RESULT (T1, T2)>: > ... Ok. > > @@ -2203,6 +2781,9 @@ wi::sext (const T &x, unsigned int offse > > unsigned int precision = get_precision (result); > > WIDE_INT_REF_FOR (T) xi (x, precision); > > > > + if (result.needs_write_val_arg) > > + val = result.write_val (MAX (xi.len, > > + CEIL (offset, HOST_BITS_PER_WIDE_INT))); > > Why MAX rather than MIN? Because it needs to be an upper bound. In this case, sext_large has unsigned int len = offset / HOST_BITS_PER_WIDE_INT; /* Extending beyond the precision is a no-op. If we have only stored OFFSET bits or fewer, the rest are already signs. */ if (offset >= precision || len >= xlen) { for (unsigned i = 0; i < xlen; ++i) val[i] = xval[i]; return xlen; } unsigned int suboffset = offset % HOST_BITS_PER_WIDE_INT; for (unsigned int i = 0; i < len; i++) val[i] = xval[i]; if (suboffset > 0) { val[len] = sext_hwi (xval[len], suboffset); len += 1; } return canonize (val, len, precision); So, in some cases it returns xlen (xi.len in the caller), in other cases something <= CEIL (offset, HOST_BITS_PER_WIDE_INT) (smaller if canonize finds redundancy). Sure, the inline caller could try to be more accurate, like if (result.needs_write_val_arg) { unsigned int exp_len = offset / HOST_BITS_PER_WIDE_INT; if (offset >= precision || exp_len >= xi.len) exp_len = xi.len; else if (offset % HOST_BITS_PER_WIDE_INT) ++exp_len; val = result.write_val (exp_len); } etc., but I'm afraid the larger the inline code is, the less likely will it get inlined and if it will, it will make code size even larger. That is also why I'm using that ugly needs_write_val_arg static data member, it should already from FE be constant evaluated and so even without if constexpr optimized away for wide_int/fixed_wide_int_storage etc. > > if (offset <= HOST_BITS_PER_WIDE_INT) > > { > > val[0] = sext_hwi (xi.ulow (), offset); > > I wondered for this kind of thing whether we should have: > > if (result.needs_write_val_arg) > val = result.write_val (1); > > and leave the complicated case in the slow path. But maybe it doesn't > pay for itself. Yes, I'm also afraid it will make the inline function even larger. > > @@ -2259,6 +2843,9 @@ wi::set_bit (const T &x, unsigned int bi > > WI_UNARY_RESULT_VAR (result, val, T, x); > > unsigned int precision = get_precision (result); > > WIDE_INT_REF_FOR (T) xi (x, precision); > > + if (result.needs_write_val_arg) > > + val = result.write_val (MAX (xi.len, > > + bit / HOST_BITS_PER_WIDE_INT + 1)); > > if (precision <= HOST_BITS_PER_WIDE_INT) > > { > > val[0] = xi.ulow () | (HOST_WIDE_INT_1U << bit); > > @@ -2280,6 +2867,8 @@ wi::bswap (const T &x) > > WI_UNARY_RESULT_VAR (result, val, T, x); > > unsigned int precision = get_precision (result); > > WIDE_INT_REF_FOR (T) xi (x, precision); > > + if (result.needs_write_val_arg) > > + gcc_unreachable (); /* bswap on widest_int makes no sense. */ > > Doesn't this work as a static_assert? (You might have covered this > before, sorry.) I think it indeed could be a static_assert. > > @@ -2643,6 +3252,8 @@ wi::mul (const T1 &x, const T2 &y) > > unsigned int precision = get_precision (result); > > WIDE_INT_REF_FOR (T1) xi (x, precision); > > WIDE_INT_REF_FOR (T2) yi (y, precision); > > + if (result.needs_write_val_arg) > > + val = result.write_val (xi.len + yi.len + 2); > > if (precision <= HOST_BITS_PER_WIDE_INT) > > { > > val[0] = xi.ulow () * yi.ulow (); > > I realise this is deliberately conservative, just curious: why + 2 > rather than + 1? I think I've ran into a buffer overflow there, had xi.len + yi.len + 1 until the Sep 26th patch and not in the Sep 28th version. Even the wide_int.cc side attempts to be conservative: if (prec > WIDE_INT_MAX_INL_PRECISION && !high) prec = (op1len + op2len + 1) * HOST_BITS_PER_WIDE_INT; The + 1 in there is meant e.g. for the case where the sign changes. In that case blocks_needed is (op1len + op2len + 1), half_blocks_needed 2x as much. So, I think the wi_pack should have the while loop iterate (op1len + op2len + 1) times, in_len & 1 should be always false, but precision passed to wi::pack is for widest_int the constant precision (i.e. 32640), so else if (j < blocks_needed) result[j++] = 0; is most likely true because blocks_needed will be 510 (32640 / 64). So we actually will store xy.len + yi.len + 2 and then hopefuly canonize will decrease that in most cases. Jakub