On Fri, Feb 19, 2021 at 9:08 AM Alexandre Oliva <ol...@adacore.com> wrote: > > Here's an improved version of the patch. Regstrapped on > x86_64-linux-gnu, with and without a patchlet that moved multi-pieces > ahead of setmem, and also tested with riscv32-elf. > > Is it ok to install? Or should it wait for stage1?
It generally looks OK but I'd wait for stage1. I'd also like to see comments from others. Note that I think we need to guard the loop emitting on optimize_*_for_speed/!size. It's also not entirely clear to me how the code avoids expanding all > 1 byte block size memsets to a loop (thus bypassing more optimal libc implementations for larger sizes). I also remember that we sometimes do a dynamic dispatch between inlined and non-inlined code paths, though that might be target specific handling in the x86 backend. Thanks, Richard. > [PR94092] introduce try store by multiple pieces > > From: Alexandre Oliva <ol...@adacore.com> > > The ldist pass turns even very short loops into memset calls. E.g., > the TFmode emulation calls end with a loop of up to 3 iterations, to > zero out trailing words, and the loop distribution pass turns them > into calls of the memset builtin. > > Though short constant-length clearing memsets are usually dealt with > efficiently, for non-constant-length ones, the options are setmemM, or > a function calls. > > RISC-V doesn't have any setmemM pattern, so the loops above end up > "optimized" into memset calls, incurring not only the overhead of an > explicit call, but also discarding the information the compiler has > about the alignment of the destination, and that the length is a > multiple of the word alignment. > > This patch handles variable lengths with multiple conditional > power-of-2-constant-sized stores-by-pieces, so as to reduce the > overhead of length compares. > > It also changes the last copy-prop pass into ccp, so that pointer > alignment and length's nonzero bits are detected and made available > for the expander, even for ldist-introduced SSA_NAMEs. > > > for gcc/ChangeLog > > PR tree-optimization/94092 > * builtins.c (try_store_by_multiple_pieces): New. > (expand_builtin_memset_args): Use it. If target_char_cast > fails, proceed as for non-constant val. Pass len's ctz to... > * expr.c (clear_storage_hints): ... this. Try store by > multiple pieces after setmem. > (clear_storage): Adjust. > * expr.h (clear_storage_hints): Likewise. > (try_store_by_multiple_pieces): Declare. > * passes.def: Replace the last copy_prop to ccp. > --- > gcc/builtins.c | 182 > ++++++++++++++++++++++++++++++++++++++++++++++++++++++-- > gcc/expr.c | 9 ++- > gcc/expr.h | 13 ++++ > gcc/passes.def | 5 +- > 4 files changed, 197 insertions(+), 12 deletions(-) > > diff --git a/gcc/builtins.c b/gcc/builtins.c > index 0aed008687ccc..05b14f2a3f6d3 100644 > --- a/gcc/builtins.c > +++ b/gcc/builtins.c > @@ -6600,6 +6600,166 @@ expand_builtin_memset (tree exp, rtx target, > machine_mode mode) > return expand_builtin_memset_args (dest, val, len, target, mode, exp); > } > > +/* Try to store VAL (or, if NULL_RTX, VALC) in LEN bytes starting at TO. > + Return TRUE if successful, FALSE otherwise. TO is assumed to be > + aligned at an ALIGN-bits boundary. LEN must be a multiple of > + 1<<CTZ_LEN between MIN_LEN and MAX_LEN. > + > + The strategy is to issue one store_by_pieces for each power of two, > + from most to least significant, guarded by a test on whether there > + are at least that many bytes left to copy in LEN. > + > + ??? Should we skip some powers of two in favor of loops? Maybe start > + at the max of TO/LEN/word alignment, at least when optimizing for > + size, instead of ensuring O(log len) dynamic compares? */ > + > +bool > +try_store_by_multiple_pieces (rtx to, rtx len, unsigned int ctz_len, > + unsigned HOST_WIDE_INT min_len, > + unsigned HOST_WIDE_INT max_len, > + rtx val, char valc, unsigned int align) > +{ > + int max_bits = floor_log2 (max_len); > + int min_bits = floor_log2 (min_len); > + int sctz_len = ctz_len; > + > + gcc_checking_assert (sctz_len >= 0); > + > + if (val) > + valc = 1; > + > + /* Bits more significant than TST_BITS are part of the shared prefix > + in the binary representation of both min_len and max_len. Since > + they're identical, we don't need to test them in the loop. */ > + int tst_bits = (max_bits != min_bits ? max_bits > + : floor_log2 (max_len ^ min_len)); > + > + /* Check whether it's profitable to start by storing a fixed BLKSIZE > + bytes, to lower max_bits. In the unlikely case of a constant LEN > + (implied by identical MAX_LEN and MIN_LEN), we want to issue a > + single store_by_pieces, but otherwise, select the minimum multiple > + of the ALIGN (in bytes) and of the MCD of the possible LENs, that > + brings MAX_LEN below TST_BITS, if that's lower than min_len. */ > + unsigned HOST_WIDE_INT blksize; > + if (max_len > min_len) > + { > + unsigned HOST_WIDE_INT alrng = MAX (HOST_WIDE_INT_1U << ctz_len, > + align / BITS_PER_UNIT); > + blksize = max_len - (HOST_WIDE_INT_1U << tst_bits) + alrng; > + blksize &= ~(alrng - 1); > + } > + else if (max_len == min_len) > + blksize = max_len; > + else > + gcc_unreachable (); > + if (min_len >= blksize) > + { > + min_len -= blksize; > + min_bits = floor_log2 (min_len); > + max_len -= blksize; > + max_bits = floor_log2 (max_len); > + > + tst_bits = (max_bits != min_bits ? max_bits > + : floor_log2 (max_len ^ min_len)); > + } > + else > + blksize = 0; > + > + /* Check that we can use store by pieces for the maximum store count > + we may issue (initial fixed-size block, plus conditional > + power-of-two-sized from max_bits to ctz_len. */ > + unsigned HOST_WIDE_INT xlenest = blksize; > + if (max_bits >= 0) > + xlenest += ((HOST_WIDE_INT_1U << max_bits) * 2 > + - (HOST_WIDE_INT_1U << ctz_len)); > + if (!can_store_by_pieces (xlenest, builtin_memset_read_str, > + &valc, align, true)) > + return false; > + > + rtx (*constfun) (void *, HOST_WIDE_INT, scalar_int_mode); > + void *constfundata; > + if (val) > + { > + constfun = builtin_memset_gen_str; > + constfundata = val = force_reg (TYPE_MODE (unsigned_char_type_node), > + val); > + } > + else > + { > + constfun = builtin_memset_read_str; > + constfundata = &valc; > + } > + > + rtx ptr = copy_addr_to_reg (convert_to_mode (ptr_mode, XEXP (to, 0), 0)); > + rtx rem = copy_to_mode_reg (ptr_mode, convert_to_mode (ptr_mode, len, 0)); > + to = replace_equiv_address (to, ptr); > + set_mem_align (to, align); > + > + if (blksize) > + { > + to = store_by_pieces (to, blksize, > + constfun, constfundata, > + align, true, > + max_len != 0 ? RETURN_END : RETURN_BEGIN); > + if (max_len == 0) > + return true; > + > + /* Adjust PTR, TO and REM. Since TO's address is likely > + PTR+offset, we have to replace it. */ > + emit_move_insn (ptr, XEXP (to, 0)); > + to = replace_equiv_address (to, ptr); > + emit_move_insn (rem, plus_constant (ptr_mode, rem, -blksize)); > + } > + > + /* Iterate over power-of-two block sizes from the maximum length to > + the least significant bit possibly set in the length. */ > + for (int i = max_bits; i >= sctz_len; i--) > + { > + rtx_code_label *label = NULL; > + blksize = HOST_WIDE_INT_1U << i; > + > + /* If we're past the bits shared between min_ and max_len, expand > + a test on the dynamic length, comparing it with the > + BLKSIZE. */ > + if (i <= tst_bits) > + { > + label = gen_label_rtx (); > + emit_cmp_and_jump_insns (rem, GEN_INT (blksize), LT, NULL, > + ptr_mode, 1, label, > + profile_probability::even ()); > + } > + /* If we are at a bit that is in the prefix shared by min_ and > + max_len, skip this BLKSIZE if the bit is clear. */ > + else if ((max_len & blksize) == 0) > + continue; > + > + /* Issue a store of BLKSIZE bytes. */ > + to = store_by_pieces (to, blksize, > + constfun, constfundata, > + align, true, > + i != sctz_len ? RETURN_END : RETURN_BEGIN); > + > + /* Adjust REM and PTR, unless this is the last iteration. */ > + if (i != sctz_len) > + { > + emit_move_insn (ptr, XEXP (to, 0)); > + to = replace_equiv_address (to, ptr); > + emit_move_insn (rem, plus_constant (ptr_mode, rem, -blksize)); > + } > + > + if (label) > + { > + emit_label (label); > + > + /* Given conditional stores, the offset can no longer be > + known, so clear it. */ > + clear_mem_offset (to); > + } > + } > + > + return true; > +} > + > /* Helper function to do the actual work for expand_builtin_memset. The > arguments to the builtin_memset call DEST, VAL, and LEN are broken out > so that this can also be called without constructing an actual CALL_EXPR. > @@ -6654,7 +6814,8 @@ expand_builtin_memset_args (tree dest, tree val, tree > len, > dest_mem = get_memory_rtx (dest, len); > val_mode = TYPE_MODE (unsigned_char_type_node); > > - if (TREE_CODE (val) != INTEGER_CST) > + if (TREE_CODE (val) != INTEGER_CST > + || target_char_cast (val, &c)) > { > rtx val_rtx; > > @@ -6678,7 +6839,12 @@ expand_builtin_memset_args (tree dest, tree val, tree > len, > else if (!set_storage_via_setmem (dest_mem, len_rtx, val_rtx, > dest_align, expected_align, > expected_size, min_size, max_size, > - probable_max_size)) > + probable_max_size) > + && !try_store_by_multiple_pieces (dest_mem, len_rtx, > + tree_ctz (len), > + min_size, max_size, > + val_rtx, 0, > + dest_align)) > goto do_libcall; > > dest_mem = force_operand (XEXP (dest_mem, 0), NULL_RTX); > @@ -6686,9 +6852,6 @@ expand_builtin_memset_args (tree dest, tree val, tree > len, > return dest_mem; > } > > - if (target_char_cast (val, &c)) > - goto do_libcall; > - > if (c) > { > if (tree_fits_uhwi_p (len) > @@ -6702,7 +6865,12 @@ expand_builtin_memset_args (tree dest, tree val, tree > len, > gen_int_mode (c, val_mode), > dest_align, expected_align, > expected_size, min_size, max_size, > - probable_max_size)) > + probable_max_size) > + && !try_store_by_multiple_pieces (dest_mem, len_rtx, > + tree_ctz (len), > + min_size, max_size, > + NULL_RTX, c, > + dest_align)) > goto do_libcall; > > dest_mem = force_operand (XEXP (dest_mem, 0), NULL_RTX); > @@ -6716,7 +6884,7 @@ expand_builtin_memset_args (tree dest, tree val, tree > len, > ? BLOCK_OP_TAILCALL : BLOCK_OP_NORMAL, > expected_align, expected_size, > min_size, max_size, > - probable_max_size); > + probable_max_size, tree_ctz (len)); > > if (dest_addr == 0) > { > diff --git a/gcc/expr.c b/gcc/expr.c > index 86dc1b6c97349..f92e4136f67e5 100644 > --- a/gcc/expr.c > +++ b/gcc/expr.c > @@ -3056,7 +3056,8 @@ clear_storage_hints (rtx object, rtx size, enum > block_op_methods method, > unsigned int expected_align, HOST_WIDE_INT expected_size, > unsigned HOST_WIDE_INT min_size, > unsigned HOST_WIDE_INT max_size, > - unsigned HOST_WIDE_INT probable_max_size) > + unsigned HOST_WIDE_INT probable_max_size, > + unsigned ctz_size) > { > machine_mode mode = GET_MODE (object); > unsigned int align; > @@ -3103,6 +3104,10 @@ clear_storage_hints (rtx object, rtx size, enum > block_op_methods method, > expected_align, expected_size, > min_size, max_size, probable_max_size)) > ; > + else if (try_store_by_multiple_pieces (object, size, ctz_size, > + min_size, max_size, > + NULL_RTX, 0, align)) > + ; > else if (ADDR_SPACE_GENERIC_P (MEM_ADDR_SPACE (object))) > return set_storage_via_libcall (object, size, const0_rtx, > method == BLOCK_OP_TAILCALL); > @@ -3120,7 +3125,7 @@ clear_storage (rtx object, rtx size, enum > block_op_methods method) > min = max = UINTVAL (size); > else > max = GET_MODE_MASK (GET_MODE (size)); > - return clear_storage_hints (object, size, method, 0, -1, min, max, max); > + return clear_storage_hints (object, size, method, 0, -1, min, max, max, 0); > } > > > diff --git a/gcc/expr.h b/gcc/expr.h > index 1f0177a4cfa5d..e43f7cf3b9565 100644 > --- a/gcc/expr.h > +++ b/gcc/expr.h > @@ -193,7 +193,8 @@ extern rtx clear_storage_hints (rtx, rtx, enum > block_op_methods, > unsigned int, HOST_WIDE_INT, > unsigned HOST_WIDE_INT, > unsigned HOST_WIDE_INT, > - unsigned HOST_WIDE_INT); > + unsigned HOST_WIDE_INT, > + unsigned); > /* The same, but always output an library call. */ > extern rtx set_storage_via_libcall (rtx, rtx, rtx, bool = false); > > @@ -224,6 +225,16 @@ extern int can_store_by_pieces (unsigned HOST_WIDE_INT, > extern rtx store_by_pieces (rtx, unsigned HOST_WIDE_INT, by_pieces_constfn, > void *, unsigned int, bool, memop_ret); > > +/* If can_store_by_pieces passes for worst-case values near MAX_LEN, call > + store_by_pieces within conditionals so as to handle variable LEN > efficiently, > + storing VAL, if non-NULL_RTX, or valc instead. */ > +extern bool try_store_by_multiple_pieces (rtx to, rtx len, > + unsigned int ctz_len, > + unsigned HOST_WIDE_INT min_len, > + unsigned HOST_WIDE_INT max_len, > + rtx val, char valc, > + unsigned int align); > + > /* Emit insns to set X from Y. */ > extern rtx_insn *emit_move_insn (rtx, rtx); > extern rtx_insn *gen_move_insn (rtx, rtx); > diff --git a/gcc/passes.def b/gcc/passes.def > index e9ed3c7bc5773..8568151233d78 100644 > --- a/gcc/passes.def > +++ b/gcc/passes.def > @@ -332,8 +332,9 @@ along with GCC; see the file COPYING3. If not see > NEXT_PASS (pass_thread_jumps); > NEXT_PASS (pass_vrp, false /* warn_array_bounds_p */); > /* Threading can leave many const/copy propagations in the IL. > - Clean them up. */ > - NEXT_PASS (pass_copy_prop); > + Clean them up. Instead of just copy_prop, we use ccp to > + compute alignment and nonzero bits. */ > + NEXT_PASS (pass_ccp, true /* nonzero_p */); > NEXT_PASS (pass_warn_restrict); > NEXT_PASS (pass_dse); > NEXT_PASS (pass_cd_dce, true /* update_address_taken_p */); > > > -- > Alexandre Oliva, happy hacker https://FSFLA.org/blogs/lxo/ > Free Software Activist GNU Toolchain Engineer > Vim, Vi, Voltei pro Emacs -- GNUlius Caesar