Wilco Dijkstra via Gcc-patches <gcc-patches@gcc.gnu.org> writes: >>> + || (size <= (max_copy_size / 2) >>> + && (aarch64_tune_params.extra_tuning_flags >>> + & AARCH64_EXTRA_TUNE_NO_LDP_STP_QREGS))) >>> + copy_bits = GET_MODE_BITSIZE (TImode); >> >> (Looks like the mailer has eaten some tabs here.) > > The email contains the correct tabs at the time I send it.
Yeah, sorry the noise, looks like Exchange ate them at my end. The version on gmane was ok. >> As discussed in Sudi's setmem patch, I think we should make the >> AARCH64_EXTRA_TUNE_NO_LDP_STP_QREGS conditional on optimising >> for speed. For size, using LDP Q and STP Q is a win regardless >> of what the CPU wants. > > I've changed the logic slightly based on benchmarking. It's actually better > to fallback to calling memcpy for larger sizes in this case rather than emit > an > inlined Q-register memcpy. OK. >>> int n_bits = GET_MODE_BITSIZE (next_mode).to_constant (); >> >> As I mentioned in the reply I've just sent to Sudi's patch, >> I think it might help to add: >> >> gcc_assert (n_bits <= mode_bits); >> >> to show why this is guaranteed not to overrun the original copy. >> (I agree that the code does guarantee no overrun.) > > I've added the same assert so the code remains similar. > > Here is the updated version: > > > Improve the inline memcpy expansion. Use integer load/store for copies <= 24 > bytes > instead of SIMD. Set the maximum copy to expand to 256 by default, except > that -Os or > no Neon expands up to 128 bytes. When using LDP/STP of Q-registers, also use > Q-register > accesses for the unaligned tail, saving 2 instructions (eg. all sizes up to > 48 bytes emit > exactly 4 instructions). Cleanup code and comments. > > The codesize gain vs the GCC10 expansion is 0.05% on SPECINT2017. > > Passes bootstrap and regress. OK for commit? OK, thanks. Richard > > ChangeLog: > 2020-11-16 Wilco Dijkstra <wdijk...@arm.com> > > * config/aarch64/aarch64.c (aarch64_expand_cpymem): Cleanup code and > comments, tweak expansion decisions and improve tail expansion. > > --- > > diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c > index > 41e2a699108146e0fa7464743607bd34e91ea9eb..4b2d5fa7d452dc53ff42308dd2781096ff8c95d2 > 100644 > --- a/gcc/config/aarch64/aarch64.c > +++ b/gcc/config/aarch64/aarch64.c > @@ -21255,35 +21255,39 @@ aarch64_copy_one_block_and_progress_pointers (rtx > *src, rtx *dst, > bool > aarch64_expand_cpymem (rtx *operands) > { > - /* These need to be signed as we need to perform arithmetic on n as > - signed operations. */ > - int n, mode_bits; > + int mode_bits; > rtx dst = operands[0]; > rtx src = operands[1]; > rtx base; > - machine_mode cur_mode = BLKmode, next_mode; > - bool speed_p = !optimize_function_for_size_p (cfun); > - > - /* When optimizing for size, give a better estimate of the length of a > - memcpy call, but use the default otherwise. Moves larger than 8 bytes > - will always require an even number of instructions to do now. And each > - operation requires both a load+store, so divide the max number by 2. */ > - unsigned int max_num_moves = (speed_p ? 16 : AARCH64_CALL_RATIO) / 2; > + machine_mode cur_mode = BLKmode; > > - /* We can't do anything smart if the amount to copy is not constant. */ > + /* Only expand fixed-size copies. */ > if (!CONST_INT_P (operands[2])) > return false; > > - unsigned HOST_WIDE_INT tmp = INTVAL (operands[2]); > + unsigned HOST_WIDE_INT size = INTVAL (operands[2]); > > - /* Try to keep the number of instructions low. For all cases we will do at > - most two moves for the residual amount, since we'll always overlap the > - remainder. */ > - if (((tmp / 16) + (tmp % 16 ? 2 : 0)) > max_num_moves) > - return false; > + /* Inline up to 256 bytes when optimizing for speed. */ > + unsigned HOST_WIDE_INT max_copy_size = 256; > > - /* At this point tmp is known to have to fit inside an int. */ > - n = tmp; > + if (optimize_function_for_size_p (cfun)) > + max_copy_size = 128; > + > + int copy_bits = 256; > + > + /* Default to 256-bit LDP/STP on large copies, however small copies, no > SIMD > + support or slow 256-bit LDP/STP fall back to 128-bit chunks. */ > + if (size <= 24 > + || !TARGET_SIMD > + || (aarch64_tune_params.extra_tuning_flags > + & AARCH64_EXTRA_TUNE_NO_LDP_STP_QREGS)) > + { > + copy_bits = 128; > + max_copy_size = max_copy_size / 2; > + } > + > + if (size > max_copy_size) > + return false; > > base = copy_to_mode_reg (Pmode, XEXP (dst, 0)); > dst = adjust_automodify_address (dst, VOIDmode, base, 0); > @@ -21291,15 +21295,8 @@ aarch64_expand_cpymem (rtx *operands) > base = copy_to_mode_reg (Pmode, XEXP (src, 0)); > src = adjust_automodify_address (src, VOIDmode, base, 0); > > - /* Convert n to bits to make the rest of the code simpler. */ > - n = n * BITS_PER_UNIT; > - > - /* Maximum amount to copy in one go. We allow 256-bit chunks based on the > - AARCH64_EXTRA_TUNE_NO_LDP_STP_QREGS tuning parameter and TARGET_SIMD. > */ > - const int copy_limit = ((aarch64_tune_params.extra_tuning_flags > - & AARCH64_EXTRA_TUNE_NO_LDP_STP_QREGS) > - || !TARGET_SIMD) > - ? GET_MODE_BITSIZE (TImode) : 256; > + /* Convert size to bits to make the rest of the code simpler. */ > + int n = size * BITS_PER_UNIT; > > while (n > 0) > { > @@ -21307,24 +21304,28 @@ aarch64_expand_cpymem (rtx *operands) > or writing. */ > opt_scalar_int_mode mode_iter; > FOR_EACH_MODE_IN_CLASS (mode_iter, MODE_INT) > - if (GET_MODE_BITSIZE (mode_iter.require ()) <= MIN (n, copy_limit)) > + if (GET_MODE_BITSIZE (mode_iter.require ()) <= MIN (n, copy_bits)) > cur_mode = mode_iter.require (); > > gcc_assert (cur_mode != BLKmode); > > mode_bits = GET_MODE_BITSIZE (cur_mode).to_constant (); > + > + /* Prefer Q-register accesses for the last bytes. */ > + if (mode_bits == 128 && copy_bits == 256) > + cur_mode = V4SImode; > + > aarch64_copy_one_block_and_progress_pointers (&src, &dst, cur_mode); > > n -= mode_bits; > > - /* Do certain trailing copies as overlapping if it's going to be > - cheaper. i.e. less instructions to do so. For instance doing a 15 > - byte copy it's more efficient to do two overlapping 8 byte copies than > - 8 + 6 + 1. */ > - if (n > 0 && n <= 8 * BITS_PER_UNIT) > + /* Emit trailing copies using overlapping unaligned accesses - this is > + smaller and faster. */ > + if (n > 0 && n < copy_bits / 2) > { > - next_mode = smallest_mode_for_size (n, MODE_INT); > + machine_mode next_mode = smallest_mode_for_size (n, MODE_INT); > int n_bits = GET_MODE_BITSIZE (next_mode).to_constant (); > + gcc_assert (n_bits <= mode_bits); > src = aarch64_move_pointer (src, (n - n_bits) / BITS_PER_UNIT); > dst = aarch64_move_pointer (dst, (n - n_bits) / BITS_PER_UNIT); > n = n_bits;