[PATCH] aarch64: Fix PR97638
Hi Currently the testcase in the patch was failing to produce a 'bti c' at the beginning of the function. This was because in aarch64_pac_insn_p, we were wrongly returning at the first check. This patch fixes the return value. Bootstrap and regression tested on aarch64-none-linux-gnu. Is this ok for trunk and gcc 10 backport? Thanks Sudi gcc/ChangeLog: 2020-10-30 Sudakshina Das PR target/97638 * config/aarch64/aarch64-bti-insert.c (aarch64_pac_insn_p): Update return value on INSN_P check. gcc/testsuite/ChangeLog: 2020-10-30 Sudakshina Das PR target/97638 * gcc.target/aarch64/pr97638.c: New test. ### Attachment also inlined for ease of reply### diff --git a/gcc/config/aarch64/aarch64-bti-insert.c b/gcc/config/aarch64/aarch64-bti-insert.c index 57663ee23b490162dbe7ffe2f618066e71cea455..98026695fdbbe2eda84e0befad94b5fe4ce22754 100644 --- a/gcc/config/aarch64/aarch64-bti-insert.c +++ b/gcc/config/aarch64/aarch64-bti-insert.c @@ -95,7 +95,7 @@ static bool aarch64_pac_insn_p (rtx x) { if (!INSN_P (x)) -return x; +return false; subrtx_var_iterator::array_type array; FOR_EACH_SUBRTX_VAR (iter, array, PATTERN (x), ALL) diff --git a/gcc/testsuite/gcc.target/aarch64/pr97638.c b/gcc/testsuite/gcc.target/aarch64/pr97638.c new file mode 100644 index ..e5869e86c449aef5606541c4c7a51069a1426793 --- /dev/null +++ b/gcc/testsuite/gcc.target/aarch64/pr97638.c @@ -0,0 +1,17 @@ +/* { dg-do compile } */ +/* { dg-options "-O2 -mbranch-protection=bti" } */ + +char *foo (const char *s, const int c) +{ + const char *p = 0; + for (;;) + { +if (*s == c) +p = s; +if (p != 0 || *s++ == 0) +break; + } + return (char *)p; +} + +/* { dg-final { scan-assembler "hint\t34" } } */ rb13708.patch Description: rb13708.patch
RE: [PATCH] aarch64: Fix PR97638
Hi Richard > -Original Message- > From: Richard Sandiford > Sent: 02 November 2020 10:31 > To: Sudakshina Das > Cc: gcc-patches@gcc.gnu.org; Kyrylo Tkachov ; > Richard Earnshaw > Subject: Re: [PATCH] aarch64: Fix PR97638 > > Sudakshina Das writes: > > Hi > > > > Currently the testcase in the patch was failing to produce a 'bti c' > > at the beginning of the function. This was because in > > aarch64_pac_insn_p, we were wrongly returning at the first check. This > > patch fixes the return value. > > > > Bootstrap and regression tested on aarch64-none-linux-gnu. > > Is this ok for trunk and gcc 10 backport? > > OK for both, thanks. Thank you! Pushed to trunk. Will wait for a couple of days before backport. Thanks Sudi > > Richard
RE: [PATCH] aarch64: Add backend support for expanding __builtin_memset
Hi Richard > -Original Message- > From: Richard Sandiford > Sent: 30 October 2020 19:56 > To: Sudakshina Das > Cc: Wilco Dijkstra ; gcc-patches@gcc.gnu.org; > Kyrylo Tkachov ; Richard Earnshaw > > Subject: Re: [PATCH] aarch64: Add backend support for expanding > __builtin_memset > > > + base = copy_to_mode_reg (Pmode, XEXP (dst, 0)); dst = > > + adjust_automodify_address (dst, VOIDmode, base, 0); > > + > > + /* Prepare the val using a DUP v0.16B, val. */ if (CONST_INT_P > > + (val)) > > +{ > > + val = force_reg (QImode, val); > > +} > > + src = gen_reg_rtx (V16QImode); > > + emit_insn (gen_aarch64_simd_dupv16qi(src, val)); > > I think we should use: > > src = expand_vector_broadcast (V16QImode, val); > > here (without the CONST_INT_P check), so that for constants we just move a > constant directly into a register. > Sorry to bring this up again. When I tried expand_vector_broadcast, I see the following behaviour: for __builtin_memset(p, 1, 24) where the duplicated constant fits moviv0.16b, 0x1 mov x1, 72340172838076673 str x1, [x0, 16] str q0, [x0] and an ICE for __builtin_memset(p, 1, 32) where I am guessing the duplicated constant does not fit x.c:7:30: error: unrecognizable insn: 7 | { __builtin_memset(p, 1, 32);} | ^ (insn 8 7 0 2 (parallel [ (set (mem:V16QI (reg:DI 94) [0 MEM [(void *)p_2(D)]+0 S16 A8]) (const_vector:V16QI [ (const_int 1 [0x1]) repeated x16 ])) (set (mem:V16QI (plus:DI (reg:DI 94) (const_int 16 [0x10])) [0 MEM [(void *)p_2(D)]+16 S16 A8]) (const_vector:V16QI [ (const_int 1 [0x1]) repeated x16 ])) ]) "x.c":7:3 -1 (nil)) during RTL pass: vregs > Sudakshina Das writes: > >> > + > >> > + /* "Cast" the *dst to the correct mode. */ *dst = > >> > + adjust_address (*dst, mode, 0); > >> > + /* Emit the memset. */ > >> > + emit_move_insn (*dst, reg); > >> > + /* Move the pointer forward. */ *dst = > >> > + aarch64_progress_pointer (*dst); } > >> > + > >> > +/* Expand setmem, as if from a __builtin_memset. Return true if > >> > + we succeed, otherwise return false. */ > >> > + > >> > +bool > >> > +aarch64_expand_setmem (rtx *operands) { > >> > + int n, mode_bits; > >> > + unsigned HOST_WIDE_INT len; > >> > + rtx dst = operands[0]; > >> > + rtx val = operands[2], src; > >> > + rtx base; > >> > + machine_mode cur_mode = BLKmode, next_mode; > >> > + bool speed_p = !optimize_function_for_size_p (cfun); > >> > + unsigned max_set_size = speed_p ? 256 : 128; > >> > >> What's the basis for the size value? AIUI (and I've probably got > >> this wrong), that effectively means a worst case of 3+2 stores > >> (3 STP Qs and 2 mop-up stores). Then we need one instruction to set > >> up the constant. So if that's right, it looks like the worst-case size is > >> 6 > instructions. > >> > >> AARCH64_CALL_RATIO has a value of 8, but I'm not sure how that > >> relates to the number of instructions in a call. I guess the best > >> case is 4 (3 instructions for the parameters and one for the call itself). > >> > > > > This one I will ask Wilco to chime in. We discussed offline what would > > be the largest case that this builtin should allow and he suggested > > 256-bytes. It would actually generate 9 instructions (its in the memset- > corner-case.c). > > Personally I am not sure what the best decisions are in this case so I > > will rely on Wilco's suggestions. > > Ah, sorry, by “the size value”, I meant the !speed_p value of 128. > I now realise that that was far from clear given that the variable is called > max_set_size :-) > > So yeah, I'm certainly not questioning the speed_p value of 256. > I'm sure you and Wilco have picked the best value for that. But -Os stuff can > usually be justified on first principles and I wasn't sure where the value of > 128 > came from. > I had another chat with Wilco about the 128byte value for !speed_p. We estimate the average number of instructions upto 128byte would be ~3 which is similar to do a memset call. But I did go back and think about the tuning argument of AARCH64_EXTRA_TUNE_NO_LDP_STP_QREGS a bit more because you are right that based on that the average instructions can become double. I would propose using 256/128 based on speed_p but halving the value based on the tune parameter. Obviously the assumption here is that we are respecting the core's choice of avoiding stp of q registers (given that I do not see other uses of AARCH64_EXTRA_TUNE_NO_LDP_STP_QREGS being changed by -Os). There might be a debate on how useful AARCH64_EXTRA_TUNE_NO_LDP_STP_QREGS is in the context of memset/memcpy but that needs more analysis and I would say should be a separate patch. > >> > + > >> > + /* Convert len to bits to make the rest of the co
RE: [PATCH] aarch64: Add backend support for expanding __builtin_memset
Hi Richard > -Original Message- > From: Richard Sandiford > Sent: 03 November 2020 11:34 > To: Sudakshina Das > Cc: Wilco Dijkstra ; gcc-patches@gcc.gnu.org; > Kyrylo Tkachov ; Richard Earnshaw > > Subject: Re: [PATCH] aarch64: Add backend support for expanding > __builtin_memset > > Sudakshina Das writes: > >> -Original Message- > >> From: Richard Sandiford > >> Sent: 30 October 2020 19:56 > >> To: Sudakshina Das > >> Cc: Wilco Dijkstra ; gcc-patches@gcc.gnu.org; > >> Kyrylo Tkachov ; Richard Earnshaw > >> > >> Subject: Re: [PATCH] aarch64: Add backend support for expanding > >> __builtin_memset > >> > >> > + base = copy_to_mode_reg (Pmode, XEXP (dst, 0)); dst = > >> > + adjust_automodify_address (dst, VOIDmode, base, 0); > >> > + > >> > + /* Prepare the val using a DUP v0.16B, val. */ if (CONST_INT_P > >> > + (val)) > >> > +{ > >> > + val = force_reg (QImode, val); > >> > +} > >> > + src = gen_reg_rtx (V16QImode); > >> > + emit_insn (gen_aarch64_simd_dupv16qi(src, val)); > >> > >> I think we should use: > >> > >> src = expand_vector_broadcast (V16QImode, val); > >> > >> here (without the CONST_INT_P check), so that for constants we just > >> move a constant directly into a register. > >> > > > > Sorry to bring this up again. When I tried expand_vector_broadcast, I > > see the following behaviour: > > for __builtin_memset(p, 1, 24) where the duplicated constant fits > > moviv0.16b, 0x1 > > mov x1, 72340172838076673 > > str x1, [x0, 16] > > str q0, [x0] > > and an ICE for __builtin_memset(p, 1, 32) where I am guessing the > > duplicated constant does not fit > > x.c:7:30: error: unrecognizable insn: > > 7 | { __builtin_memset(p, 1, 32);} > > | ^ > > (insn 8 7 0 2 (parallel [ > > (set (mem:V16QI (reg:DI 94) [0 MEM [(void > > *)p_2(D)]+0 > S16 A8]) > > (const_vector:V16QI [ > > (const_int 1 [0x1]) repeated x16 > > ])) > > (set (mem:V16QI (plus:DI (reg:DI 94) > > (const_int 16 [0x10])) [0 MEM [(void > > *)p_2(D)]+16 > S16 A8]) > > (const_vector:V16QI [ > > (const_int 1 [0x1]) repeated x16 > > ])) > > ]) "x.c":7:3 -1 > > (nil)) > > during RTL pass: vregs > > Ah, yeah, I guess we need to call force_reg on the result. > > >> So yeah, I'm certainly not questioning the speed_p value of 256. > >> I'm sure you and Wilco have picked the best value for that. But -Os > >> stuff can usually be justified on first principles and I wasn't sure > >> where the value of 128 came from. > >> > > > > I had another chat with Wilco about the 128byte value for !speed_p. We > > estimate the average number of instructions upto 128byte would be ~3 > > which is similar to do a memset call. But I did go back and think > > about the tuning argument of > AARCH64_EXTRA_TUNE_NO_LDP_STP_QREGS a > > bit more because you are right that based on that the average instructions > can become double. > > I would propose using 256/128 based on speed_p but halving the value > > based on the tune parameter. Obviously the assumption here is that we > > are respecting the core's choice of avoiding stp of q registers (given > > that I do not see other uses of > AARCH64_EXTRA_TUNE_NO_LDP_STP_QREGS being changed by -Os). > > Yeah, but I think the lack of an -Os check in the existing code might be a > mistake. The point is that STP Q is smaller than two separate STR Qs, so > using > it is a size optimisation even if it's not a speed optimisation. > And like I say, -Os isn't supposed to be striking a balance between size and > speed: it's supposed to be going for size quite aggressively. > > So TBH I have slight preference for keeping the current value and only > checking the tuning flag for speed_p. But I agree that halving the value > would be self-consistent, so if you or Wilco believe strongly that halving is > better, that'd be OK with me too. > > > There might be a debate on how useful > > AARCH64_EXTRA_TUNE_NO_LDP_STP_QREGS > > is in the context of memset/memcpy but that needs more analysis and I > > would say should be a separate patch. > > Agreed. > > >> >> > + if (n > 0 && n < copy_limit / 2) > >> >> > + { > >> >> > + next_mode = smallest_mode_for_size (n, MODE_INT); > >> >> > + /* Last 1-byte causes the compiler to optimize to STRB when > >> >> > +it > >> >> should > >> >> > + use STR Bx, [mem] since we already used SIMD registers. > >> >> > + So force it to HImode. */ > >> >> > + if (next_mode == QImode) > >> >> > + next_mode = HImode; > >> >> > >> >> Is this always better? E.g. for variable inputs and zero it seems > >> >> quite natural to store the original scalar GPR. > >> >> > >> >> If we do do this, I think we should assert before the loop that n > 1. > >> >> > >>
RE: [PATCH] aarch64: Add backend support for expanding __builtin_memset
Hi Richard > -Original Message- > From: Richard Sandiford > Sent: 11 November 2020 17:52 > To: Sudakshina Das > Cc: Wilco Dijkstra ; gcc-patches@gcc.gnu.org; > Kyrylo Tkachov ; Richard Earnshaw > > Subject: Re: [PATCH] aarch64: Add backend support for expanding > __builtin_memset > > Sudakshina Das writes: > > Apologies for the delay. I have attached another version of the patch. > > I have disabled the test cases for ILP32. This is only because > > function body check fails because there is an addition unsigned extension > instruction for src pointer in > > every test (uxtwx0, w0). The actual inlining is not different. > > Yeah, agree that's the best way of handling the ILP32 difference. > > > […] > > +/* SET_RATIO is similar to CLEAR_RATIO, but for a non-zero constant. > Without > > + -mstrict-align, make decisions in "setmem". Otherwise follow a sensible > > + default: when optimizing for size adjust the ratio to account for > > +the > > nit: should just be one space after “:” > > > […] > > @@ -21289,6 +21292,134 @@ aarch64_expand_cpymem (rtx *operands) > >return true; > > } > > > > +/* Like aarch64_copy_one_block_and_progress_pointers, except for > memset where > > + *src is a register we have created with the duplicated value to be > > +set. */ > > “*src” -> SRC > since there's no dereference now > > > […] > > + /* In case we are optimizing for size or if the core does not > > + want to use STP Q regs, lower the max_set_size. */ > > + max_set_size = (!speed_p > > + || (aarch64_tune_params.extra_tuning_flags > > + & AARCH64_EXTRA_TUNE_NO_LDP_STP_QREGS)) > > + ? max_set_size/2 : max_set_size; > > Formatting nit: should be a space either side of “/”. > > > + while (n > 0) > > +{ > > + /* Find the largest mode in which to do the copy in without > > +over writing. */ > > s/in without/without/ > > > + opt_scalar_int_mode mode_iter; > > + FOR_EACH_MODE_IN_CLASS (mode_iter, MODE_INT) > > + if (GET_MODE_BITSIZE (mode_iter.require ()) <= MIN (n, copy_limit)) > > + cur_mode = mode_iter.require (); > > + > > + gcc_assert (cur_mode != BLKmode); > > + > > + mode_bits = GET_MODE_BITSIZE (cur_mode).to_constant (); > > + aarch64_set_one_block_and_progress_pointer (src, &dst, > > + cur_mode); > > + > > + n -= mode_bits; > > + > > + /* Do certain trailing copies as overlapping if it's going to be > > +cheaper. i.e. less instructions to do so. For instance doing a 15 > > +byte copy it's more efficient to do two overlapping 8 byte copies > than > > +8 + 4 + 2 + 1. */ > > + if (n > 0 && n < copy_limit / 2) > > + { > > + next_mode = smallest_mode_for_size (n, MODE_INT); > > + int n_bits = GET_MODE_BITSIZE (next_mode).to_constant (); > > Sorry for the runaround, but looking at this again, I'm a bit worried that we > only indirectly test that n_bits is within the length of the original set. I > guess > it is because if n < copy_limit / 2 then n < mode_bits, and so n_bits will > never > exceed mode_bits. I think it might be worth adding an assert to make that > “clearer” (maybe only to me, probably obvious to everyone else): > > gcc_assert (n_bits <= mode_bits); > > OK with those changes, thanks. Thank you! Committed as 54bbde5 with those changes. Sudi > > Richard > > > + dst = aarch64_move_pointer (dst, (n - n_bits) / BITS_PER_UNIT); > > + n = n_bits; > > + } > > +} > > + > > + return true; > > +} > > + > > + > > /* Split a DImode store of a CONST_INT SRC to MEM DST as two > > SImode stores. Handle the case when the constant has identical > > bottom and top halves. This is beneficial when the two stores can > > be
[PATCH] aarch64: Add backend support for expanding __builtin_memset
Hi This patch implements aarch64 backend expansion for __builtin_memset. Most of the implementation is based on the expansion of __builtin_memcpy. We change the values of SET_RATIO and MOVE_RATIO for cases where we do not have to strictly align and where we can benefit from NEON instructions in the backend. So for a test case like: void foo (void* p) { __builtin_memset (p, 1, 7); } instead of generating: mov w3, 16843009 mov w2, 257 mov w1, 1 str w3, [x0] strhw2, [x0, 4] strbw1, [x0, 6] ret we now generate moviv0.16b, 0x1 str s0, [x0] str s0, [x0, 3] ret Bootstrapped and regression tested on aarch64-none-linux-gnu. With this patch I have seen an overall improvement of 0.27% in Spec2017 Int and 0.19% in Spec2017 FP benchmarks on Neoverse N1. Is this ok for trunk? gcc/ChangeLog: 2020-xx-xx Sudakshina Das * config/aarch64/aarch64-protos.h (aarch64_expand_setmem): New declaration. * config/aarch64/aarch64.c (aarch64_gen_store_pair): Add case for E_V16QImode. (aarch64_set_one_block_and_progress_pointer): New helper for aarch64_expand_setmem. (aarch64_expand_setmem): Define the expansion for memset. * config/aarch64/aarch64.h (CLEAR_RATIO): Tweak to favor aarch64_expand_setmem when allowed and profitable. (SET_RATIO): Likewise. * config/aarch64/aarch64.md: Define pattern for setmemdi. gcc/testsuite/ChangeLog: 2020-xx-xx Sudakshina Das * g++.dg/tree-ssa/pr90883.C: Remove xfail for aarch64. * gcc.dg/tree-prof/stringop-2.c: Add xfail for aarch64. * gcc.target/aarch64/memset-corner-cases.c: New test. * gcc.target/aarch64/memset-q-reg.c: New test. Thanks Sudi ### Attachment also inlined for ease of reply### diff --git a/gcc/config/aarch64/aarch64-protos.h b/gcc/config/aarch64/aarch64-protos.h index 7a34c841355bad88365381912b163c61c5a35811..2aa3f1fddaafae58f0bfb26e5b33fe6a94e85e06 100644 --- a/gcc/config/aarch64/aarch64-protos.h +++ b/gcc/config/aarch64/aarch64-protos.h @@ -510,6 +510,7 @@ bool aarch64_emit_approx_div (rtx, rtx, rtx); bool aarch64_emit_approx_sqrt (rtx, rtx, bool); void aarch64_expand_call (rtx, rtx, rtx, bool); bool aarch64_expand_cpymem (rtx *); +bool aarch64_expand_setmem (rtx *); bool aarch64_float_const_zero_rtx_p (rtx); bool aarch64_float_const_rtx_p (rtx); bool aarch64_function_arg_regno_p (unsigned); diff --git a/gcc/config/aarch64/aarch64.h b/gcc/config/aarch64/aarch64.h index 00b5f8438863bb52c348cfafd5d4db478fe248a7..bcb654809c9662db0f51fc1368e37e42969efd29 100644 --- a/gcc/config/aarch64/aarch64.h +++ b/gcc/config/aarch64/aarch64.h @@ -1024,16 +1024,18 @@ typedef struct #define MOVE_RATIO(speed) \ (!STRICT_ALIGNMENT ? 2 : (((speed) ? 15 : AARCH64_CALL_RATIO) / 2)) -/* For CLEAR_RATIO, when optimizing for size, give a better estimate - of the length of a memset call, but use the default otherwise. */ +/* Like MOVE_RATIO, without -mstrict-align, make decisions in "setmem" when + we would use more than 3 scalar instructions. + Otherwise follow a sensible default: when optimizing for size, give a better + estimate of the length of a memset call, but use the default otherwise. */ #define CLEAR_RATIO(speed) \ - ((speed) ? 15 : AARCH64_CALL_RATIO) + (!STRICT_ALIGNMENT ? 4 : (speed) ? 15 : AARCH64_CALL_RATIO) /* SET_RATIO is similar to CLEAR_RATIO, but for a non-zero constant, so when optimizing for size adjust the ratio to account for the overhead of loading the constant. */ #define SET_RATIO(speed) \ - ((speed) ? 15 : AARCH64_CALL_RATIO - 2) + (!STRICT_ALIGNMENT ? 0 : (speed) ? 15 : AARCH64_CALL_RATIO - 2) /* Disable auto-increment in move_by_pieces et al. Use of auto-increment is rarely a good idea in straight-line code since it adds an extra address diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c index a8cc545c37044345c3f1d3bf09151c8a9578a032..16ac0c076adcc82627af43473a938e78d3a7ecdc 100644 --- a/gcc/config/aarch64/aarch64.c +++ b/gcc/config/aarch64/aarch64.c @@ -7058,6 +7058,9 @@ aarch64_gen_store_pair (machine_mode mode, rtx mem1, rtx reg1, rtx mem2, case E_V4SImode: return gen_vec_store_pairv4siv4si (mem1, reg1, mem2, reg2); +case E_V16QImode: + return gen_vec_store_pairv16qiv16qi (mem1, reg1, mem2, reg2); + default: gcc_unreachable (); } @@ -21373,6 +21376,134 @@ aarch64_expand_cpymem (rtx *operands) return true; } +/* Like aarch64_copy_one_block_and_progress_pointers, except for memset where + *src is a register we have created with the duplicated value to be set. */ +static void +aarch64_set_one_block_and_progress_pointer (rtx *src, rtx *dst, + machine_mode mode) +{ + /* If we are copying 128bits or 256bits, we can do that s
RE: [PATCH] aarch64: Add backend support for expanding __builtin_memset
Hi Richard Thank you for the review. Please find my comments inlined. > -Original Message- > From: Richard Sandiford > Sent: 30 October 2020 15:03 > To: Sudakshina Das > Cc: gcc-patches@gcc.gnu.org; Kyrylo Tkachov ; > Richard Earnshaw > Subject: Re: [PATCH] aarch64: Add backend support for expanding > __builtin_memset > > Sudakshina Das writes: > > diff --git a/gcc/config/aarch64/aarch64.h > > b/gcc/config/aarch64/aarch64.h index > > > 00b5f8438863bb52c348cfafd5d4db478fe248a7..bcb654809c9662db0f51fc1368 > e3 > > 7e42969efd29 100644 > > --- a/gcc/config/aarch64/aarch64.h > > +++ b/gcc/config/aarch64/aarch64.h > > @@ -1024,16 +1024,18 @@ typedef struct #define MOVE_RATIO(speed) \ > >(!STRICT_ALIGNMENT ? 2 : (((speed) ? 15 : AARCH64_CALL_RATIO) / 2)) > > > > -/* For CLEAR_RATIO, when optimizing for size, give a better estimate > > - of the length of a memset call, but use the default otherwise. */ > > +/* Like MOVE_RATIO, without -mstrict-align, make decisions in "setmem" > when > > + we would use more than 3 scalar instructions. > > + Otherwise follow a sensible default: when optimizing for size, give a > better > > + estimate of the length of a memset call, but use the default > > +otherwise. */ > > #define CLEAR_RATIO(speed) \ > > - ((speed) ? 15 : AARCH64_CALL_RATIO) > > + (!STRICT_ALIGNMENT ? 4 : (speed) ? 15 : AARCH64_CALL_RATIO) > > > > /* SET_RATIO is similar to CLEAR_RATIO, but for a non-zero constant, so > when > > optimizing for size adjust the ratio to account for the overhead of > > loading > > the constant. */ > > #define SET_RATIO(speed) \ > > - ((speed) ? 15 : AARCH64_CALL_RATIO - 2) > > + (!STRICT_ALIGNMENT ? 0 : (speed) ? 15 : AARCH64_CALL_RATIO - 2) > > Think it would help to adjust the SET_RATIO comment too, otherwise it's not > obvious why its !STRICT_ALIGNMNENT value is 0. > Will do. > > > > /* Disable auto-increment in move_by_pieces et al. Use of auto- > increment is > > rarely a good idea in straight-line code since it adds an extra > > address diff --git a/gcc/config/aarch64/aarch64.c > > b/gcc/config/aarch64/aarch64.c index > > > a8cc545c37044345c3f1d3bf09151c8a9578a032..16ac0c076adcc82627af43473a9 > 3 > > 8e78d3a7ecdc 100644 > > --- a/gcc/config/aarch64/aarch64.c > > +++ b/gcc/config/aarch64/aarch64.c > > @@ -7058,6 +7058,9 @@ aarch64_gen_store_pair (machine_mode mode, > rtx mem1, rtx reg1, rtx mem2, > > case E_V4SImode: > >return gen_vec_store_pairv4siv4si (mem1, reg1, mem2, reg2); > > > > +case E_V16QImode: > > + return gen_vec_store_pairv16qiv16qi (mem1, reg1, mem2, reg2); > > + > > default: > >gcc_unreachable (); > > } > > @@ -21373,6 +21376,134 @@ aarch64_expand_cpymem (rtx *operands) > >return true; > > } > > > > +/* Like aarch64_copy_one_block_and_progress_pointers, except for > memset where > > + *src is a register we have created with the duplicated value to be > > +set. */ > > AIUI, *SRC doesn't accumulate across calls in the way that it does for > aarch64_copy_one_block_and_progress_pointers, so it might be better to > pass an rtx rather than an “rtx *”. > Will do. > > +static void > > +aarch64_set_one_block_and_progress_pointer (rtx *src, rtx *dst, > > + machine_mode mode) > > +{ > > + /* If we are copying 128bits or 256bits, we can do that straight from > > + the SIMD register we prepared. */ > > Nit: excess space before “the”. > Will do. > > + if (known_eq (GET_MODE_BITSIZE (mode), 256)) > > +{ > > + mode = GET_MODE (*src); > > Excess space before “GET_MODE”. > Will do. > > + /* "Cast" the *dst to the correct mode. */ > > + *dst = adjust_address (*dst, mode, 0); > > + /* Emit the memset. */ > > + emit_insn (aarch64_gen_store_pair (mode, *dst, *src, > > +aarch64_progress_pointer (*dst), > *src)); > > + > > + /* Move the pointers forward. */ > > + *dst = aarch64_move_pointer (*dst, 32); > > + return; > > +} > > + else if (known_eq (GET_MODE_BITSIZE (mode), 128)) > > Nit: more usual in GCC not to have an “else” after an early return. > Will do. > > +{ > > + /* "Cast" the *dst to the correct mode. */ > > + *dst = adjust_address (*dst, GET_MODE (*src), 0); > > + /* Emit the memset. */ > > + emit_move_insn (*dst, *src); > > + /* Move the pointers forward. */ > > + *dst = aarch64_move_pointer (*dst, 16); > > + return; > > +} > > + /* For copying less, we have to extract the right amount from *src. > > + */ machine_mode vq_mode = aarch64_vq_mode > > + (GET_MODE_INNER(mode)).require (); > > Nit: should be a space before “(mode)”. > Will do. > > + *src = convert_to_mode (vq_mode, *src, 0); rtx reg = > > + simplify_gen_subreg (mode, *src, vq_mode, 0); > > I was surprised that this needed a two-step conversion. Does a direct subreg > of the original V16QI src blow up for s