[PATCH] aarch64: Fix PR97638

2020-11-02 Thread Sudakshina Das via Gcc-patches
Hi

Currently the testcase in the patch was failing to produce
a 'bti c' at the beginning of the function. This was because
in aarch64_pac_insn_p, we were wrongly returning at the first
check. This patch fixes the return value.

Bootstrap and regression tested on aarch64-none-linux-gnu.
Is this ok for trunk and gcc 10 backport?

Thanks
Sudi

gcc/ChangeLog:

2020-10-30  Sudakshina Das  

PR target/97638
* config/aarch64/aarch64-bti-insert.c (aarch64_pac_insn_p): Update
return value on INSN_P check.

gcc/testsuite/ChangeLog:

2020-10-30  Sudakshina Das  

PR target/97638
* gcc.target/aarch64/pr97638.c: New test.


### Attachment also inlined for ease of reply###

diff --git a/gcc/config/aarch64/aarch64-bti-insert.c 
b/gcc/config/aarch64/aarch64-bti-insert.c
index 
57663ee23b490162dbe7ffe2f618066e71cea455..98026695fdbbe2eda84e0befad94b5fe4ce22754
 100644
--- a/gcc/config/aarch64/aarch64-bti-insert.c
+++ b/gcc/config/aarch64/aarch64-bti-insert.c
@@ -95,7 +95,7 @@ static bool
 aarch64_pac_insn_p (rtx x)
 {
   if (!INSN_P (x))
-return x;
+return false;
 
   subrtx_var_iterator::array_type array;
   FOR_EACH_SUBRTX_VAR (iter, array, PATTERN (x), ALL)
diff --git a/gcc/testsuite/gcc.target/aarch64/pr97638.c 
b/gcc/testsuite/gcc.target/aarch64/pr97638.c
new file mode 100644
index 
..e5869e86c449aef5606541c4c7a51069a1426793
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/pr97638.c
@@ -0,0 +1,17 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -mbranch-protection=bti" } */
+
+char *foo (const char *s, const int c)
+{
+  const char *p = 0;
+  for (;;)
+  {
+if (*s == c)
+p = s;
+if (p != 0 || *s++ == 0)
+break;
+  }
+  return (char *)p;
+}
+
+/* { dg-final { scan-assembler "hint\t34" } } */


rb13708.patch
Description: rb13708.patch


RE: [PATCH] aarch64: Fix PR97638

2020-11-02 Thread Sudakshina Das via Gcc-patches
Hi Richard

> -Original Message-
> From: Richard Sandiford 
> Sent: 02 November 2020 10:31
> To: Sudakshina Das 
> Cc: gcc-patches@gcc.gnu.org; Kyrylo Tkachov ;
> Richard Earnshaw 
> Subject: Re: [PATCH] aarch64: Fix PR97638
> 
> Sudakshina Das  writes:
> > Hi
> >
> > Currently the testcase in the patch was failing to produce a 'bti c'
> > at the beginning of the function. This was because in
> > aarch64_pac_insn_p, we were wrongly returning at the first check. This
> > patch fixes the return value.
> >
> > Bootstrap and regression tested on aarch64-none-linux-gnu.
> > Is this ok for trunk and gcc 10 backport?
> 
> OK for both, thanks.

Thank you! Pushed to trunk. Will wait for a couple of days before backport.

Thanks
Sudi

> 
> Richard


RE: [PATCH] aarch64: Add backend support for expanding __builtin_memset

2020-11-03 Thread Sudakshina Das via Gcc-patches
Hi Richard

> -Original Message-
> From: Richard Sandiford 
> Sent: 30 October 2020 19:56
> To: Sudakshina Das 
> Cc: Wilco Dijkstra ; gcc-patches@gcc.gnu.org;
> Kyrylo Tkachov ; Richard Earnshaw
> 
> Subject: Re: [PATCH] aarch64: Add backend support for expanding
> __builtin_memset
> 
> > +  base = copy_to_mode_reg (Pmode, XEXP (dst, 0));  dst =
> > + adjust_automodify_address (dst, VOIDmode, base, 0);
> > +
> > +  /* Prepare the val using a DUP v0.16B, val.  */  if (CONST_INT_P
> > + (val))
> > +{
> > +  val = force_reg (QImode, val);
> > +}
> > +  src = gen_reg_rtx (V16QImode);
> > +  emit_insn (gen_aarch64_simd_dupv16qi(src, val));
> 
> I think we should use:
> 
>   src = expand_vector_broadcast (V16QImode, val);
> 
> here (without the CONST_INT_P check), so that for constants we just move a
> constant directly into a register.
>

Sorry to bring this up again. When I tried expand_vector_broadcast, I 
see the following behaviour:
for __builtin_memset(p, 1, 24) where the duplicated constant fits
moviv0.16b, 0x1
mov x1, 72340172838076673
str x1, [x0, 16]
str q0, [x0]
and an ICE for __builtin_memset(p, 1, 32) where I am guessing the duplicated
constant does not fit
x.c:7:30: error: unrecognizable insn:
7 | { __builtin_memset(p, 1, 32);}
  |  ^
(insn 8 7 0 2 (parallel [
(set (mem:V16QI (reg:DI 94) [0 MEM  [(void *)p_2(D)]+0 
S16 A8])
(const_vector:V16QI [
(const_int 1 [0x1]) repeated x16
]))
(set (mem:V16QI (plus:DI (reg:DI 94)
(const_int 16 [0x10])) [0 MEM  [(void 
*)p_2(D)]+16 S16 A8])
(const_vector:V16QI [
(const_int 1 [0x1]) repeated x16
]))
]) "x.c":7:3 -1
 (nil))
during RTL pass: vregs

> Sudakshina Das  writes:
> >> > +
> >> > +  /* "Cast" the *dst to the correct mode.  */  *dst =
> >> > + adjust_address (*dst, mode, 0);
> >> > +  /* Emit the memset.  */
> >> > +  emit_move_insn (*dst, reg);
> >> > +  /* Move the pointer forward.  */  *dst =
> >> > + aarch64_progress_pointer (*dst); }
> >> > +
> >> > +/* Expand setmem, as if from a __builtin_memset.  Return true if
> >> > +   we succeed, otherwise return false.  */
> >> > +
> >> > +bool
> >> > +aarch64_expand_setmem (rtx *operands) {
> >> > +  int n, mode_bits;
> >> > +  unsigned HOST_WIDE_INT len;
> >> > +  rtx dst = operands[0];
> >> > +  rtx val = operands[2], src;
> >> > +  rtx base;
> >> > +  machine_mode cur_mode = BLKmode, next_mode;
> >> > +  bool speed_p = !optimize_function_for_size_p (cfun);
> >> > +  unsigned max_set_size = speed_p ? 256 : 128;
> >>
> >> What's the basis for the size value?  AIUI (and I've probably got
> >> this wrong), that effectively means a worst case of 3+2 stores
> >> (3 STP Qs and 2 mop-up stores).  Then we need one instruction to set
> >> up the constant.  So if that's right, it looks like the worst-case size is 
> >> 6
> instructions.
> >>
> >> AARCH64_CALL_RATIO has a value of 8, but I'm not sure how that
> >> relates to the number of instructions in a call.  I guess the best
> >> case is 4 (3 instructions for the parameters and one for the call itself).
> >>
> >
> > This one I will ask Wilco to chime in. We discussed offline what would
> > be the largest case that this builtin should allow and he suggested
> > 256-bytes. It would actually generate 9 instructions (its in the memset-
> corner-case.c).
> > Personally I am not sure what the best decisions are in this case so I
> > will rely on Wilco's suggestions.
> 
> Ah, sorry, by “the size value”, I meant the !speed_p value of 128.
> I now realise that that was far from clear given that the variable is called
> max_set_size :-)
> 
> So yeah, I'm certainly not questioning the speed_p value of 256.
> I'm sure you and Wilco have picked the best value for that.  But -Os stuff can
> usually be justified on first principles and I wasn't sure where the value of 
> 128
> came from.
>

I had another chat with Wilco about the 128byte value for !speed_p. We
estimate the average number of instructions upto 128byte would be ~3 which
is similar to do a memset call. But I did go back and think about the tuning
argument of  AARCH64_EXTRA_TUNE_NO_LDP_STP_QREGS a bit more because
you are right that based on that the average instructions can become double.
I would propose using 256/128 based on speed_p but halving the value based on 
the
tune parameter. Obviously the assumption here is that we are respecting the 
core's
choice of avoiding stp of q registers (given that I do not see other uses of
AARCH64_EXTRA_TUNE_NO_LDP_STP_QREGS being changed by -Os).

There might be a debate on how useful AARCH64_EXTRA_TUNE_NO_LDP_STP_QREGS
is in the context of memset/memcpy but that needs more analysis and I would
say should be a separate patch.
 
> >> > +
> >> > +  /* Convert len to bits to make the rest of the co

RE: [PATCH] aarch64: Add backend support for expanding __builtin_memset

2020-11-11 Thread Sudakshina Das via Gcc-patches
Hi Richard

> -Original Message-
> From: Richard Sandiford 
> Sent: 03 November 2020 11:34
> To: Sudakshina Das 
> Cc: Wilco Dijkstra ; gcc-patches@gcc.gnu.org;
> Kyrylo Tkachov ; Richard Earnshaw
> 
> Subject: Re: [PATCH] aarch64: Add backend support for expanding
> __builtin_memset
> 
> Sudakshina Das  writes:
> >> -Original Message-
> >> From: Richard Sandiford 
> >> Sent: 30 October 2020 19:56
> >> To: Sudakshina Das 
> >> Cc: Wilco Dijkstra ; gcc-patches@gcc.gnu.org;
> >> Kyrylo Tkachov ; Richard Earnshaw
> >> 
> >> Subject: Re: [PATCH] aarch64: Add backend support for expanding
> >> __builtin_memset
> >>
> >> > +  base = copy_to_mode_reg (Pmode, XEXP (dst, 0));  dst =
> >> > + adjust_automodify_address (dst, VOIDmode, base, 0);
> >> > +
> >> > +  /* Prepare the val using a DUP v0.16B, val.  */  if (CONST_INT_P
> >> > + (val))
> >> > +{
> >> > +  val = force_reg (QImode, val);
> >> > +}
> >> > +  src = gen_reg_rtx (V16QImode);
> >> > +  emit_insn (gen_aarch64_simd_dupv16qi(src, val));
> >>
> >> I think we should use:
> >>
> >>   src = expand_vector_broadcast (V16QImode, val);
> >>
> >> here (without the CONST_INT_P check), so that for constants we just
> >> move a constant directly into a register.
> >>
> >
> > Sorry to bring this up again. When I tried expand_vector_broadcast, I
> > see the following behaviour:
> > for __builtin_memset(p, 1, 24) where the duplicated constant fits
> > moviv0.16b, 0x1
> > mov x1, 72340172838076673
> > str x1, [x0, 16]
> > str q0, [x0]
> > and an ICE for __builtin_memset(p, 1, 32) where I am guessing the
> > duplicated constant does not fit
> > x.c:7:30: error: unrecognizable insn:
> > 7 | { __builtin_memset(p, 1, 32);}
> >   |  ^
> > (insn 8 7 0 2 (parallel [
> > (set (mem:V16QI (reg:DI 94) [0 MEM  [(void 
> > *)p_2(D)]+0
> S16 A8])
> > (const_vector:V16QI [
> > (const_int 1 [0x1]) repeated x16
> > ]))
> > (set (mem:V16QI (plus:DI (reg:DI 94)
> > (const_int 16 [0x10])) [0 MEM  [(void 
> > *)p_2(D)]+16
> S16 A8])
> > (const_vector:V16QI [
> > (const_int 1 [0x1]) repeated x16
> > ]))
> > ]) "x.c":7:3 -1
> >  (nil))
> > during RTL pass: vregs
> 
> Ah, yeah, I guess we need to call force_reg on the result.
> 
> >> So yeah, I'm certainly not questioning the speed_p value of 256.
> >> I'm sure you and Wilco have picked the best value for that.  But -Os
> >> stuff can usually be justified on first principles and I wasn't sure
> >> where the value of 128 came from.
> >>
> >
> > I had another chat with Wilco about the 128byte value for !speed_p. We
> > estimate the average number of instructions upto 128byte would be ~3
> > which is similar to do a memset call. But I did go back and think
> > about the tuning argument of
> AARCH64_EXTRA_TUNE_NO_LDP_STP_QREGS a
> > bit more because you are right that based on that the average instructions
> can become double.
> > I would propose using 256/128 based on speed_p but halving the value
> > based on the tune parameter. Obviously the assumption here is that we
> > are respecting the core's choice of avoiding stp of q registers (given
> > that I do not see other uses of
> AARCH64_EXTRA_TUNE_NO_LDP_STP_QREGS being changed by -Os).
> 
> Yeah, but I think the lack of an -Os check in the existing code might be a
> mistake.  The point is that STP Q is smaller than two separate STR Qs, so 
> using
> it is a size optimisation even if it's not a speed optimisation.
> And like I say, -Os isn't supposed to be striking a balance between size and
> speed: it's supposed to be going for size quite aggressively.
> 
> So TBH I have slight preference for keeping the current value and only
> checking the tuning flag for speed_p.  But I agree that halving the value
> would be self-consistent, so if you or Wilco believe strongly that halving is
> better, that'd be OK with me too.
> 
> > There might be a debate on how useful
> > AARCH64_EXTRA_TUNE_NO_LDP_STP_QREGS
> > is in the context of memset/memcpy but that needs more analysis and I
> > would say should be a separate patch.
> 
> Agreed.
> 
> >> >> > +  if (n > 0 && n < copy_limit / 2)
> >> >> > + {
> >> >> > +   next_mode = smallest_mode_for_size (n, MODE_INT);
> >> >> > +   /* Last 1-byte causes the compiler to optimize to STRB when
> >> >> > +it
> >> >> should
> >> >> > +  use STR Bx, [mem] since we already used SIMD registers.
> >> >> > +  So force it to HImode.  */
> >> >> > +   if (next_mode == QImode)
> >> >> > + next_mode = HImode;
> >> >>
> >> >> Is this always better?  E.g. for variable inputs and zero it seems
> >> >> quite natural to store the original scalar GPR.
> >> >>
> >> >> If we do do this, I think we should assert before the loop that n > 1.
> >> >>
> >> 

RE: [PATCH] aarch64: Add backend support for expanding __builtin_memset

2020-11-13 Thread Sudakshina Das via Gcc-patches
Hi Richard

> -Original Message-
> From: Richard Sandiford 
> Sent: 11 November 2020 17:52
> To: Sudakshina Das 
> Cc: Wilco Dijkstra ; gcc-patches@gcc.gnu.org;
> Kyrylo Tkachov ; Richard Earnshaw
> 
> Subject: Re: [PATCH] aarch64: Add backend support for expanding
> __builtin_memset
> 
> Sudakshina Das  writes:
> > Apologies for the delay. I have attached another version of the patch.
> > I have disabled the test cases for ILP32. This is only because
> > function body check fails because there is an addition unsigned extension
> instruction for src pointer in
> > every test (uxtwx0, w0). The actual inlining is not different.
> 
> Yeah, agree that's the best way of handling the ILP32 difference.
> 
> > […]
> > +/* SET_RATIO is similar to CLEAR_RATIO, but for a non-zero constant.
> Without
> > +   -mstrict-align, make decisions in "setmem".  Otherwise follow a sensible
> > +   default:  when optimizing for size adjust the ratio to account for
> > +the
> 
> nit: should just be one space after “:”
> 
> > […]
> > @@ -21289,6 +21292,134 @@ aarch64_expand_cpymem (rtx *operands)
> >return true;
> >  }
> >
> > +/* Like aarch64_copy_one_block_and_progress_pointers, except for
> memset where
> > +   *src is a register we have created with the duplicated value to be
> > +set.  */
> 
> “*src” -> SRC
> since there's no dereference now
> 
> > […]
> > +  /* In case we are optimizing for size or if the core does not
> > + want to use STP Q regs, lower the max_set_size.  */
> > +  max_set_size = (!speed_p
> > + || (aarch64_tune_params.extra_tuning_flags
> > + & AARCH64_EXTRA_TUNE_NO_LDP_STP_QREGS))
> > + ? max_set_size/2 : max_set_size;
> 
> Formatting nit: should be a space either side of “/”.
> 
> > +  while (n > 0)
> > +{
> > +  /* Find the largest mode in which to do the copy in without
> > +over writing.  */
> 
> s/in without/without/
> 
> > +  opt_scalar_int_mode mode_iter;
> > +  FOR_EACH_MODE_IN_CLASS (mode_iter, MODE_INT)
> > +   if (GET_MODE_BITSIZE (mode_iter.require ()) <= MIN (n, copy_limit))
> > + cur_mode = mode_iter.require ();
> > +
> > +  gcc_assert (cur_mode != BLKmode);
> > +
> > +  mode_bits = GET_MODE_BITSIZE (cur_mode).to_constant ();
> > +  aarch64_set_one_block_and_progress_pointer (src, &dst,
> > + cur_mode);
> > +
> > +  n -= mode_bits;
> > +
> > +  /* Do certain trailing copies as overlapping if it's going to be
> > +cheaper.  i.e. less instructions to do so.  For instance doing a 15
> > +byte copy it's more efficient to do two overlapping 8 byte copies
> than
> > +8 + 4 + 2 + 1.  */
> > +  if (n > 0 && n < copy_limit / 2)
> > +   {
> > + next_mode = smallest_mode_for_size (n, MODE_INT);
> > + int n_bits = GET_MODE_BITSIZE (next_mode).to_constant ();
> 
> Sorry for the runaround, but looking at this again, I'm a bit worried that we
> only indirectly test that n_bits is within the length of the original set.  I 
> guess
> it is because if n < copy_limit / 2 then n < mode_bits, and so n_bits will 
> never
> exceed mode_bits.  I think it might be worth adding an assert to make that
> “clearer” (maybe only to me, probably obvious to everyone else):
> 
> gcc_assert (n_bits <= mode_bits);
> 
> OK with those changes, thanks.

Thank you! Committed as 54bbde5 with those changes.

Sudi

> 
> Richard
> 
> > + dst = aarch64_move_pointer (dst, (n - n_bits) / BITS_PER_UNIT);
> > + n = n_bits;
> > +   }
> > +}
> > +
> > +  return true;
> > +}
> > +
> > +
> >  /* Split a DImode store of a CONST_INT SRC to MEM DST as two
> > SImode stores.  Handle the case when the constant has identical
> > bottom and top halves.  This is beneficial when the two stores can
> > be


[PATCH] aarch64: Add backend support for expanding __builtin_memset

2020-10-29 Thread Sudakshina Das via Gcc-patches
Hi

This patch implements aarch64 backend expansion for __builtin_memset. Most of 
the
implementation is based on the expansion of __builtin_memcpy. We change the 
values of
SET_RATIO and MOVE_RATIO for cases where we do not have to strictly align and 
where
we can benefit from NEON instructions in the backend.

So for a test case like:

void foo (void* p) { __builtin_memset (p, 1, 7); }

instead of generating:
mov w3, 16843009
mov w2, 257
mov w1, 1
str w3, [x0]
strhw2, [x0, 4]
strbw1, [x0, 6]
ret
we now generate
moviv0.16b, 0x1
str s0, [x0]
str s0, [x0, 3]
ret

Bootstrapped and regression tested on aarch64-none-linux-gnu.
With this patch I have seen an overall improvement of 0.27% in Spec2017 Int
and 0.19% in Spec2017 FP benchmarks on Neoverse N1.

Is this ok for trunk?

gcc/ChangeLog:

2020-xx-xx  Sudakshina Das  

* config/aarch64/aarch64-protos.h (aarch64_expand_setmem): New
declaration.
* config/aarch64/aarch64.c (aarch64_gen_store_pair): Add case for
E_V16QImode.
(aarch64_set_one_block_and_progress_pointer): New helper for
aarch64_expand_setmem.
(aarch64_expand_setmem): Define the expansion for memset.
* config/aarch64/aarch64.h (CLEAR_RATIO): Tweak to favor
aarch64_expand_setmem when allowed and profitable.
(SET_RATIO): Likewise.
* config/aarch64/aarch64.md: Define pattern for setmemdi.

gcc/testsuite/ChangeLog:

2020-xx-xx  Sudakshina Das  

* g++.dg/tree-ssa/pr90883.C: Remove xfail for aarch64.
* gcc.dg/tree-prof/stringop-2.c: Add xfail for aarch64.
* gcc.target/aarch64/memset-corner-cases.c: New test.
* gcc.target/aarch64/memset-q-reg.c: New test.

Thanks
Sudi

### Attachment also inlined for ease of reply###

diff --git a/gcc/config/aarch64/aarch64-protos.h 
b/gcc/config/aarch64/aarch64-protos.h
index 
7a34c841355bad88365381912b163c61c5a35811..2aa3f1fddaafae58f0bfb26e5b33fe6a94e85e06
 100644
--- a/gcc/config/aarch64/aarch64-protos.h
+++ b/gcc/config/aarch64/aarch64-protos.h
@@ -510,6 +510,7 @@ bool aarch64_emit_approx_div (rtx, rtx, rtx);
 bool aarch64_emit_approx_sqrt (rtx, rtx, bool);
 void aarch64_expand_call (rtx, rtx, rtx, bool);
 bool aarch64_expand_cpymem (rtx *);
+bool aarch64_expand_setmem (rtx *);
 bool aarch64_float_const_zero_rtx_p (rtx);
 bool aarch64_float_const_rtx_p (rtx);
 bool aarch64_function_arg_regno_p (unsigned);
diff --git a/gcc/config/aarch64/aarch64.h b/gcc/config/aarch64/aarch64.h
index 
00b5f8438863bb52c348cfafd5d4db478fe248a7..bcb654809c9662db0f51fc1368e37e42969efd29
 100644
--- a/gcc/config/aarch64/aarch64.h
+++ b/gcc/config/aarch64/aarch64.h
@@ -1024,16 +1024,18 @@ typedef struct
 #define MOVE_RATIO(speed) \
   (!STRICT_ALIGNMENT ? 2 : (((speed) ? 15 : AARCH64_CALL_RATIO) / 2))
 
-/* For CLEAR_RATIO, when optimizing for size, give a better estimate
-   of the length of a memset call, but use the default otherwise.  */
+/* Like MOVE_RATIO, without -mstrict-align, make decisions in "setmem" when
+   we would use more than 3 scalar instructions.
+   Otherwise follow a sensible default: when optimizing for size, give a better
+   estimate of the length of a memset call, but use the default otherwise.  */
 #define CLEAR_RATIO(speed) \
-  ((speed) ? 15 : AARCH64_CALL_RATIO)
+  (!STRICT_ALIGNMENT ? 4 : (speed) ? 15 : AARCH64_CALL_RATIO)
 
 /* SET_RATIO is similar to CLEAR_RATIO, but for a non-zero constant, so when
optimizing for size adjust the ratio to account for the overhead of loading
the constant.  */
 #define SET_RATIO(speed) \
-  ((speed) ? 15 : AARCH64_CALL_RATIO - 2)
+  (!STRICT_ALIGNMENT ? 0 : (speed) ? 15 : AARCH64_CALL_RATIO - 2)
 
 /* Disable auto-increment in move_by_pieces et al.  Use of auto-increment is
rarely a good idea in straight-line code since it adds an extra address
diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
index 
a8cc545c37044345c3f1d3bf09151c8a9578a032..16ac0c076adcc82627af43473a938e78d3a7ecdc
 100644
--- a/gcc/config/aarch64/aarch64.c
+++ b/gcc/config/aarch64/aarch64.c
@@ -7058,6 +7058,9 @@ aarch64_gen_store_pair (machine_mode mode, rtx mem1, rtx 
reg1, rtx mem2,
 case E_V4SImode:
   return gen_vec_store_pairv4siv4si (mem1, reg1, mem2, reg2);
 
+case E_V16QImode:
+  return gen_vec_store_pairv16qiv16qi (mem1, reg1, mem2, reg2);
+
 default:
   gcc_unreachable ();
 }
@@ -21373,6 +21376,134 @@ aarch64_expand_cpymem (rtx *operands)
   return true;
 }
 
+/* Like aarch64_copy_one_block_and_progress_pointers, except for memset where
+   *src is a register we have created with the duplicated value to be set.  */
+static void
+aarch64_set_one_block_and_progress_pointer (rtx *src, rtx *dst,
+   machine_mode mode)
+{
+  /* If we are copying 128bits or 256bits, we can do that s

RE: [PATCH] aarch64: Add backend support for expanding __builtin_memset

2020-10-30 Thread Sudakshina Das via Gcc-patches
Hi Richard

Thank you for the review. Please find my comments inlined.

> -Original Message-
> From: Richard Sandiford 
> Sent: 30 October 2020 15:03
> To: Sudakshina Das 
> Cc: gcc-patches@gcc.gnu.org; Kyrylo Tkachov ;
> Richard Earnshaw 
> Subject: Re: [PATCH] aarch64: Add backend support for expanding
> __builtin_memset
> 
> Sudakshina Das  writes:
> > diff --git a/gcc/config/aarch64/aarch64.h
> > b/gcc/config/aarch64/aarch64.h index
> >
> 00b5f8438863bb52c348cfafd5d4db478fe248a7..bcb654809c9662db0f51fc1368
> e3
> > 7e42969efd29 100644
> > --- a/gcc/config/aarch64/aarch64.h
> > +++ b/gcc/config/aarch64/aarch64.h
> > @@ -1024,16 +1024,18 @@ typedef struct  #define MOVE_RATIO(speed) \
> >(!STRICT_ALIGNMENT ? 2 : (((speed) ? 15 : AARCH64_CALL_RATIO) / 2))
> >
> > -/* For CLEAR_RATIO, when optimizing for size, give a better estimate
> > -   of the length of a memset call, but use the default otherwise.  */
> > +/* Like MOVE_RATIO, without -mstrict-align, make decisions in "setmem"
> when
> > +   we would use more than 3 scalar instructions.
> > +   Otherwise follow a sensible default: when optimizing for size, give a
> better
> > +   estimate of the length of a memset call, but use the default
> > +otherwise.  */
> >  #define CLEAR_RATIO(speed) \
> > -  ((speed) ? 15 : AARCH64_CALL_RATIO)
> > +  (!STRICT_ALIGNMENT ? 4 : (speed) ? 15 : AARCH64_CALL_RATIO)
> >
> >  /* SET_RATIO is similar to CLEAR_RATIO, but for a non-zero constant, so
> when
> > optimizing for size adjust the ratio to account for the overhead of 
> > loading
> > the constant.  */
> >  #define SET_RATIO(speed) \
> > -  ((speed) ? 15 : AARCH64_CALL_RATIO - 2)
> > +  (!STRICT_ALIGNMENT ? 0 : (speed) ? 15 : AARCH64_CALL_RATIO - 2)
> 
> Think it would help to adjust the SET_RATIO comment too, otherwise it's not
> obvious why its !STRICT_ALIGNMNENT value is 0.
> 

Will do.

> >
> >  /* Disable auto-increment in move_by_pieces et al.  Use of auto-
> increment is
> > rarely a good idea in straight-line code since it adds an extra
> > address diff --git a/gcc/config/aarch64/aarch64.c
> > b/gcc/config/aarch64/aarch64.c index
> >
> a8cc545c37044345c3f1d3bf09151c8a9578a032..16ac0c076adcc82627af43473a9
> 3
> > 8e78d3a7ecdc 100644
> > --- a/gcc/config/aarch64/aarch64.c
> > +++ b/gcc/config/aarch64/aarch64.c
> > @@ -7058,6 +7058,9 @@ aarch64_gen_store_pair (machine_mode mode,
> rtx mem1, rtx reg1, rtx mem2,
> >  case E_V4SImode:
> >return gen_vec_store_pairv4siv4si (mem1, reg1, mem2, reg2);
> >
> > +case E_V16QImode:
> > +  return gen_vec_store_pairv16qiv16qi (mem1, reg1, mem2, reg2);
> > +
> >  default:
> >gcc_unreachable ();
> >  }
> > @@ -21373,6 +21376,134 @@ aarch64_expand_cpymem (rtx *operands)
> >return true;
> >  }
> >
> > +/* Like aarch64_copy_one_block_and_progress_pointers, except for
> memset where
> > +   *src is a register we have created with the duplicated value to be
> > +set.  */
> 
> AIUI, *SRC doesn't accumulate across calls in the way that it does for
> aarch64_copy_one_block_and_progress_pointers, so it might be better to
> pass an rtx rather than an “rtx *”.
> 

Will do.

> > +static void
> > +aarch64_set_one_block_and_progress_pointer (rtx *src, rtx *dst,
> > +   machine_mode mode)
> > +{
> > +  /* If we are copying 128bits or 256bits, we can do that straight from
> > +  the SIMD register we prepared.  */
> 
> Nit: excess space before “the”.
>
 
Will do.

> > +  if (known_eq (GET_MODE_BITSIZE (mode), 256))
> > +{
> > +  mode =  GET_MODE (*src);
> 
> Excess space before “GET_MODE”.
>
  
Will do.

> > +  /* "Cast" the *dst to the correct mode.  */
> > +  *dst = adjust_address (*dst, mode, 0);
> > +  /* Emit the memset.  */
> > +  emit_insn (aarch64_gen_store_pair (mode, *dst, *src,
> > +aarch64_progress_pointer (*dst),
> *src));
> > +
> > +  /* Move the pointers forward.  */
> > +  *dst = aarch64_move_pointer (*dst, 32);
> > +  return;
> > +}
> > +  else if (known_eq (GET_MODE_BITSIZE (mode), 128))
> 
> Nit: more usual in GCC not to have an “else” after an early return.
>

Will do.
 
> > +{
> > +  /* "Cast" the *dst to the correct mode.  */
> > +  *dst = adjust_address (*dst, GET_MODE (*src), 0);
> > +  /* Emit the memset.  */
> > +  emit_move_insn (*dst, *src);
> > +  /* Move the pointers forward.  */
> > +  *dst = aarch64_move_pointer (*dst, 16);
> > +  return;
> > +}
> > +  /* For copying less, we have to extract the right amount from *src.
> > + */  machine_mode vq_mode = aarch64_vq_mode
> > + (GET_MODE_INNER(mode)).require ();
> 
> Nit: should be a space before “(mode)”.
> 

Will do.

> > +  *src = convert_to_mode (vq_mode, *src, 0);  rtx reg =
> > + simplify_gen_subreg (mode, *src, vq_mode, 0);
> 
> I was surprised that this needed a two-step conversion.  Does a direct subreg
> of the original V16QI src blow up for s