Re: [PATCH] AArch64 SIMD: convert mvn+shrn into mvni+subhn

2025-06-13 Thread Remi Machet


On 6/13/25 11:22, Alex Coplan wrote:
> External email: Use caution opening links or attachments
> 
> 
> Hi Remi,
> 
> On 12/06/2025 17:02, Richard Sandiford wrote:
>> Remi Machet  writes:
>>> Add an optimization to aarch64 SIMD converting mvn+shrn into mvni+subhn
>>> which
>>> allows for better optimization when the code is inside a loop by using a
>>> constant.
> 
> It can be helpful for reviewers to show the codegen you get before/after
> your patch in the cover letter.
> 
> In this case, it might also be helpful to include a brief argument for
> why the transformation is correct, e.g. noting that, for a uint16_t x:
>-x = ~x + 1
>~x = -1 - x
> so:
>(u8)(~x >> imm)
> is equivalent to:
>(u8)(((u16)-1 - x) >> imm).
> 
>>>
>>> Bootstrapped and regtested on aarch64-linux-gnu.
>>>
>>> Signed-off-by: Remi Machet 
>>>
>>> gcc/ChangeLog:
>>>
>>>   * config/aarch64/aarch64-simd.md (*shrn_to_subhn_): Add pattern
>>>   converting mvn+shrn into mvni+subhn.
>>>
>>> gcc/testsuite/ChangeLog:
>>>
>>>   * gcc.target/aarch64/simd/shrn2subhn.c: New test.
>>>
>>> diff --git a/gcc/config/aarch64/aarch64-simd.md
>>> b/gcc/config/aarch64/aarch64-simd.md
> 
> As Kyrill noted, the formatting of the patch seems to be quite broken.
> This diff --git line shouldn't be wrapped, for example.  There are also
> quite a few unicode NBSP (non-breaking space) characters in the patch.
> These things make it difficult to apply the patch (it needs manual
> edits).  Attaching the patch file next time should help.
> 
>>> index 6e30dc48934..f49e6fe6a26 100644
>>> --- a/gcc/config/aarch64/aarch64-simd.md
>>> +++ b/gcc/config/aarch64/aarch64-simd.md
>>> @@ -5028,6 +5028,34 @@
>>>  DONE;
>>>})
>>>
>>> +;; convert what would be a mvn+shrn into a mvni+subhn because the use of a
>>> +;; constant load rather than not instructions allows for better loop
>>> +;; optimization. On some implementations the use of subhn also result
>>> in better
>>> +;; throughput.
>>> +(define_insn_and_split "*shrn_to_subhn_"
>>> +  [(set (match_operand: 0 "register_operand" "=&w")
>>> +(truncate:
>>> +  (lshiftrt:VQN
>>> +(not:VQN (match_operand:VQN 1 "register_operand" "w"))
>>> +(match_operand:VQN 2 "aarch64_simd_shift_imm_vec_exact_top"]
>>
>> Very minor, sorry, but it would be good to format this so that the
>> (truncate ...) lines up with the (match_operand...):
>>
>>[(set (match_operand: 0 "register_operand" "=&w")
>>(truncate:
>>  (lshiftrt:VQN
>>(not:VQN (match_operand:VQN 1 "register_operand" "w"))
>>(match_operand:VQN 2 "aarch64_simd_shift_imm_vec_exact_top"]
>>
>>> +  "TARGET_SIMD"
>>> +  "#"
>>> +  "&& true"
>>> +  [(const_int 0)]
>>> +{
>>> +  rtx tmp;
>>> +  if (can_create_pseudo_p ())
>>> +tmp = gen_reg_rtx (mode);
>>> +  else
>>> +tmp = gen_rtx_REG (mode, REGNO (operands[0]));
>>> +  emit_insn (gen_move_insn (tmp,
>>> +  aarch64_simd_gen_const_vector_dup (mode, -1)));
>>
>> This can be simplified to:
>>
>>emit_insn (gen_move_insn (tmp, CONSTM1_RTX (mode)));
> 
> Is there a reason to prefer emit_insn (gen_move_insn (x,y)) over just
> emit_move_insn (x,y)?  I tried the latter locally and it seemed to work.
> 
>>
>>> +  emit_insn (gen_aarch64_subhn_insn (operands[0], tmp,
>>> +operands[1], operands[2]));
>>
>> Sorry for the formatting nit, but: "operands[1]" should line up with
>> "operands[0]".
>>
>>> +  DONE;
>>> +})
>>> +
>>> +
>>>;; pmul.
>>>
>>>(define_insn "aarch64_pmul"
>>> diff --git a/gcc/testsuite/gcc.target/aarch64/simd/shrn2subhn.c
>>> b/gcc/testsuite/gcc.target/aarch64/simd/shrn2subhn.c
>>> new file mode 100644
>>> index 000..d03af815671
>>> --- /dev/null
>>> +++ b/gcc/testsuite/gcc.target/aarch64/simd/shrn2subhn.c
>>> @@ -0,0 +1,18 @@
>>> +/* This test case checks that replacing a not+shift by a sub -1 works. */
>>> +/* { dg-do compile } */
>>> +/* { d

[PATCH v2] AArch64 SIMD: convert mvn+shrn into mvni+subhn

2025-06-13 Thread Remi Machet
Add an optimization to aarch64 SIMD converting mvn+shrn into mvni+subhn 
which
allows for better optimization when the code is inside a loop by using a
constant.

Bootstrapped and regtested on aarch64-linux-gnu.

Signed-off-by: Remi Machet 

gcc/ChangeLog:

* config/aarch64/aarch64-simd.md (*shrn_to_subhn_): Add pattern
 converting mvn+shrn into mvni+subhn.

gcc/testsuite/ChangeLog:

* gcc.target/aarch64/simd/shrn2subhn.c: New test.

diff --git a/gcc/config/aarch64/aarch64-simd.md 
b/gcc/config/aarch64/aarch64-simd.md
index 6e30dc48934..7ce5b19a638 100644
--- a/gcc/config/aarch64/aarch64-simd.md
+++ b/gcc/config/aarch64/aarch64-simd.md
@@ -5028,6 +5028,33 @@
DONE;
  })

+;; convert what would be a mvn+shrn into a mvni+subhn because the use of a
+;; constant load rather than not instructions allows for better loop
+;; optimization.
+;; On some implementations the use of subhn also result in better 
throughput.
+(define_insn_and_split "*shrn_to_subhn_"
+  [(set (match_operand: 0 "register_operand" "=&w")
+   (truncate:
+ (lshiftrt:VQN
+   (not:VQN (match_operand:VQN 1 "register_operand" "w"))
+   (match_operand:VQN 2 "aarch64_simd_shift_imm_vec_exact_top"]
+  "TARGET_SIMD"
+  "#"
+  "&& true"
+  [(const_int 0)]
+{
+  rtx tmp;
+  if (can_create_pseudo_p ())
+tmp = gen_reg_rtx (mode);
+  else
+tmp = gen_rtx_REG (mode, REGNO (operands[0]));
+  emit_insn (gen_move_insn (tmp, CONSTM1_RTX (mode)));
+  emit_insn (gen_aarch64_subhn_insn (operands[0], tmp,
+  operands[1], operands[2]));
+  DONE;
+})
+
+
  ;; pmul.

  (define_insn "aarch64_pmul"
diff --git a/gcc/testsuite/gcc.target/aarch64/simd/shrn2subhn.c 
b/gcc/testsuite/gcc.target/aarch64/simd/shrn2subhn.c
new file mode 100644
index 000..06e94b48108
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/simd/shrn2subhn.c
@@ -0,0 +1,38 @@
+/* This test case checks that replacing a not+shift by a sub -1 works. */
+/* { dg-do compile } */
+/* { dg-additional-options "-O1" } */
+/* { dg-final { scan-assembler-times "\\tsubhn\\t" 6 } } */
+
+#include
+#include
+#include
+
+uint8x8_t neg_narrow_v8hi(uint16x8_t a) {
+  uint16x8_t b = vmvnq_u16(a);
+  return vshrn_n_u16(b, 8);
+}
+
+uint8x8_t neg_narrow_vsubhn_v8hi(uint16x8_t a) {
+  uint16x8_t ones = vdupq_n_u16(0x);
+  return vsubhn_u16(ones, a);
+}
+
+uint16x4_t neg_narrow_v4si(uint32x4_t a) {
+  uint32x4_t b = vmvnq_u32(a);
+  return vshrn_n_u32(b, 16);
+}
+
+uint16x4_t neg_narrow_vsubhn_v4si(uint32x4_t a) {
+  uint32x4_t ones = vdupq_n_u32(0x);
+  return vsubhn_u32(ones, a);
+}
+
+uint32x2_t neg_narrow_v2di(uint64x2_t a) {
+  uint64x2_t b = ~a;
+  return vshrn_n_u64(b, 32);
+}
+
+uint32x2_t neg_narrow_vsubhn_v2di(uint64x2_t a) {
+  uint64x2_t ones = vdupq_n_u64(0x);
+  return vsubhn_u64(ones, a);
+}



[PATCH] AArch64 SIMD: convert mvn+shrn into mvni+subhn

2025-06-12 Thread Remi Machet
Add an optimization to aarch64 SIMD converting mvn+shrn into mvni+subhn 
which
allows for better optimization when the code is inside a loop by using a
constant.

Bootstrapped and regtested on aarch64-linux-gnu.

Signed-off-by: Remi Machet 

gcc/ChangeLog:

     * config/aarch64/aarch64-simd.md (*shrn_to_subhn_): Add pattern
     converting mvn+shrn into mvni+subhn.

gcc/testsuite/ChangeLog:

     * gcc.target/aarch64/simd/shrn2subhn.c: New test.

diff --git a/gcc/config/aarch64/aarch64-simd.md 
b/gcc/config/aarch64/aarch64-simd.md
index 6e30dc48934..f49e6fe6a26 100644
--- a/gcc/config/aarch64/aarch64-simd.md
+++ b/gcc/config/aarch64/aarch64-simd.md
@@ -5028,6 +5028,34 @@
    DONE;
  })

+;; convert what would be a mvn+shrn into a mvni+subhn because the use of a
+;; constant load rather than not instructions allows for better loop
+;; optimization. On some implementations the use of subhn also result 
in better
+;; throughput.
+(define_insn_and_split "*shrn_to_subhn_"
+  [(set (match_operand: 0 "register_operand" "=&w")
+    (truncate:
+  (lshiftrt:VQN
+    (not:VQN (match_operand:VQN 1 "register_operand" "w"))
+    (match_operand:VQN 2 "aarch64_simd_shift_imm_vec_exact_top"]
+  "TARGET_SIMD"
+  "#"
+  "&& true"
+  [(const_int 0)]
+{
+  rtx tmp;
+  if (can_create_pseudo_p ())
+    tmp = gen_reg_rtx (mode);
+  else
+    tmp = gen_rtx_REG (mode, REGNO (operands[0]));
+  emit_insn (gen_move_insn (tmp,
+  aarch64_simd_gen_const_vector_dup (mode, -1)));
+  emit_insn (gen_aarch64_subhn_insn (operands[0], tmp,
+    operands[1], operands[2]));
+  DONE;
+})
+
+
  ;; pmul.

  (define_insn "aarch64_pmul"
diff --git a/gcc/testsuite/gcc.target/aarch64/simd/shrn2subhn.c 
b/gcc/testsuite/gcc.target/aarch64/simd/shrn2subhn.c
new file mode 100644
index 000..d03af815671
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/simd/shrn2subhn.c
@@ -0,0 +1,18 @@
+/* This test case checks that replacing a not+shift by a sub -1 works. */
+/* { dg-do compile } */
+/* { dg-additional-options "--save-temps -O1" } */
+/* { dg-final { scan-assembler-times "\\tsubhn\\t" 2 } } */
+
+#include
+#include
+#include
+
+uint8x8_t neg_narrow(uint16x8_t a) {
+  uint16x8_t b = vmvnq_u16(a);
+  return vshrn_n_u16(b, 8);
+}
+
+uint8x8_t neg_narrow_vsubhn(uint16x8_t a) {
+  uint16x8_t ones = vdupq_n_u16(0x);
+  return vsubhn_u16(ones, a);
+}



Re: [PATCH] AArch64 SIMD: convert mvn+shrn into mvni+subhn

2025-06-12 Thread Remi Machet

On 6/12/25 12:02, Richard Sandiford wrote:
> External email: Use caution opening links or attachments
>
>
> Remi Machet  writes:
>> Add an optimization to aarch64 SIMD converting mvn+shrn into mvni+subhn
>> which
>> allows for better optimization when the code is inside a loop by using a
>> constant.
>>
>> Bootstrapped and regtested on aarch64-linux-gnu.
>>
>> Signed-off-by: Remi Machet 
>>
>> gcc/ChangeLog:
>>
>>   * config/aarch64/aarch64-simd.md (*shrn_to_subhn_): Add pattern
>>   converting mvn+shrn into mvni+subhn.
>>
>> gcc/testsuite/ChangeLog:
>>
>>   * gcc.target/aarch64/simd/shrn2subhn.c: New test.
>>
>> diff --git a/gcc/config/aarch64/aarch64-simd.md
>> b/gcc/config/aarch64/aarch64-simd.md
>> index 6e30dc48934..f49e6fe6a26 100644
>> --- a/gcc/config/aarch64/aarch64-simd.md
>> +++ b/gcc/config/aarch64/aarch64-simd.md
>> @@ -5028,6 +5028,34 @@
>>  DONE;
>>})
>>
>> +;; convert what would be a mvn+shrn into a mvni+subhn because the use of a
>> +;; constant load rather than not instructions allows for better loop
>> +;; optimization. On some implementations the use of subhn also result
>> in better
>> +;; throughput.
>> +(define_insn_and_split "*shrn_to_subhn_"
>> +  [(set (match_operand: 0 "register_operand" "=&w")
>> +(truncate:
>> +  (lshiftrt:VQN
>> +(not:VQN (match_operand:VQN 1 "register_operand" "w"))
>> +(match_operand:VQN 2 "aarch64_simd_shift_imm_vec_exact_top"]
> Very minor, sorry, but it would be good to format this so that the
> (truncate ...) lines up with the (match_operand...):
>
>[(set (match_operand: 0 "register_operand" "=&w")
>  (truncate:
>(lshiftrt:VQN
>  (not:VQN (match_operand:VQN 1 "register_operand" "w"))
>  (match_operand:VQN 2 "aarch64_simd_shift_imm_vec_exact_top"]
>
>> +  "TARGET_SIMD"
>> +  "#"
>> +  "&& true"
>> +  [(const_int 0)]
>> +{
>> +  rtx tmp;
>> +  if (can_create_pseudo_p ())
>> +tmp = gen_reg_rtx (mode);
>> +  else
>> +tmp = gen_rtx_REG (mode, REGNO (operands[0]));
>> +  emit_insn (gen_move_insn (tmp,
>> +  aarch64_simd_gen_const_vector_dup (mode, -1)));
> This can be simplified to:
>
>emit_insn (gen_move_insn (tmp, CONSTM1_RTX (mode)));
>
>> +  emit_insn (gen_aarch64_subhn_insn (operands[0], tmp,
>> +operands[1], operands[2]));
> Sorry for the formatting nit, but: "operands[1]" should line up with
> "operands[0]".
>
>> +  DONE;
>> +})
>> +
>> +
>>;; pmul.
>>
>>(define_insn "aarch64_pmul"
>> diff --git a/gcc/testsuite/gcc.target/aarch64/simd/shrn2subhn.c
>> b/gcc/testsuite/gcc.target/aarch64/simd/shrn2subhn.c
>> new file mode 100644
>> index 000..d03af815671
>> --- /dev/null
>> +++ b/gcc/testsuite/gcc.target/aarch64/simd/shrn2subhn.c
>> @@ -0,0 +1,18 @@
>> +/* This test case checks that replacing a not+shift by a sub -1 works. */
>> +/* { dg-do compile } */
>> +/* { dg-additional-options "--save-temps -O1" } */
> The --save-temps isn't needed, since dg-do compile compiles to assembly
> anyway.
>
> Looks really good otherwise, thanks!  So OK with those changes from my POV,
> but please leave a day or so for others to comment.
>
> Richard
>
>> +/* { dg-final { scan-assembler-times "\\tsubhn\\t" 2 } } */
>> +
>> +#include
>> +#include
>> +#include
>> +
>> +uint8x8_t neg_narrow(uint16x8_t a) {
>> +  uint16x8_t b = vmvnq_u16(a);
>> +  return vshrn_n_u16(b, 8);
>> +}
>> +
>> +uint8x8_t neg_narrow_vsubhn(uint16x8_t a) {
>> +  uint16x8_t ones = vdupq_n_u16(0x);
>> +  return vsubhn_u16(ones, a);
>> +}


Hi Richard,

Thank you for reviewing the patch and for the feedback! I will send a 
follow up patch with all the suggestions, note that the truncate vs. 
match_operand text alignment seems to be an issue with my email program 
rather than the patch itself: in the patch they align if you treat the 
tabs as 8 characters, while I think in the email they got converted to 4 
spaces.

Remi



[PATCH v3] AArch64 SIMD: convert mvn+shrn into mvni+subhn

2025-06-30 Thread Remi Machet
Hi,

Attached is the updated patch for the aarch64 conversion of some 
mvn+shrn patterns into a mvni+subhn. Hopefully attachment fixes the tab 
issues, the cover letter was updated to better explain what the patch 
does, code was changed to use emit_move_insn, and testcase was cleaned 
up per Richard and Alex's suggestions.

Sorry for the delay in fixing all suggestions but I was traveling for 
the past 2 weeks.

Remi
From 102b7358be9d030f9f518c2accd329d14fe545a3 Mon Sep 17 00:00:00 2001
From: Remi Machet 
Date: Fri, 13 Jun 2025 18:44:53 +
Subject: [PATCH v3] AArch64 SIMD: convert mvn+shrn into mvni+subhn

Add an optimization to aarch64 SIMD converting mvn+shrn into mvni+subhn when
possible, which allows for better optimization when the code is inside a loop
by using a constant.

The conversion is based on the fact that for an unsigned integer:
  -x = ~x + 1 => ~x = -1 - x
thus '(u8)(~x >> imm)' is equivalent to '(u8)(((u16)-1 - x) >> imm)'.

For the following function:
uint8x8_t neg_narrow_v8hi(uint16x8_t a) {
  uint16x8_t b = vmvnq_u16(a);
  return vshrn_n_u16(b, 8);
}

Without this patch the assembly look like:
not v0.16b, v0.16b
shrnv0.8b, v0.8h, 8

After the patch it becomes:
mvniv31.4s, 0
subhn   v0.8b, v31.8h, v0.8h

Bootstrapped and regtested on aarch64-linux-gnu.

Signed-off-by: Remi Machet 

gcc/ChangeLog:

* config/aarch64/aarch64-simd.md (*shrn_to_subhn_): Add pattern
converting mvn+shrn into mvni+subhn.

gcc/testsuite/ChangeLog:

* gcc.target/aarch64/simd/shrn2subhn.c: New test.
---
 gcc/config/aarch64/aarch64-simd.md| 30 
 .../gcc.target/aarch64/simd/shrn2subhn.c  | 36 +++
 2 files changed, 66 insertions(+)
 create mode 100644 gcc/testsuite/gcc.target/aarch64/simd/shrn2subhn.c

diff --git a/gcc/config/aarch64/aarch64-simd.md 
b/gcc/config/aarch64/aarch64-simd.md
index 6e30dc48934..29796f7cf1b 100644
--- a/gcc/config/aarch64/aarch64-simd.md
+++ b/gcc/config/aarch64/aarch64-simd.md
@@ -5028,6 +5028,36 @@
   DONE;
 })
 
+;; convert (truncate)(~x >> imm) into (truncate)(((u16)-1 - x) >> imm)
+;; because it will result in the 'not' being replaced with a constant load
+;; which allows for better loop optimization.
+;; We limit this to truncations that take the upper half and shift it to the
+;; lower half as we use subhn (patterns that would have generated an shrn
+;; otherwise).
+;; On some implementations the use of subhn also result in better throughput.
+(define_insn_and_split "*shrn_to_subhn_"
+  [(set (match_operand: 0 "register_operand" "=&w")
+   (truncate:
+ (lshiftrt:VQN
+   (not:VQN (match_operand:VQN 1 "register_operand" "w"))
+   (match_operand:VQN 2 "aarch64_simd_shift_imm_vec_exact_top"]
+  "TARGET_SIMD"
+  "#"
+  "&& true"
+  [(const_int 0)]
+{
+  rtx tmp;
+  if (can_create_pseudo_p ())
+tmp = gen_reg_rtx (mode);
+  else
+tmp = gen_rtx_REG (mode, REGNO (operands[0]));
+  emit_move_insn (tmp, CONSTM1_RTX (mode));
+  emit_insn (gen_aarch64_subhn_insn (operands[0], tmp,
+  operands[1], operands[2]));
+  DONE;
+})
+
+
 ;; pmul.
 
 (define_insn "aarch64_pmul"
diff --git a/gcc/testsuite/gcc.target/aarch64/simd/shrn2subhn.c 
b/gcc/testsuite/gcc.target/aarch64/simd/shrn2subhn.c
new file mode 100644
index 000..f90ea134f09
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/simd/shrn2subhn.c
@@ -0,0 +1,36 @@
+/* This test case checks that replacing a not+shift by a sub -1 works. */
+/* { dg-do compile } */
+/* { dg-additional-options "-O1" } */
+/* { dg-final { scan-assembler-times "\\tsubhn\\t" 6 } } */
+
+#include
+
+uint8x8_t neg_narrow_v8hi(uint16x8_t a) {
+  uint16x8_t b = vmvnq_u16(a);
+  return vshrn_n_u16(b, 8);
+}
+
+uint8x8_t neg_narrow_vsubhn_v8hi(uint16x8_t a) {
+  uint16x8_t ones = vdupq_n_u16(0x);
+  return vsubhn_u16(ones, a);
+}
+
+uint16x4_t neg_narrow_v4si(uint32x4_t a) {
+  uint32x4_t b = vmvnq_u32(a);
+  return vshrn_n_u32(b, 16);
+}
+
+uint16x4_t neg_narrow_vsubhn_v4si(uint32x4_t a) {
+  uint32x4_t ones = vdupq_n_u32(0x);
+  return vsubhn_u32(ones, a);
+}
+
+uint32x2_t neg_narrow_v2di(uint64x2_t a) {
+  uint64x2_t b = ~a;
+  return vshrn_n_u64(b, 32);
+}
+
+uint32x2_t neg_narrow_vsubhn_v2di(uint64x2_t a) {
+  uint64x2_t ones = vdupq_n_u64(0x);
+  return vsubhn_u64(ones, a);
+}
-- 
2.34.1



Re: [PATCH 4/7] aarch64: Use EOR3 for DImode values

2025-07-07 Thread Remi Machet
On 7/7/25 06:18, Kyrylo Tkachov wrote:
> External email: Use caution opening links or attachments
>
>
> Hi all,
>
> Similar to BCAX, we can use EOR3 for DImode, but we have to be careful
> not to force GP<->SIMD moves unnecessarily, so add a splitter for that case.
>
> So for input:
> uint64_t eor3_d_gp (uint64_t a, uint64_t b, uint64_t c) { return EOR3 (a, b, 
> c); }
> uint64x1_t eor3_d (uint64x1_t a, uint64x1_t b, uint64x1_t c) { return EOR3 
> (a, b, c); }
>
> We generate the desired:
> eor3_d_gp:
> eor x1, x1, x2
> eor x0, x1, x0
> ret
>
> eor3_d:
> eor3 v0.16b, v0.16b, v1.16b, v2.16b
> ret
>
> Bootstrapped and tested on aarch64-none-linux-gnu.
> Ok for trunk?
> Thanks,
> Kyrill
>
> Signed-off-by: Kyrylo Tkachov 
>
> gcc/
>
>  * config/aarch64/aarch64-simd.md (*eor3qdi4): New
>  define_insn_and_split.
>
> gcc/testsuite/
>
>  * gcc.target/aarch64/simd/eor3_d.c: Add tests for DImode operands.

Hi Kyrill,

I assume compact syntax is a no-go because of the different modifiers on 
operand 0 ('=' and '&')?

Also, shouldn't the second variant use '=&r' for operand 0 instead of '&r'?

Remi




Re: [PATCH 5/7] aarch64: Use SVE2 NBSL for DImode arguments

2025-07-07 Thread Remi Machet

On 7/7/25 06:19, Kyrylo Tkachov wrote:

External email: Use caution opening links or attachments


Hi all,

Similar to the BCAX and EOR3 patterns from TARGET_SHA3 we can use the
SVE2 NBSL instruction for DImode arugments when they come in SIMD registers.

Minor nit: there is a typo in "arugments"


Again, this is accomplished with a new splitter for the GP case. I noticed
that the split has a side-effect of producing a GP EON instruction where it
wasn't getting generated before because the BSL insn-and-split got in the way.
So for the inputs:

uint64_t nbsl_gp(uint64_t a, uint64_t b, uint64_t c) { return NBSL (a, b, c); }
uint64x1_t nbsl_d (uint64x1_t a, uint64x1_t b, uint64x1_t c) { return NBSL (a, 
b, c); }

We now generate:
nbsl_gp:
eor x0, x0, x1
and x0, x0, x2
eon x0, x0, x1
ret

nbsl_d:
nbsl z0.d, z0.d, z1.d, z2.d
ret

instead of:
nbsl_gp:
eor x0, x1, x0
and x0, x0, x2
eor x0, x0, x1
mvn x0, x0
ret

nbsl_d:
bif v0.8b, v1.8b, v2.8b
mvn v0.8b, v0.8b
ret

Bootstrapped and tested on aarch64-none-linux-gnu.
Ok for trunk?

Looks good to me aside from the nit above.

Remi


Thanks,
Kyrill

Signed-off-by: Kyrylo Tkachov 

gcc/

* config/aarch64/aarch64-sve.md (*aarch64_sve2_nbsl_unpreddi): New
define_insn_and_split.

gcc/testsuite/

* gcc.target/aarch64/sve2/nbsl_d.c: New test.



Re: [PATCH 6/7] aarch64: Use SVE2 BSL1N for DImode arguments

2025-07-07 Thread Remi Machet

On 7/7/25 06:19, Kyrylo Tkachov wrote:
> External email: Use caution opening links or attachments
>
>
> Hi all,
>
> Similar to other patches in this series, this patch adds a splitter
> for DImode BSL1N operations, taking care to generate the right code
> in the GP regs case.
>
> Thus for the testcase we generate:
> bsl1n_gp:
> eon x0, x0, x1
> and x0, x0, x2
> eor x0, x0, x1
> ret
>
> bsl1n_d:
> bsl1n z0.d, z0.d, z1.d, z2.d
> ret
>
> instead of the previous:
> bsl1n_gp: // The same, avoid moves to FP regs.
> eon x0, x0, x1
> and x0, x0, x2
> eor x0, x0, x1
> ret
>
> bsl1n_d:
> fmov x0, d0
> fmov x1, d1
> eon x0, x1, x0
> fmov d31, x0
> and v2.8b, v31.8b, v2.8b
> eor v0.8b, v2.8b, v1.8b
> ret
>
> Bootstrapped and tested on aarch64-none-linux-gnu.
> Ok for trunk?

Looks good to me,

Remi

> Thanks,
> Kyrill
>
> Signed-off-by: Kyrylo Tkachov 
>
> gcc/
>
>  * config/aarch64/aarch64-sve2.md (*aarch64_sve2_bsl1n_unpreddi): New
>  define_insn_and_split.
>
> gcc/testsuite/
>
>  * gcc.target/aarch64/sve2/bsl1n_d.c: New test.
>


Re: [PATCH] aarch64: Extend HVLA permutations to big-endian

2025-07-09 Thread Remi Machet

On 7/9/25 11:00, Richard Sandiford wrote:

External email: Use caution opening links or attachments


Richard Sandiford  
writes:


TARGET_VECTORIZE_VEC_PERM_CONST has code to match the SVE2.1
"hybrid VLA" DUPQ, EXTQ, UZPQ{1,2}, and ZIPQ{1,2} instructions.
This matching was conditional on !BYTES_BIG_ENDIAN.

The ACLE code also lowered the associated SVE2.1 intrinsics into
suitable VEC_PERM_EXPRs.  This lowering was not conditional on
!BYTES_BIG_ENDIAN.

The mismatch led to lots of ICEs in the ACLE tests on big-endian
targets: we lowered to VEC_PERM_EXPRs that are not supported.

I think the !BYTES_BIG_ENDIAN restriction was unnecessary.
SVE maps the first memory element to the least significant end of
the register for both endiannesses, so no endian correction or lane
number adjustment is necessary.

This is in some ways a bit counterintuitive.  ZIPQ1 is conceptually
"apply Advanced SIMD ZIP1 to each 128-bit block" and endianness does
matter when choosing between Advanced SIMD ZIP1 and ZIP2.  For example,
the V4SI permute selector { 0, 4, 1, 5 } corresponds to ZIP1 for little-
endian and ZIP2 for big-endian.  But the difference between the hybrid
VLA and Advanced SIMD permute selectors is a consequence of the
difference between the SVE and Advanced SIMD element orders.

The same thing applies to ACLE intrinsics.  The current lowering of
svzipq1 etc. is correct for both endiannesses.  If ACLE code does:

  2x svld1_s32 + svzipq1_s32 + svst1_s32

then the byte-for-byte result is the same for both endiannesses.
On big-endian targets, this is different from using the Advanced SIMD
sequence below for each 128-bit block:

  2x LDR + ZIP1 + STR

In contrast, the byte-for-byte result of:

  2x svld1q_gather_s32 + svzipq1_s32 + svst11_scatter_s32

depends on endianness, since the quadword gathers and scatters use
Advanced SIMD byte ordering for each 128-bit block.  This gather/scatter
sequence behaves in the same way as the Advanced SIMD LDR+ZIP1+STR
sequence for both endiannesses.

Programmers writing ACLE code have to be aware of this difference
if they want to support both endiannesses.

The patch includes some new execution tests to verify the expansion
of the VEC_PERM_EXPRs.

Tested on aarch64-linux-gnu and aarch64_be-elf.  OK to install?

Richard


gcc/
  * doc/sourcebuild.texi (aarch64_sve2_hw, aarch64_sve2p1_hw): Document.
  * config/aarch64/aarch64.cc (aarch64_evpc_hvla): Extend to
  BYTES_BIG_ENDIAN.

gcc/testsuite/
  * lib/target-supports.exp (check_effective_target_aarch64_sve2p1_hw):
  New proc.
  * gcc.target/aarch64/sve2/dupq_1.c: Extend to big-endian.  Add
  noipa attributes.
  * gcc.target/aarch64/sve2/extq_1.c: Likewise.
  * gcc.target/aarch64/sve2/uzpq_1.c: Likewise.
  * gcc.target/aarch64/sve2/zipq_1.c: Likewise.



Just noticed that I failed to add nopia to the other files -- will fix.



  * gcc.target/aarch64/sve2/dupq_1_run.c: New test.
  * gcc.target/aarch64/sve2/extq_1_run.c: Likewise.
  * gcc.target/aarch64/sve2/uzpq_1_run.c: Likewise.
  * gcc.target/aarch64/sve2/zipq_1_run.c: Likewise.


Hi Richard,

Looks good to me (but I cannot approve as I am neither a reviewer nor an 
approver).

Remi


Re: [PATCH 7/7] aarch64: Use BSL2N for DImode operands

2025-07-09 Thread Remi Machet

On 7/7/25 06:20, Kyrylo Tkachov wrote:
> External email: Use caution opening links or attachments
>
>
> Hi all,
>
> The intent of the patch is similar to previous in the series.
> Make more use of BSL2N when we have DImode operands in SIMD regs,
> but still use the GP instructions when that's where the operands are.
> Compared to the previous patches there are a couple of complications:
> * The operands are a bit more complex and get rejected by RTX costs during
> combine. This is fixed by adding some costing logic to aarch64_rtx_costs.
>
> * The GP split sequence requires two temporaries instead of just one.
> I've marked operand 1 to be an input/output earlyclobber operand to give
> the second temporary together with the earlyclobber operand 0. This means
> that operand is marked with "+" even for the "w" alternatives as the modifier
> is global, but I don't see another way out here. Suggestions welcome.
>
> With these fixed for the testcase we generate:
> bsl2n_gp: // unchanged scalar output
> orr x1, x2, x1
> and x0, x0, x2
> orn x0, x0, x1
> ret
>
> bsl2n_d:
> bsl2n z0.d, z0.d, z1.d, z2.d
> ret
>
> compared to the previous:
> bsl2n_gp:
> orr x1, x2, x1
> and x0, x0, x2
> orn x0, x0, x1
> ret
>
> bsl2n_d:
> orr v1.8b, v2.8b, v1.8b
> and v0.8b, v2.8b, v0.8b
> orn v0.8b, v0.8b, v1.8b
> ret
>
> Bootstrapped and tested on aarch64-none-linux-gnu.
> Ok for trunk?
> Thanks,
> Kyrill
Hi Kyrill

I think in the GPR variant some overlap in operands is possible, like "[ 
&r   , &r    , r1    , r01 ; *  ] #".

In aarch64_bsl2n_rtx_form_p() shouldn't there be a check for one 
parameter being the same on both sides (the select)?

Otherwise looks good to me (but I cannot approve as I am neither 
reviewer or approver).

Remi
>
> Signed-off-by: Kyrylo Tkachov 
>
> gcc/
>
>  * config/aarch64/aarch64-sve2.md (*aarch64_sve2_bsl2n_unpreddi): New
>  define_insn_and_split.
>  * config/aarch64/aarch64.cc (aarch64_bsl2n_rtx_form_p): Define.
>  (aarch64_rtx_costs): Use the above. Cost BSL2N ops.
>
> gcc/testsuite/
>
>  * gcc.target/aarch64/sve2/bsl2n_d.c: New test.


Re: [PATCH] aarch64: Support unpacked SVE integer division

2025-07-11 Thread Remi Machet

On 7/11/25 08:21, Spencer Abson wrote:

External email: Use caution opening links or attachments


This patch extends the existing patterns for SVE_INT_BINARY_SD to
support partial SVE integer modes, including those implement the
conditional form.

gcc/ChangeLog:

* config/aarch64/aarch64-sve.md (3): Extend
to SVE_SDI_SIMD.
(@aarch64_pred_): Likewise.
(@cond_): Extend to SVE_SDI.
(*cond__2): Likewise.
(*cond__3): Likewise.
(*cond__any): Likewise.
* config/aarch64/iterators.md (SVE_SDI): New iterator for
all SVE vector modes with 32-bit or 64-bit elements.
(SVE_SDI_SIMD): New iterator.  As above, but including
V4SI and V2DI.

gcc/testsuite/ChangeLog:

* g++.target/aarch64/sve/cond_arith_1.C: Rename TEST_SHIFT
to TEST_OP, add tests for SDIV and UDIV.
* g++.target/aarch64/sve/cond_arith_2.C: Likewise.
* g++.target/aarch64/sve/cond_arith_3.C: Likewise.
* g++.target/aarch64/sve/cond_arith_4.C: Likewise.
* gcc.target/aarch64/sve/div_2.c: New test.

---

Bootstrapped & regtested on aarch64-linux-gnu.  OK for master?

Thanks,
Spencer

---
 gcc/config/aarch64/aarch64-sve.md | 64 +--
 gcc/config/aarch64/iterators.md   |  7 ++
 .../g++.target/aarch64/sve/cond_arith_1.C | 25 +---
 .../g++.target/aarch64/sve/cond_arith_2.C | 25 +---
 .../g++.target/aarch64/sve/cond_arith_3.C | 27 +---
 .../g++.target/aarch64/sve/cond_arith_4.C | 27 +---
 gcc/testsuite/gcc.target/aarch64/sve/div_2.c  | 22 +++
 7 files changed, 127 insertions(+), 70 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/div_2.c

diff --git a/gcc/config/aarch64/aarch64-sve.md 
b/gcc/config/aarch64/aarch64-sve.md
index 6b5113eb70f..871b31623bb 100644
--- a/gcc/config/aarch64/aarch64-sve.md
+++ b/gcc/config/aarch64/aarch64-sve.md
@@ -4712,12 +4712,12 @@
 ;; We can use it with Advanced SIMD modes to expose the V2DI and V4SI
 ;; optabs to the midend.
 (define_expand "3"
-  [(set (match_operand:SVE_FULL_SDI_SIMD 0 "register_operand")
-   (unspec:SVE_FULL_SDI_SIMD
+  [(set (match_operand:SVE_SDI_SIMD 0 "register_operand")
+   (unspec:SVE_SDI_SIMD
  [(match_dup 3)
-  (SVE_INT_BINARY_SD:SVE_FULL_SDI_SIMD
-(match_operand:SVE_FULL_SDI_SIMD 1 "register_operand")
-(match_operand:SVE_FULL_SDI_SIMD 2 "register_operand"))]
+  (SVE_INT_BINARY_SD:SVE_SDI_SIMD
+(match_operand:SVE_SDI_SIMD 1 "register_operand")
+(match_operand:SVE_SDI_SIMD 2 "register_operand"))]
  UNSPEC_PRED_X))]
   "TARGET_SVE"
   {
@@ -4727,12 +4727,12 @@

 ;; Integer division predicated with a PTRUE.
 (define_insn "@aarch64_pred_"
-  [(set (match_operand:SVE_FULL_SDI_SIMD 0 "register_operand")
-   (unspec:SVE_FULL_SDI_SIMD
+  [(set (match_operand:SVE_SDI_SIMD 0 "register_operand")
+   (unspec:SVE_SDI_SIMD
  [(match_operand: 1 "register_operand")
-  (SVE_INT_BINARY_SD:SVE_FULL_SDI_SIMD
-(match_operand:SVE_FULL_SDI_SIMD 2 "register_operand")
-(match_operand:SVE_FULL_SDI_SIMD 3 "register_operand"))]
+  (SVE_INT_BINARY_SD:SVE_SDI_SIMD
+(match_operand:SVE_SDI_SIMD 2 "register_operand")
+(match_operand:SVE_SDI_SIMD 3 "register_operand"))]
  UNSPEC_PRED_X))]
   "TARGET_SVE"
   {@ [ cons: =0 , 1   , 2 , 3 ; attrs: movprfx ]
@@ -4744,25 +4744,25 @@

 ;; Predicated integer division with merging.
 (define_expand "@cond_"
-  [(set (match_operand:SVE_FULL_SDI 0 "register_operand")
-   (unspec:SVE_FULL_SDI
+  [(set (match_operand:SVE_SDI 0 "register_operand")
+   (unspec:SVE_SDI
  [(match_operand: 1 "register_operand")
-  (SVE_INT_BINARY_SD:SVE_FULL_SDI
-(match_operand:SVE_FULL_SDI 2 "register_operand")
-(match_operand:SVE_FULL_SDI 3 "register_operand"))
-  (match_operand:SVE_FULL_SDI 4 "aarch64_simd_reg_or_zero")]
+  (SVE_INT_BINARY_SD:SVE_SDI
+(match_operand:SVE_SDI 2 "register_operand")
+(match_operand:SVE_SDI 3 "register_operand"))
+  (match_operand:SVE_SDI 4 "aarch64_simd_reg_or_zero")]
  UNSPEC_SEL))]
   "TARGET_SVE"
 )

 ;; Predicated integer division, merging with the first input.
 (define_insn "*cond__2"
-  [(set (match_operand:SVE_FULL_SDI 0 "register_operand")
-   (unspec:SVE_FULL_SDI
+  [(set (match_operand:SVE_SDI 0 "register_operand")
+   (unspec:SVE_SDI
  [(match_operand: 1 "register_operand")
-  (SVE_INT_BINARY_SD:SVE_FULL_SDI
-(match_operand:SVE_FULL_SDI 2 "register_operand")
-(match_operand:SVE_FULL_SDI 3 "register_operand"))
+  (SVE_INT_BINARY_SD:SVE_SDI
+(match_operand:SVE_SDI 2 "register_operand")
+(match_operand:SVE_SDI 3 "register_operand"))
   (match_dup 2)]
  UNSPEC_SEL))]
   "TARGET_SVE"
@@ -4774,1

Re: [PATCH] aarch64: Support unpacked SVE integer division

2025-07-14 Thread Remi Machet

On 7/14/25 06:35, Spencer Abson wrote:
> External email: Use caution opening links or attachments
>
>
> On Fri, Jul 11, 2025 at 02:40:46PM +, Remi Machet wrote:
>> On 7/11/25 08:21, Spencer Abson wrote:
>>
>> External email: Use caution opening links or attachments
>>
>>
>> This patch extends the existing patterns for SVE_INT_BINARY_SD to
>> support partial SVE integer modes, including those implement the
>> conditional form.
>>
>> gcc/ChangeLog:
>>
>>  * config/aarch64/aarch64-sve.md (3): Extend
>>  to SVE_SDI_SIMD.
>>  (@aarch64_pred_): Likewise.
>>  (@cond_): Extend to SVE_SDI.
>>  (*cond__2): Likewise.
>>  (*cond__3): Likewise.
>>  (*cond__any): Likewise.
>>  * config/aarch64/iterators.md (SVE_SDI): New iterator for
>>  all SVE vector modes with 32-bit or 64-bit elements.
>>  (SVE_SDI_SIMD): New iterator.  As above, but including
>>  V4SI and V2DI.
>>
>> gcc/testsuite/ChangeLog:
>>
>>  * g++.target/aarch64/sve/cond_arith_1.C: Rename TEST_SHIFT
>>  to TEST_OP, add tests for SDIV and UDIV.
>>  * g++.target/aarch64/sve/cond_arith_2.C: Likewise.
>>  * g++.target/aarch64/sve/cond_arith_3.C: Likewise.
>>  * g++.target/aarch64/sve/cond_arith_4.C: Likewise.
>>  * gcc.target/aarch64/sve/div_2.c: New test.
>>
>> ---
>>
>> Bootstrapped & regtested on aarch64-linux-gnu.  OK for master?
>>
>> Thanks,
>> Spencer
>>
>> ---
>>   gcc/config/aarch64/aarch64-sve.md | 64 +--
>>   gcc/config/aarch64/iterators.md   |  7 ++
>>   .../g++.target/aarch64/sve/cond_arith_1.C | 25 +---
>>   .../g++.target/aarch64/sve/cond_arith_2.C | 25 +---
>>   .../g++.target/aarch64/sve/cond_arith_3.C | 27 +---
>>   .../g++.target/aarch64/sve/cond_arith_4.C | 27 +---
>>   gcc/testsuite/gcc.target/aarch64/sve/div_2.c  | 22 +++
>>   7 files changed, 127 insertions(+), 70 deletions(-)
>>   create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/div_2.c
>>
>> diff --git a/gcc/config/aarch64/aarch64-sve.md 
>> b/gcc/config/aarch64/aarch64-sve.md
>> index 6b5113eb70f..871b31623bb 100644
>> --- a/gcc/config/aarch64/aarch64-sve.md
>> +++ b/gcc/config/aarch64/aarch64-sve.md
>> @@ -4712,12 +4712,12 @@
>>   ;; We can use it with Advanced SIMD modes to expose the V2DI and V4SI
>>   ;; optabs to the midend.
>>   (define_expand "3"
>> -  [(set (match_operand:SVE_FULL_SDI_SIMD 0 "register_operand")
>> -   (unspec:SVE_FULL_SDI_SIMD
>> +  [(set (match_operand:SVE_SDI_SIMD 0 "register_operand")
>> +   (unspec:SVE_SDI_SIMD
>>[(match_dup 3)
>> -  (SVE_INT_BINARY_SD:SVE_FULL_SDI_SIMD
>> -(match_operand:SVE_FULL_SDI_SIMD 1 "register_operand")
>> -(match_operand:SVE_FULL_SDI_SIMD 2 "register_operand"))]
>> +  (SVE_INT_BINARY_SD:SVE_SDI_SIMD
>> +(match_operand:SVE_SDI_SIMD 1 "register_operand")
>> +(match_operand:SVE_SDI_SIMD 2 "register_operand"))]
>>UNSPEC_PRED_X))]
>> "TARGET_SVE"
>> {
>> @@ -4727,12 +4727,12 @@
>>
>>   ;; Integer division predicated with a PTRUE.
>>   (define_insn "@aarch64_pred_"
>> -  [(set (match_operand:SVE_FULL_SDI_SIMD 0 "register_operand")
>> -   (unspec:SVE_FULL_SDI_SIMD
>> +  [(set (match_operand:SVE_SDI_SIMD 0 "register_operand")
>> +   (unspec:SVE_SDI_SIMD
>>[(match_operand: 1 "register_operand")
>> -  (SVE_INT_BINARY_SD:SVE_FULL_SDI_SIMD
>> -(match_operand:SVE_FULL_SDI_SIMD 2 "register_operand")
>> -(match_operand:SVE_FULL_SDI_SIMD 3 "register_operand"))]
>> +  (SVE_INT_BINARY_SD:SVE_SDI_SIMD
>> +(match_operand:SVE_SDI_SIMD 2 "register_operand")
>> +(match_operand:SVE_SDI_SIMD 3 "register_operand"))]
>>UNSPEC_PRED_X))]
>> "TARGET_SVE"
>> {@ [ cons: =0 , 1   , 2 , 3 ; attrs: movprfx ]
>> @@ -4744,25 +4744,25 @@
>>
>>   ;; Predicated integer division with merging.
>>   (define_expand "@cond_"
>> -  [(set (match_operand:SVE_FULL_SDI 0 "register_operand")
>> -   (unspec:SVE_FULL_SDI
>> +  [(set (match_operand:SVE_SDI 0 &qu

Re: [PATCH] aarch64: Use SVE2 NBSL for vector NOR and NAND for Advanced SIMD modes

2025-07-15 Thread Remi Machet

On 7/15/25 08:57, Kyrylo Tkachov wrote:
> External email: Use caution opening links or attachments
>
>
> Hi all,
>
> We already have patterns to use the NBSL instruction to implement vector
> NOR and NAND operations for SVE types and modes. It is straightforward to
> have similar patterns for the fixed-width Advanced SIMD modes as well, though
> it requires combine patterns without the predicate operand and an explicit 'Z'
> output modifier. This patch does so.
>
> So now for example we generate for:
>
> uint64x2_t nand_q(uint64x2_t a, uint64x2_t b) { return NAND(a, b); }
> uint64x2_t nor_q(uint64x2_t a, uint64x2_t b) { return NOR(a, b); }
>
> nand_q:
>  nbsl z0.d, z0.d, z1.d, z1.d
>  ret
>
> nor_q:
>  nbsl z0.d, z0.d, z1.d, z0.d
>  ret
>
> instead of the previous:
> nand_q:
>  and v0.16b, v0.16b, v1.16b
>  not v0.16b, v0.16b
>  ret
>
> nor_q:
>  orr v0.16b, v0.16b, v1.16b
>  not v0.16b, v0.16b
>  ret
>
> The tied operand requirements for NBSL mean that we can generate the MOVPRFX
> when the operands fall that way, but I guess having a 2-insn MOVPRFX form is
> not worse than the current 2-insn codegen at least, and the MOVPRFX can be
> fused by many cores.
>
> Bootstrapped and tested on aarch64-none-linux-gnu.
> Ok for trunk?

Looks good to me.

Remi

> Thanks,
> Kyrill
>
> Signed-off-by: Kyrylo Tkachov 
>
> gcc/
>
>  * config/aarch64/aarch64-sve2.md (*aarch64_sve2_unpred_nor):
>  New define_insn.
>  (*aarch64_sve2_nand_unpred): Likewise.
>
> gcc/testsuite/
>
>  * gcc.target/aarch64/sve2/nbsl_nor_nand_neon.c: New test.
>


Re: [PATCH 2/2] aarch64: Allow CPU tuning to avoid INS-(W|X)ZR instructions

2025-07-18 Thread Remi Machet

On 7/18/25 05:39, Kyrylo Tkachov wrote:
> External email: Use caution opening links or attachments
>
>
> Hi all,
>
> For inserting zero into a vector lane we usually use an instruction like:
>  ins v0.h[2], wzr
>
> This, however, has not-so-great performance on some CPUs.
> On Grace, for example it has a latency of 5 and throughput 1.
> The alternative sequence:
>  moviv31.8b, #0
>  ins v0.h[2], v31.h[0]
> is prefereble bcause the MOVI-0 is often a zero-latency operation that is
> eliminated by the CPU frontend and the lane-to-lane INS has a latency of 2 and
> throughput of 4.
> We can avoid the merging of the two instructions into the 
> aarch64_simd_vec_set_zero
> insn through rtx costs.  We just need to handle the right VEC_MERGE form in
> aarch64_rtx_costs. The new CPU-specific cost field ins_gp is introduced to 
> describe
> this operation.
> According to a similar LLVM PR: 
> https://github.com/llvm/llvm-project/pull/146538
> and looking at some Arm SWOGs I expect the Neoverse-derived cores to benefit 
> from this,
> whereas little cores like Cortex-A510 won't (INS-WZR has a respectable latency
> 3 in Cortex-A510).
>
> Practically, a value of COSTS_N_INSNS (2) and higher for ins_gp causes the 
> split
> into two instructions, values lower than that retain the INS-WZR form.
> cortexa76_extra_costs, from which Grace and many other Neoverse cores derive 
> from,
> sets ins_gp to COSTS_N_INSNS (3) to reflect a latency of 5 cycles.  3 is the 
> number
> of cycles above the normal cheapest SIMD instruction on such cores (which 
> take 2 cycles
> for the cheapest one).
>
> cortexa53_extra_costs and all other costs set ins_gp to COSTS_N_INSNS (1) to
> preserve the current codegen, though I'd be happy to increase it for generic 
> tuning.
>
> For -Os we don't add any extra cost so the shorter INS-WZR form is still
> generated always.
>
> Bootstrapped and tested on aarch64-none-linux-gnu.
> Ok for trunk?
> Thanks,
> Kyrill

Minor nit: one line in gcc/config/aarch64/aarch64.cc is past 80 characters.

Looks good to me otherwise (but I cannot approve).

Remi

>
> Signed-off-by: Kyrylo Tkachov 
>
> gcc/
>
>  * config/arm/aarch-common-protos.h (vector_cost_table): Add ins_gp
>  field.  Add comments to other vector cost fields.
>  * config/aarch64/aarch64.cc (aarch64_rtx_costs): Handle VEC_MERGE 
> case.
>  * config/aarch64/aarch64-cost-tables.h (qdf24xx_extra_costs,
>  thunderx_extra_costs, thunderx2t99_extra_costs,
>  thunderx3t110_extra_costs, tsv110_extra_costs, a64fx_extra_costs,
>  ampere1_extra_costs, ampere1a_extra_costs, ampere1b_extra_costs):
>  Specify ins_gp cost.
>  * config/arm/aarch-cost-tables.h (generic_extra_costs,
>  cortexa53_extra_costs, cortexa57_extra_costs, cortexa76_extra_costs,
>  exynosm1_extra_costs, xgene1_extra_costs): Likewise.
>
> gcc/testsuite/
>
>  * gcc.target/aarch64/simd/mf8_data_1.c (test_set_lane4,
>  test_setq_lane4): Relax allowed assembly.
>