https://gcc.gnu.org/g:db76482175c4e76db273d7fb3a00ae0f932529a6

commit r15-7832-gdb76482175c4e76db273d7fb3a00ae0f932529a6
Author: Kyrylo Tkachov <ktkac...@nvidia.com>
Date:   Thu Feb 27 09:00:25 2025 -0800

    PR rtl-optimization/119046: Don't mark PARALLEL RTXes with floating-point 
mode as trapping
    
    In this testcase late-combine was failing to merge:
            dup     v31.4s, v31.s[3]
            fmla    v30.4s, v31.4s, v29.4s
    into the lane-wise fmla form.
    This is because late-combine checks may_trap_p under the hood on the dup 
insn.
    This ended up returning true for the insn:
    (set (reg:V4SF 152 [ _32 ])
            (vec_duplicate:V4SF (vec_select:SF (reg:V4SF 111 [ rhs_panel.8_31 ])
                    (parallel:V4SF [
                            (const_int 3 [0x3])]))))
    
    Although mem_trap_p correctly reasoned that vec_duplicate and vec_select of
    floating-point modes can't trap, it assumed that the V4SF parallel can trap.
    The correct behaviour is to recurse into vector inside the PARALLEL and 
check
    the sub-expression.  This patch adjusts may_trap_p_1 to do just that.
    With this check the above insn is not deemed to be trapping and is 
propagated
    into the FMLA giving:
            fmla    vD.4s, vA.4s, vB.s[3]
    
    Bootstrapped and tested on aarch64-none-linux-gnu.
    Apparently this also fixes a regression in
    gcc.target/aarch64/vmul_element_cost.c that I observed.
    
    Signed-off-by: Kyrylo Tkachov <ktkac...@nvidia.com>
    
    gcc/
    
            PR rtl-optimization/119046
            * rtlanal.cc (may_trap_p_1): Don't mark FP-mode PARALLELs as 
trapping.
    
    gcc/testsuite/
    
            PR rtl-optimization/119046
            * gcc.target/aarch64/pr119046.c: New test.

Diff:
---
 gcc/rtlanal.cc                              |  1 +
 gcc/testsuite/gcc.target/aarch64/pr119046.c | 16 ++++++++++++++++
 2 files changed, 17 insertions(+)

diff --git a/gcc/rtlanal.cc b/gcc/rtlanal.cc
index 8caffafdaa44..7ad67afb9fe8 100644
--- a/gcc/rtlanal.cc
+++ b/gcc/rtlanal.cc
@@ -3252,6 +3252,7 @@ may_trap_p_1 (const_rtx x, unsigned flags)
        return true;
       break;
 
+    case PARALLEL:
     case NEG:
     case ABS:
     case SUBREG:
diff --git a/gcc/testsuite/gcc.target/aarch64/pr119046.c 
b/gcc/testsuite/gcc.target/aarch64/pr119046.c
new file mode 100644
index 000000000000..aa5fa7c848c1
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/pr119046.c
@@ -0,0 +1,16 @@
+/* { dg-do compile } */
+/* { dg-additional-options "-O2" } */
+
+#include <arm_neon.h>
+
+float32x4_t madd_helper_1(float32x4_t a, float32x4_t b, float32x4_t d)
+{
+  float32x4_t t = a;
+  t = vfmaq_f32 (t, vdupq_n_f32(vgetq_lane_f32 (b, 1)), d);
+  t = vfmaq_f32 (t, vdupq_n_f32(vgetq_lane_f32 (b, 1)), d);
+  return t;
+}
+
+/* { dg-final { scan-assembler-not {\tdup\tv[0-9]+\.4s, v[0-9]+.s\[1\]\n} } } 
*/
+/* { dg-final { scan-assembler-times {\tfmla\tv[0-9]+\.4s, v[0-9]+\.4s, 
v[0-9]+\.s\[1\]\n} 2 } } */
+

Reply via email to