Kyrylo Tkachov <ktkac...@nvidia.com> writes: > Hi all, > > In this testcase late-combine was failing to merge: > dup v31.4s, v31.s[3] > fmla v30.4s, v31.4s, v29.4s > into the lane-wise fmla form. > This is because late-combine checks may_trap_p under the hood on the dup insn. > This ended up returning true for the insn: > (set (reg:V4SF 152 [ _32 ]) > (vec_duplicate:V4SF (vec_select:SF (reg:V4SF 111 [ rhs_panel.8_31 ]) > (parallel:V4SF [ > (const_int 3 [0x3])])))) > > Although mem_trap_p correctly reasoned that vec_duplicate and vec_select of > floating-point modes can't trap, it assumed that the V4SF parallel can trap. > The correct behaviour is to recurse into vector inside the PARALLEL and check > the sub-expression. This patch adjusts may_trap_p_1 to do just that. > With this check the above insn is not deemed to be trapping and is propagated > into the FMLA giving: > fmla vD.4s, vA.4s, vB.s[3] > > Bootstrapped and tested on aarch64-none-linux-gnu. > Apparently this also fixes a regression in > gcc.target/aarch64/vmul_element_cost.c that I observed. > > Signed-off-by: Kyrylo Tkachov <ktkac...@nvidia.com> > > gcc/ > > PR rtl-optimization/119046 > * rtlanal.cc (may_trap_p_1): Don't mark FP-mode PARALLELs as trapping. > > gcc/testsuite/ > > PR rtl-optimization/119046 > * gcc.target/aarch64/pr119046.c: New test. > > From 39fa1be73e8fed7fc673329e1854ba41587623c4 Mon Sep 17 00:00:00 2001 > From: Kyrylo Tkachov <ktkac...@nvidia.com> > Date: Thu, 27 Feb 2025 09:00:25 -0800 > Subject: [PATCH] PR rtl-optimization/119046: Don't mark PARALLEL RTXes with > floating-point mode as trapping > > In this testcase late-combine was failing to merge: > dup v31.4s, v31.s[3] > fmla v30.4s, v31.4s, v29.4s > into the lane-wise fmla form. > This is because late-combine checks may_trap_p under the hood on the dup insn. > This ended up returning true for the insn: > (set (reg:V4SF 152 [ _32 ]) > (vec_duplicate:V4SF (vec_select:SF (reg:V4SF 111 [ rhs_panel.8_31 ]) > (parallel:V4SF [ > (const_int 3 [0x3])])))) > > Although mem_trap_p correctly reasoned that vec_duplicate and vec_select of > floating-point modes can't trap, it assumed that the V4SF parallel can trap. > The correct behaviour is to recurse into vector inside the PARALLEL and check > the sub-expression. This patch adjusts may_trap_p_1 to do just that. > With this check the above insn is not deemed to be trapping and is propagated > into the FMLA giving: > fmla vD.4s, vA.4s, vB.s[3] > > Bootstrapped and tested on aarch64-none-linux-gnu. > Apparently this also fixes a regression in > gcc.target/aarch64/vmul_element_cost.c that I observed. > > Signed-off-by: Kyrylo Tkachov <ktkac...@nvidia.com> > > gcc/ > > PR rtl-optimization/119046 > * rtlanal.cc (may_trap_p_1): Don't mark FP-mode PARALLELs as trapping.
I don't object to this, if someone else agrees that it's the right fix. But the mode on the parallel looks iffy to me, in that the data is not floating-point data. VOIDmode would probably be more accurate and seems to be what x86 uses. Thanks, Richard > > gcc/testsuite/ > > PR rtl-optimization/119046 > * gcc.target/aarch64/pr119046.c: New test. > --- > gcc/rtlanal.cc | 1 + > gcc/testsuite/gcc.target/aarch64/pr119046.c | 16 ++++++++++++++++ > 2 files changed, 17 insertions(+) > create mode 100644 gcc/testsuite/gcc.target/aarch64/pr119046.c > > diff --git a/gcc/rtlanal.cc b/gcc/rtlanal.cc > index 8caffafdaa4..7ad67afb9fe 100644 > --- a/gcc/rtlanal.cc > +++ b/gcc/rtlanal.cc > @@ -3252,6 +3252,7 @@ may_trap_p_1 (const_rtx x, unsigned flags) > return true; > break; > > + case PARALLEL: > case NEG: > case ABS: > case SUBREG: > diff --git a/gcc/testsuite/gcc.target/aarch64/pr119046.c > b/gcc/testsuite/gcc.target/aarch64/pr119046.c > new file mode 100644 > index 00000000000..aa5fa7c848c > --- /dev/null > +++ b/gcc/testsuite/gcc.target/aarch64/pr119046.c > @@ -0,0 +1,16 @@ > +/* { dg-do compile } */ > +/* { dg-additional-options "-O2" } */ > + > +#include <arm_neon.h> > + > +float32x4_t madd_helper_1(float32x4_t a, float32x4_t b, float32x4_t d) > +{ > + float32x4_t t = a; > + t = vfmaq_f32 (t, vdupq_n_f32(vgetq_lane_f32 (b, 1)), d); > + t = vfmaq_f32 (t, vdupq_n_f32(vgetq_lane_f32 (b, 1)), d); > + return t; > +} > + > +/* { dg-final { scan-assembler-not {\tdup\tv[0-9]+\.4s, v[0-9]+.s\[1\]\n} } > } */ > +/* { dg-final { scan-assembler-times {\tfmla\tv[0-9]+\.4s, v[0-9]+\.4s, > v[0-9]+\.s\[1\]\n} 2 } } */ > +