Kyrylo Tkachov <ktkac...@nvidia.com> writes:
> Hi all,
>
> In this testcase late-combine was failing to merge:
> dup v31.4s, v31.s[3]
> fmla v30.4s, v31.4s, v29.4s
> into the lane-wise fmla form.
> This is because late-combine checks may_trap_p under the hood on the dup insn.
> This ended up returning true for the insn:
> (set (reg:V4SF 152 [ _32 ])
>       (vec_duplicate:V4SF (vec_select:SF (reg:V4SF 111 [ rhs_panel.8_31 ])
>               (parallel:V4SF [
>                       (const_int 3 [0x3])]))))
>
> Although mem_trap_p correctly reasoned that vec_duplicate and vec_select of
> floating-point modes can't trap, it assumed that the V4SF parallel can trap.
> The correct behaviour is to recurse into vector inside the PARALLEL and check
> the sub-expression. This patch adjusts may_trap_p_1 to do just that.
> With this check the above insn is not deemed to be trapping and is propagated
> into the FMLA giving:
> fmla vD.4s, vA.4s, vB.s[3]
>
> Bootstrapped and tested on aarch64-none-linux-gnu.
> Apparently this also fixes a regression in
> gcc.target/aarch64/vmul_element_cost.c that I observed.
>
> Signed-off-by: Kyrylo Tkachov <ktkac...@nvidia.com>
>
> gcc/
>
>       PR rtl-optimization/119046
>       * rtlanal.cc (may_trap_p_1): Don't mark FP-mode PARALLELs as trapping.
>
> gcc/testsuite/
>
>       PR rtl-optimization/119046
>       * gcc.target/aarch64/pr119046.c: New test.
>
> From 39fa1be73e8fed7fc673329e1854ba41587623c4 Mon Sep 17 00:00:00 2001
> From: Kyrylo Tkachov <ktkac...@nvidia.com>
> Date: Thu, 27 Feb 2025 09:00:25 -0800
> Subject: [PATCH] PR rtl-optimization/119046: Don't mark PARALLEL RTXes with
>  floating-point mode as trapping
>
> In this testcase late-combine was failing to merge:
>         dup     v31.4s, v31.s[3]
>         fmla    v30.4s, v31.4s, v29.4s
> into the lane-wise fmla form.
> This is because late-combine checks may_trap_p under the hood on the dup insn.
> This ended up returning true for the insn:
> (set (reg:V4SF 152 [ _32 ])
>         (vec_duplicate:V4SF (vec_select:SF (reg:V4SF 111 [ rhs_panel.8_31 ])
>                 (parallel:V4SF [
>                         (const_int 3 [0x3])]))))
>
> Although mem_trap_p correctly reasoned that vec_duplicate and vec_select of
> floating-point modes can't trap, it assumed that the V4SF parallel can trap.
> The correct behaviour is to recurse into vector inside the PARALLEL and check
> the sub-expression.  This patch adjusts may_trap_p_1 to do just that.
> With this check the above insn is not deemed to be trapping and is propagated
> into the FMLA giving:
>         fmla    vD.4s, vA.4s, vB.s[3]
>
> Bootstrapped and tested on aarch64-none-linux-gnu.
> Apparently this also fixes a regression in
> gcc.target/aarch64/vmul_element_cost.c that I observed.
>
> Signed-off-by: Kyrylo Tkachov <ktkac...@nvidia.com>
>
> gcc/
>
>       PR rtl-optimization/119046
>       * rtlanal.cc (may_trap_p_1): Don't mark FP-mode PARALLELs as trapping.

I don't object to this, if someone else agrees that it's the right fix.
But the mode on the parallel looks iffy to me, in that the data is not
floating-point data.  VOIDmode would probably be more accurate and
seems to be what x86 uses.

Thanks,
Richard

>
> gcc/testsuite/
>
>       PR rtl-optimization/119046
>       * gcc.target/aarch64/pr119046.c: New test.
> ---
>  gcc/rtlanal.cc                              |  1 +
>  gcc/testsuite/gcc.target/aarch64/pr119046.c | 16 ++++++++++++++++
>  2 files changed, 17 insertions(+)
>  create mode 100644 gcc/testsuite/gcc.target/aarch64/pr119046.c
>
> diff --git a/gcc/rtlanal.cc b/gcc/rtlanal.cc
> index 8caffafdaa4..7ad67afb9fe 100644
> --- a/gcc/rtlanal.cc
> +++ b/gcc/rtlanal.cc
> @@ -3252,6 +3252,7 @@ may_trap_p_1 (const_rtx x, unsigned flags)
>       return true;
>        break;
>  
> +    case PARALLEL:
>      case NEG:
>      case ABS:
>      case SUBREG:
> diff --git a/gcc/testsuite/gcc.target/aarch64/pr119046.c 
> b/gcc/testsuite/gcc.target/aarch64/pr119046.c
> new file mode 100644
> index 00000000000..aa5fa7c848c
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/pr119046.c
> @@ -0,0 +1,16 @@
> +/* { dg-do compile } */
> +/* { dg-additional-options "-O2" } */
> +
> +#include <arm_neon.h>
> +
> +float32x4_t madd_helper_1(float32x4_t a, float32x4_t b, float32x4_t d)
> +{
> +  float32x4_t t = a;
> +  t = vfmaq_f32 (t, vdupq_n_f32(vgetq_lane_f32 (b, 1)), d);
> +  t = vfmaq_f32 (t, vdupq_n_f32(vgetq_lane_f32 (b, 1)), d);
> +  return t;
> +}
> +
> +/* { dg-final { scan-assembler-not {\tdup\tv[0-9]+\.4s, v[0-9]+.s\[1\]\n} } 
> } */
> +/* { dg-final { scan-assembler-times {\tfmla\tv[0-9]+\.4s, v[0-9]+\.4s, 
> v[0-9]+\.s\[1\]\n} 2 } } */
> +

Reply via email to