RE: [PATCH] aarch64: Recognize vector permute patterns suitable for FMOV [PR100165]

2025-02-18 Thread quic_pzheng
> Pengxuan Zheng  writes:
> > This patch optimizes certain vector permute expansion with the FMOV
> > instruction when one of the input vectors is a vector of all zeros and
> > the result of the vector permute is as if the upper lane of the
> > non-zero input vector is set to zero and the lower lane remains
unchanged.
> >
> > Note that the patch also propagates zero_op0_p and zero_op1_p during
> > re-encode now.  They will be used by aarch64_evpc_fmov to check if the
> > input vectors are valid candidates.
> >
> > PR target/100165
> >
> > gcc/ChangeLog:
> >
> > * config/aarch64/aarch64-simd.md
> (aarch64_simd_vec_set_zero_fmov):
> > New define_insn.
> > * config/aarch64/aarch64.cc (aarch64_evpc_reencode): Copy
> zero_op0_p and
> > zero_op1_p.
> > (aarch64_evpc_fmov): New function.
> > (aarch64_expand_vec_perm_const_1): Add call to
> aarch64_evpc_fmov.
> >
> > gcc/testsuite/ChangeLog:
> >
> > * gcc.target/aarch64/vec-set-zero.c: Update test accordingly.
> > * gcc.target/aarch64/fmov.c: New test.
> > * gcc.target/aarch64/fmov-be.c: New test.
> 
> Nice!  Thanks for doing this.  Some comments on the patch below.
> >
> > Signed-off-by: Pengxuan Zheng 
> > ---
> >  gcc/config/aarch64/aarch64-simd.md|  14 +++
> >  gcc/config/aarch64/aarch64.cc |  74 +++-
> >  gcc/testsuite/gcc.target/aarch64/fmov-be.c|  74 
> >  gcc/testsuite/gcc.target/aarch64/fmov.c   | 110 ++
> >  .../gcc.target/aarch64/vec-set-zero.c |   6 +-
> >  5 files changed, 275 insertions(+), 3 deletions(-)  create mode
> > 100644 gcc/testsuite/gcc.target/aarch64/fmov-be.c
> >  create mode 100644 gcc/testsuite/gcc.target/aarch64/fmov.c
> >
> > diff --git a/gcc/config/aarch64/aarch64-simd.md
> > b/gcc/config/aarch64/aarch64-simd.md
> > index e456f693d2f..543126948e7 100644
> > --- a/gcc/config/aarch64/aarch64-simd.md
> > +++ b/gcc/config/aarch64/aarch64-simd.md
> > @@ -1190,6 +1190,20 @@ (define_insn "aarch64_simd_vec_set"
> >[(set_attr "type" "neon_ins, neon_from_gp,
> > neon_load1_one_lane")]
> >  )
> >
> > +(define_insn "aarch64_simd_vec_set_zero_fmov"
> > +  [(set (match_operand:VP_2E 0 "register_operand" "=w")
> > +   (vec_merge:VP_2E
> > +   (match_operand:VP_2E 1 "aarch64_simd_imm_zero" "Dz")
> > +   (match_operand:VP_2E 3 "register_operand" "w")
> > +   (match_operand:SI 2 "immediate_operand" "i")))]
> > +  "TARGET_SIMD
> > +   && (ENDIAN_LANE_N (, exact_log2 (INTVAL (operands[2]))) ==
> 1)"
> > +  {
> > +return "fmov\\t%0, %3";
> > +  }
> > +  [(set_attr "type" "fmov")]
> > +)
> > +
> 
> I think this shows that target-independent code is missing some
> canonicalisation of vec_merge.  combine has:
> 
>   unsigned n_elts = 0;
>   if (GET_CODE (x) == VEC_MERGE
>   && CONST_INT_P (XEXP (x, 2))
>   && GET_MODE_NUNITS (GET_MODE (x)).is_constant (&n_elts)
>   && (swap_commutative_operands_p (XEXP (x, 0), XEXP (x, 1))
> /* Two operands have same precedence, then
>first bit of mask select first operand.  */
> || (!swap_commutative_operands_p (XEXP (x, 1), XEXP (x, 0))
> && !(UINTVAL (XEXP (x, 2)) & 1
> {
>   rtx temp = XEXP (x, 0);
>   unsigned HOST_WIDE_INT sel = UINTVAL (XEXP (x, 2));
>   unsigned HOST_WIDE_INT mask = HOST_WIDE_INT_1U;
>   if (n_elts == HOST_BITS_PER_WIDE_INT)
>   mask = -1;
>   else
>   mask = (HOST_WIDE_INT_1U << n_elts) - 1;
>   SUBST (XEXP (x, 0), XEXP (x, 1));
>   SUBST (XEXP (x, 1), temp);
>   SUBST (XEXP (x, 2), GEN_INT (~sel & mask));
> }
> 
> which AFAICT would prefer to put the immediate second, not first.  I think
we
> should be doing the same canonicalisation in simplify_ternary_operation,
and
> possibly elsewhere, so that the .md pattern only needs to match the
canonical
> form (i.e. register, immedate, mask).

Thanks for the suggestion. I've added the canonicalization in a separate
patch.
https://gcc.gnu.org/pipermail/gcc-patches/2025-February/676105.html

> 
> On:
> 
> > +   && (ENDIAN_LANE_N (, exact_log2 (INTVAL (operands[2]))) ==
> 1)"
> 
> it seems dangerous to pass exact_log2 to ENDIAN_LANE_N when we haven't
> checked whether it is a power of 2.  (0b00 or 0b11 ought to get
simplified, but
> I don't think we can ignore the possibility.)
> 
> Rather than restrict the pattern to pairs, could we instead handle
> VALL_F16 minus the QI elements, with the 16-bit elements restricted to
> TARGET_F16?  E.g. we should be able to handle V4SI using an FMOV of S
> registers if only the low element is nonzero.

Good point! I've addressed these in the latest version. Please let me know
if I missed anything.
https://gcc.gnu.org/pipermail/gcc-patches/2025-February/676106.html

Thanks,
Pengxuan

> 
> Part of me thinks that this should just be described as a plain old AND,
but I
> suppose that doesn't work well for FP modes.  Still, handling ANDs might
be
> an interesting follow-up :)
> 
> Thank

RE: [PATCH] aarch64: Recognize vector permute patterns suitable for FMOV [PR100165]

2025-02-18 Thread quic_pzheng
> > Pengxuan Zheng  writes:
> > > This patch optimizes certain vector permute expansion with the FMOV
> > > instruction when one of the input vectors is a vector of all zeros
> > > and the result of the vector permute is as if the upper lane of the
> > > non-zero input vector is set to zero and the lower lane remains
> unchanged.
> > >
> > > Note that the patch also propagates zero_op0_p and zero_op1_p during
> > > re-encode now.  They will be used by aarch64_evpc_fmov to check if
> > > the input vectors are valid candidates.
> > >
> > >   PR target/100165
> > >
> > > gcc/ChangeLog:
> > >
> > >   * config/aarch64/aarch64-simd.md
> > (aarch64_simd_vec_set_zero_fmov):
> > >   New define_insn.
> > >   * config/aarch64/aarch64.cc (aarch64_evpc_reencode): Copy
> > zero_op0_p and
> > >   zero_op1_p.
> > >   (aarch64_evpc_fmov): New function.
> > >   (aarch64_expand_vec_perm_const_1): Add call to
> > aarch64_evpc_fmov.
> > >
> > > gcc/testsuite/ChangeLog:
> > >
> > >   * gcc.target/aarch64/vec-set-zero.c: Update test accordingly.
> > >   * gcc.target/aarch64/fmov.c: New test.
> > >   * gcc.target/aarch64/fmov-be.c: New test.
> >
> > Nice!  Thanks for doing this.  Some comments on the patch below.
> > >
> > > Signed-off-by: Pengxuan Zheng 
> > > ---
> > >  gcc/config/aarch64/aarch64-simd.md|  14 +++
> > >  gcc/config/aarch64/aarch64.cc |  74 +++-
> > >  gcc/testsuite/gcc.target/aarch64/fmov-be.c|  74 
> > >  gcc/testsuite/gcc.target/aarch64/fmov.c   | 110
++
> > >  .../gcc.target/aarch64/vec-set-zero.c |   6 +-
> > >  5 files changed, 275 insertions(+), 3 deletions(-)  create mode
> > > 100644 gcc/testsuite/gcc.target/aarch64/fmov-be.c
> > >  create mode 100644 gcc/testsuite/gcc.target/aarch64/fmov.c
> > >
> > > diff --git a/gcc/config/aarch64/aarch64-simd.md
> > > b/gcc/config/aarch64/aarch64-simd.md
> > > index e456f693d2f..543126948e7 100644
> > > --- a/gcc/config/aarch64/aarch64-simd.md
> > > +++ b/gcc/config/aarch64/aarch64-simd.md
> > > @@ -1190,6 +1190,20 @@ (define_insn "aarch64_simd_vec_set"
> > >[(set_attr "type" "neon_ins, neon_from_gp,
> > > neon_load1_one_lane")]
> > >  )
> > >
> > > +(define_insn "aarch64_simd_vec_set_zero_fmov"
> > > +  [(set (match_operand:VP_2E 0 "register_operand" "=w")
> > > + (vec_merge:VP_2E
> > > + (match_operand:VP_2E 1 "aarch64_simd_imm_zero" "Dz")
> > > + (match_operand:VP_2E 3 "register_operand" "w")
> > > + (match_operand:SI 2 "immediate_operand" "i")))]
> > > +  "TARGET_SIMD
> > > +   && (ENDIAN_LANE_N (, exact_log2 (INTVAL (operands[2])))
> > > +==
> > 1)"
> > > +  {
> > > +return "fmov\\t%0, %3";
> > > +  }
> > > +  [(set_attr "type" "fmov")]
> > > +)
> > > +
> >
> > I think this shows that target-independent code is missing some
> > canonicalisation of vec_merge.  combine has:
> >
> >   unsigned n_elts = 0;
> >   if (GET_CODE (x) == VEC_MERGE
> >   && CONST_INT_P (XEXP (x, 2))
> >   && GET_MODE_NUNITS (GET_MODE (x)).is_constant (&n_elts)
> >   && (swap_commutative_operands_p (XEXP (x, 0), XEXP (x, 1))
> >   /* Two operands have same precedence, then
> >  first bit of mask select first operand.  */
> >   || (!swap_commutative_operands_p (XEXP (x, 1), XEXP (x, 0))
> >   && !(UINTVAL (XEXP (x, 2)) & 1
> > {
> >   rtx temp = XEXP (x, 0);
> >   unsigned HOST_WIDE_INT sel = UINTVAL (XEXP (x, 2));
> >   unsigned HOST_WIDE_INT mask = HOST_WIDE_INT_1U;
> >   if (n_elts == HOST_BITS_PER_WIDE_INT)
> > mask = -1;
> >   else
> > mask = (HOST_WIDE_INT_1U << n_elts) - 1;
> >   SUBST (XEXP (x, 0), XEXP (x, 1));
> >   SUBST (XEXP (x, 1), temp);
> >   SUBST (XEXP (x, 2), GEN_INT (~sel & mask));
> > }
> >
> > which AFAICT would prefer to put the immediate second, not first.  I
> > think we should be doing the same canonicalisation in
> > simplify_ternary_operation, and possibly elsewhere, so that the .md
> > pattern only needs to match the canonical form (i.e. register, immedate,
> mask).
> 
> Thanks for the suggestion. I've added the canonicalization in a separate
patch.
> https://gcc.gnu.org/pipermail/gcc-patches/2025-February/676105.html
> 
> >
> > On:
> >
> > > +   && (ENDIAN_LANE_N (, exact_log2 (INTVAL (operands[2])))
> > > + ==
> > 1)"
> >
> > it seems dangerous to pass exact_log2 to ENDIAN_LANE_N when we
> haven't
> > checked whether it is a power of 2.  (0b00 or 0b11 ought to get
> > simplified, but I don't think we can ignore the possibility.)
> >
> > Rather than restrict the pattern to pairs, could we instead handle
> > VALL_F16 minus the QI elements, with the 16-bit elements restricted to
> > TARGET_F16?  E.g. we should be able to handle V4SI using an FMOV of S
> > registers if only the low element is nonzero.
> 
> Good point! I've addressed these in the latest version. Please let me know
if I
> missed anything.
> https://gcc.gnu.org/pipermail/gcc-patches/2025-February/676106.html

Missed