RE: [PATCH] aarch64: Recognize vector permute patterns suitable for FMOV [PR100165]
> Pengxuan Zheng writes: > > This patch optimizes certain vector permute expansion with the FMOV > > instruction when one of the input vectors is a vector of all zeros and > > the result of the vector permute is as if the upper lane of the > > non-zero input vector is set to zero and the lower lane remains unchanged. > > > > Note that the patch also propagates zero_op0_p and zero_op1_p during > > re-encode now. They will be used by aarch64_evpc_fmov to check if the > > input vectors are valid candidates. > > > > PR target/100165 > > > > gcc/ChangeLog: > > > > * config/aarch64/aarch64-simd.md > (aarch64_simd_vec_set_zero_fmov): > > New define_insn. > > * config/aarch64/aarch64.cc (aarch64_evpc_reencode): Copy > zero_op0_p and > > zero_op1_p. > > (aarch64_evpc_fmov): New function. > > (aarch64_expand_vec_perm_const_1): Add call to > aarch64_evpc_fmov. > > > > gcc/testsuite/ChangeLog: > > > > * gcc.target/aarch64/vec-set-zero.c: Update test accordingly. > > * gcc.target/aarch64/fmov.c: New test. > > * gcc.target/aarch64/fmov-be.c: New test. > > Nice! Thanks for doing this. Some comments on the patch below. > > > > Signed-off-by: Pengxuan Zheng > > --- > > gcc/config/aarch64/aarch64-simd.md| 14 +++ > > gcc/config/aarch64/aarch64.cc | 74 +++- > > gcc/testsuite/gcc.target/aarch64/fmov-be.c| 74 > > gcc/testsuite/gcc.target/aarch64/fmov.c | 110 ++ > > .../gcc.target/aarch64/vec-set-zero.c | 6 +- > > 5 files changed, 275 insertions(+), 3 deletions(-) create mode > > 100644 gcc/testsuite/gcc.target/aarch64/fmov-be.c > > create mode 100644 gcc/testsuite/gcc.target/aarch64/fmov.c > > > > diff --git a/gcc/config/aarch64/aarch64-simd.md > > b/gcc/config/aarch64/aarch64-simd.md > > index e456f693d2f..543126948e7 100644 > > --- a/gcc/config/aarch64/aarch64-simd.md > > +++ b/gcc/config/aarch64/aarch64-simd.md > > @@ -1190,6 +1190,20 @@ (define_insn "aarch64_simd_vec_set" > >[(set_attr "type" "neon_ins, neon_from_gp, > > neon_load1_one_lane")] > > ) > > > > +(define_insn "aarch64_simd_vec_set_zero_fmov" > > + [(set (match_operand:VP_2E 0 "register_operand" "=w") > > + (vec_merge:VP_2E > > + (match_operand:VP_2E 1 "aarch64_simd_imm_zero" "Dz") > > + (match_operand:VP_2E 3 "register_operand" "w") > > + (match_operand:SI 2 "immediate_operand" "i")))] > > + "TARGET_SIMD > > + && (ENDIAN_LANE_N (, exact_log2 (INTVAL (operands[2]))) == > 1)" > > + { > > +return "fmov\\t%0, %3"; > > + } > > + [(set_attr "type" "fmov")] > > +) > > + > > I think this shows that target-independent code is missing some > canonicalisation of vec_merge. combine has: > > unsigned n_elts = 0; > if (GET_CODE (x) == VEC_MERGE > && CONST_INT_P (XEXP (x, 2)) > && GET_MODE_NUNITS (GET_MODE (x)).is_constant (&n_elts) > && (swap_commutative_operands_p (XEXP (x, 0), XEXP (x, 1)) > /* Two operands have same precedence, then >first bit of mask select first operand. */ > || (!swap_commutative_operands_p (XEXP (x, 1), XEXP (x, 0)) > && !(UINTVAL (XEXP (x, 2)) & 1 > { > rtx temp = XEXP (x, 0); > unsigned HOST_WIDE_INT sel = UINTVAL (XEXP (x, 2)); > unsigned HOST_WIDE_INT mask = HOST_WIDE_INT_1U; > if (n_elts == HOST_BITS_PER_WIDE_INT) > mask = -1; > else > mask = (HOST_WIDE_INT_1U << n_elts) - 1; > SUBST (XEXP (x, 0), XEXP (x, 1)); > SUBST (XEXP (x, 1), temp); > SUBST (XEXP (x, 2), GEN_INT (~sel & mask)); > } > > which AFAICT would prefer to put the immediate second, not first. I think we > should be doing the same canonicalisation in simplify_ternary_operation, and > possibly elsewhere, so that the .md pattern only needs to match the canonical > form (i.e. register, immedate, mask). Thanks for the suggestion. I've added the canonicalization in a separate patch. https://gcc.gnu.org/pipermail/gcc-patches/2025-February/676105.html > > On: > > > + && (ENDIAN_LANE_N (, exact_log2 (INTVAL (operands[2]))) == > 1)" > > it seems dangerous to pass exact_log2 to ENDIAN_LANE_N when we haven't > checked whether it is a power of 2. (0b00 or 0b11 ought to get simplified, but > I don't think we can ignore the possibility.) > > Rather than restrict the pattern to pairs, could we instead handle > VALL_F16 minus the QI elements, with the 16-bit elements restricted to > TARGET_F16? E.g. we should be able to handle V4SI using an FMOV of S > registers if only the low element is nonzero. Good point! I've addressed these in the latest version. Please let me know if I missed anything. https://gcc.gnu.org/pipermail/gcc-patches/2025-February/676106.html Thanks, Pengxuan > > Part of me thinks that this should just be described as a plain old AND, but I > suppose that doesn't work well for FP modes. Still, handling ANDs might be > an interesting follow-up :) > > Thank
RE: [PATCH] aarch64: Recognize vector permute patterns suitable for FMOV [PR100165]
> > Pengxuan Zheng writes: > > > This patch optimizes certain vector permute expansion with the FMOV > > > instruction when one of the input vectors is a vector of all zeros > > > and the result of the vector permute is as if the upper lane of the > > > non-zero input vector is set to zero and the lower lane remains > unchanged. > > > > > > Note that the patch also propagates zero_op0_p and zero_op1_p during > > > re-encode now. They will be used by aarch64_evpc_fmov to check if > > > the input vectors are valid candidates. > > > > > > PR target/100165 > > > > > > gcc/ChangeLog: > > > > > > * config/aarch64/aarch64-simd.md > > (aarch64_simd_vec_set_zero_fmov): > > > New define_insn. > > > * config/aarch64/aarch64.cc (aarch64_evpc_reencode): Copy > > zero_op0_p and > > > zero_op1_p. > > > (aarch64_evpc_fmov): New function. > > > (aarch64_expand_vec_perm_const_1): Add call to > > aarch64_evpc_fmov. > > > > > > gcc/testsuite/ChangeLog: > > > > > > * gcc.target/aarch64/vec-set-zero.c: Update test accordingly. > > > * gcc.target/aarch64/fmov.c: New test. > > > * gcc.target/aarch64/fmov-be.c: New test. > > > > Nice! Thanks for doing this. Some comments on the patch below. > > > > > > Signed-off-by: Pengxuan Zheng > > > --- > > > gcc/config/aarch64/aarch64-simd.md| 14 +++ > > > gcc/config/aarch64/aarch64.cc | 74 +++- > > > gcc/testsuite/gcc.target/aarch64/fmov-be.c| 74 > > > gcc/testsuite/gcc.target/aarch64/fmov.c | 110 ++ > > > .../gcc.target/aarch64/vec-set-zero.c | 6 +- > > > 5 files changed, 275 insertions(+), 3 deletions(-) create mode > > > 100644 gcc/testsuite/gcc.target/aarch64/fmov-be.c > > > create mode 100644 gcc/testsuite/gcc.target/aarch64/fmov.c > > > > > > diff --git a/gcc/config/aarch64/aarch64-simd.md > > > b/gcc/config/aarch64/aarch64-simd.md > > > index e456f693d2f..543126948e7 100644 > > > --- a/gcc/config/aarch64/aarch64-simd.md > > > +++ b/gcc/config/aarch64/aarch64-simd.md > > > @@ -1190,6 +1190,20 @@ (define_insn "aarch64_simd_vec_set" > > >[(set_attr "type" "neon_ins, neon_from_gp, > > > neon_load1_one_lane")] > > > ) > > > > > > +(define_insn "aarch64_simd_vec_set_zero_fmov" > > > + [(set (match_operand:VP_2E 0 "register_operand" "=w") > > > + (vec_merge:VP_2E > > > + (match_operand:VP_2E 1 "aarch64_simd_imm_zero" "Dz") > > > + (match_operand:VP_2E 3 "register_operand" "w") > > > + (match_operand:SI 2 "immediate_operand" "i")))] > > > + "TARGET_SIMD > > > + && (ENDIAN_LANE_N (, exact_log2 (INTVAL (operands[2]))) > > > +== > > 1)" > > > + { > > > +return "fmov\\t%0, %3"; > > > + } > > > + [(set_attr "type" "fmov")] > > > +) > > > + > > > > I think this shows that target-independent code is missing some > > canonicalisation of vec_merge. combine has: > > > > unsigned n_elts = 0; > > if (GET_CODE (x) == VEC_MERGE > > && CONST_INT_P (XEXP (x, 2)) > > && GET_MODE_NUNITS (GET_MODE (x)).is_constant (&n_elts) > > && (swap_commutative_operands_p (XEXP (x, 0), XEXP (x, 1)) > > /* Two operands have same precedence, then > > first bit of mask select first operand. */ > > || (!swap_commutative_operands_p (XEXP (x, 1), XEXP (x, 0)) > > && !(UINTVAL (XEXP (x, 2)) & 1 > > { > > rtx temp = XEXP (x, 0); > > unsigned HOST_WIDE_INT sel = UINTVAL (XEXP (x, 2)); > > unsigned HOST_WIDE_INT mask = HOST_WIDE_INT_1U; > > if (n_elts == HOST_BITS_PER_WIDE_INT) > > mask = -1; > > else > > mask = (HOST_WIDE_INT_1U << n_elts) - 1; > > SUBST (XEXP (x, 0), XEXP (x, 1)); > > SUBST (XEXP (x, 1), temp); > > SUBST (XEXP (x, 2), GEN_INT (~sel & mask)); > > } > > > > which AFAICT would prefer to put the immediate second, not first. I > > think we should be doing the same canonicalisation in > > simplify_ternary_operation, and possibly elsewhere, so that the .md > > pattern only needs to match the canonical form (i.e. register, immedate, > mask). > > Thanks for the suggestion. I've added the canonicalization in a separate patch. > https://gcc.gnu.org/pipermail/gcc-patches/2025-February/676105.html > > > > > On: > > > > > + && (ENDIAN_LANE_N (, exact_log2 (INTVAL (operands[2]))) > > > + == > > 1)" > > > > it seems dangerous to pass exact_log2 to ENDIAN_LANE_N when we > haven't > > checked whether it is a power of 2. (0b00 or 0b11 ought to get > > simplified, but I don't think we can ignore the possibility.) > > > > Rather than restrict the pattern to pairs, could we instead handle > > VALL_F16 minus the QI elements, with the 16-bit elements restricted to > > TARGET_F16? E.g. we should be able to handle V4SI using an FMOV of S > > registers if only the low element is nonzero. > > Good point! I've addressed these in the latest version. Please let me know if I > missed anything. > https://gcc.gnu.org/pipermail/gcc-patches/2025-February/676106.html Missed