Pengxuan Zheng <quic_pzh...@quicinc.com> writes: > This patch optimizes certain vector permute expansion with the FMOV > instruction > when one of the input vectors is a vector of all zeros and the result of the > vector permute is as if the upper lane of the non-zero input vector is set to > zero and the lower lane remains unchanged. > > Note that the patch also propagates zero_op0_p and zero_op1_p during re-encode > now. They will be used by aarch64_evpc_fmov to check if the input vectors are > valid candidates. > > PR target/100165 > > gcc/ChangeLog: > > * config/aarch64/aarch64-protos.h (aarch64_lane0_mask_p): New. > * config/aarch64/aarch64-simd.md > (@aarch64_simd_vec_set_zero_fmov<mode>): > New define_insn. > * config/aarch64/aarch64.cc (aarch64_lane0_mask_p): New. > (aarch64_evpc_reencode): Copy zero_op0_p and zero_op1_p. > (aarch64_evpc_fmov): New. > (aarch64_expand_vec_perm_const_1): Add call to aarch64_evpc_fmov. > * config/aarch64/iterators.md (VALL_F16_NO_QI): New mode iterator. > > gcc/testsuite/ChangeLog: > > * gcc.target/aarch64/vec-set-zero.c: Update test accordingly. > * gcc.target/aarch64/fmov-1.c: New test. > * gcc.target/aarch64/fmov-2.c: New test. > * gcc.target/aarch64/fmov-3.c: New test. > * gcc.target/aarch64/fmov-be-1.c: New test. > * gcc.target/aarch64/fmov-be-2.c: New test. > * gcc.target/aarch64/fmov-be-3.c: New test.
Sorry to be awkward, but looking at this again, and going back to my previous comment: Part of me thinks that this should just be described as a plain old AND, but I suppose that doesn't work well for FP modes. Still, handling ANDs might be an interesting follow-up :) I wonder whether we should model this as an AND after all. That is, any permute the blends a vector with zero can be interpreted as an AND of a mask. We could even provide a target-independent routine for detecting that case. At present: v4hf f_v4hf (v4hf x) { return __builtin_shuffle (x, (v4hf){ 0, 0, 0, 0 }, (v4hi){ 4, 1, 6, 3 }); } generates: f_v4hf: uzp1 v0.2d, v0.2d, v0.2d adrp x0, .LC0 ldr d31, [x0, #:lo12:.LC0] tbl v0.8b, {v0.16b}, v31.8b ret .LC0: .byte -1 .byte -1 .byte 2 .byte 3 .byte -1 .byte -1 .byte 6 .byte 7 whereas with SVE enabled it could just be: f_v4hf: and z0.d, z0.d, #0xffff0000ffff ret and even without SVE it would be: f_v4hf: movi v31.2s, 0xff, msl 8 and v0.8b, v0.8b, v31.8b ret Then, using fmov would be an optimisation of AND. I think this would also simplify the evpc detection, since the requirement for using AND is the same for big-endian and little-endian, namely that index I of the result must either come from index I of the nonzero vector or from any element of the zero vector. (What differs between big-endian and little-endian is which masks correspond to FMOV.) Sorry again for the run-around. Richard