Pengxuan Zheng <quic_pzh...@quicinc.com> writes:
> This patch optimizes certain vector permute expansion with the FMOV 
> instruction
> when one of the input vectors is a vector of all zeros and the result of the
> vector permute is as if the upper lane of the non-zero input vector is set to
> zero and the lower lane remains unchanged.
>
> Note that the patch also propagates zero_op0_p and zero_op1_p during re-encode
> now.  They will be used by aarch64_evpc_fmov to check if the input vectors are
> valid candidates.
>
>       PR target/100165
>
> gcc/ChangeLog:
>
>       * config/aarch64/aarch64-protos.h (aarch64_lane0_mask_p): New.
>       * config/aarch64/aarch64-simd.md 
> (@aarch64_simd_vec_set_zero_fmov<mode>):
>       New define_insn.
>       * config/aarch64/aarch64.cc (aarch64_lane0_mask_p): New.
>       (aarch64_evpc_reencode): Copy zero_op0_p and zero_op1_p.
>       (aarch64_evpc_fmov): New.
>       (aarch64_expand_vec_perm_const_1): Add call to aarch64_evpc_fmov.
>       * config/aarch64/iterators.md (VALL_F16_NO_QI): New mode iterator.
>
> gcc/testsuite/ChangeLog:
>
>       * gcc.target/aarch64/vec-set-zero.c: Update test accordingly.
>       * gcc.target/aarch64/fmov-1.c: New test.
>       * gcc.target/aarch64/fmov-2.c: New test.
>       * gcc.target/aarch64/fmov-3.c: New test.
>       * gcc.target/aarch64/fmov-be-1.c: New test.
>       * gcc.target/aarch64/fmov-be-2.c: New test.
>       * gcc.target/aarch64/fmov-be-3.c: New test.

Sorry to be awkward, but looking at this again, and going back to my
previous comment:

  Part of me thinks that this should just be described as a plain old AND,
  but I suppose that doesn't work well for FP modes.  Still, handling ANDs
  might be an interesting follow-up :)

I wonder whether we should model this as an AND after all.  That is,
any permute the blends a vector with zero can be interpreted as an AND
of a mask.  We could even provide a target-independent routine for
detecting that case.

At present:

v4hf
f_v4hf (v4hf x)
{
  return __builtin_shuffle (x, (v4hf){ 0, 0, 0, 0 }, (v4hi){ 4, 1, 6, 3 });
}

generates:

f_v4hf:
        uzp1    v0.2d, v0.2d, v0.2d
        adrp    x0, .LC0
        ldr     d31, [x0, #:lo12:.LC0]
        tbl     v0.8b, {v0.16b}, v31.8b
        ret
.LC0:
        .byte   -1
        .byte   -1
        .byte   2
        .byte   3
        .byte   -1
        .byte   -1
        .byte   6
        .byte   7

whereas with SVE enabled it could just be:

f_v4hf:
        and     z0.d, z0.d, #0xffff0000ffff
        ret

and even without SVE it would be:

f_v4hf:
        movi    v31.2s, 0xff, msl 8
        and     v0.8b, v0.8b, v31.8b
        ret

Then, using fmov would be an optimisation of AND.

I think this would also simplify the evpc detection, since the requirement
for using AND is the same for big-endian and little-endian, namely that
index I of the result must either come from index I of the nonzero
vector or from any element of the zero vector.  (What differs between
big-endian and little-endian is which masks correspond to FMOV.)

Sorry again for the run-around.

Richard

Reply via email to