Hi all,
We've received requests to optimise the attached intrinsics testcase.
We currently generate:
foo_1:
uaddlp v0.4s, v0.8h
uaddlv d31, v0.4s
fmov x0, d31
ret
foo_2:
uaddlp v0.4s, v0.8h
addv s31, v0.4s
fmov w0, s31
ret
foo_3:
saddlp v0.4s, v0.8h
addv s31, v0.4s
fmov w0, s31
ret
The widening pair-wise addition addlp instructions can be omitted if we're just
doing an ADDV afterwards.
Making this optimisation would be quite simple if we had a standard RTL PLUS
vector reduction code.
As we don't, we can use UNSPEC_ADDV as a stand in.
This patch expresses the SADDLV and UADDLV instructions as an UNSPEC_ADDV over
a widened input, thus removing
the need for separate UNSPEC_SADDLV and UNSPEC_UADDLV codes.
To optimise the testcases involved we add two splitters that match a vector
addition where all participating elements
are taken and widened from the same vector and then fed into an UNSPEC_ADDV. In
that case we can just remove the
vector PLUS and just emit the simple RTL for SADDLV/UADDLV.
Bootstrapped and tested on aarch64-none-linux-gnu.
Pushing to trunk.
Thanks,
Kyrill
gcc/ChangeLog:
* config/aarch64/aarch64-protos.h (aarch64_parallel_select_half_p):
Define prototype.
(aarch64_pars_overlap_p): Likewise.
* config/aarch64/aarch64-simd.md (aarch64_<su>addlv<mode>):
Express in terms of UNSPEC_ADDV.
(*aarch64_<su>addlv<VDQV_L:mode>_ze<GPI:mode>): Likewise.
(*aarch64_<su>addlv<mode>_reduction): Define.
(*aarch64_uaddlv<mode>_reduction_2): Likewise.
* config/aarch64/aarch64.cc (aarch64_parallel_select_half_p):
Define.
(aarch64_pars_overlap_p): Likewise.
* config/aarch64/iterators.md (UNSPEC_SADDLV, UNSPEC_UADDLV): Delete.
(VQUADW): New mode attribute.
(VWIDE2X_S): Likewise.
(USADDLV): Delete.
(su): Delete handling of UNSPEC_SADDLV, UNSPEC_UADDLV.
* config/aarch64/predicates.md (vect_par_cnst_select_half): Define.
gcc/testsuite/ChangeLog:
* gcc.target/aarch64/simd/addlv_1.c: New test.
addlv2.patch
Description: addlv2.patch
