On 04/06/18 18:40, Kyrill Tkachov wrote:
Hi all, This patch adds support for generating LDPs and STPs of Q-registers. This allows for more compact code generation and makes better use of the ISA. It's implemented in a straightforward way by allowing 16-byte modes in the sched-fusion machinery and adding appropriate peepholes in aarch64-ldpstp.md as well as the patterns themselves in aarch64-simd.md. I didn't see any non-noise performance effect on SPEC2017 on Cortex-A72 and Cortex-A53.
Adding some folks who know more about other CPUs as well. Are you okay with enabling these instructions in AArch64? If you could give this a spin on some benchmarks you care about on your platforms it would be really useful data. Thanks, Kyrill
Bootstrapped and tested on aarch64-none-linux-gnu. Ok for trunk? Thanks, Kyrill 2018-06-04 Kyrylo Tkachov <kyrylo.tkac...@arm.com> * config/aarch64/aarch64.c (aarch64_mode_valid_for_sched_fusion_p): Allow 16-byte modes. (aarch64_classify_address): Allow 16-byte modes for load_store_pair_p. * config/aarch64/aarch64-ldpstp.md: Add peepholes for LDP STP of 128-bit modes. * config/aarch64/aarch64-simd.md (load_pair<VQ:mode><VQ2:mode>): New pattern. (vec_store_pair<VQ:mode><VQ2:mode>): Likewise. * config/aarch64/iterators.md (VQ2): New mode iterator. 2018-06-04 Kyrylo Tkachov <kyrylo.tkac...@arm.com> * gcc.target/aarch64/ldp_stp_q.c: New test. * gcc.target/aarch64/stp_vec_128_1.c: Likewise.