On Thu, Jun 07, 2018 at 05:58:01AM -0500, Kyrill Tkachov wrote: > > On 05/06/18 18:28, James Greenhalgh wrote: > > On Tue, Jun 05, 2018 at 11:32:06AM -0500, Kyrill Tkachov wrote: > >> On 04/06/18 18:40, Kyrill Tkachov wrote: > >>> Hi all, > >>> > >>> This patch adds support for generating LDPs and STPs of Q-registers. > >>> This allows for more compact code generation and makes better use of the > >>> ISA. > >>> > >>> It's implemented in a straightforward way by allowing 16-byte modes in the > >>> sched-fusion machinery and adding appropriate peepholes in > >>> aarch64-ldpstp.md > >>> as well as the patterns themselves in aarch64-simd.md. > >>> > >>> I didn't see any non-noise performance effect on SPEC2017 on Cortex-A72 > >>> and Cortex-A53. > >>> > >> Adding some folks who know more about other CPUs as well. > >> Are you okay with enabling these instructions in AArch64? > >> > >> If you could give this a spin on some benchmarks you > >> care about on your platforms it would be really useful data. > > From an architecture perspective, I think this is the right thing for us > > to do. Given the feedback from Andrew and Siddhesh I think we should support > > this patch, defaulting to on; but behind a tuning flag for those who want > > to disable it for their -mcpu tuning. > > > > If you can respin it behind a tuning parameter and give the community > > another 48 hours or so to respond, I think we'd have a good patch here. > > > > I'm also adding some more contributors to the AArch64 cores file for their > > thoughts on the proposal. > > Ok, here's an updated patch adding a new no_ldp_stp_qregs tuning flag. > I use it to restrict the peepholes in aarch64-ldpstp.md from merging the > operations together into PARALLELs. I also use it to restrict the sched fusion > check that brings such loads and stores together. This is enough to avoid > forming the pairs. > > Given Philipp's comments I've enabled this flag for xgene1_tunings so that > -mtune=xgene1 > will not generate these pairs, but every other tuning will by default. > > The tests are updated to explicitly pass -moverride=tune=none so that > they do not depend on a particular default -mcpu option. > This -moverride option disables all the tuning flags, so it disables > the "disabling" of LDP/STP-Q. > > I've manually confirmed that for -mcpu=xgene1 the pairs are not formed > and added a test that the new tuning flag does indeed disable the generation. > > Bootstrapped and tested on aarch64-none-linux-gnu. > What do folks think of this version?
OK for trunk. We can tune up other cores ready for the GCC 9 release; but no reason not to have this in now. If, as we get close to GCC 9 the consensus in our supported cores is to go back to GCC 8 behaviour, we can flick the switch to no_ldp_stp_qregs in generic, and cores which will benefit from this can turn it back on in their private tuning structures. Thanks, James > 2018-06-07 Kyrylo Tkachov <kyrylo.tkac...@arm.com> > > * config/aarch64/aarch64-tuning-flags.def (no_ldp_stp_qregs): New. > * config/aarch64/aarch64.c (xgene1_tunings): Add > AARCH64_EXTRA_TUNE_NO_LDP_STP_QREGS to tune_flags. > (aarch64_mode_valid_for_sched_fusion_p): > Allow 16-byte modes. > (aarch64_classify_address): Allow 16-byte modes for load_store_pair_p. > * config/aarch64/aarch64-ldpstp.md: Add peepholes for LDP STP of > 128-bit modes. > * config/aarch64/aarch64-simd.md (load_pair<VQ:mode><VQ2:mode>): > New pattern. > (vec_store_pair<VQ:mode><VQ2:mode>): Likewise. > * config/aarch64/iterators.md (VQ2): New mode iterator. > > 2018-06-07 Kyrylo Tkachov <kyrylo.tkac...@arm.com> > > * gcc.target/aarch64/ldp_stp_q.c: New test. > * gcc.target/aarch64/stp_vec_128_1.c: Likewise. > * gcc.target/aarch64/ldp_stp_q_disable.c: Likewise.