Many of the NEON load/store instructions only allow address of the form: (reg rN) (post_inc (reg rN)) (post_modify (reg rN) (reg rM))
with no reg+const alternative. If vectorised code has several consecutive loads, it's often better to use a series of post_incs such as: *r1++ *r1++ *r1++ *r1++ rather than: *r1 r2 = r1 + size *r2 r3 = r1 + size * 2 *r3 r4 = r1 + size * 3 *r4 At the moment, auto-inc-dec.c only considers pairs of instructions, so it can't optimise this kind of sequence. The attached patch is a WIP (but almost complete) attempt to handle longer sequences too. It's a rewrite of the pass, so I've attached the C file rather than a diff. The patch improves the performance of several libav loops on Cortex A8 by up to 8%. On a popular but unnamable embedded benchmark suite, it improves the score of three individual tests by 20%. Like the current auto-inc-dec.c, the pass only considers single basic blocks. It might be interesting to relax that restriction in future, but it wouldn't really help with the kind of cases that matter for NEON. I haven't tried to implement anything as ambitious as the old optimise-related-values pass, but I think the new pass structure would make that kind of optimisation easier to add than it would be at present. Apart from the new cases above, the other main change is to use insn_rtx_cost. That only works well with an additional patch: Index: gcc/gcc/rtlanal.c =================================================================== --- gcc.orig/gcc/rtlanal.c +++ gcc/gcc/rtlanal.c @@ -4779,7 +4779,8 @@ insn_rtx_cost (rtx pat, bool speed) else return 0; - cost = rtx_cost (SET_SRC (set), SET, speed); + cost = (rtx_cost (SET_DEST (set), SET, speed) + + rtx_cost (SET_SRC (set), SET, speed)); return cost > 0 ? cost : COSTS_N_INSNS (1); } which, when I tested it on a few targets a couple of weeks ago, showed some small CSiBE improvements. The new pass should still handle the cases that the current one does, and due to value tracking, can handle a few pairs that the current one can't. I also have some ARM patches to change the MEM rtx costs (address writeback is expensive for core Cortex A8 instructions) and to model address writeback in the Cortex A8 and A9 schedulers. Using rtx costs does make things worse on some targets, which I think is due to dubious MEM costs. m68k is a particularly bad case, because for -Os, it uses byte counts rather than COSTS_N_INSNS. The insn_rtx_cost code above: return cost > 0 ? cost : COSTS_N_INSNS (1); then makes register moves very expensive. The new pass is still linear if you consider splay tree lookups to have amortised linear complexity. I've borrowed the splay-tree.c approach to these lookups; the outcome of the question I asked on gcc@ recently was that the current code was arrived at after some experimentation (and after bad experiences with previous versions). I didn't see any significant increase in compile time for an extreme case like: #define A *x++ = 1; #define B A A A A A A A A #define C B B B B B B B B #define D C C C C C C C C #define E D D D D D D D D #define F E E E E E E E E void foo (volatile int *x) { E E E } (it's actually slightly quicker after the patch, presumably because there are fewer instructions for later passes to handle). The new pass will take more memory than the old pass though. What's the best way of getting a figure? Tested on arm-linux-gnueabi. Thoughts? Richard
auto-inc-dec.c.bz2
Description: BZip2 compressed data