Many of the NEON load/store instructions only allow address of the form:

    (reg rN)
    (post_inc (reg rN))
    (post_modify (reg rN) (reg rM))

with no reg+const alternative.  If vectorised code has several
consecutive loads, it's often better to use a series of post_incs such as:

    *r1++
    *r1++
    *r1++
    *r1++

rather than:

    *r1
    r2 = r1 + size
    *r2
    r3 = r1 + size * 2
    *r3
    r4 = r1 + size * 3
    *r4

At the moment, auto-inc-dec.c only considers pairs of instructions,
so it can't optimise this kind of sequence.  The attached patch is a
WIP (but almost complete) attempt to handle longer sequences too.
It's a rewrite of the pass, so I've attached the C file rather than
a diff.

The patch improves the performance of several libav loops on Cortex A8
by up to 8%.  On a popular but unnamable embedded benchmark suite,
it improves the score of three individual tests by 20%.

Like the current auto-inc-dec.c, the pass only considers single basic
blocks.  It might be interesting to relax that restriction in future,
but it wouldn't really help with the kind of cases that matter for NEON.

I haven't tried to implement anything as ambitious as the old
optimise-related-values pass, but I think the new pass structure
would make that kind of optimisation easier to add than it would
be at present.

Apart from the new cases above, the other main change is to use
insn_rtx_cost.  That only works well with an additional patch:

Index: gcc/gcc/rtlanal.c
===================================================================
--- gcc.orig/gcc/rtlanal.c
+++ gcc/gcc/rtlanal.c
@@ -4779,7 +4779,8 @@ insn_rtx_cost (rtx pat, bool speed)
   else
     return 0;
 
-  cost = rtx_cost (SET_SRC (set), SET, speed);
+  cost = (rtx_cost (SET_DEST (set), SET, speed)
+         + rtx_cost (SET_SRC (set), SET, speed));
   return cost > 0 ? cost : COSTS_N_INSNS (1);
 }
 
which, when I tested it on a few targets a couple of weeks ago, showed
some small CSiBE improvements.

The new pass should still handle the cases that the current one does,
and due to value tracking, can handle a few pairs that the current
one can't.

I also have some ARM patches to change the MEM rtx costs (address
writeback is expensive for core Cortex A8 instructions) and to model
address writeback in the Cortex A8 and A9 schedulers.

Using rtx costs does make things worse on some targets, which I think
is due to dubious MEM costs.  m68k is a particularly bad case, because
for -Os, it uses byte counts rather than COSTS_N_INSNS.  The
insn_rtx_cost code above:

   return cost > 0 ? cost : COSTS_N_INSNS (1);

then makes register moves very expensive.

The new pass is still linear if you consider splay tree lookups to have
amortised linear complexity.  I've borrowed the splay-tree.c approach
to these lookups; the outcome of the question I asked on gcc@ recently
was that the current code was arrived at after some experimentation
(and after bad experiences with previous versions).

I didn't see any significant increase in compile time for an extreme
case like:

    #define A *x++ = 1;
    #define B A A A A A A A A
    #define C B B B B B B B B
    #define D C C C C C C C C
    #define E D D D D D D D D
    #define F E E E E E E E E
    void foo (volatile int *x) { E E E }

(it's actually slightly quicker after the patch, presumably because
there are fewer instructions for later passes to handle).

The new pass will take more memory than the old pass though.  What's the
best way of getting a figure?

Tested on arm-linux-gnueabi.  Thoughts?

Richard

Attachment: auto-inc-dec.c.bz2
Description: BZip2 compressed data

Reply via email to