On 16/10/14 21:43, Andrew Pinski wrote:
On Thu, Oct 16, 2014 at 1:38 PM, Sebastian Pop <seb...@gmail.com> wrote:
Richard Biener wrote:

I have posted 5 patches as part of a larger series to merge
(parts) from the match-and-simplify branch.  While I think
there was overall consensus that the idea behind the project
is sound there are technical questions left for how the
thing should look in the end.  I've raised them in 3/n
which is the only patch of the series that contains any
patterns sofar.

To re-iterate here (as I expect most people will only look
at [0/n] patches ;)), the question is whether we are fine
with making fold-const (thus fold_{unary,binary,ternary})
not handle some cases it handles currently.



I have tested on aarch64 all the code in the match-and-simplify against trunk as
of the last merge at r216315:

2014-10-16  Richard Biener  <rguent...@suse.de>

         Merge from trunk r216235 through r216315.

Overall, I see a lot of perf regressions (about 2/3 of the tests) than
improvements (1/3 of the tests).  I will try to reduce tests.


For instance, saxpy regresses at -O3 on aarch64:

void saxpy(double* x, double* y, double* z) {
     int i=0;
     for (i = 0 ; i < ARRAY_SIZE; i++) {
         z[i] = x[i] + scalar*y[i];
     }
}

This looks like a scheduling issue rather than anything else.  The
scheduler for a57 is not complete and does not model some things like
the fusion of the compares and branch which is most likely what you
are seeing.


Huh !! how is that related to the code generation shown by Seb ?

See the replacement of subs by cmp and sub. Folding cmp into other flag setting instructions is a very useful optimization on ARM and AArch64 and that's what appears missing in fold-const. That maybe what's causing the slowdown. I've never known that to be caused by any scheduler vagaries !

regards
Ramana





Thanks,
Andrew Pinski


$ diff -u base.s mas.s
--- base.s      2014-10-16 15:30:15.351430000 -0500
+++ mas.s       2014-10-16 15:30:16.183035000 -0500
@@ -2,12 +2,14 @@
         add     x1, x2, 800
         ldr     q0, [x0, x2]
         add     x3, x2, 1600
+       cmp     x0, 784
         ldr     q1, [x0, x1]
+       add     x1, x0, 16
         fmla    v0.2d, v1.2d, v2.2d
         str     q0, [x0, x3]
-       add     x0, x0, 16
-       cmp     x0, 800
+       mov     x0, x1
         bne     .L140
  .LBE179:
-       subs    w4, w4, #1
+       cmp     w4, 1
+       sub     w4, w4, #1
         bne     .L139



Thanks,
Sebastian

Reply via email to