LTO slows down calculix by more than 10% on aarch64

Prathamesh Kulkarni via Gcc Wed, 26 Aug 2020 03:33:49 -0700

Hi,
We're seeing a consistent regression >10% on calculix with -O2 -flto vs -O2
on aarch64 in our validation CI. I tried to investigate this issue a
bit, and it seems the regression comes from inlining of orthonl into
e_c3d. Disabling that brings back the performance. However, inlining
orthonl into e_c3d, increases it's size from 3187 to 3837 by around
16.9% which isn't too large.


I have attached two test-cases, e_c3d.f that has orthonl manually
inlined into e_c3d to "simulate" LTO's inlining, and e_c3d-orig.f,
which contains unmodified function.
(gauss.f is included by e_c3d.f). For reproducing, just passing -O2 is
sufficient.

It seems that inlining orthonl, causes 20 hoistings into block 181,
which are then hoisted to block 173, in particular hoistings of w(1,
1) ... w(3, 3), which wasn't
possible without inlining. The hoistings happen because of basic block
that computes orthonl in line 672 has w(1, 1) ... w(3, 3) and the
following block in line 1035 in e_c3d.f:

senergy=
     &                    (s11*w(1,1)+s12*(w(1,2)+w(2,1))
     &                    +s13*(w(1,3)+w(3,1))+s22*w(2,2)
     &                    +s23*(w(2,3)+w(3,2))+s33*w(3,3))*weight

Disabling hoisting into blocks 173 (and 181), brings back most of the
performance. I am not able to understand why (if?) these hoistings of
w(1, 1) ...
w(3, 3) are causing slowdown however. Looking at assembly, the hot
code-path from perf in e_c3d shows following code-gen diff:
For inlined version:
.L122:
        ldr     d15, [x1, -248]
        add     w0, w0, 1
        add     x2, x2, 24
        add     x1, x1, 72
        fmul    d15, d17, d15
        fmul    d15, d15, d18
        fmul    d14, d15, d14
        fmadd   d16, d14, d31, d16
        cmp     w0, 4
        beq     .L121
        ldr     d14, [x2, -8]
        b       .L122

and for non-inlined version:
.L118:
        ldr     d0, [x1, -248]
        add     w0, w0, 1
        ldr     d2, [x2, -8]
        add     x1, x1, 72
        add     x2, x2, 24
        fmul    d0, d3, d0
        fmul    d0, d0, d5
        fmul    d0, d0, d2
        fmadd   d1, d4, d0, d1
        cmp     w0, 4
        bne     .L118

which corresponds to the following loop in line 1014.
                                do n1=1,3
                                  s(iii1,jjj1)=s(iii1,jjj1)
     &                                  +anisox(m1,k1,n1,l1)
     &                                  *w(k1,l1)*vo(i1,m1)*vo(j1,n1)
     &                                  *weight

I am not sure why would hoisting have any direct effect on this loop
except perhaps that hoisting allocated more reigsters, and led to
increased register pressure. Perhaps that's why it's using highered
number regs for code-gen in inlined version ? However disabling
hoisting in blocks 173 and 181, also leads to overall 6 extra spills
(by grepping for str to sp), so
hoisting is also helping here ? I am not sure how to proceed further,
and would be grateful for suggestions.

Thanks,
Prathamesh

e_c3d.f
Description: Binary data

e_c3d-orig.f
Description: Binary data

gauss.f
Description: Binary data

LTO slows down calculix by more than 10% on aarch64

Reply via email to