Hi, We're seeing a consistent regression >10% on calculix with -O2 -flto vs -O2 on aarch64 in our validation CI. I tried to investigate this issue a bit, and it seems the regression comes from inlining of orthonl into e_c3d. Disabling that brings back the performance. However, inlining orthonl into e_c3d, increases it's size from 3187 to 3837 by around 16.9% which isn't too large.
I have attached two test-cases, e_c3d.f that has orthonl manually inlined into e_c3d to "simulate" LTO's inlining, and e_c3d-orig.f, which contains unmodified function. (gauss.f is included by e_c3d.f). For reproducing, just passing -O2 is sufficient. It seems that inlining orthonl, causes 20 hoistings into block 181, which are then hoisted to block 173, in particular hoistings of w(1, 1) ... w(3, 3), which wasn't possible without inlining. The hoistings happen because of basic block that computes orthonl in line 672 has w(1, 1) ... w(3, 3) and the following block in line 1035 in e_c3d.f: senergy= & (s11*w(1,1)+s12*(w(1,2)+w(2,1)) & +s13*(w(1,3)+w(3,1))+s22*w(2,2) & +s23*(w(2,3)+w(3,2))+s33*w(3,3))*weight Disabling hoisting into blocks 173 (and 181), brings back most of the performance. I am not able to understand why (if?) these hoistings of w(1, 1) ... w(3, 3) are causing slowdown however. Looking at assembly, the hot code-path from perf in e_c3d shows following code-gen diff: For inlined version: .L122: ldr d15, [x1, -248] add w0, w0, 1 add x2, x2, 24 add x1, x1, 72 fmul d15, d17, d15 fmul d15, d15, d18 fmul d14, d15, d14 fmadd d16, d14, d31, d16 cmp w0, 4 beq .L121 ldr d14, [x2, -8] b .L122 and for non-inlined version: .L118: ldr d0, [x1, -248] add w0, w0, 1 ldr d2, [x2, -8] add x1, x1, 72 add x2, x2, 24 fmul d0, d3, d0 fmul d0, d0, d5 fmul d0, d0, d2 fmadd d1, d4, d0, d1 cmp w0, 4 bne .L118 which corresponds to the following loop in line 1014. do n1=1,3 s(iii1,jjj1)=s(iii1,jjj1) & +anisox(m1,k1,n1,l1) & *w(k1,l1)*vo(i1,m1)*vo(j1,n1) & *weight I am not sure why would hoisting have any direct effect on this loop except perhaps that hoisting allocated more reigsters, and led to increased register pressure. Perhaps that's why it's using highered number regs for code-gen in inlined version ? However disabling hoisting in blocks 173 and 181, also leads to overall 6 extra spills (by grepping for str to sp), so hoisting is also helping here ? I am not sure how to proceed further, and would be grateful for suggestions. Thanks, Prathamesh
e_c3d.f
Description: Binary data
e_c3d-orig.f
Description: Binary data
gauss.f
Description: Binary data