https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63503
--- Comment #16 from Evandro <e.menezes at samsung dot com> --- (In reply to Wilco from comment #15) > Using -Ofast is not any different from -O3 -ffast-math when compiling > non-Fortran code. As comment 10 shows, both loops are vectorized, however > LLVM unrolls twice and uses multiple accumulators while GCC doesn't. You're right. LLVM produces: .LBB0_1: // %vector.body // =>This Inner Loop Header: Depth=1 add x11, x9, x8 add x12, x10, x8 ldp q2, q3, [x11] ldp q4, q5, [x12] add x8, x8, #32 // =32 fmla v0.2d, v2.2d, v4.2d fmla v1.2d, v3.2d, v5.2d cmp x8, #128, lsl #12 // =524288 b.ne .LBB0_1 And GCC: .L3: ldr q2, [x2, x0] add w1, w1, 1 ldr q1, [x3, x0] cmp w1, w4 add x0, x0, 16 fmla v0.2d, v2.2d, v1.2d bcc .L3 > I still don't see what this has to do with A57. You should open a generic > bug about GCC not applying basic loop optimizations with -O3 (in fact > limited unrolling is useful even for -O2). Indeed, but I think that there's still a code-generation opportunity for A57 here. Note above that the registers are loaded in pairs by LLVM, while GCC, when it unrolls the loop, more aggressively BTW, each vector is loaded individually: .L3: ldr q28, [x15, x16] add x17, x16, 16 ldr q29, [x14, x16] add x0, x16, 32 ldr q30, [x15, x17] add x18, x16, 48 ldr q31, [x14, x17] add x1, x16, 64 ... fmla v27.2d, v28.2d, v29.2d ... fmla v27.2d, v30.2d, v31.2d ... # Rest of 8x unroll bcc .L3 It also goes without saying that this code could also benefit from the post-increment addressing mode.