https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63503

--- Comment #16 from Evandro <e.menezes at samsung dot com> ---
(In reply to Wilco from comment #15)
> Using -Ofast is not any different from -O3 -ffast-math when compiling
> non-Fortran code. As comment 10 shows, both loops are vectorized, however
> LLVM unrolls twice and uses multiple accumulators while GCC doesn't.

You're right.  LLVM produces:

.LBB0_1:                                // %vector.body
                                        // =>This Inner Loop Header: Depth=1
        add      x11, x9, x8
        add      x12, x10, x8
        ldp      q2, q3, [x11]
        ldp      q4, q5, [x12]
        add      x8, x8, #32             // =32
        fmla     v0.2d, v2.2d, v4.2d
        fmla     v1.2d, v3.2d, v5.2d
        cmp      x8, #128, lsl #12      // =524288
        b.ne    .LBB0_1

And GCC:

.L3:
        ldr     q2, [x2, x0]
        add     w1, w1, 1
        ldr     q1, [x3, x0]
        cmp     w1, w4
        add     x0, x0, 16
        fmla    v0.2d, v2.2d, v1.2d
        bcc     .L3

> I still don't see what this has to do with A57. You should open a generic
> bug about GCC not applying basic loop optimizations with -O3 (in fact
> limited unrolling is useful even for -O2).

Indeed, but I think that there's still a code-generation opportunity for A57
here.

Note above that the registers are loaded in pairs by LLVM, while GCC, when it
unrolls the loop, more aggressively BTW, each vector is loaded individually:

.L3:
        ldr     q28, [x15, x16]
        add     x17, x16, 16
        ldr     q29, [x14, x16]
        add     x0, x16, 32
        ldr     q30, [x15, x17]
        add     x18, x16, 48
        ldr     q31, [x14, x17]
        add     x1, x16, 64
        ...
        fmla    v27.2d, v28.2d, v29.2d
        ...
        fmla    v27.2d, v30.2d, v31.2d
        ...     # Rest of 8x unroll
        bcc     .L3

It also goes without saying that this code could also benefit from the
post-increment addressing mode.

Reply via email to