https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79390

--- Comment #4 from Richard Biener <rguenth at gcc dot gnu.org> ---
On trunk I see with -fno-split-paths:

.L5:
        movq    (%r14,%rdx,8), %rcx
        vmovsd  (%rcx,%rbx), %xmm0
        vandpd  %xmm3, %xmm0, %xmm0
        vucomisd        %xmm1, %xmm0
        jbe     .L4
        vmovapd %xmm0, %xmm1
        movl    %edx, %r9d
.L4:
        addq    $1, %rdx
        cmpq    %rdi, %rdx
        jne     .L5

so a jump vs. the max/cmov.  I wonder how this subloop can account for 10% of
performance difference...  the main part should be the nest

            for (ii=j+1; ii<M; ii++)
            {
                double *Aii = A[ii];
                double *Aj = A[j];
                double AiiJ = Aii[j];
                int jj;
                for (jj=j+1; jj<N; jj++)
                  Aii[jj] -= AiiJ * Aj[jj];

            }

but I never profiled LU...

Reply via email to