https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79390
--- Comment #4 from Richard Biener <rguenth at gcc dot gnu.org> --- On trunk I see with -fno-split-paths: .L5: movq (%r14,%rdx,8), %rcx vmovsd (%rcx,%rbx), %xmm0 vandpd %xmm3, %xmm0, %xmm0 vucomisd %xmm1, %xmm0 jbe .L4 vmovapd %xmm0, %xmm1 movl %edx, %r9d .L4: addq $1, %rdx cmpq %rdi, %rdx jne .L5 so a jump vs. the max/cmov. I wonder how this subloop can account for 10% of performance difference... the main part should be the nest for (ii=j+1; ii<M; ii++) { double *Aii = A[ii]; double *Aj = A[j]; double AiiJ = Aii[j]; int jj; for (jj=j+1; jj<N; jj++) Aii[jj] -= AiiJ * Aj[jj]; } but I never profiled LU...