https://gcc.gnu.org/bugzilla/show_bug.cgi?id=51017
--- Comment #24 from Richard Biener <rguenth at gcc dot gnu.org> --- GCC 4.5 vs GCC 5 still shows GCC 4.5 is faster almost everywhere Benchmarking: Traditional DES [128/128 BS SSE2-16]... DONE Benchmarking: Traditional DES [128/128 BS SSE2-16]... DONE Many salts: 3636K c/s real, 3636K c/s virtual | Many salts: 3488K c/s real, 3488K c/s virtual Only one salt: 3047K c/s real, 3047K c/s virtual | Only one salt: 2896K c/s real, 2896K c/s virtual Benchmarking: BSDI DES (x725) [128/128 BS SSE2-16]... DONE Benchmarking: BSDI DES (x725) [128/128 BS SSE2-16]... DONE Many salts: 127360 c/s real, 127360 c/s virtual | Many salts: 108800 c/s real, 108800 c/s virtual Only one salt: 124288 c/s real, 123057 c/s virtual | Only one salt: 106112 c/s real, 106112 c/s virtual Benchmarking: FreeBSD MD5 [32/64 X2]... DONE Benchmarking: FreeBSD MD5 [32/64 X2]... DONE Raw: 15392 c/s real, 15392 c/s virtual | Raw: 15936 c/s real, 15936 c/s virtual Benchmarking: OpenBSD Blowfish (x32) [32/64 X2]... DONE Benchmarking: OpenBSD Blowfish (x32) [32/64 X2]... DONE Raw: 900 c/s real, 900 c/s virtual | Raw: 892 c/s real, 892 c/s virtual Benchmarking: Kerberos AFS DES [48/64 4K]... DONE Benchmarking: Kerberos AFS DES [48/64 4K]... DONE Short: 478208 c/s real, 473473 c/s virtual | Short: 476672 c/s real, 476672 c/s virtual Long: 1470K c/s real, 1470K c/s virtual | Long: 1473K c/s real, 1473K c/s virtual Benchmarking: LM DES [128/128 BS SSE2-16]... DONE Benchmarking: LM DES [128/128 BS SSE2-16]... DONE Raw: 16977K c/s real, 16977K c/s virtual | Raw: 14971K c/s real, 14971K c/s virtual Benchmarking: generic crypt(3) [?/64]... DONE Benchmarking: generic crypt(3) [?/64]... DONE Many salts: 362784 c/s real, 362784 c/s virtual | Many salts: 296352 c/s real, 296352 c/s virtual Only one salt: 361728 c/s real, 361728 c/s virtual | Only one salt: 292182 c/s real, 295104 c/s virtual Benchmarking: dummy [N/A]... DONE Benchmarking: dummy [N/A]... DONE Raw: 60157K c/s real, 60157K c/s virtual | Raw: 53849K c/s real, 53316K c/s virtual GCC 5 vs. GCC 6 shows some progress (and some small regressions), but not for BSDI DES. Benchmarking: Traditional DES [128/128 BS SSE2-16]... DONE Benchmarking: Traditional DES [128/128 BS SSE2-16]... DONE Many salts: 3488K c/s real, 3488K c/s virtual | Many salts: 3446K c/s real, 3446K c/s virtual Only one salt: 2896K c/s real, 2896K c/s virtual | Only one salt: 2895K c/s real, 2895K c/s virtual Benchmarking: BSDI DES (x725) [128/128 BS SSE2-16]... DONE Benchmarking: BSDI DES (x725) [128/128 BS SSE2-16]... DONE Many salts: 108800 c/s real, 108800 c/s virtual | Many salts: 104934 c/s real, 105984 c/s virtual Only one salt: 106112 c/s real, 106112 c/s virtual | Only one salt: 103040 c/s real, 103040 c/s virtual Benchmarking: FreeBSD MD5 [32/64 X2]... DONE Benchmarking: FreeBSD MD5 [32/64 X2]... DONE Raw: 15936 c/s real, 15936 c/s virtual | Raw: 15864 c/s real, 15864 c/s virtual Benchmarking: OpenBSD Blowfish (x32) [32/64 X2]... DONE Benchmarking: OpenBSD Blowfish (x32) [32/64 X2]... DONE Raw: 892 c/s real, 892 c/s virtual | Raw: 916 c/s real, 916 c/s virtual Benchmarking: Kerberos AFS DES [48/64 4K]... DONE Benchmarking: Kerberos AFS DES [48/64 4K]... DONE Short: 476672 c/s real, 476672 c/s virtual | Short: 471808 c/s real, 471808 c/s virtual Long: 1473K c/s real, 1473K c/s virtual | Long: 1449K c/s real, 1449K c/s virtual Benchmarking: LM DES [128/128 BS SSE2-16]... DONE Benchmarking: LM DES [128/128 BS SSE2-16]... DONE Raw: 14971K c/s real, 14971K c/s virtual | Raw: 15917K c/s real, 15917K c/s virtual Benchmarking: generic crypt(3) [?/64]... DONE Benchmarking: generic crypt(3) [?/64]... DONE Many salts: 296352 c/s real, 296352 c/s virtual | Many salts: 348096 c/s real, 348096 c/s virtual Only one salt: 292182 c/s real, 295104 c/s virtual | Only one salt: 347616 c/s real, 347616 c/s virtual Benchmarking: dummy [N/A]... DONE Benchmarking: dummy [N/A]... DONE Raw: 53849K c/s real, 53316K c/s virtual | Raw: 60114K c/s real, 60114K c/s virtual Note that -fno-tree-pre no longer helps. With GCC 5/6 most intrinsics are using a generic implementation and thus are transparent to the GIMPLE middle-end apart from __builtin_ia32_pandn128 which is used by _mm_andnot_si128. What helps is -fno-tree-loop-im in addition to -fno-tree-pre so the underlying issue is still that of register pressure it seems and it is not really the loop-carried stuff we introduce but the excessive invariant motion. Benchmarking: BSDI DES (x725) [128/128 BS SSE2-16]... DONE Many salts: 118528 c/s real, 117354 c/s virtual Only one salt: 114944 c/s real, 114944 c/s virtual movq DES_bs_all+18632(%rip), %rdi movq DES_bs_all+18624(%rip), %rcx movq DES_bs_all+18712(%rip), %rbp movq DES_bs_all+18696(%rip), %r9 movq DES_bs_all+18688(%rip), %r10 movq DES_bs_all+18680(%rip), %r11 movq %rdi, 624(%rsp) movq %rcx, 616(%rsp) movq %rbp, 320(%rsp) etc. - of course quite stupid to load sth and then spill it immediately... We're also back to all unaligned loads/stores.