https://gcc.gnu.org/bugzilla/show_bug.cgi?id=54349
Alexander Peslyak <solar-gcc at openwall dot com> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |solar-gcc at openwall dot com --- Comment #10 from Alexander Peslyak <solar-gcc at openwall dot com> --- I confirm that this is fixed in 4.9. Since a lot of people are still using pre-4.9 gcc and may stumble upon this bug, here's my experience with the bug and with working around it: The bug manifests itself the worst when only a pre-SSE4.1 instruction set is available (such as when compiling for x86_64 with no -m... options given), and (at least for me) especially on AMD Bulldozer: over 26% speedup from fully working around the bug in plain SSE2 build of yescrypt with Ubuntu 12.04's gcc 4.6.3 on FX-8120. On Intel CPUs, the impact of the bug is typically 5% to 10%. Enabling SSE4.1 (or AVX or better) mostly mitigates the bug, resulting in inbetween or full speeds (varying by CPU), since "(v)pextrq $0," is then generated and it is almost as good as "(v)movq" (but not exactly). The suggested "-mtune=corei7" workaround works, but is only recognized by gcc 4.6 and up (thus, is only for versions 4.6.x to 4.8.x). At source file level, this works: #if defined(__x86_64__) && \ __GNUC__ == 4 && __GNUC_MINOR__ >= 6 && __GNUC_MINOR__ < 9 #pragma GCC target ("tune=corei7") #endif A related bug is that those versions of gcc with that workaround wrongly generate "movd" (as in e.g. "movd %xmm0, %rax") instead of "movq". Luckily, binutils primarily looks at the register names and silently corrects this error (there's "movq" in the disassembly). For a much wider range of gcc versions - 4.0 and up - this works: #if defined(__x86_64__) && __GNUC__ == 4 && __GNUC_MINOR__ < 9 #ifdef __AVX__ #define MAYBE_V "v" #else #define MAYBE_V "" #endif #define _mm_cvtsi128_si64(x) ({ \ uint64_t result; \ __asm__(MAYBE_V "movq %1,%0" : "=r" (result) : "x" (x)); \ result; \ }) #endif A drawback for using inline asm for a single instruction is that it might negatively affect gcc's instruction scheduling (where gcc ends up unaware of the inlined instruction's timings). However, on this specific occasion (with yescrypt) I am not seeing any slowdown of such code compared to the "tune=corei7" approach, nor compared to gcc 4.9+. It just works for me. Still, because of this concern, it might be wise to combine the two approaches, only resorting to inline asm on pre-4.6 gcc: /* gcc before 4.9 would unnecessarily use store/load (without SSE4.1) or * (V)PEXTR (with SSE4.1 or AVX) instead of simply (V)MOV. */ #if defined(__x86_64__) && \ __GNUC__ == 4 && __GNUC_MINOR__ >= 6 && __GNUC_MINOR__ < 9 #pragma GCC target ("tune=corei7") #endif #include <stdint.h> #include <emmintrin.h> #if defined(__x86_64__) && __GNUC__ == 4 && __GNUC_MINOR__ < 6 #ifdef __AVX__ #define MAYBE_V "v" #else #define MAYBE_V "" #endif #define _mm_cvtsi128_si64(x) ({ \ uint64_t result; \ __asm__(MAYBE_V "movq %1,%0" : "=r" (result) : "x" (x)); \ result; \ }) #endif Unfortunately, unlike the pure inline asm workaround, this relies on binutils correcting the "movd" for gcc 4.6.x to 4.8.x. Oh well. I've tested the above combined workaround on these gcc versions (and it works): 4.0.0 4.1.0 4.1.2 4.2.0 4.2.4 4.3.0 4.3.6 4.4.0 4.4.1 4.4.2 4.4.3 4.4.4 4.4.5 4.4.6 4.5.0 4.5.3 4.6.0 4.6.2 4.7.0 4.7.4 4.8.0 4.8.4 4.9.0 4.9.2