[Bug target/54349] _mm_cvtsi128_si64 unnecessary stores value at stack

solar-gcc at openwall dot com Fri, 26 Feb 2016 01:03:13 -0800

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=54349


Alexander Peslyak <solar-gcc at openwall dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |solar-gcc at openwall dot com

--- Comment #10 from Alexander Peslyak <solar-gcc at openwall dot com> ---
I confirm that this is fixed in 4.9.  Since a lot of people are still using
pre-4.9 gcc and may stumble upon this bug, here's my experience with the bug
and with working around it:

The bug manifests itself the worst when only a pre-SSE4.1 instruction set is
available (such as when compiling for x86_64 with no -m... options given), and
(at least for me) especially on AMD Bulldozer: over 26% speedup from fully
working around the bug in plain SSE2 build of yescrypt with Ubuntu 12.04's gcc
4.6.3 on FX-8120.  On Intel CPUs, the impact of the bug is typically 5% to 10%.
 Enabling SSE4.1 (or AVX or better) mostly mitigates the bug, resulting in
inbetween or full speeds (varying by CPU), since "(v)pextrq $0," is then
generated and it is almost as good as "(v)movq" (but not exactly).

The suggested "-mtune=corei7" workaround works, but is only recognized by gcc
4.6 and up (thus, is only for versions 4.6.x to 4.8.x).  At source file level,
this works:

#if defined(__x86_64__) && \
    __GNUC__ == 4 && __GNUC_MINOR__ >= 6 && __GNUC_MINOR__ < 9
#pragma GCC target ("tune=corei7")
#endif

A related bug is that those versions of gcc with that workaround wrongly
generate "movd" (as in e.g. "movd %xmm0, %rax") instead of "movq".  Luckily,
binutils primarily looks at the register names and silently corrects this error
(there's "movq" in the disassembly).

For a much wider range of gcc versions - 4.0 and up - this works:

#if defined(__x86_64__) && __GNUC__ == 4 && __GNUC_MINOR__ < 9
#ifdef __AVX__
#define MAYBE_V "v"
#else
#define MAYBE_V ""
#endif
#define _mm_cvtsi128_si64(x) ({ \
                uint64_t result; \
                __asm__(MAYBE_V "movq %1,%0" : "=r" (result) : "x" (x)); \
                result; \
        })
#endif

A drawback for using inline asm for a single instruction is that it might
negatively affect gcc's instruction scheduling (where gcc ends up unaware of
the inlined instruction's timings).  However, on this specific occasion (with
yescrypt) I am not seeing any slowdown of such code compared to the
"tune=corei7" approach, nor compared to gcc 4.9+.  It just works for me. 
Still, because of this concern, it might be wise to combine the two approaches,
only resorting to inline asm on pre-4.6 gcc:

/* gcc before 4.9 would unnecessarily use store/load (without SSE4.1) or
 * (V)PEXTR (with SSE4.1 or AVX) instead of simply (V)MOV. */
#if defined(__x86_64__) && \
    __GNUC__ == 4 && __GNUC_MINOR__ >= 6 && __GNUC_MINOR__ < 9
#pragma GCC target ("tune=corei7")
#endif

#include <stdint.h>
#include <emmintrin.h>

#if defined(__x86_64__) && __GNUC__ == 4 && __GNUC_MINOR__ < 6
#ifdef __AVX__
#define MAYBE_V "v"
#else
#define MAYBE_V ""
#endif
#define _mm_cvtsi128_si64(x) ({ \
                uint64_t result; \
                __asm__(MAYBE_V "movq %1,%0" : "=r" (result) : "x" (x)); \
                result; \
        })
#endif

Unfortunately, unlike the pure inline asm workaround, this relies on binutils
correcting the "movd" for gcc 4.6.x to 4.8.x.  Oh well.

I've tested the above combined workaround on these gcc versions (and it works):
4.0.0 4.1.0 4.1.2 4.2.0 4.2.4 4.3.0 4.3.6 4.4.0 4.4.1 4.4.2 4.4.3 4.4.4 4.4.5
4.4.6 4.5.0 4.5.3 4.6.0 4.6.2 4.7.0 4.7.4 4.8.0 4.8.4 4.9.0 4.9.2

[Bug target/54349] _mm_cvtsi128_si64 unnecessary stores value at stack

Reply via email to