In order to perform faster loads from an unaligned memory location to a SSE
register, a common trick is to replace the default unaligned load instructions
(e.g., MOVUPS for floats) by one MOVSD followed by one MOVHPS. Using
intrinsics, this can be implemented like this:

inline __m128 ploadu(const float* from) {
  __m128 r;
  r = _mm_castpd_ps(_mm_load_sd((double*)(from)));
  r = _mm_loadh_pi(r, (const __m64*)(from+2));
  return r;
}

Unfortunately, when optimizations are enabled (-O2), I found that GCC can
incorrectly reorder the instructions leading to invalid code. For instance, in
that example:

float data[4] = {1, 2, 3, 4};
__attribute__ ((aligned(16))) float aligned_data[4];
_mm_store_ps(aligned_data, ploadu(data));
std::cout << aligned_data[0] << " " << aligned_data[1] << " "
          << aligned_data[2] << " " << aligned_data[3] << "\n";

GCC generates the following ASM:

 movsd 32(%rsp), %xmm0
 movl  $0x40400000, 40(%rsp)
 movl  $0x40800000, 44(%rsp)
 movl  $0x3f800000, 32(%rsp)
 movhps  40(%rsp), %xmm0
 movl  $0x40000000, 36(%rsp)
 movaps  %xmm0, 16(%rsp)

where the MOVSD instruction is executed before the values of the array "data"
have been set.

If we use the standard _mm_loadu_ps intrinsics, then the generated ASM is
obviously correct:

 movl  $0x3f800000, 32(%rsp)
 movl  $0x40000000, 36(%rsp)
 movl  $0x40400000, 40(%rsp)
 movl  $0x40800000, 44(%rsp)
 movups  32(%rsp), %xmm0
 movaps  %xmm0, 16(%rsp)

Please, see the attachment for a complete example.


-- 
           Summary: wrong instr. dependency with some SSE intrinsics
           Product: gcc
           Version: 4.3.2
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: target
        AssignedTo: unassigned at gcc dot gnu dot org
        ReportedBy: gael dot guennebaud at gmail dot com
 GCC build triplet: x86_64-pc-linux
  GCC host triplet: x86_64-pc-linux
GCC target triplet: x86_64-pc-linux


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=40537

Reply via email to