In order to perform faster loads from an unaligned memory location to a SSE register, a common trick is to replace the default unaligned load instructions (e.g., MOVUPS for floats) by one MOVSD followed by one MOVHPS. Using intrinsics, this can be implemented like this:
inline __m128 ploadu(const float* from) { __m128 r; r = _mm_castpd_ps(_mm_load_sd((double*)(from))); r = _mm_loadh_pi(r, (const __m64*)(from+2)); return r; } Unfortunately, when optimizations are enabled (-O2), I found that GCC can incorrectly reorder the instructions leading to invalid code. For instance, in that example: float data[4] = {1, 2, 3, 4}; __attribute__ ((aligned(16))) float aligned_data[4]; _mm_store_ps(aligned_data, ploadu(data)); std::cout << aligned_data[0] << " " << aligned_data[1] << " " << aligned_data[2] << " " << aligned_data[3] << "\n"; GCC generates the following ASM: movsd 32(%rsp), %xmm0 movl $0x40400000, 40(%rsp) movl $0x40800000, 44(%rsp) movl $0x3f800000, 32(%rsp) movhps 40(%rsp), %xmm0 movl $0x40000000, 36(%rsp) movaps %xmm0, 16(%rsp) where the MOVSD instruction is executed before the values of the array "data" have been set. If we use the standard _mm_loadu_ps intrinsics, then the generated ASM is obviously correct: movl $0x3f800000, 32(%rsp) movl $0x40000000, 36(%rsp) movl $0x40400000, 40(%rsp) movl $0x40800000, 44(%rsp) movups 32(%rsp), %xmm0 movaps %xmm0, 16(%rsp) Please, see the attachment for a complete example. -- Summary: wrong instr. dependency with some SSE intrinsics Product: gcc Version: 4.3.2 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target AssignedTo: unassigned at gcc dot gnu dot org ReportedBy: gael dot guennebaud at gmail dot com GCC build triplet: x86_64-pc-linux GCC host triplet: x86_64-pc-linux GCC target triplet: x86_64-pc-linux http://gcc.gnu.org/bugzilla/show_bug.cgi?id=40537