https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65709
--- Comment #10 from Jakub Jelinek <jakub at gcc dot gnu.org> --- (In reply to Yann Collet from comment #9) > Looking at the assembler generated, we see that GCC generates a MOVDQA > instruction for it. > > movdqa (%rdi,%rax,1),%xmm0 > > $rdi=0x7fffea4b53e6 > > $rax=0x0 > > This seems wrong on 2 levels : > > - The function only wants to copy 8 bytes. MOVDQA works on a full SSE > register, which is 16 bytes. This spell troubles, if only for buffer > boundaries checks : the algorithm uses 8 bytes because it knows it can > safely read/write that size without crossing buffer limits. With 16 bytes, > no such guarantee. The function has been inlined into the callers, like: do { LZ4_copy8(d,s); d+=8; s+=8; } while (d<e); and this loop is then vectorized. The vectorization prologue of course has to adjust if s is not 16 byte aligned, but it can assume that both s and d are 8 byte aligned (otherwise it is undefined behavior). So, if they aren't 8 byte aligned, you could get crashes etc. The load is then performed as aligned, because the vectorization prologue ensured it is aligned (unless the program has undefined behavior), while the stores as done using movups because it is possible the pointers aren't both aligned the same. > - MOVDQA requires both positions to be aligned. > I read it as being SSE size aligned, which means 16-bytes aligned. > But they are not, these pointers are supposed to be 8-bytes aligned only. > > (A bit off topic, but from a general perspective, I don't understand the use > of MOVDQA, which requires such a strong alignment condition, while there is > also MOVDQU available, which works fine at any memory address, while > suffering no performance penalty on aligned memory addresses. MOVDQU looks > like a better choice in every circumstances.) On most CPUs there is a significant performance difference between the two, even if you use MOVDQU with aligned addresses.