[Bug tree-optimization/65709] [5 Regression] Bad code for LZ4 decompression with -O3 on x86_64

jakub at gcc dot gnu.org Thu, 09 Apr 2015 14:31:37 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65709


--- Comment #10 from Jakub Jelinek <jakub at gcc dot gnu.org> ---
(In reply to Yann Collet from comment #9)
> Looking at the assembler generated, we see that GCC generates a MOVDQA
> instruction for it.
> > movdqa (%rdi,%rax,1),%xmm0
> > $rdi=0x7fffea4b53e6
> > $rax=0x0
> 
> This seems wrong on 2 levels :
> 
> - The function only wants to copy 8 bytes. MOVDQA works on a full SSE
> register, which is 16 bytes. This spell troubles, if only for buffer
> boundaries checks : the algorithm uses 8 bytes because it knows it can
> safely read/write that size without crossing buffer limits. With 16 bytes,
> no such guarantee.

The function has been inlined into the callers, like:
      do { LZ4_copy8(d,s); d+=8; s+=8; } while (d<e);
and this loop is then vectorized.  The vectorization prologue of course has to
adjust if s is not 16 byte aligned, but it can assume that both s and d are 8
byte aligned (otherwise it is undefined behavior).  So, if they aren't 8 byte
aligned, you could get crashes etc.  The load is then performed as aligned,
because the vectorization prologue ensured it is aligned (unless the program
has undefined behavior), while the stores as done using movups because it is
possible the pointers aren't both aligned the same.

> - MOVDQA requires both positions to be aligned.
> I read it as being SSE size aligned, which means 16-bytes aligned.
> But they are not, these pointers are supposed to be 8-bytes aligned only.
> 
> (A bit off topic, but from a general perspective, I don't understand the use
> of MOVDQA, which requires such a strong alignment condition, while there is
> also MOVDQU available, which works fine at any memory address, while
> suffering no performance penalty on aligned memory addresses. MOVDQU looks
> like a better choice in every circumstances.)

On most CPUs there is a significant performance difference between the two,
even if you use MOVDQU with aligned addresses.

[Bug tree-optimization/65709] [5 Regression] Bad code for LZ4 decompression with -O3 on x86_64

Reply via email to