https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89071

--- Comment #5 from Peter Cordes <peter at cordes dot ca> ---
(In reply to H.J. Lu from comment #4)
> (In reply to Peter Cordes from comment #2)

> >  Can you show some
> > asm where this performs better?
> 
> Please try cvtsd2ss branch at:
> 
> https://github.com/hjl-tools/microbenchmark/
> 
> On Intel Core i7-6700K, I got

I have the same CPU.

> [hjl@gnu-skl-2 microbenchmark]$ make
> gcc -g -I.    -c -o test.o test.c
> gcc -g   -c -o sse.o sse.S
> gcc -g   -c -o sse-clear.o sse-clear.S
> gcc -g   -c -o avx.o avx.S
> gcc -g   -c -o avx2.o avx2.S
> gcc -g   -c -o avx-clear.o avx-clear.S
> gcc -o test test.o sse.o sse-clear.o avx.o avx2.o avx-clear.o
> ./test
> sse      : 24533145
> sse_clear: 24286462
> avx      : 64117779
> avx2     : 62186716
> avx_clear: 58684727
> [hjl@gnu-skl-2 microbenchmark]$

You forgot the RET at the end of the AVX functions (but not the SSE ones); The
AVX functions fall through into each other, then into __libc_csu_init before
jumping around and eventually returning.  That's why they're much slower. 
Single-step through the loop in GDB...

   │0x555555555660 <avx>                    vcvtsd2ss xmm0,xmm0,xmm1
  >│0x555555555664                          nop    WORD PTR cs:[rax+rax*1+0x0]
   │0x55555555566e                          xchg   ax,ax
   │0x555555555670 <avx2>                   vcvtsd2ss xmm0,xmm1,xmm1
   │0x555555555674                          nop    WORD PTR cs:[rax+rax*1+0x0]
   │0x55555555567e                          xchg   ax,ax
   │0x555555555680 <avx_clear>              vxorps xmm0,xmm0,xmm0
   │0x555555555684 <avx_clear+4>            vcvtsd2ss xmm0,xmm0,xmm1
   │0x555555555688                          nop    DWORD PTR [rax+rax*1+0x0]
   │0x555555555690 <__libc_csu_init>        endbr64
   │0x555555555694 <__libc_csu_init+4>      push   r15
   │0x555555555696 <__libc_csu_init+6>      mov    r15,rdx

And BTW, SSE vs. SSE_clear are about the same speed because your loop
bottlenecks on the store/reload latency of keeping a loop counter in memory
(because you compiled the C without optimization).  Plus, the C caller loads
write-only into XMM0 and XMM1 every iteration, breaking any loop-carried
dependency the false dep would create.

I'm not sure why it makes a measurable difference to run the extra NOPS, and 3x
vcvtsd2ss instead of 1 for avx() vs. avx_clear(), because the C caller should
still be breaking dependencies for the AVX-128 instructions.

But whatever the effect is, it's totally unrelated to what you were *trying* to
test. :/

Reply via email to