https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89071
--- Comment #5 from Peter Cordes <peter at cordes dot ca> --- (In reply to H.J. Lu from comment #4) > (In reply to Peter Cordes from comment #2) > > Can you show some > > asm where this performs better? > > Please try cvtsd2ss branch at: > > https://github.com/hjl-tools/microbenchmark/ > > On Intel Core i7-6700K, I got I have the same CPU. > [hjl@gnu-skl-2 microbenchmark]$ make > gcc -g -I. -c -o test.o test.c > gcc -g -c -o sse.o sse.S > gcc -g -c -o sse-clear.o sse-clear.S > gcc -g -c -o avx.o avx.S > gcc -g -c -o avx2.o avx2.S > gcc -g -c -o avx-clear.o avx-clear.S > gcc -o test test.o sse.o sse-clear.o avx.o avx2.o avx-clear.o > ./test > sse : 24533145 > sse_clear: 24286462 > avx : 64117779 > avx2 : 62186716 > avx_clear: 58684727 > [hjl@gnu-skl-2 microbenchmark]$ You forgot the RET at the end of the AVX functions (but not the SSE ones); The AVX functions fall through into each other, then into __libc_csu_init before jumping around and eventually returning. That's why they're much slower. Single-step through the loop in GDB... │0x555555555660 <avx> vcvtsd2ss xmm0,xmm0,xmm1 >│0x555555555664 nop WORD PTR cs:[rax+rax*1+0x0] │0x55555555566e xchg ax,ax │0x555555555670 <avx2> vcvtsd2ss xmm0,xmm1,xmm1 │0x555555555674 nop WORD PTR cs:[rax+rax*1+0x0] │0x55555555567e xchg ax,ax │0x555555555680 <avx_clear> vxorps xmm0,xmm0,xmm0 │0x555555555684 <avx_clear+4> vcvtsd2ss xmm0,xmm0,xmm1 │0x555555555688 nop DWORD PTR [rax+rax*1+0x0] │0x555555555690 <__libc_csu_init> endbr64 │0x555555555694 <__libc_csu_init+4> push r15 │0x555555555696 <__libc_csu_init+6> mov r15,rdx And BTW, SSE vs. SSE_clear are about the same speed because your loop bottlenecks on the store/reload latency of keeping a loop counter in memory (because you compiled the C without optimization). Plus, the C caller loads write-only into XMM0 and XMM1 every iteration, breaking any loop-carried dependency the false dep would create. I'm not sure why it makes a measurable difference to run the extra NOPS, and 3x vcvtsd2ss instead of 1 for avx() vs. avx_clear(), because the C caller should still be breaking dependencies for the AVX-128 instructions. But whatever the effect is, it's totally unrelated to what you were *trying* to test. :/