https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77287
--- Comment #4 from Petr <kobalicek.petr at gmail dot com> --- Adding -fschedule-insns is definitely a huge improvement in this case. I wonder why this doesn't happen by default at -O2 and -Os, as it really improves things and makes shorter output, or it's just in this particular case? Here is the assembly produced by gcc with -fschedule-insns: push ebp mov ebp, esp and esp, -32 lea esp, [esp-32] mov ecx, DWORD PTR [ebp+8] mov edx, DWORD PTR [ebp+32] mov eax, DWORD PTR [ebp+36] vmovdqu ymm5, YMMWORD PTR [ecx] mov ecx, DWORD PTR [ebp+12] vmovdqu ymm3, YMMWORD PTR [edx] vmovdqu ymm6, YMMWORD PTR [eax] vmovdqu ymm2, YMMWORD PTR [ecx] mov ecx, DWORD PTR [ebp+28] vpackuswb ymm7, ymm2, ymm3 vpaddw ymm7, ymm7, ymm2 vpsubw ymm7, ymm7, ymm3 vmovdqu ymm4, YMMWORD PTR [ecx] mov ecx, DWORD PTR [ebp+16] vpackuswb ymm0, ymm5, ymm4 vpaddw ymm0, ymm0, ymm5 vpsubw ymm0, ymm0, ymm4 vmovdqu ymm1, YMMWORD PTR [ecx] vpackuswb ymm0, ymm0, ymm7 mov ecx, DWORD PTR [ebp+20] vpackuswb ymm2, ymm1, ymm6 vmovdqu ymm4, YMMWORD PTR [edx+32] vpaddw ymm1, ymm2, ymm1 mov edx, DWORD PTR [ebp+24] vpsubw ymm1, ymm1, ymm6 vmovdqu ymm5, YMMWORD PTR [ecx] vpackuswb ymm0, ymm0, ymm1 vpackuswb ymm3, ymm5, ymm4 vmovdqa YMMWORD PTR [esp], ymm3 vmovdqu ymm2, YMMWORD PTR [eax+32] ; LOOK HERE vpaddw ymm5, ymm5, YMMWORD PTR [esp] vmovdqu ymm3, YMMWORD PTR [edx] ; AND HERE vpsubw ymm4, ymm5, ymm4 vpackuswb ymm7, ymm3, ymm2 vpackuswb ymm0, ymm0, ymm4 vpaddw ymm3, ymm7, ymm3 vpsubw ymm2, ymm3, ymm2 vpackuswb ymm2, ymm0, ymm2 vpextrd eax, xmm2, 1 vzeroupper leave ret Which is pretty close to clang already, however, look at this part: vmovdqa YMMWORD PTR [esp], ymm3 ; Spill YMM3 vmovdqu ymm2, YMMWORD PTR [eax+32] vpaddw ymm5, ymm5, YMMWORD PTR [esp] ; Mem instead of YMM3? vmovdqu ymm3, YMMWORD PTR [edx] ; Old YMM3 becomes dead here The spill is completely unnecessary in our case, and it's the only reason why the prolog/epilog requires code to perform dynamic stack alignment. I mean if this one thing is eliminated then GCC basically generates a comparable code to clang. But thanks for -fschedule-insns hint, I didn't know about it.