https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102264
--- Comment #2 from Nicholai Tukanov <ntukanov at cmu dot edu> --- (In reply to Andrew Pinski from comment #1) > There seems to be some extra moves the register allocator cannot remove and > that is causing some extra spilling. > > Your loop has 32 live variables and that is just at the limit. Can the register allocator be modified to recognize the other registers? The problem seems limited to the compute instruction (vpdpwssd in this case). I specifically choose 32 to max out the registers. Since the compute instruction gets limited to half of that (zmm0-zmm15), the extra moves are killing the performance.