https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103252
--- Comment #7 from Jason A. Donenfeld <jason at zx2c4 dot com> --- The strange thing in this case is that the non-avx512 codegen _doesn't_ spill to memory. It just uses the gprs that are around. So it seems like that, somehow, the mere existence of the mask registers causes the register allocator to be lazier than usual, resulting in this situation, where the effects combine to produce suboptimal code.