https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99912
--- Comment #2 from Erik Schnetter <schnetter at gmail dot com> --- I did not describe the scale of the issue. There are more than just a few inefficient or unnecessary operations: The loop kernel (a single basic block) extends from address 0x1240 to 0xbf27 in the attached disassembled object file. Out of about 6000 instructions in the loop, 1000 are inefficient (and likely superfluous) moves that copy one 32-byte stack slot into another, using 16-byte wide copies. For example, the stack slot 9376(%rsp) is written 9 times in the loop kernel, but is read only once.