https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97127
--- Comment #6 from Michael_S <already5chosen at yahoo dot com> --- Why do you see it as addition of peephole pattern? I see it as removal. Like, "do what's written in the source and don't try to be tricky". Probably, I am too removed from how compilers work :( Or, may be, handle it at the level of cost of instructions? I don't know how gcc works internally, but it seems that currently the cost of register move [on Haswell and Skylake] is underestimated. Although it is true that register move has no cost in terms of execution ports and latency, it still has the same cost as, say, integer ALU instructions, in terms of the front end and renamer. Also, as pointed out above by Alexander, the cost of FMA3 with (base+index) or (index*scale) memory operands could also be underestimated. Unlike Alexander, I am not sure that the difference between (base+index) and (base) is really what matters. IMO, the cost of FMA3 with *any* memory operands is underestimated, but I am not going to insist on that. In the ideal world compiler should reason as I do it myself when coding in asm: estimate which resource is critical in the given loop and then try to reduce pressure on this particular resource. In this particular loop a critical resource appears to be renamer, so the cost of instructions should be seen as cost at renamer. In other situations critical resource could be throughput of particular issue port. In yet another situation it could be latency of instructions that form dependency chain across multiple iterations of the loop. The later is especially common in "reduction" algorithms of which dot product is the most common example. A single value of instruction cost simply can't cover all these different cases in satisfactory manner. May be, gcc it's already here, I don't know.