https://gcc.gnu.org/bugzilla/show_bug.cgi?id=118076
Richard Biener <rguenth at gcc dot gnu.org> changed: What |Removed |Added ---------------------------------------------------------------------------- Priority|P3 |P2 Keywords| |needs-bisection CC| |hjl.tools at gmail dot com, | |rguenth at gcc dot gnu.org Summary|extra memcpy for passing |[12/13/14/15 Regression] |large arguments in some |extra memcpy for passing |cases |large arguments in some | |cases, introduces STLF | |fails Target| |x86_64-*-* Target Milestone|--- |12.5 --- Comment #4 from Richard Biener <rguenth at gcc dot gnu.org> --- It's also very bad for performance as the two DImode stores do not forward to the TImode load. The issue is that we fail to re-use the dead-after-call stack space for the argument slot (or rather the other way around since the call needs appropriate placement of the aggregate on the stack). In some cases RTL opts are able to elide 's' and directly copy the registers to the argument slot, but I guess the inline expanded block copy via XMM confuses RTL ops here. This is probably a regression (on x86-64) for when we started to use XMM to populate aggregate argument slots. GCC 11 used movq %rdi, (%rsp) movq %rsi, 8(%rsp) movq %rdx, 16(%rsp) movq %rcx, 24(%rsp) pushq 24(%rsp) .cfi_def_cfa_offset 56 pushq 24(%rsp) .cfi_def_cfa_offset 64 pushq 24(%rsp) .cfi_def_cfa_offset 72 pushq 24(%rsp) .cfi_def_cfa_offset 80 call extern_func and with GCC 12 we started using the bad sequence.