https://gcc.gnu.org/bugzilla/show_bug.cgi?id=118076
Richard Biener <rguenth at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Priority|P3 |P2
Keywords| |needs-bisection
CC| |hjl.tools at gmail dot com,
| |rguenth at gcc dot gnu.org
Summary|extra memcpy for passing |[12/13/14/15 Regression]
|large arguments in some |extra memcpy for passing
|cases |large arguments in some
| |cases, introduces STLF
| |fails
Target| |x86_64-*-*
Target Milestone|--- |12.5
--- Comment #4 from Richard Biener <rguenth at gcc dot gnu.org> ---
It's also very bad for performance as the two DImode stores do not forward to
the TImode load.
The issue is that we fail to re-use the dead-after-call stack space for the
argument slot (or rather the other way around since the call needs appropriate
placement of the aggregate on the stack).
In some cases RTL opts are able to elide 's' and directly copy the registers
to the argument slot, but I guess the inline expanded block copy via XMM
confuses RTL ops here.
This is probably a regression (on x86-64) for when we started to use XMM
to populate aggregate argument slots.
GCC 11 used
movq %rdi, (%rsp)
movq %rsi, 8(%rsp)
movq %rdx, 16(%rsp)
movq %rcx, 24(%rsp)
pushq 24(%rsp)
.cfi_def_cfa_offset 56
pushq 24(%rsp)
.cfi_def_cfa_offset 64
pushq 24(%rsp)
.cfi_def_cfa_offset 72
pushq 24(%rsp)
.cfi_def_cfa_offset 80
call extern_func
and with GCC 12 we started using the bad sequence.