https://gcc.gnu.org/bugzilla/show_bug.cgi?id=123631
Bug ID: 123631
Summary: Odd choice for vector constant materialization
Product: gcc
Version: 16.0
Status: UNCONFIRMED
Severity: normal
Priority: P3
Component: target
Assignee: unassigned at gcc dot gnu.org
Reporter: rguenth at gcc dot gnu.org
Target Milestone: ---
I'm seeing
void foo (int *q)
{
q[0] = 10;
q[1] = 10;
q[2] = 10;
q[3] = 10;
}
with -march=znver2
0: b8 0a 00 00 00 mov $0xa,%eax
5: c5 f9 6e c0 vmovd %eax,%xmm0
9: c4 e2 79 58 c0 vpbroadcastd %xmm0,%xmm0
e: c5 fa 7f 07 vmovdqu %xmm0,(%rdi)
and -march=znver4
0: b8 0a 00 00 00 mov $0xa,%eax
5: 62 f2 7d 08 7c c0 vpbroadcastd %eax,%xmm0
b: c5 fa 7f 07 vmovdqu %xmm0,(%rdi)
which are both larger than with a non-uniform vector constant which is
loaded from memory:
0: c5 f9 6f 05 00 00 00 vmovdqa 0x0(%rip),%xmm0 # 8 <foo+0x8>
7: 00
8: c5 fa 7f 07 vmovdqu %xmm0,(%rdi)
and I think also has comparable (if not lower) latency (due to GPR<->XMM move)
if in cache, for sure less uops and less port pressure.
With FP we're broadcasting from scalar memory using vbroadcastss. For
the same sized integer data that should be possible as well, but is
one byte larger (but possibly better for dcache, esp. when broadcasting
to %ymm or %zmm).