https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117875
Richard Biener <rguenth at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Component|target |tree-optimization
--- Comment #1 from Richard Biener <rguenth at gcc dot gnu.org> ---
Samples: 530K of event 'cycles:Pu', Event count (approx.): 680879110118
Overhead Samples Command Shared Object Symbol
51.45% 273953 hmmer_peak.amd6 hmmer_peak.amd64-m64-gcc42-nn [.]
P7Viterbi
38.49% 202968 hmmer_base.amd6 hmmer_base.amd64-m64-gcc42-nn [.]
P7Viterbi
71 │4135c0┌─ vmovd (%r11,%rdi,4),%xmm3 ▒
1361 │4135c6│ vpaddd %xmm3,%xmm0,%xmm0 ▒
29411 │4135ca│ mov %rdi,%r8 ▒
15 │4135cd│ vmovd %xmm0,0x4(%rdx,%rdi,4) ▒
5826 │4135d3│ vmovd (%rax,%rdi,4),%xmm4 ▒
981 │4135d8│ vmovd (%r10,%rdi,4),%xmm3 ◆
725 │4135de│ vpaddd %xmm3,%xmm4,%xmm3 ▒
3186 │4135e2│ vmovdqa 0x47346(%rip),%xmm4 ▒
787 │4135ea│ vpmaxsd %xmm4,%xmm3,%xmm3 ▒
3801 │4135ef│ vpmaxsd %xmm0,%xmm3,%xmm0 ▒
28932 │4135f4│ vmovd %xmm0,0x4(%rdx,%rdi,4) ▒
2073 │4135fa│ inc %rdi ▒
3464 │4135fd├── cmp %r8,%r9 ▒
11 │413600└── jne 4135c0 <P7Viterbi+0x1100>
vs.
│413aa0┌─ vmovd (%r11,%rdi,4),%xmm3 ▒
208 │413aa6│ mov %rdi,%r8 ▒
393 │413aa9│ vpaddd %xmm3,%xmm0,%xmm0 ▒
11199 │413aad│ vmovd %xmm0,0x4(%rdx,%rdi,4) ▒
11840 │413ab3│ vmovd (%rax,%rdi,4),%xmm5 ▒
3889 │413ab8│ vmovd (%r10,%rdi,4),%xmm3 ▒
340 │413abe│ vpaddd %xmm3,%xmm5,%xmm3 ▒
2829 │413ac2│ vmovdqa 0x48656(%rip),%xmm5 ▒
720 │413aca│ vpmaxsd %xmm5,%xmm3,%xmm3 ▒
1047 │413acf│ vpmaxsd %xmm0,%xmm3,%xmm0 ▒
10698 │413ad4│ vmovd %xmm0,0x4(%rdx,%rdi,4) ◆
12478 │413ada│ inc %rdi ▒
2966 │413add├── cmp %r8,%r9 ▒
1 │413ae0└── jne 413aa0 <P7Viterbi+0x1760>
that's the scalar epilog, -mtune-ctrl=^avx512_two_epilogues does not help.
The regression also shows up on Icelake.
For some reason we're dealing with branch misses here which we have none
for BASE for the above loop but plenty with PEAK.
This seems to be related to loop splitting - for PEAK we have two
iterating loops while for BASE there's simply fallthru code before.
-fno-split-loops fixes this.
We do not seem to realize that splitting
for (k = 1; k <= M; k++) {
if (k < M) {
}
}
has the k == M loop run only once. That causes us to vectorize the
epilog loop as well.
A simplified testcase looks like
int a[1024], b[1024];
void foo (int M)
{
for (int k = 1; k <= M; ++k)
{
a[k] = a[k] + 1;
if (k < M)
b[k] = b[k] + 1;
}
}
likely "caused" by the loop splitting improvements, though for the simplified
testcase above the generated code is the same.
I'll note that with GCC 14 we do
fast_algorithms.c:145:10: optimized: loop split
fast_algorithms.c:133:19: optimized: Loop 3 distributed: split to 3 loops and 0
library calls.
fast_algorithms.c:133:19: optimized: Loop 5 distributed: split to 2 loops and 0
library calls.
fast_algorithms.c:133:19: optimized: loop vectorized using 64 byte vectors
fast_algorithms.c:133:19: optimized: loop versioned for vectorization because
of possible aliasing
fast_algorithms.c:133:19: optimized: loop vectorized using 32 byte vectors
fast_algorithms.c:133:19: optimized: loop vectorized using 64 byte vectors
fast_algorithms.c:133:19: optimized: loop versioned for vectorization because
of possible aliasing
fast_algorithms.c:133:19: optimized: loop vectorized using 32 byte vectors
fast_algorithms.c:133:19: optimized: loop vectorized using 64 byte vectors
fast_algorithms.c:133:19: optimized: loop versioned for vectorization because
of possible aliasing
fast_algorithms.c:133:19: optimized: loop vectorized using 32 byte vectors
fast_algorithms.c:133:19: optimized: loop with 6 iterations completely unrolled
(header execution count 7100547)
fast_algorithms.c:134:7: optimized: loop turned into non-loop; it never loops
fast_algorithms.c:133:19: optimized: loop with 6 iterations completely unrolled
(header execution count 20163246)
fast_algorithms.c:134:7: optimized: loop turned into non-loop; it never loops
fast_algorithms.c:133:19: optimized: loop with 6 iterations completely unrolled
(header execution count 16089390)
fast_algorithms.c:134:7: optimized: loop turned into non-loop; it never loops
while trunk does
fast_algorithms.c:145:10: optimized: loop split
fast_algorithms.c:133:19: optimized: Loop 3 distributed: split to 3 loops and 0
library calls.
fast_algorithms.c:133:19: optimized: Loop 5 distributed: split to 2 loops and 0
library calls.
fast_algorithms.c:165:19: optimized: loop vectorized using 64 byte vectors
fast_algorithms.c:165:19: optimized: loop vectorized using 32 byte vectors
fast_algorithms.c:165:19: optimized: loop vectorized using 16 byte vectors
fast_algorithms.c:133:19: optimized: loop vectorized using 64 byte vectors
fast_algorithms.c:133:19: optimized: loop versioned for vectorization because
of possible aliasing
fast_algorithms.c:133:19: optimized: loop vectorized using 32 byte vectors
fast_algorithms.c:133:19: optimized: loop vectorized using 16 byte vectors
fast_algorithms.c:133:19: optimized: loop vectorized using 64 byte vectors
fast_algorithms.c:133:19: optimized: loop versioned for vectorization because
of possible aliasing
fast_algorithms.c:133:19: optimized: loop vectorized using 32 byte vectors
fast_algorithms.c:133:19: optimized: loop vectorized using 16 byte vectors
fast_algorithms.c:133:19: optimized: loop with 2 iterations completely unrolled
(header execution count 21835320)
fast_algorithms.c:134:7: optimized: loop turned into non-loop; it never loops
fast_algorithms.c:134:7: optimized: loop turned into non-loop; it never loops
fast_algorithms.c:133:19: optimized: loop with 2 iterations completely unrolled
(header execution count 13974604)
fast_algorithms.c:134:7: optimized: loop turned into non-loop; it never loops
fast_algorithms.c:134:7: optimized: loop turned into non-loop; it never loops
Which is mostly the same (but all do not realize the loop from splitting
doesn't iterate).
The loop splitting is quite pointless (but it elides the condition).