https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103750
Bug ID: 103750
Summary: [i386] GCC schedules KMOV instructions that destroys
performance in loop
Product: gcc
Version: 12.0
Status: UNCONFIRMED
Severity: normal
Priority: P3
Component: target
Assignee: unassigned at gcc dot gnu.org
Reporter: thiago at kde dot org
Target Milestone: ---
Testcase:
const char16_t *qustrchr(char16_t *n, char16_t *e, char16_t c) noexcept
{
__m256i mch256 = _mm256_set1_epi16(c);
for ( ; n < e; n += 32) {
__m256i data1 = _mm256_loadu_si256(reinterpret_cast<const __m256i
*>(n));
__m256i data2 = _mm256_loadu_si256(reinterpret_cast<const __m256i *>(n)
+ 1);
__mmask16 mask1 = _mm256_cmpeq_epu16_mask(data1, mch256);
__mmask16 mask2 = _mm256_cmpeq_epu16_mask(data2, mch256);
if (_kortestz_mask16_u8(mask1, mask2))
continue;
unsigned idx = _tzcnt_u32(mask1);
if (mask1 == 0) {
idx = __tzcnt_u16(mask2);
n += 16;
}
return n + idx;
}
return e;
}
The assembly for this produces:
vmovdqu16 (%rdi), %ymm1
vmovdqu16 32(%rdi), %ymm2
vpcmpuw $0, %ymm0, %ymm1, %k0
vpcmpuw $0, %ymm0, %ymm2, %k1
kmovw %k0, %edx
kmovw %k1, %eax
kortestw %k1, %k0
je .L10
Those two KMOVW instructions aren't required for the check that follows.
They're also dispatched on port 0, same as the KORTESTW, meaning the KORTEST
can't be dispatched until those two have executed, thus introducing a 2-cycle
delay in this loop.
Clang generates:
.LBB0_2: # =>This Inner Loop Header: Depth=1
vpcmpeqw (%rdi), %ymm0, %k0
vpcmpeqw 32(%rdi), %ymm0, %k1
kortestw %k0, %k1
jne .LBB0_3
ICC inserts one KMOVW, but not the other.
Godbolt build link: https://gcc.godbolt.org/z/cc3heo48M
LLVM-MCA analysis: https://analysis.godbolt.org/z/dGvY1Wj78
It shows the Clang loop runs on average 2.0 cycles per loop, whereas the GCC
code is 3 cycles/loop.
LLVM-MCA says the ICC loop with one of the two KMOV also runs at 2.0 cycles per
loop, because it can run in parallel with the second load, given that the loads
are ports 2 and 3.