https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103750
Bug ID: 103750 Summary: [i386] GCC schedules KMOV instructions that destroys performance in loop Product: gcc Version: 12.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: thiago at kde dot org Target Milestone: --- Testcase: const char16_t *qustrchr(char16_t *n, char16_t *e, char16_t c) noexcept { __m256i mch256 = _mm256_set1_epi16(c); for ( ; n < e; n += 32) { __m256i data1 = _mm256_loadu_si256(reinterpret_cast<const __m256i *>(n)); __m256i data2 = _mm256_loadu_si256(reinterpret_cast<const __m256i *>(n) + 1); __mmask16 mask1 = _mm256_cmpeq_epu16_mask(data1, mch256); __mmask16 mask2 = _mm256_cmpeq_epu16_mask(data2, mch256); if (_kortestz_mask16_u8(mask1, mask2)) continue; unsigned idx = _tzcnt_u32(mask1); if (mask1 == 0) { idx = __tzcnt_u16(mask2); n += 16; } return n + idx; } return e; } The assembly for this produces: vmovdqu16 (%rdi), %ymm1 vmovdqu16 32(%rdi), %ymm2 vpcmpuw $0, %ymm0, %ymm1, %k0 vpcmpuw $0, %ymm0, %ymm2, %k1 kmovw %k0, %edx kmovw %k1, %eax kortestw %k1, %k0 je .L10 Those two KMOVW instructions aren't required for the check that follows. They're also dispatched on port 0, same as the KORTESTW, meaning the KORTEST can't be dispatched until those two have executed, thus introducing a 2-cycle delay in this loop. Clang generates: .LBB0_2: # =>This Inner Loop Header: Depth=1 vpcmpeqw (%rdi), %ymm0, %k0 vpcmpeqw 32(%rdi), %ymm0, %k1 kortestw %k0, %k1 jne .LBB0_3 ICC inserts one KMOVW, but not the other. Godbolt build link: https://gcc.godbolt.org/z/cc3heo48M LLVM-MCA analysis: https://analysis.godbolt.org/z/dGvY1Wj78 It shows the Clang loop runs on average 2.0 cycles per loop, whereas the GCC code is 3 cycles/loop. LLVM-MCA says the ICC loop with one of the two KMOV also runs at 2.0 cycles per loop, because it can run in parallel with the second load, given that the loads are ports 2 and 3.