https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103750

            Bug ID: 103750
           Summary: [i386] GCC schedules KMOV instructions that destroys
                    performance in loop
           Product: gcc
           Version: 12.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: target
          Assignee: unassigned at gcc dot gnu.org
          Reporter: thiago at kde dot org
  Target Milestone: ---

Testcase:

const char16_t *qustrchr(char16_t *n, char16_t *e, char16_t c) noexcept
{
    __m256i mch256 = _mm256_set1_epi16(c);
    for ( ; n < e; n += 32) {
        __m256i data1 = _mm256_loadu_si256(reinterpret_cast<const __m256i
*>(n));
        __m256i data2 = _mm256_loadu_si256(reinterpret_cast<const __m256i *>(n)
+ 1);
        __mmask16 mask1 = _mm256_cmpeq_epu16_mask(data1, mch256);
        __mmask16 mask2 = _mm256_cmpeq_epu16_mask(data2, mch256);
        if (_kortestz_mask16_u8(mask1, mask2))
            continue;

        unsigned idx = _tzcnt_u32(mask1);
        if (mask1 == 0) {
            idx = __tzcnt_u16(mask2);
            n += 16;
        }
        return n + idx;
    }
    return e;
}

The assembly for this produces:

        vmovdqu16       (%rdi), %ymm1
        vmovdqu16       32(%rdi), %ymm2
        vpcmpuw $0, %ymm0, %ymm1, %k0
        vpcmpuw $0, %ymm0, %ymm2, %k1
        kmovw   %k0, %edx
        kmovw   %k1, %eax
        kortestw        %k1, %k0
        je      .L10

Those two KMOVW instructions aren't required for the check that follows.
They're also dispatched on port 0, same as the KORTESTW, meaning the KORTEST
can't be dispatched until those two have executed, thus introducing a 2-cycle
delay in this loop.

Clang generates:

.LBB0_2:                                # =>This Inner Loop Header: Depth=1
        vpcmpeqw        (%rdi), %ymm0, %k0
        vpcmpeqw        32(%rdi), %ymm0, %k1
        kortestw        %k0, %k1
        jne     .LBB0_3

ICC inserts one KMOVW, but not the other.

Godbolt build link: https://gcc.godbolt.org/z/cc3heo48M

LLVM-MCA analysis: https://analysis.godbolt.org/z/dGvY1Wj78
It shows the Clang loop runs on average 2.0 cycles per loop, whereas the GCC
code is 3 cycles/loop.

LLVM-MCA says the ICC loop with one of the two KMOV also runs at 2.0 cycles per
loop, because it can run in parallel with the second load, given that the loads
are ports 2 and 3.

Reply via email to