https://gcc.gnu.org/bugzilla/show_bug.cgi?id=120647
Bug ID: 120647 Summary: [X86] Sub optimal code generated for counting the number matches between two array elements Product: gcc Version: 16.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: rtl-optimization Assignee: unassigned at gcc dot gnu.org Reporter: vekumar at gcc dot gnu.org Target Milestone: --- For the below test case unsigned vector_comparison(const char * rhs, const char * lhs) { unsigned n = 0; for (unsigned i = 0; i < 48; ++i) if (lhs[i] == rhs[i]) ++n; return n; } GCC generated suboptimal code. ref: https://godbolt.org/z/s1sxxq77s When "pocnt" is available for ex. -O3 -march=znver5 we should be generating simpler code. vmovdqu (%rdi), %ymm0 ; Load first 32 bytes vpcmpeqb (%rsi), %ymm0, %k0 ; Direct compare 32 bytes vmovdqu 32(%rdi), %xmm0 ; Load next 16 bytes vpcmpeqb 32(%rsi), %xmm0, %k1 ; Compare next 16 bytes kmovd %k0, %eax ; Convert mask to register popcntl %eax, %ecx ; Count matches (32 bytes) kmovw %k1, %eax ; Convert second mask popcntl %eax, %eax ; Count matches (16 bytes) addl %ecx, %eax ; Sum results (total 48 bytes)