https://gcc.gnu.org/bugzilla/show_bug.cgi?id=62011
--- Comment #7 from Yuri Rumyantsev <ysrumyan at gmail dot com> ---
Please ignore my previous comment - if we insert nullifying of destination
register before each popcnt (and lzcnt) performance will restore:
original test results:
unsigned 83886630000 0.848533 sec 24.715 GB/s
uint64_t 83886630000 1.37436 sec 15.2592 GB/s
fixed popcnt:
unsigned 90440370000 0.853753 sec 24.5639 GB/s
uint64_t 83886630000 0.694458 sec 30.1984 GB/s
Here is assembly for 2nd loop:
.L16:
xorq %rax, %rax
popcntq -8(%rdx), %rax
xorq %rcx, %rcx
popcntq (%rdx), %rcx
addq %rax, %rcx
xorq %rax, %rax
popcntq 8(%rdx), %rax
addq %rcx, %rax
addq $32, %rdx
xorq %rcx, %rcx
popcntq -16(%rdx), %rcx
addq %rax, %rcx
addq %rcx, %r13
cmpq %rsi, %rdx
jne .L16