https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115749
--- Comment #1 from kim.walisch at gmail dot com ---
I played a bit more with my C/C++ code snippet and managed to further simplify
it. The GCC performance issue seems to be mostly caused by GCC producing worse
assembly than Clang for the integer modulo by a constant on x86-64 CPUs:
unsigned long func(unsigned long x)
{
return x % 240;
}
GCC trunk produces the following 11 instruction assembly sequence (without
mulx) when compiled using -O3 -mbmi2:
func:
movabs rax, -8608480567731124087
mul rdi
mov rax, rdx
shr rax, 7
mov rdx, rax
sal rdx, 4
sub rdx, rax
mov rax, rdi
sal rdx, 4
sub rax, rdx
ret
Clang trunk produces the following shorter and faster 8 instruction assembly
sequence (with mulx) when compiled using -O3 -mbmi2:
func:
mov rax, rdi
movabs rcx, -8608480567731124087
mov rdx, rdi
mulx rcx, rcx, rcx
shr rcx, 7
imul rcx, rcx, 240
sub rax, rcx
ret
In my first post one can see that Clang uses mulx for both the integer division
by a constant and the integer modulo by a constant, while GCC does not use
mulx. However, for the integer division by a constant GCC uses the same number
of instructions as Clang (even without GCC using mulx) but for the integer
modulo by a constant GCC uses up to 30% more instructions and is noticeably
slower.
Please note that Clang's assembly is also shorter (8 asm instructions) than
GCC's assembly for the integer modulo by a constant on x86-64 CPUs when
compiling without -mbmi2 e.g. with just -O3:
func:
movabs rcx, -8608480567731124087
mov rax, rdi
mul rcx
shr rdx, 7
imul rax, rdx, 240
sub rdi, rax
mov rax, rdi
ret