https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115749
Bug ID: 115749
Summary: Missed BMI2 optimization on x86-64
Product: gcc
Version: 14.1.0
Status: UNCONFIRMED
Severity: normal
Priority: P3
Component: c++
Assignee: unassigned at gcc dot gnu.org
Reporter: kim.walisch at gmail dot com
Target Milestone: ---
Hi,
I have debugged a performance issue in one of my C++ applications on x86-64
CPUs where GCC produces noticeably slower code (using all GCC versions) than
Clang. I was able to find that the performance issue was caused by GCC not
using the mulx instruction from BMI2 even when compiling with -mbmi2. Clang on
the other hand used the mulx instruction producing a shorter and faster
assembly sequence. For this particular code sequence Clang used up to 30% fewer
instructions than GCC.
Here is a minimal C/C++ code snippet that reproduces the issue:
extern const unsigned long array[240];
unsigned long func(unsigned long x)
{
unsigned long index = x / 240;
return array[index % 240];
}
GCC trunk produces the following 15 instruction assembly sequence (without
mulx) when compiled using -O3 -mbmi2:
func(unsigned long):
movabs rcx, -8608480567731124087
mov rax, rdi
mul rcx
mov rdi, rdx
shr rdi, 7
mov rax, rdi
mul rcx
shr rdx, 7
mov rax, rdx
sal rax, 4
sub rax, rdx
sal rax, 4
sub rdi, rax
mov rax, QWORD PTR array[0+rdi*8]
ret
Clang trunk produces the following shorter and faster 12 instruction assembly
sequence (with mulx) when compiled using -O3 -mbmi2:
func(unsigned long): # @func(unsigned long)
movabs rax, -8608480567731124087
mov rdx, rdi
mulx rdx, rdx, rax
shr rdx, 7
movabs rax, 153722867280912931
mulx rax, rax, rax
shr eax
imul eax, eax, 240
sub edx, eax
mov rax, qword ptr [rip + array@GOTPCREL]
mov rax, qword ptr [rax + 8*rdx]
ret