https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110823
Bug ID: 110823
Summary: [missed optimization] >50% speedup for x86-64 ASCII
processing a la GNU diffutils
Product: gcc
Version: 13.1.1
Status: UNCONFIRMED
Severity: normal
Priority: P3
Component: rtl-optimization
Assignee: unassigned at gcc dot gnu.org
Reporter: eggert at cs dot ucla.edu
Target Milestone: ---
Created attachment 55643
--> https://gcc.gnu.org/bugzilla/attachment.cgi?id=55643&action=edit
proprocessed source code inspired by GNU diffutils
This is GCC 13.1.1 20230614 (Red Hat 13.1.1-4) on x86-64.
While tuning GNU diffutils I noticed that its loops to process mostly-ASCII
text were not compiled well by GCC on x86-64. For a stripped-down example of
the problem, compile the attached program with:
gcc -O2 -S code-mbcel1.i
The result is in the attached file code-mbcel1.s. Its loop kernel assuming
ASCII text (starting on line 212) looks like this:
.L33:
testb %al, %al
js .L30
movl $1, %edx
.L31:
movl %eax, %eax
addq %rdx, %rbx
addq %rax, %rbp
movsbl (%rbx), %eax
testb %al, %al
jne .L33
As I understand it the "movl %eax, %eax" is unnecessary, as all code that
reaches .L31 guarantees that %rax's top 32 bits are zero.
Also, the loop body executes "testb %al, %al" twice when once would suffice.
(As a minor point, since the compiler can easily tell that %al is positive when
the loop is entered, it can omit the first testb.)
Suppose we change the above code to the following, as is done in the attached
file code-mbcel1-opt.s:
.L33:
movl $1, %edx
.L31:
addq %rdx, %rbx
addq %rax, %rbp
movsbl (%rbx), %eax
testb %al, %al
jg .L33
js .L30
This small change improves performance significantly: for me, the test program
runs 55% faster on a circa-2021 Intel Xeon W-1350, and 74% faster on a
circa-2010 AMD Phenom II x4 910e, using the following commands to benchmark:
gcc -O2 code-mbcel1.i -o code-mbcel1
gcc -O2 code-mbcel1-opt.s -o code-mbcel1-opt
time ./code-mbcel1
time ./code-mbcel1-opt