https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110823
Bug ID: 110823 Summary: [missed optimization] >50% speedup for x86-64 ASCII processing a la GNU diffutils Product: gcc Version: 13.1.1 Status: UNCONFIRMED Severity: normal Priority: P3 Component: rtl-optimization Assignee: unassigned at gcc dot gnu.org Reporter: eggert at cs dot ucla.edu Target Milestone: --- Created attachment 55643 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=55643&action=edit proprocessed source code inspired by GNU diffutils This is GCC 13.1.1 20230614 (Red Hat 13.1.1-4) on x86-64. While tuning GNU diffutils I noticed that its loops to process mostly-ASCII text were not compiled well by GCC on x86-64. For a stripped-down example of the problem, compile the attached program with: gcc -O2 -S code-mbcel1.i The result is in the attached file code-mbcel1.s. Its loop kernel assuming ASCII text (starting on line 212) looks like this: .L33: testb %al, %al js .L30 movl $1, %edx .L31: movl %eax, %eax addq %rdx, %rbx addq %rax, %rbp movsbl (%rbx), %eax testb %al, %al jne .L33 As I understand it the "movl %eax, %eax" is unnecessary, as all code that reaches .L31 guarantees that %rax's top 32 bits are zero. Also, the loop body executes "testb %al, %al" twice when once would suffice. (As a minor point, since the compiler can easily tell that %al is positive when the loop is entered, it can omit the first testb.) Suppose we change the above code to the following, as is done in the attached file code-mbcel1-opt.s: .L33: movl $1, %edx .L31: addq %rdx, %rbx addq %rax, %rbp movsbl (%rbx), %eax testb %al, %al jg .L33 js .L30 This small change improves performance significantly: for me, the test program runs 55% faster on a circa-2021 Intel Xeon W-1350, and 74% faster on a circa-2010 AMD Phenom II x4 910e, using the following commands to benchmark: gcc -O2 code-mbcel1.i -o code-mbcel1 gcc -O2 code-mbcel1-opt.s -o code-mbcel1-opt time ./code-mbcel1 time ./code-mbcel1-opt