50% speedup for x86-64 ASCII processing a la GNU diffutils

eggert at cs dot ucla.edu via Gcc-bugs Wed, 26 Jul 2023 12:37:28 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110823


            Bug ID: 110823
           Summary: [missed optimization] >50% speedup for x86-64 ASCII
                    processing a la GNU diffutils
           Product: gcc
           Version: 13.1.1
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: rtl-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: eggert at cs dot ucla.edu
  Target Milestone: ---

Created attachment 55643
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=55643&action=edit
proprocessed source code inspired by GNU diffutils

This is GCC 13.1.1 20230614 (Red Hat 13.1.1-4) on x86-64.

While tuning GNU diffutils I noticed that its loops to process mostly-ASCII
text were not compiled well by GCC on x86-64. For a stripped-down example of
the problem, compile the attached program with:

gcc -O2 -S code-mbcel1.i

The result is in the attached file code-mbcel1.s. Its loop kernel assuming
ASCII text (starting on line 212) looks like this:

  .L33:
        testb   %al, %al
        js      .L30
        movl    $1, %edx
  .L31:
        movl    %eax, %eax
        addq    %rdx, %rbx
        addq    %rax, %rbp
        movsbl  (%rbx), %eax
        testb   %al, %al
        jne     .L33

As I understand it the "movl %eax, %eax" is unnecessary, as all code that
reaches .L31 guarantees that %rax's top 32 bits are zero.

Also, the loop body executes "testb %al, %al" twice when once would suffice.
(As a minor point, since the compiler can easily tell that %al is positive when
the loop is entered, it can omit the first testb.)

Suppose we change the above code to the following, as is done in the attached
file code-mbcel1-opt.s:

  .L33:
        movl    $1, %edx
  .L31:
        addq    %rdx, %rbx
        addq    %rax, %rbp
        movsbl  (%rbx), %eax
        testb   %al, %al
        jg      .L33
        js      .L30

This small change improves performance significantly: for me, the test program
runs 55% faster on a circa-2021 Intel Xeon W-1350, and 74% faster on a
circa-2010 AMD Phenom II x4 910e, using the following commands to benchmark:

gcc -O2 code-mbcel1.i -o code-mbcel1
gcc -O2 code-mbcel1-opt.s -o code-mbcel1-opt
time ./code-mbcel1
time ./code-mbcel1-opt

[Bug rtl-optimization/110823] New: [missed optimization] >50% speedup for x86-64 ASCII processing a la GNU diffutils

Reply via email to