= large power of two

pascal_cuoq at hotmail dot com Sat, 18 Apr 2020 11:29:13 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94650


            Bug ID: 94650
           Summary: Missed x86-64 peephole optimization: x >= large power
                    of two
           Product: gcc
           Version: 9.3.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: target
          Assignee: unassigned at gcc dot gnu.org
          Reporter: pascal_cuoq at hotmail dot com
  Target Milestone: ---

Consider the three functions check, test0 and test1:

(Compiler Explorer link: https://gcc.godbolt.org/z/Sh4GpR )

#include <string.h>

#define LARGE_POWER_OF_TWO (1UL << 40)

int check(unsigned long m)
{
    return m >= LARGE_POWER_OF_TWO;
}

void g(int);

void test0(unsigned long m)
{
    if (m >= LARGE_POWER_OF_TWO) g(0);
}

void test1(unsigned long m)
{
    if (m >= LARGE_POWER_OF_TWO) g(m);
}

At least in the case of check and test0, the optimal way to compare m to 1<<40
is to shift m by 40 and compare the result to 0. This is the code generated for
these functions by Clang 10:

check:                                  # @check
        xorl    %eax, %eax
        shrq    $40, %rdi
        setne   %al
        retq
test0:                                  # @test0
        shrq    $40, %rdi
        je      .LBB1_1
        xorl    %edi, %edi
        jmp     g                       # TAILCALL
.LBB1_1:
        retq

In contrast, GCC 9.3 uses a 64-bit constant that needs to be loaded in a
register with movabsq:

check:
        movabsq $1099511627775, %rax
        cmpq    %rax, %rdi
        seta    %al
        movzbl  %al, %eax
        ret
test0:
        movabsq $1099511627775, %rax
        cmpq    %rax, %rdi
        ja      .L5
        ret
.L5:
        xorl    %edi, %edi
        jmp     g


In the case of the function test1 the comparison is between these two version,
because the shift is destructive:

Clang10:
test1:                                  # @test1
        movq    %rdi, %rax
        shrq    $40, %rax
        je      .LBB2_1
        jmp     g                       # TAILCALL
.LBB2_1:
        retq

GCC9.3:
test1:
        movabsq $1099511627775, %rax
        cmpq    %rax, %rdi
        ja      .L8
        ret
.L8:
        jmp     g

It is less obvious which approach is better in the case of the function test1,
but generally speaking the shift approach should still be faster. The
register-register move can be free on Skylake (in the sense of not needing any
execution port), whereas movabsq requires an execution port and also it's a
10-byte instruction!

[Bug target/94650] New: Missed x86-64 peephole optimization: x >= large power of two

Reply via email to