https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79665

            Bug ID: 79665
           Summary: gcc's signed (x*x)/200 is slower than clang's
           Product: gcc
           Version: 7.0
               URL: https://github.com/rust-lang/rust/issues/39446#issueco
                    mment-280798073
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: c
          Assignee: unassigned at gcc dot gnu.org
          Reporter: jistone at redhat dot com
  Target Milestone: ---

Created attachment 40809
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=40809&action=edit
C source from rust-lang/rust#39446

The linked Rust issue is initially comparing rustc and clang, but I found that
GCC is also generating worse code than clang.  Profiling pointed me to these:

          x_x = (x * x) / 200;
          y_y = (y * y) / 200;

Clang optimizes these to just 4 instructions each, like:

        movl    %edx, %ebp
        imull   %ebp, %ebp
        imulq   $1374389535, %rbp, %rbp # imm = 0x51EB851F
        shrq    $38, %rbp

But clang -fwrapv generates 8 instructions each, same as rustc, and it turns
out GCC uses a similar longer sequence too regardless of -fwrapv.  GCC 6.3
gives:

        movl    %edi, %r13d
        imull   %edi, %r13d
        movl    %r13d, %eax
        sarl    $31, %r13d
        imull   %ebx
        sarl    $6, %edx
        movl    %edx, %ecx
        subl    %r13d, %ecx

And GCC 7 is the same, although it interleaves the two x and y lines.

Here they are on Compiler Explorer: https://godbolt.org/g/Fy8Fiy

I believe what's happening is that clang identifies that signed (x*x) is always
positive, thanks to overflow UB, and then (x*x)/200 is essentially unsigned
division, which requires less fixup than signed constant division.

Reply via email to