https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79665
Bug ID: 79665 Summary: gcc's signed (x*x)/200 is slower than clang's Product: gcc Version: 7.0 URL: https://github.com/rust-lang/rust/issues/39446#issueco mment-280798073 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c Assignee: unassigned at gcc dot gnu.org Reporter: jistone at redhat dot com Target Milestone: --- Created attachment 40809 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=40809&action=edit C source from rust-lang/rust#39446 The linked Rust issue is initially comparing rustc and clang, but I found that GCC is also generating worse code than clang. Profiling pointed me to these: x_x = (x * x) / 200; y_y = (y * y) / 200; Clang optimizes these to just 4 instructions each, like: movl %edx, %ebp imull %ebp, %ebp imulq $1374389535, %rbp, %rbp # imm = 0x51EB851F shrq $38, %rbp But clang -fwrapv generates 8 instructions each, same as rustc, and it turns out GCC uses a similar longer sequence too regardless of -fwrapv. GCC 6.3 gives: movl %edi, %r13d imull %edi, %r13d movl %r13d, %eax sarl $31, %r13d imull %ebx sarl $6, %edx movl %edx, %ecx subl %r13d, %ecx And GCC 7 is the same, although it interleaves the two x and y lines. Here they are on Compiler Explorer: https://godbolt.org/g/Fy8Fiy I believe what's happening is that clang identifies that signed (x*x) is always positive, thanks to overflow UB, and then (x*x)/200 is essentially unsigned division, which requires less fixup than signed constant division.