[Bug c++/85466] Performance is slow when doing 'branchless' conditional style math operations

jgreenhalgh at gcc dot gnu.org Thu, 19 Apr 2018 04:47:41 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85466


James Greenhalgh <jgreenhalgh at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |jgreenhalgh at gcc dot gnu.org

--- Comment #3 from James Greenhalgh <jgreenhalgh at gcc dot gnu.org> ---
Created attachment 43988
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=43988&action=edit
Reduced testcase

I believe this testcase shows the issue being reported here. Clang seems to
spot this is essentially a memset across the array, while GCC doesn't.

On AArch64 with Clang:

  .LBB1_9:                                // =>This Inner Loop Header: Depth=1
        stp     q0, q0, [x8, #-16]
        subs    x20, x20, #8            // =8
        add     x8, x8, #32             // =32
        b.ne    .LBB1_9

On x86-64 with Clang:

  .LBB1_9:                                # =>This Inner Loop Header: Depth=1
        movups  %xmm0, -144(%rax,%rcx,4)
        movups  %xmm0, -128(%rax,%rcx,4)
        movups  %xmm0, -112(%rax,%rcx,4)
        movups  %xmm0, -96(%rax,%rcx,4)
        movups  %xmm0, -80(%rax,%rcx,4)
        movups  %xmm0, -64(%rax,%rcx,4)
        movups  %xmm0, -48(%rax,%rcx,4)
        movups  %xmm0, -32(%rax,%rcx,4)
        movups  %xmm0, -16(%rax,%rcx,4)
        movups  %xmm0, (%rax,%rcx,4)
        addq    $40, %rcx
        cmpq    $100036, %rcx           # imm = 0x186C4
        jne     .LBB1_9

GCC doesn't spot this.

On the other hand G++'s inlining of the various random number initialisation
routines really hammers Clang, which ends up emulating 128-bit arithmetic on
AArch64.

[Bug c++/85466] Performance is slow when doing 'branchless' conditional style math operations

Reply via email to