https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93142

            Bug ID: 93142
           Summary: Missed optimization : Use of adc when checking
                    overflow
           Product: gcc
           Version: 9.2.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: rtl-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: madhur4127 at gmail dot com
  Target Milestone: ---

Created attachment 47585
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=47585&action=edit
Complete code for computing dot product (same as the godbolt link)

Consider emulating 192-bit integer using a 128-bit integer and a 64-bit
integer. In the code sample this emulated integer is used to compute dot
product of two uint64_t vectors of length N. 

    // function to compute dot product of two vectors
    using u128 = unsigned __int128;
    const int N = 2048;
    uint64_t a[N], b[N];
    u128 sum = 0;
    uint64_t overflow = 0;
    for(int i=0;i<N;++i){
        u128 prod = (u128) a[i] * (u128) b[i];
        sum += prod;
        // gcc branches, clang just uses: adc overflow, 0
        overflow += sum<prod;
    }

To check for overflow in 128-bit addition a branch statement is produced in
assembly while it can be substituted with `adc`. This idiom of manual checking
of Carry Flag works with clang.

Thus a branch statement can be eliminated in this case.

g++ 9.2.0 with -O3 -Wall -Wextra -march=broadwell -fno-unroll-loops produces:
.L3:
        mov     rdx, QWORD PTR [rdi+rcx]
        mulx    r9, r8, QWORD PTR [rsi+rcx]
        add     r14, r8
        adc     r15, r9

        cmp     r14, r8    <----- branch because of : overflow += sum < prod;
        mov     rax, r15
        sbb     rax, r9
        adc     r10, 0

        add     rcx, 8
        cmp     rcx, 16384
        jne     .L3

whereas clang++ 9.0.0 with -O3 -Wall -Wextra -march=broadwell -fno-unroll-loops
produces:

.LBB0_1:               
        mov     rax, qword ptr [rsi + 8*rcx]
        mul     qword ptr [rdi + 8*rcx]
        add     r10, rax
        adc     r9, rdx

        adc     r11, 0     <------- branch eliminated in clang

        inc     rcx
        cmp     rcx, 2048
        jne     .LBB0_1

For complete source code please visit: https://godbolt.org/z/ktdA4b

Reply via email to