https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93142
Bug ID: 93142
Summary: Missed optimization : Use of adc when checking
overflow
Product: gcc
Version: 9.2.0
Status: UNCONFIRMED
Severity: normal
Priority: P3
Component: rtl-optimization
Assignee: unassigned at gcc dot gnu.org
Reporter: madhur4127 at gmail dot com
Target Milestone: ---
Created attachment 47585
--> https://gcc.gnu.org/bugzilla/attachment.cgi?id=47585&action=edit
Complete code for computing dot product (same as the godbolt link)
Consider emulating 192-bit integer using a 128-bit integer and a 64-bit
integer. In the code sample this emulated integer is used to compute dot
product of two uint64_t vectors of length N.
// function to compute dot product of two vectors
using u128 = unsigned __int128;
const int N = 2048;
uint64_t a[N], b[N];
u128 sum = 0;
uint64_t overflow = 0;
for(int i=0;i<N;++i){
u128 prod = (u128) a[i] * (u128) b[i];
sum += prod;
// gcc branches, clang just uses: adc overflow, 0
overflow += sum<prod;
}
To check for overflow in 128-bit addition a branch statement is produced in
assembly while it can be substituted with `adc`. This idiom of manual checking
of Carry Flag works with clang.
Thus a branch statement can be eliminated in this case.
g++ 9.2.0 with -O3 -Wall -Wextra -march=broadwell -fno-unroll-loops produces:
.L3:
mov rdx, QWORD PTR [rdi+rcx]
mulx r9, r8, QWORD PTR [rsi+rcx]
add r14, r8
adc r15, r9
cmp r14, r8 <----- branch because of : overflow += sum < prod;
mov rax, r15
sbb rax, r9
adc r10, 0
add rcx, 8
cmp rcx, 16384
jne .L3
whereas clang++ 9.0.0 with -O3 -Wall -Wextra -march=broadwell -fno-unroll-loops
produces:
.LBB0_1:
mov rax, qword ptr [rsi + 8*rcx]
mul qword ptr [rdi + 8*rcx]
add r10, rax
adc r9, rdx
adc r11, 0 <------- branch eliminated in clang
inc rcx
cmp rcx, 2048
jne .LBB0_1
For complete source code please visit: https://godbolt.org/z/ktdA4b