https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93142
Bug ID: 93142 Summary: Missed optimization : Use of adc when checking overflow Product: gcc Version: 9.2.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: rtl-optimization Assignee: unassigned at gcc dot gnu.org Reporter: madhur4127 at gmail dot com Target Milestone: --- Created attachment 47585 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=47585&action=edit Complete code for computing dot product (same as the godbolt link) Consider emulating 192-bit integer using a 128-bit integer and a 64-bit integer. In the code sample this emulated integer is used to compute dot product of two uint64_t vectors of length N. // function to compute dot product of two vectors using u128 = unsigned __int128; const int N = 2048; uint64_t a[N], b[N]; u128 sum = 0; uint64_t overflow = 0; for(int i=0;i<N;++i){ u128 prod = (u128) a[i] * (u128) b[i]; sum += prod; // gcc branches, clang just uses: adc overflow, 0 overflow += sum<prod; } To check for overflow in 128-bit addition a branch statement is produced in assembly while it can be substituted with `adc`. This idiom of manual checking of Carry Flag works with clang. Thus a branch statement can be eliminated in this case. g++ 9.2.0 with -O3 -Wall -Wextra -march=broadwell -fno-unroll-loops produces: .L3: mov rdx, QWORD PTR [rdi+rcx] mulx r9, r8, QWORD PTR [rsi+rcx] add r14, r8 adc r15, r9 cmp r14, r8 <----- branch because of : overflow += sum < prod; mov rax, r15 sbb rax, r9 adc r10, 0 add rcx, 8 cmp rcx, 16384 jne .L3 whereas clang++ 9.0.0 with -O3 -Wall -Wextra -march=broadwell -fno-unroll-loops produces: .LBB0_1: mov rax, qword ptr [rsi + 8*rcx] mul qword ptr [rdi + 8*rcx] add r10, rax adc r9, rdx adc r11, 0 <------- branch eliminated in clang inc rcx cmp rcx, 2048 jne .LBB0_1 For complete source code please visit: https://godbolt.org/z/ktdA4b