https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110104
--- Comment #5 from Mason <slash.tmp at free dot fr> --- FWIW, trunk (gcc14) translates testcase3 to the same code as the other testcases, while remaining portable across all architectures: $ gcc-trunk -O3 -march=bdver3 testcase3.c typedef unsigned long long u64; typedef unsigned __int128 u128; void testcase3(u64 *acc, u64 a, u64 b) { int c1, c2; u128 res = (u128)a * b; u64 lo = res, hi = res >> 64; c1 = __builtin_add_overflow(lo, acc[0], &acc[0]); c2 = __builtin_add_overflow(hi, acc[1], &acc[1]) | __builtin_add_overflow(c1, acc[1], &acc[1]); __builtin_add_overflow(c2, acc[2], &acc[2]); } testcase3: movq %rsi, %rax mulq %rdx addq %rax, (%rdi) adcq %rdx, 8(%rdi) adcq $0, 16(%rdi) ret Thanks again, Jakub.