https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108862
--- Comment #1 from Jakub Jelinek <jakub at gcc dot gnu.org> --- Note, -O2 -mcpu=power9: __attribute__((noipa)) unsigned __int128 foo (unsigned __int128 x, unsigned long long y, unsigned long long z) { return x + (unsigned __int128) y * z; } int main () { unsigned __int128 x = foo (0, 0x04a13945d898c296ULL, 0x0000100000000fffULL); if ((unsigned long long) (x >> 64) != 0x0000004a13945dd3ULL || (unsigned long long) x != 0x9b1c8443b3909d6aULL) __builtin_abort (); return 0; } works correctly, in that case we get: maddhdu 10,5,6,3 maddld 3,5,6,3 add 4,10,4 which is correct. But for the #c0 testcase above, e.g. with -O2 -fno-unroll-loops -mcpu=power9 we get .L3: ldu 9,8(8) ldu 10,-8(5) maddld 3,9,10,3 maddhdu 9,9,10,3 add 4,9,4 bdnz .L3 in the inner loop, which looks wrong because maddhdu in that case uses result of maddld as last operand rather than the low part of the 128-bit counter (w).