https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113560
--- Comment #5 from accelerator0099 at gmail dot com --- If we are using an arch without BMI2, we can use single MUL instruction instead. Here is the description of MUL reg64/mem64. Multiplies a 64-bit register or memory operand by the contents of the RAX register and stores the result in the RDX:RAX register. It stores the result in RDX:RAX, putting the high-order bits of the product in RDX. And on zen4 arch, it costs 3 or 4 circles to do this.