https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89845
Jakub Jelinek <jakub at gcc dot gnu.org> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |drepper at gcc dot gnu.org, | |law at gcc dot gnu.org --- Comment #1 from Jakub Jelinek <jakub at gcc dot gnu.org> --- I admit I haven't done any measurements yet, but just at eyeballing the generated code with both trunk gcc and clang say on -O2 -mtune=generic with their https://raw.githubusercontent.com/lemire/fastmod/master/include/fastmod.h doesn't seem that compelling; the fastmod code is pretty much in all cases larger (part of it is that we aren't good at optimizing that __uint128_t or __int128_t arithmetics in there just to highpart 64x64 multiplication and nothing else, plus all these fastmod sequences have the large and more costly movabsqs compared to what we emit currently. For the int128_t arithmetics, perhaps it would help let tree-ssa-math-opts.c pattern recognize MULT_HIGHPART_EXPR code (or do that in match.pd) even in non-vectorized code if can_mult_highpart_p returns 1. #include "fastmod.h" uint32_t ud3 (uint32_t x) { return fastmod::fastdiv<3> (x); } uint32_t ud3_ (uint32_t x) { return x / 3; } uint32_t um3 (uint32_t x) { return fastmod::fastmod<3> (x); } uint32_t um3_ (uint32_t x) { return x % 3; } int32_t sd3 (int32_t x) { return fastmod::fastdiv<3> (x); } int32_t sd3_ (int32_t x) { return x / 3; } int32_t sm3 (int32_t x) { return fastmod::fastmod<3> (x); } int32_t sm3_ (int32_t x) { return x % 3; } uint32_t ud61 (uint32_t x) { return fastmod::fastdiv<61> (x); } uint32_t ud61_ (uint32_t x) { return x / 61; } uint32_t um61 (uint32_t x) { return fastmod::fastmod<61> (x); } uint32_t um61_ (uint32_t x) { return x % 61; } int32_t sd61 (int32_t x) { return fastmod::fastdiv<61> (x); } int32_t sd61_ (int32_t x) { return x / 61; } int32_t sm61 (int32_t x) { return fastmod::fastmod<61> (x); } int32_t sm61_ (int32_t x) { return x % 61; } uint32_t ud279 (uint32_t x) { return fastmod::fastdiv<279> (x); } uint32_t ud279_ (uint32_t x) { return x / 279; } uint32_t um279 (uint32_t x) { return fastmod::fastmod<279> (x); } uint32_t um279_ (uint32_t x) { return x % 279; } int32_t sd279 (int32_t x) { return fastmod::fastdiv<279> (x); } int32_t sd279_ (int32_t x) { return x / 279; } int32_t sm279 (int32_t x) { return fastmod::fastmod<279> (x); } int32_t sm279_ (int32_t x) { return x % 279; } E.g. in ud3 there is movl %edi, %eax movabsq $6148914691236517206, %rdi mulq %rdi movq %rdx, %rax and in ud3_ movl %edi, %eax movl $2863311531, %edi imulq %rdi, %rax shrq $33, %rax In um3 there is: movabsq $6148914691236517206, %rcx movl %edi, %edi xorl %edx, %edx imulq %rcx, %rdi movq %rdi, %rax movq %rdi, %rsi movq %rdx, %rdi addq %rax, %rsi shldq $1, %rax, %rdi addq %rsi, %rax adcq %rdi, %rdx movq %rdx, %rax which is terrible, in um3_ movl %edi, %eax movl $2863311531, %edx imulq %rdx, %rax shrq $33, %rax leal (%rax,%rax,2), %eax subl %eax, %edi movl %edi, %eax In *.optimized um3 is: _3 = (long unsigned int) x_2(D); lowbits_4 = _3 * 6148914691236517206; _6 = lowbits_4 w* 3; _7 = _6 >> 64; _8 = (unsigned int) _7; where _4 is 64-bit and _6, _7 is 128-bit. This is another case where just using _3 = (long unsigned int) x_2(D); lowbits_4 = _3 * 6148914691236517206; _9 = lowbits_4 h* 3; _8 = (unsigned int) _9; might result in better code, though not sure if we have a good expander for highpart multiply by constant.