https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89845
Jakub Jelinek <jakub at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
CC| |drepper at gcc dot gnu.org,
| |law at gcc dot gnu.org
--- Comment #1 from Jakub Jelinek <jakub at gcc dot gnu.org> ---
I admit I haven't done any measurements yet, but just at eyeballing the
generated code with both trunk gcc and clang say on -O2 -mtune=generic
with their
https://raw.githubusercontent.com/lemire/fastmod/master/include/fastmod.h
doesn't seem that compelling; the fastmod code is pretty much in all cases
larger (part of it is that we aren't good at optimizing that __uint128_t or
__int128_t arithmetics in there just to highpart 64x64 multiplication and
nothing else, plus all these fastmod sequences have the large and more costly
movabsqs compared to what we emit currently.
For the int128_t arithmetics, perhaps it would help let tree-ssa-math-opts.c
pattern recognize MULT_HIGHPART_EXPR code (or do that in match.pd) even in
non-vectorized code if can_mult_highpart_p returns 1.
#include "fastmod.h"
uint32_t ud3 (uint32_t x) { return fastmod::fastdiv<3> (x); }
uint32_t ud3_ (uint32_t x) { return x / 3; }
uint32_t um3 (uint32_t x) { return fastmod::fastmod<3> (x); }
uint32_t um3_ (uint32_t x) { return x % 3; }
int32_t sd3 (int32_t x) { return fastmod::fastdiv<3> (x); }
int32_t sd3_ (int32_t x) { return x / 3; }
int32_t sm3 (int32_t x) { return fastmod::fastmod<3> (x); }
int32_t sm3_ (int32_t x) { return x % 3; }
uint32_t ud61 (uint32_t x) { return fastmod::fastdiv<61> (x); }
uint32_t ud61_ (uint32_t x) { return x / 61; }
uint32_t um61 (uint32_t x) { return fastmod::fastmod<61> (x); }
uint32_t um61_ (uint32_t x) { return x % 61; }
int32_t sd61 (int32_t x) { return fastmod::fastdiv<61> (x); }
int32_t sd61_ (int32_t x) { return x / 61; }
int32_t sm61 (int32_t x) { return fastmod::fastmod<61> (x); }
int32_t sm61_ (int32_t x) { return x % 61; }
uint32_t ud279 (uint32_t x) { return fastmod::fastdiv<279> (x); }
uint32_t ud279_ (uint32_t x) { return x / 279; }
uint32_t um279 (uint32_t x) { return fastmod::fastmod<279> (x); }
uint32_t um279_ (uint32_t x) { return x % 279; }
int32_t sd279 (int32_t x) { return fastmod::fastdiv<279> (x); }
int32_t sd279_ (int32_t x) { return x / 279; }
int32_t sm279 (int32_t x) { return fastmod::fastmod<279> (x); }
int32_t sm279_ (int32_t x) { return x % 279; }
E.g. in ud3 there is
movl %edi, %eax
movabsq $6148914691236517206, %rdi
mulq %rdi
movq %rdx, %rax
and in ud3_
movl %edi, %eax
movl $2863311531, %edi
imulq %rdi, %rax
shrq $33, %rax
In um3 there is:
movabsq $6148914691236517206, %rcx
movl %edi, %edi
xorl %edx, %edx
imulq %rcx, %rdi
movq %rdi, %rax
movq %rdi, %rsi
movq %rdx, %rdi
addq %rax, %rsi
shldq $1, %rax, %rdi
addq %rsi, %rax
adcq %rdi, %rdx
movq %rdx, %rax
which is terrible, in um3_
movl %edi, %eax
movl $2863311531, %edx
imulq %rdx, %rax
shrq $33, %rax
leal (%rax,%rax,2), %eax
subl %eax, %edi
movl %edi, %eax
In *.optimized um3 is:
_3 = (long unsigned int) x_2(D);
lowbits_4 = _3 * 6148914691236517206;
_6 = lowbits_4 w* 3;
_7 = _6 >> 64;
_8 = (unsigned int) _7;
where _4 is 64-bit and _6, _7 is 128-bit. This is another case where just
using
_3 = (long unsigned int) x_2(D);
lowbits_4 = _3 * 6148914691236517206;
_9 = lowbits_4 h* 3;
_8 = (unsigned int) _9;
might result in better code, though not sure if we have a good expander for
highpart multiply by constant.