https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89845

Jakub Jelinek <jakub at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |drepper at gcc dot gnu.org,
                   |                            |law at gcc dot gnu.org

--- Comment #1 from Jakub Jelinek <jakub at gcc dot gnu.org> ---
I admit I haven't done any measurements yet, but just at eyeballing the
generated code with both trunk gcc and clang say on -O2 -mtune=generic
with their
https://raw.githubusercontent.com/lemire/fastmod/master/include/fastmod.h
doesn't seem that compelling; the fastmod code is pretty much in all cases
larger (part of it is that we aren't good at optimizing that __uint128_t or
__int128_t arithmetics in there just to highpart 64x64 multiplication and
nothing else, plus all these fastmod sequences have the large and more costly
movabsqs compared to what we emit currently.

For the int128_t arithmetics, perhaps it would help let tree-ssa-math-opts.c
pattern recognize MULT_HIGHPART_EXPR code (or do that in match.pd) even in
non-vectorized code if can_mult_highpart_p returns 1.

#include "fastmod.h"

uint32_t ud3 (uint32_t x) { return fastmod::fastdiv<3> (x); }
uint32_t ud3_ (uint32_t x) { return x / 3; }
uint32_t um3 (uint32_t x) { return fastmod::fastmod<3> (x); }
uint32_t um3_ (uint32_t x) { return x % 3; }
int32_t sd3 (int32_t x) { return fastmod::fastdiv<3> (x); }
int32_t sd3_ (int32_t x) { return x / 3; }
int32_t sm3 (int32_t x) { return fastmod::fastmod<3> (x); }
int32_t sm3_ (int32_t x) { return x % 3; }
uint32_t ud61 (uint32_t x) { return fastmod::fastdiv<61> (x); }
uint32_t ud61_ (uint32_t x) { return x / 61; }
uint32_t um61 (uint32_t x) { return fastmod::fastmod<61> (x); }
uint32_t um61_ (uint32_t x) { return x % 61; }
int32_t sd61 (int32_t x) { return fastmod::fastdiv<61> (x); }
int32_t sd61_ (int32_t x) { return x / 61; }
int32_t sm61 (int32_t x) { return fastmod::fastmod<61> (x); }
int32_t sm61_ (int32_t x) { return x % 61; }
uint32_t ud279 (uint32_t x) { return fastmod::fastdiv<279> (x); }
uint32_t ud279_ (uint32_t x) { return x / 279; }
uint32_t um279 (uint32_t x) { return fastmod::fastmod<279> (x); }
uint32_t um279_ (uint32_t x) { return x % 279; }
int32_t sd279 (int32_t x) { return fastmod::fastdiv<279> (x); }
int32_t sd279_ (int32_t x) { return x / 279; }
int32_t sm279 (int32_t x) { return fastmod::fastmod<279> (x); }
int32_t sm279_ (int32_t x) { return x % 279; }

E.g. in ud3 there is
        movl    %edi, %eax
        movabsq $6148914691236517206, %rdi
        mulq    %rdi
        movq    %rdx, %rax
and in ud3_
        movl    %edi, %eax
        movl    $2863311531, %edi
        imulq   %rdi, %rax
        shrq    $33, %rax
In um3 there is:
        movabsq $6148914691236517206, %rcx
        movl    %edi, %edi
        xorl    %edx, %edx
        imulq   %rcx, %rdi
        movq    %rdi, %rax
        movq    %rdi, %rsi
        movq    %rdx, %rdi
        addq    %rax, %rsi
        shldq   $1, %rax, %rdi
        addq    %rsi, %rax
        adcq    %rdi, %rdx
        movq    %rdx, %rax
which is terrible, in um3_
        movl    %edi, %eax
        movl    $2863311531, %edx
        imulq   %rdx, %rax
        shrq    $33, %rax
        leal    (%rax,%rax,2), %eax
        subl    %eax, %edi
        movl    %edi, %eax
In *.optimized um3 is:
  _3 = (long unsigned int) x_2(D);
  lowbits_4 = _3 * 6148914691236517206;
  _6 = lowbits_4 w* 3;
  _7 = _6 >> 64;
  _8 = (unsigned int) _7;
where _4 is 64-bit and _6, _7 is 128-bit.  This is another case where just
using
  _3 = (long unsigned int) x_2(D);
  lowbits_4 = _3 * 6148914691236517206;
  _9 = lowbits_4 h* 3;
  _8 = (unsigned int) _9;
might result in better code, though not sure if we have a good expander for
highpart multiply by constant.

Reply via email to