Making GNU GCC choose_multiplier in expmed.c significantly faster

colinb2 . Tue, 10 Jul 2018 04:10:27 -0700

Feel free to copy this email and attachment to anyone who might be interested.
I'm very happy to answer any questions anyone has.
The program can be compiled and run like this on Linux with GNU GCC:
gcc -O2 -o expmed2.exe expmed2.c
./expmed2.exe


This email deals with making part of the GNU GCC compiler - integer division
by a constant divisor - faster. (The calculation of the parameters for the
machine code will be faster; compiled programs won't run faster.)
Further down I mention inequality (1) which can be used to make the LLVM
compiler somewhat faster, because that currently uses code based on (2).
I don't know what - if anything - the Java JVM uses for this, or how other
compilers do this, but these ideas may be useful there.

By significantly faster I mean I have benchmarked alternative versions of
choose_multiplier which on my low specification netbook can take maybe less than
half the time the current version takes. Time saved in compilation is much less
important than time saved in running compiled programs, but the code for the
faster versions is about the same length as the code for the current version,
and is only a bit more complicated, so is worth considering?

A short summary of the following is that choose_multiplier currently uses an
elegant algorithm due to Granlund & Montgomery, but which as implemented seems
rather slow. We can make it faster while retaining the basic structure, and
using a different, mostly equivalent, algorithm, may be a bit faster than that.

Licencing: in Fractint people's words "Don't want money, want recognition!"
The version choose_multiplier_v2 is based - but improves - on what's in
the GCC choose_multiplier function in file expmed.c, so the GCC licence.
The version choose_multiplier_v4 is based - but improves - on magicu2 in
"Hacker's Delight", so the licence is you needn't acknowledge the source
but it would be nice to credit the code as from
magicu2 in "Hacker's Delight" by Henry S Warren http://hackersdelight.org
with improvements by Colin Bartlett <coli...@gmail.com>.
This latter also applies to choose_multiplier_power2_divrem because that
is also an obvious (?) idea from magicu2 in "Hacker's Delight".  */

The idea is using "wide int" seems slow compared to "unsigned HOST_WIDE_INT",
so it makes sense to avoid using "wide int" as much as possible. We can easily
rewrite choose_multiplier to only use "wide int" to calculate the initial mlow;
this is choose_multiplier_v2. An alternative for choose_multiplier_v2 completely
avoids using "wide int" by iterating upwards for the initial mlow, but if that
benchmarks as faster than using "wide int" even once (I suspect it might) then
just iterating upwards may even be a bit faster; this is choose_multiplier_v4.

The attachment is self-contained, and I have checked that the values produced
agree with a "Hacker's Delight" table of M/2^s for small d and n=precision=32.

What follows is a short summary of the theory, applying it to choose_multiplier.

Problem: find M/2^s so that for 0<=i<=iMax>=d we have floor(i/d)=floor(i*M/2^s).
Let qc=floor((iMax+1)/d); nc=qc*d-1; lgup=ceiling(log2(d)).
Given s let M=floor(2^s/d)+1; delta=M*d-2^s.

For GCC choose_multiplier:

* equivalent necessary & sufficient conditions:
(1) 0<delta and qc*delta<M
(2) 0<delta and nc*delta<2^s
Proof of (1) if and only if (2):
qc*delta*d-delta=nc*delta<2^s==M*d-delta
(3) 1/d<M/2^s<(1+1/nc)*1/d

* equivalent sufficient conditions: we have nc<2^precision, so:
(4) 0<delta and 2^precision*delta<=2^s
(4.1) 0<delta and delta<=2^(s-precision)
(5) 1/d<M/2^s<=(1+1/2^precision)*1/d
(5.1) 2^s/d<M<=(2^s+2^(s-precision))/d

(1) seems to be new to the literature, but is equivalent to (2) which is in
"Hacker's Delight" (Chapter 10) by Henry S Warren. Both give upwards iterating
algorithms which give minimal M/2^s and which avoid needing to use "wide int",
but the algorithm code using (1) is simpler and faster than that using (2).

(4.1) is in "Hacker's Delight". It gives an upwards iterating algorithm which
avoids using "wide int", and the algorithm code is a bit simpler than using (1).
Mostly it gives the same result as GCC choose_multiplier which implements
(possibly slowly because it uses "wide int") the elegant algorithm due to
Granlund & Montgomery. (Which perhaps builds on work by Alverson.)

We can prove if iMax<2^precision then for 0<=i<2^(precision/2) inequality (4)
etc give the same M/2^s as inequality (1) etc. For n=precision=32
the smallest d for which (4) gives a non-minimal M is d=102807.

Rough not necessarily reliable statistics suggest that using (4) etc
we have that if n=precision the average post_shift=lgup-1.

So if we can *quickly" calculate M/2^s at s=n+lgup-1, then either that fails
and we need to use an extra "top" bit for M and use post_shift=lgup,
or it works and we can try reducing M.

The catch is *if* we can *quickly* calculate M/2^s at s=n+lgup-1;
it's easy to directly calculate M/2^s at that s using "wide int", but using
"wide int" seems slow, and it might actually be faster to iterate upwards
from s=n and completely avoid using "wide int".

In any case "wide int" seems slow compared with using "unsigned HOST_WIDE_INT",
so it makes sense to avoid "wide int" as far as possible.

So in the attachment choose_multiplier_v2 rewrites choose_multiplier to:
(a) only use "wide int" to calculate the initial mlow at s=n+lgup-1;
    all other calculations are made using "unsigned HOST_WIDE_INT";
or (b) avoid using "wide int" to calculate the initial mlow at s=n+lgup-1,
       by iterating upwards from s=n to find the initial mlow.
But if (b) benchmarks as faster than (a), which I suspect might be the case,
then it may even be on average a bit faster to use choose_multiplier_v4 which
iterates upwards to find M/2^s, and which would avoid needing to calculate lgup
unless that is useful as a return value of choose_multiplier.

Colin Bartlett

#include <stdio.h>
#include <stdlib.h>

#define HOST_WIDE_INT int
#define HOST_BITS_PER_WIDE_INT 32
#define HOST_BITS_PER_DOUBLE_INT (2 * HOST_BITS_PER_WIDE_INT)

/* Making GNU GCC choose_multiplier in expmed.c significantly faster.
   By which we mean up to 50% or more faster for the compiler to calculate
   parameters for the machine code for integer division by a constant divisor.
   (Compiled programs won't run faster.)
   Licencing: in Fractint people's words "Don't want money, want recognition!"
   The version choose_multiplier_v2 is based - but improves - on what's in
   the GCC choose_multiplier function in file expmed.c, so the GCC licence.
   The version choose_multiplier_v4 is based - but improves - on magicu2 in
   "Hacker's Delight", so the licence is you needn't acknowledge the source
   but it would be nice to credit the code as from
   magicu2 in "Hacker's Delight" by Henry S Warren http://hackersdelight.org
   with improvements by Colin Bartlett <coli...@gmail.com>.
   This latter also applies to choose_multiplier_power2_divrem because that
   is also an obvious (?) idea from magicu2 in "Hacker's Delight".  */

int
ceil_log2 (unsigned long long int iv)
{
  /* for now do it the long way */
  int s;
  if (iv == 0) return -1;
  iv -= 1;
  s = 0;
  while (iv)
    {
      s += 1;
      iv = iv >> 1;
    }
  return s;
}

void
gcc_assert (int iv)
{
  if (iv) return;
  printf ("gcc_assert error\n");
  exit (1);
}



/* Choose a minimal N + 1 bit approximation to 1/D that can be used to
   replace division by D, and put the least significant N bits of the result
   in *MULTIPLIER_PTR and return the most significant bit.

   The width of operations is N (should be <= HOST_BITS_PER_WIDE_INT), the
   needed precision is in PRECISION (should be <= N).

   PRECISION should be as small as possible so this function can choose
   multiplier more freely.

   The rounded-up logarithm of D is placed in *lgup_ptr.  A shift count that
   is to be used for a final right shift is placed in *POST_SHIFT_PTR.

   Using this function, x/D will be equal to (x * m) >> (*POST_SHIFT_PTR),
   where m is the full HOST_BITS_PER_WIDE_INT + 1 bit multiplier.  */


#define CMUHWI1 ((unsigned HOST_WIDE_INT) 1)
// #define CMUHWI2 ((unsigned HOST_WIDE_INT) 2)

/* Licencing: */

void
choose_multiplier_power2_divrem (int two_exp, unsigned HOST_WIDE_INT d,
		   unsigned HOST_WIDE_INT *quotient, unsigned HOST_WIDE_INT *remainder) 
{
  /* Must have that 2^two_exp / d < 2^HOST_BITS_PER_WIDE_INT */
  unsigned HOST_WIDE_INT q, r, rtest;
  int s;
  if (two_exp < HOST_BITS_PER_WIDE_INT)
    {
      r = CMUHWI1 << two_exp;
      q = r / d;
      r = r - q * d;
    }
//  else if (0 == 0)
//    {
//      /* Using wide_int seems slowish & it may be faster to iterate upwards. */
//      wide_int val = wi::set_bit_in_zero (two_exp, HOST_BITS_PER_DOUBLE_INT);
//      q = wi::udiv_trunc (val, d).to_uhwi ();
//      r = 0 - q * d;
//    }
  else
    {
      /* Iterate upwards to get q, r;
         there may be "overflows" but that's OK as using unsigned integers. */
      s = HOST_BITS_PER_WIDE_INT;
      q = (0 - d) / d + 1;
      r = 0 - q * d;
      rtest = (d - 1) >> 1;
      while (s < two_exp)
        {
          s = s + 1;
          if (r <= rtest)
            {
              q = q << 1;
              r = r << 1;
            }
          else
            {
              q = (q << 1) | 1;
              r = (r << 1) - d;
            }
        }
    }
  *quotient = q;
  *remainder = r;
  return;
}

/* Why is choose_multiplier "unsigned HOST_WIDE_INT" instead of just "int"?
   It only returns 0 or 1.                                                 */

int
choose_multiplier_v2 (unsigned HOST_WIDE_INT d, int n, int precision,
		   unsigned HOST_WIDE_INT *multiplier_ptr,
		   int *post_shift_ptr, int *lgup_ptr)
{
  int lgup, shiftv, topbit, s;
  unsigned HOST_WIDE_INT mlow, mhigh, mlowv, mhighv;

  /* lgup = ceil(log2(divisor)); */
  lgup = ceil_log2 (d);

  gcc_assert (lgup <= n);

  if (lgup == 0)
    {
      /* It's easier to deal with d = 1 separately, as that
         is the only d for which we need to be very careful
         about avoiding shifting bits by >= HOST_BITS_PER_WIDE_INT. */ 
      *multiplier_ptr = CMUHWI1 << (n - precision);
      *post_shift_ptr = 0;
      *lgup_ptr = lgup;
      return 1;
    }

  topbit = 0;
  /* shiftv = (n or precision) + lgup - 1
     mlow = 2^shiftv / d
     mhigh = (2^shiftv + 2^(shiftv - precision)) / d
     An unlikely case is if precision < lgup
     when we could just use mhigh = 0, shiftv == 0. */
  shiftv = n + lgup - 1;
  choose_multiplier_power2_divrem (shiftv, d, &mlow, &mlowv);
  s = shiftv - precision;
  /* Avoid shifts by >= HOST_BITS_PER_WIDE_INT. */
  mhigh = precision < HOST_BITS_PER_WIDE_INT ? mlow >> precision : 0;
  mhighv = (s < HOST_BITS_PER_WIDE_INT ? CMUHWI1 << s : 0) - d * mhigh;
  mhigh += mlow + (mlowv < d - mhighv ? 0 : 1);
  if (mlow < mhigh)
    {
      /* Reduce to lowest terms. */
      shiftv -= n;
      while (shiftv > 0)
        {
          mlowv = mlow >> 1;
          mhighv = mhigh >> 1;
          if (mlowv >= mhighv)
            break;
          mlow = mlowv;
          mhigh = mhighv;
          shiftv -= 1;
        }
    }
  else
    {
      mhigh = ((mhigh - (CMUHWI1 << (n - 1))) << 1) | 1;
      shiftv = lgup;
      topbit = 1;
    }

  *multiplier_ptr = mhigh;
  *post_shift_ptr = shiftv;
  *lgup_ptr = lgup;
  return topbit;
}

int
choose_multiplier_v4 (unsigned HOST_WIDE_INT d, int n, int precision,
		   unsigned HOST_WIDE_INT *multiplier_ptr,
		   int *post_shift_ptr, int *lgup_ptr)
{
  int lgup, shiftv, topbit;
  unsigned HOST_WIDE_INT q, delta, deltatest, twos, topbitcheck;

  /* lgup = ceil(log2(divisor)); */
  lgup = ceil_log2 (d);

  gcc_assert (lgup <= n);

  if (lgup == 0)
    {
      /* It's easier to deal with d = 1 separately, as that
         is the only d for which we need to be very careful
         about avoiding shifting bits by >= HOST_BITS_PER_WIDE_INT. */ 
      *multiplier_ptr = CMUHWI1 << (n - precision);
      *post_shift_ptr = 0;
      *lgup_ptr = lgup;
      return 1;
    }

  /* Iterate upwards to find multiplier and post_shift. */
  topbit = 0;
  shiftv = 0;
  twos = CMUHWI1 << (n - precision);
  topbitcheck = CMUHWI1 << (n - 1);
  deltatest = d >> 1;
  if (n < HOST_BITS_PER_WIDE_INT)
    {
      delta = CMUHWI1 << n;
      q = delta / d;
    }
  else
    {
      delta = 0;
      q = (delta - d) / d + 1;
    }
  delta = (q + 1) * d - delta;
// printf("\n");
// printf ("//#// N %2d P %2d L %2d d %10d = 0x%8x; M 0x %8x %8x rv %1d %1d s %2d %2d;\n",
//          n, precision, lgup, d, d, m, mv, rv, rvv, s, sv);
  while (delta > twos)
    {
      shiftv += 1;
      if (delta <= deltatest)
        {
          q = (q << 1) | 1;
          delta = delta << 1;
        }
      else if (q < topbitcheck)
        {
          q = q << 1;
          delta = (delta << 1) - d;
        }
      else
        {
          topbit = 1;
          q = (q - topbitcheck) << 1;
          break;
        }
      twos = twos << 1;
    }

  *multiplier_ptr = q + 1;
  *post_shift_ptr = shiftv;
  *lgup_ptr = lgup;
  return topbit;
}

int
test_v2 (int n, int precision, unsigned HOST_WIDE_INT d, int qshow)
{
  int s, lgup, rv, sv, rvv, neq;
  unsigned HOST_WIDE_INT m, mv;
  rv = choose_multiplier_v2 (d, n, precision, &m, &s, &lgup);
  rvv = choose_multiplier_v4 (d, n, precision, &mv, &sv, &lgup);
  neq = (m == mv ? 0 : m>mv ? 1 : 2) | (rv == rvv ? 0 : 4) | (s == sv ? 0 : 8);
  if (qshow >= 4 || (neq && qshow))
    printf ("//#// N %2d P %2d L %2d d %10d = 0x%8x; M 0x %8x %8x rv %1d %1d s %2d %2d; neq %2d;\n",
          n, precision, lgup, d, d, m, mv, rv, rvv, s, sv, neq);
  return neq;
}


void
many_test_v2 (int qshow)
{
/*
*/
  test_v2 (32, 32, 1, qshow);
  test_v2 (32, 32, 2, qshow);
  test_v2 (32, 32, 4, qshow);
  test_v2 (32, 32, 8, qshow);
  test_v2 (32, 32, 1 << 6, qshow);
  test_v2 (32, 31, 1 << 6, qshow);
  test_v2 (32, 30, 1 << 6, qshow);
  test_v2 (32, 29, 1 << 6, qshow);
  test_v2 (32, 28, 1 << 6, qshow);
  test_v2 (32, 27, 1 << 6, qshow);
  test_v2 (32, 26, 1 << 6, qshow);
  test_v2 (32, 25, 1 << 6, qshow);
  test_v2 (32, 32, 0x800000, qshow);
  test_v2 (32, 32, 3, qshow);
  test_v2 (32, 32, 5, qshow);
  test_v2 (32, 32, 6, qshow);
  test_v2 (32, 32, 7, qshow);
  test_v2 (32, 31, 7, qshow);
  test_v2 (32, 30, 7, qshow);
  test_v2 (32, 29, 7, qshow);
  test_v2 (32, 28, 7, qshow);
  test_v2 (32, 32, 9, qshow);
  test_v2 (32, 32, 10, qshow);
  test_v2 (32, 32, 11, qshow);
  test_v2 (32, 32, 12, qshow);
  test_v2 (32, 32, 25, qshow);
  test_v2 (32, 32, 125, qshow);
  test_v2 (32, 32, 625, qshow);
  test_v2 (32, 32, 102807, qshow);
  test_v2 (32, 31, 102807, qshow);
  test_v2 (32, 30, 102807, qshow);
  return;
}

void
lots_test_v2 ()
{
  int qshow = 1;
  test_v2 (32, 32, 1, qshow);
  return;
}

int main()
{
  many_test_v2 (4);
  exit (0);
}

Making GNU GCC choose_multiplier in expmed.c significantly faster

Reply via email to