Feel free to copy this email and attachment to anyone who might be interested. I'm very happy to answer any questions anyone has. The program can be compiled and run like this on Linux with GNU GCC: gcc -O2 -o expmed2.exe expmed2.c ./expmed2.exe
This email deals with making part of the GNU GCC compiler - integer division by a constant divisor - faster. (The calculation of the parameters for the machine code will be faster; compiled programs won't run faster.) Further down I mention inequality (1) which can be used to make the LLVM compiler somewhat faster, because that currently uses code based on (2). I don't know what - if anything - the Java JVM uses for this, or how other compilers do this, but these ideas may be useful there. By significantly faster I mean I have benchmarked alternative versions of choose_multiplier which on my low specification netbook can take maybe less than half the time the current version takes. Time saved in compilation is much less important than time saved in running compiled programs, but the code for the faster versions is about the same length as the code for the current version, and is only a bit more complicated, so is worth considering? A short summary of the following is that choose_multiplier currently uses an elegant algorithm due to Granlund & Montgomery, but which as implemented seems rather slow. We can make it faster while retaining the basic structure, and using a different, mostly equivalent, algorithm, may be a bit faster than that. Licencing: in Fractint people's words "Don't want money, want recognition!" The version choose_multiplier_v2 is based - but improves - on what's in the GCC choose_multiplier function in file expmed.c, so the GCC licence. The version choose_multiplier_v4 is based - but improves - on magicu2 in "Hacker's Delight", so the licence is you needn't acknowledge the source but it would be nice to credit the code as from magicu2 in "Hacker's Delight" by Henry S Warren http://hackersdelight.org with improvements by Colin Bartlett <coli...@gmail.com>. This latter also applies to choose_multiplier_power2_divrem because that is also an obvious (?) idea from magicu2 in "Hacker's Delight". */ The idea is using "wide int" seems slow compared to "unsigned HOST_WIDE_INT", so it makes sense to avoid using "wide int" as much as possible. We can easily rewrite choose_multiplier to only use "wide int" to calculate the initial mlow; this is choose_multiplier_v2. An alternative for choose_multiplier_v2 completely avoids using "wide int" by iterating upwards for the initial mlow, but if that benchmarks as faster than using "wide int" even once (I suspect it might) then just iterating upwards may even be a bit faster; this is choose_multiplier_v4. The attachment is self-contained, and I have checked that the values produced agree with a "Hacker's Delight" table of M/2^s for small d and n=precision=32. What follows is a short summary of the theory, applying it to choose_multiplier. Problem: find M/2^s so that for 0<=i<=iMax>=d we have floor(i/d)=floor(i*M/2^s). Let qc=floor((iMax+1)/d); nc=qc*d-1; lgup=ceiling(log2(d)). Given s let M=floor(2^s/d)+1; delta=M*d-2^s. For GCC choose_multiplier: * equivalent necessary & sufficient conditions: (1) 0<delta and qc*delta<M (2) 0<delta and nc*delta<2^s Proof of (1) if and only if (2): qc*delta*d-delta=nc*delta<2^s==M*d-delta (3) 1/d<M/2^s<(1+1/nc)*1/d * equivalent sufficient conditions: we have nc<2^precision, so: (4) 0<delta and 2^precision*delta<=2^s (4.1) 0<delta and delta<=2^(s-precision) (5) 1/d<M/2^s<=(1+1/2^precision)*1/d (5.1) 2^s/d<M<=(2^s+2^(s-precision))/d (1) seems to be new to the literature, but is equivalent to (2) which is in "Hacker's Delight" (Chapter 10) by Henry S Warren. Both give upwards iterating algorithms which give minimal M/2^s and which avoid needing to use "wide int", but the algorithm code using (1) is simpler and faster than that using (2). (4.1) is in "Hacker's Delight". It gives an upwards iterating algorithm which avoids using "wide int", and the algorithm code is a bit simpler than using (1). Mostly it gives the same result as GCC choose_multiplier which implements (possibly slowly because it uses "wide int") the elegant algorithm due to Granlund & Montgomery. (Which perhaps builds on work by Alverson.) We can prove if iMax<2^precision then for 0<=i<2^(precision/2) inequality (4) etc give the same M/2^s as inequality (1) etc. For n=precision=32 the smallest d for which (4) gives a non-minimal M is d=102807. Rough not necessarily reliable statistics suggest that using (4) etc we have that if n=precision the average post_shift=lgup-1. So if we can *quickly" calculate M/2^s at s=n+lgup-1, then either that fails and we need to use an extra "top" bit for M and use post_shift=lgup, or it works and we can try reducing M. The catch is *if* we can *quickly* calculate M/2^s at s=n+lgup-1; it's easy to directly calculate M/2^s at that s using "wide int", but using "wide int" seems slow, and it might actually be faster to iterate upwards from s=n and completely avoid using "wide int". In any case "wide int" seems slow compared with using "unsigned HOST_WIDE_INT", so it makes sense to avoid "wide int" as far as possible. So in the attachment choose_multiplier_v2 rewrites choose_multiplier to: (a) only use "wide int" to calculate the initial mlow at s=n+lgup-1; all other calculations are made using "unsigned HOST_WIDE_INT"; or (b) avoid using "wide int" to calculate the initial mlow at s=n+lgup-1, by iterating upwards from s=n to find the initial mlow. But if (b) benchmarks as faster than (a), which I suspect might be the case, then it may even be on average a bit faster to use choose_multiplier_v4 which iterates upwards to find M/2^s, and which would avoid needing to calculate lgup unless that is useful as a return value of choose_multiplier. Colin Bartlett
#include <stdio.h> #include <stdlib.h> #define HOST_WIDE_INT int #define HOST_BITS_PER_WIDE_INT 32 #define HOST_BITS_PER_DOUBLE_INT (2 * HOST_BITS_PER_WIDE_INT) /* Making GNU GCC choose_multiplier in expmed.c significantly faster. By which we mean up to 50% or more faster for the compiler to calculate parameters for the machine code for integer division by a constant divisor. (Compiled programs won't run faster.) Licencing: in Fractint people's words "Don't want money, want recognition!" The version choose_multiplier_v2 is based - but improves - on what's in the GCC choose_multiplier function in file expmed.c, so the GCC licence. The version choose_multiplier_v4 is based - but improves - on magicu2 in "Hacker's Delight", so the licence is you needn't acknowledge the source but it would be nice to credit the code as from magicu2 in "Hacker's Delight" by Henry S Warren http://hackersdelight.org with improvements by Colin Bartlett <coli...@gmail.com>. This latter also applies to choose_multiplier_power2_divrem because that is also an obvious (?) idea from magicu2 in "Hacker's Delight". */ int ceil_log2 (unsigned long long int iv) { /* for now do it the long way */ int s; if (iv == 0) return -1; iv -= 1; s = 0; while (iv) { s += 1; iv = iv >> 1; } return s; } void gcc_assert (int iv) { if (iv) return; printf ("gcc_assert error\n"); exit (1); } /* Choose a minimal N + 1 bit approximation to 1/D that can be used to replace division by D, and put the least significant N bits of the result in *MULTIPLIER_PTR and return the most significant bit. The width of operations is N (should be <= HOST_BITS_PER_WIDE_INT), the needed precision is in PRECISION (should be <= N). PRECISION should be as small as possible so this function can choose multiplier more freely. The rounded-up logarithm of D is placed in *lgup_ptr. A shift count that is to be used for a final right shift is placed in *POST_SHIFT_PTR. Using this function, x/D will be equal to (x * m) >> (*POST_SHIFT_PTR), where m is the full HOST_BITS_PER_WIDE_INT + 1 bit multiplier. */ #define CMUHWI1 ((unsigned HOST_WIDE_INT) 1) // #define CMUHWI2 ((unsigned HOST_WIDE_INT) 2) /* Licencing: */ void choose_multiplier_power2_divrem (int two_exp, unsigned HOST_WIDE_INT d, unsigned HOST_WIDE_INT *quotient, unsigned HOST_WIDE_INT *remainder) { /* Must have that 2^two_exp / d < 2^HOST_BITS_PER_WIDE_INT */ unsigned HOST_WIDE_INT q, r, rtest; int s; if (two_exp < HOST_BITS_PER_WIDE_INT) { r = CMUHWI1 << two_exp; q = r / d; r = r - q * d; } // else if (0 == 0) // { // /* Using wide_int seems slowish & it may be faster to iterate upwards. */ // wide_int val = wi::set_bit_in_zero (two_exp, HOST_BITS_PER_DOUBLE_INT); // q = wi::udiv_trunc (val, d).to_uhwi (); // r = 0 - q * d; // } else { /* Iterate upwards to get q, r; there may be "overflows" but that's OK as using unsigned integers. */ s = HOST_BITS_PER_WIDE_INT; q = (0 - d) / d + 1; r = 0 - q * d; rtest = (d - 1) >> 1; while (s < two_exp) { s = s + 1; if (r <= rtest) { q = q << 1; r = r << 1; } else { q = (q << 1) | 1; r = (r << 1) - d; } } } *quotient = q; *remainder = r; return; } /* Why is choose_multiplier "unsigned HOST_WIDE_INT" instead of just "int"? It only returns 0 or 1. */ int choose_multiplier_v2 (unsigned HOST_WIDE_INT d, int n, int precision, unsigned HOST_WIDE_INT *multiplier_ptr, int *post_shift_ptr, int *lgup_ptr) { int lgup, shiftv, topbit, s; unsigned HOST_WIDE_INT mlow, mhigh, mlowv, mhighv; /* lgup = ceil(log2(divisor)); */ lgup = ceil_log2 (d); gcc_assert (lgup <= n); if (lgup == 0) { /* It's easier to deal with d = 1 separately, as that is the only d for which we need to be very careful about avoiding shifting bits by >= HOST_BITS_PER_WIDE_INT. */ *multiplier_ptr = CMUHWI1 << (n - precision); *post_shift_ptr = 0; *lgup_ptr = lgup; return 1; } topbit = 0; /* shiftv = (n or precision) + lgup - 1 mlow = 2^shiftv / d mhigh = (2^shiftv + 2^(shiftv - precision)) / d An unlikely case is if precision < lgup when we could just use mhigh = 0, shiftv == 0. */ shiftv = n + lgup - 1; choose_multiplier_power2_divrem (shiftv, d, &mlow, &mlowv); s = shiftv - precision; /* Avoid shifts by >= HOST_BITS_PER_WIDE_INT. */ mhigh = precision < HOST_BITS_PER_WIDE_INT ? mlow >> precision : 0; mhighv = (s < HOST_BITS_PER_WIDE_INT ? CMUHWI1 << s : 0) - d * mhigh; mhigh += mlow + (mlowv < d - mhighv ? 0 : 1); if (mlow < mhigh) { /* Reduce to lowest terms. */ shiftv -= n; while (shiftv > 0) { mlowv = mlow >> 1; mhighv = mhigh >> 1; if (mlowv >= mhighv) break; mlow = mlowv; mhigh = mhighv; shiftv -= 1; } } else { mhigh = ((mhigh - (CMUHWI1 << (n - 1))) << 1) | 1; shiftv = lgup; topbit = 1; } *multiplier_ptr = mhigh; *post_shift_ptr = shiftv; *lgup_ptr = lgup; return topbit; } int choose_multiplier_v4 (unsigned HOST_WIDE_INT d, int n, int precision, unsigned HOST_WIDE_INT *multiplier_ptr, int *post_shift_ptr, int *lgup_ptr) { int lgup, shiftv, topbit; unsigned HOST_WIDE_INT q, delta, deltatest, twos, topbitcheck; /* lgup = ceil(log2(divisor)); */ lgup = ceil_log2 (d); gcc_assert (lgup <= n); if (lgup == 0) { /* It's easier to deal with d = 1 separately, as that is the only d for which we need to be very careful about avoiding shifting bits by >= HOST_BITS_PER_WIDE_INT. */ *multiplier_ptr = CMUHWI1 << (n - precision); *post_shift_ptr = 0; *lgup_ptr = lgup; return 1; } /* Iterate upwards to find multiplier and post_shift. */ topbit = 0; shiftv = 0; twos = CMUHWI1 << (n - precision); topbitcheck = CMUHWI1 << (n - 1); deltatest = d >> 1; if (n < HOST_BITS_PER_WIDE_INT) { delta = CMUHWI1 << n; q = delta / d; } else { delta = 0; q = (delta - d) / d + 1; } delta = (q + 1) * d - delta; // printf("\n"); // printf ("//#// N %2d P %2d L %2d d %10d = 0x%8x; M 0x %8x %8x rv %1d %1d s %2d %2d;\n", // n, precision, lgup, d, d, m, mv, rv, rvv, s, sv); while (delta > twos) { shiftv += 1; if (delta <= deltatest) { q = (q << 1) | 1; delta = delta << 1; } else if (q < topbitcheck) { q = q << 1; delta = (delta << 1) - d; } else { topbit = 1; q = (q - topbitcheck) << 1; break; } twos = twos << 1; } *multiplier_ptr = q + 1; *post_shift_ptr = shiftv; *lgup_ptr = lgup; return topbit; } int test_v2 (int n, int precision, unsigned HOST_WIDE_INT d, int qshow) { int s, lgup, rv, sv, rvv, neq; unsigned HOST_WIDE_INT m, mv; rv = choose_multiplier_v2 (d, n, precision, &m, &s, &lgup); rvv = choose_multiplier_v4 (d, n, precision, &mv, &sv, &lgup); neq = (m == mv ? 0 : m>mv ? 1 : 2) | (rv == rvv ? 0 : 4) | (s == sv ? 0 : 8); if (qshow >= 4 || (neq && qshow)) printf ("//#// N %2d P %2d L %2d d %10d = 0x%8x; M 0x %8x %8x rv %1d %1d s %2d %2d; neq %2d;\n", n, precision, lgup, d, d, m, mv, rv, rvv, s, sv, neq); return neq; } void many_test_v2 (int qshow) { /* */ test_v2 (32, 32, 1, qshow); test_v2 (32, 32, 2, qshow); test_v2 (32, 32, 4, qshow); test_v2 (32, 32, 8, qshow); test_v2 (32, 32, 1 << 6, qshow); test_v2 (32, 31, 1 << 6, qshow); test_v2 (32, 30, 1 << 6, qshow); test_v2 (32, 29, 1 << 6, qshow); test_v2 (32, 28, 1 << 6, qshow); test_v2 (32, 27, 1 << 6, qshow); test_v2 (32, 26, 1 << 6, qshow); test_v2 (32, 25, 1 << 6, qshow); test_v2 (32, 32, 0x800000, qshow); test_v2 (32, 32, 3, qshow); test_v2 (32, 32, 5, qshow); test_v2 (32, 32, 6, qshow); test_v2 (32, 32, 7, qshow); test_v2 (32, 31, 7, qshow); test_v2 (32, 30, 7, qshow); test_v2 (32, 29, 7, qshow); test_v2 (32, 28, 7, qshow); test_v2 (32, 32, 9, qshow); test_v2 (32, 32, 10, qshow); test_v2 (32, 32, 11, qshow); test_v2 (32, 32, 12, qshow); test_v2 (32, 32, 25, qshow); test_v2 (32, 32, 125, qshow); test_v2 (32, 32, 625, qshow); test_v2 (32, 32, 102807, qshow); test_v2 (32, 31, 102807, qshow); test_v2 (32, 30, 102807, qshow); return; } void lots_test_v2 () { int qshow = 1; test_v2 (32, 32, 1, qshow); return; } int main() { many_test_v2 (4); exit (0); }