Hello,
I noticed other shells have the same bug, and perhaps you would like to use this code in other GNU projects (like making a library call or an executable). The best, I think, is to transfer the copyright to bash maintainers. You can now copyright and license it the way you want: template<typename T> inline T ipow(register /* unsigned */ T x, register /* unsigned */ T y) { if (x == 0 && y != 0) return 0; // 1: ipow(x,y) = x ** y = Product [i=0; i<=log2(y)] (x ** (((y>>i)&1)*2**i)) // 2: x**(2**i) = x**(2**(i-1)) * x**(2**(i-1)) register T xy = 1; for(; y; y>>=1, x *= x) if (y & 1) xy *= x; return xy; } If you don't wish to use this algoritm, then there is another way of getting around the bug: for exemple, on 64 bit systems, when performing x ** y, if y > 64 then the result will be too large anyway to stand in 64 bits. It is true when you suppose x = 2, and (even more) when x > 2. In this case you may then throw an overflow error. In other cases, y < 64 is a manageable number for the loop algorithm. But, I still prefer the O(log(y)) algorithm, which always computes the exact value (modulus 2**64) in less numbers of steps on average. This is, in detail, how the algorithm works: When, for example, computing 3 ** 3 = 3 * 3 * 3 = 27, you can write it also with a binary power: 3 ** (1*2**1 + 1*2**0), and because, x ** (a+b) = x**a * x**b, you may develop it: 3 ** (2**1) * 3 ** (2**0) This gives the first equation in the comments of the code: // 1: ipow(x,y) = x ** y = Product [i=0; i<=log2(y)] (x ** (((y>>i)&1)*2**i)) The second thing to notice is that we need to compute: 3**2**i, but as 2**i = 2*2**(i-1) = 2**(i-1) + 2**(i-1), then 3**2**i = 3**(2**(i-1) + 2**(i-1)), since, x ** (a+b) = x**a * x**b, then 3**2**i = 3**(2**(i-1) + 2**(i-1)) = 3**(2**(i-1)) * 3**(2**(i-1)). This gives the second equation in the comments of the code: // 2: x**(2**i) = x**(2**(i-1)) * x**(2**(i-1)) and a reccurence to compute x**2**(i+1) when we know x**2**i, from x**2**0 = x. We just then have to roll though the bits of y, computing x = x * x each step, and multiplying xy (the temporary result) by one of the squares of x when the corresponding bit of y is set. As we always do multiplications modulus 2**64, it gives the result modulus 2**64. We only need to use 3 registers: one to roll y, one to hold the squares of x, and one to hold the result. Writing the same algoritm with assembler language can save a few instructions per loop, because gcc doesn't catch that it can use the out-going bit of y to directly jump over xy *= x; if not set. But it won't be portable then. Best regards, Nicolas Argyrou