[issue29282] Fused multiply-add: proposal to add math.fma()
Juraj Sukop added the comment: FWIW, there is a new implementation of FMA [1] which is licensed very permissively [2]. Perhaps it could be used here as well..? [1] https://github.com/smasher164/fma [2] https://github.com/smasher164/fma/commit/4e85d2388c7d4d850be12df918f9431ca687f57a -- ___ Python tracker <https://bugs.python.org/issue29282> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue43053] Speed up math.isqrt, again
New submission from Juraj Sukop : This is a follow up to https://bugs.python.org/issue36887 and https://bugs.python.org/issue36957 . The new `isqrt` is remarkably simple but it does not split the number at hand optimally. Ideally one would want to have 2n/n division everywhere but since the last iteration takes as much effort as all of the iterations before it this is what the attached code focuses on. At least in my testing the `isqrt_2` code below improved the performance by 50% (3s down to 2s, for example) and, if used, perhaps the original `isqrt` could do without the final correction `a - (a*a > n)`. -- components: Extension Modules files: isqrt_2.py messages: 385848 nosy: juraj.sukop priority: normal severity: normal status: open title: Speed up math.isqrt, again type: performance versions: Python 3.8 Added file: https://bugs.python.org/file49770/isqrt_2.py ___ Python tracker <https://bugs.python.org/issue43053> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue43053] Speed up math.isqrt, again
Juraj Sukop added the comment: What the proof goes, you did most of the work already. Consider the following: l = (n.bit_length() - 1)//4 a = isqrt(n >> 2*l) a = ((a << l) + n//(a << l))//2 return a - (a*a > n) This computes the square root of the (possibly longer) upper half, applies one Heron's step and a single correction. I think it is functionally equal to what you wrote. Those zeros don't contribute to the quotient so we could instead write: a = ((a << l) + (n >> l)//a)//2 The problem is that the 3n/n division in the step `(a + n//a)//2` basically recomputes the upper half we already know and so we want to avoid it: instead of 3n/n giving 2n quotient, we want 2n/n giving 1n quotient. If the upper half is correct, the lower half to be taken care of is `n - a**2`: a = (a << l) + ((n - (a << l)**2) >> l)//a//2 And there is no need to square the zeros either: a = (a << l) + ((n - (a**2 << 2*l) >> l)//a//2 So I *think* it should be correct, the only thing I'm not sure about is whether the final correction in the original `isqrt` is needed. Perhaps the automated proof of yours could give an answer? -- ___ Python tracker <https://bugs.python.org/issue43053> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue43053] Speed up math.isqrt, again
Juraj Sukop added the comment: The rounding down of `l` might compute more than half of the bits so that the final Heron' step in `isqrt_2` might correct the uncertain low bit if `a - (a*a > n)` is missing from `isqrt`. As it currently stands, `a - (a*a > n)` is computed both in `isqrt` and `isqrt_2`. So I was thinking that maybe the former might be dropped. Are you saying that both correction are need? -- ___ Python tracker <https://bugs.python.org/issue43053> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue43053] Speed up math.isqrt, again
Juraj Sukop added the comment: Mark, thank you very much for the thorough reply! I believe it is not for me to say whether the complication is worth it or not. What really huge numbers goes, if I'm not mistaken, the original `isqrt` is already "fast" and only constant speed-up is possible. I was also trying to speed up the last correction, i.e. that full-width multiplication, but didn't get too far (only M(n) -> M(n/2)). Continuing from `isqrt_2`: a = (a << l) + ((n - (a**2 << 2*l)) >> l)//a//2 return a - (a*a > n) a = (a << l) + ((n - (a**2 << 2*l)) >> (l + 1))//a return a - (a**2 > n) x = a a = (x << l) + ((n - (x**2 << 2*l)) >> (l + 1))//x return a - (((x << l) + ((n - (x**2 << 2*l)) >> (l + 1))//x)**2 > n) x = a x2 = x**2 a = (x << l) + ((n - (x2 << 2*l)) >> (l + 1))//x return a - (((x << l) + ((n - (x2 << 2*l)) >> (l + 1))//x)**2 > n) x = a x2 = x**2 t = (n - (x2 << 2*l)) >> (l + 1) u = t//x a = (x << l) + u b = ((x << l) + u)**2 return a - (b > n) x = a x2 = x**2 t = (n - (x2 << 2*l)) >> (l + 1) u = t//x a = (x << l) + u b = ((x << l)**2 + 2*(x << l)*u + u**2) return a - (b > n) x = a x2 = x**2 t = (n - (x2 << 2*l)) >> (l + 1) u, v = divmod(t, x) a = (x << l) + u b = ((x2 << 2*l) + ((t - v) << (l + 1)) + u**2) return a - (b > n) But then I got stuck on that `u**2`. The above should be a bit faster but probably is not worth the complication. On the other hand, if it was worth the complication, each iteration would use only single half width multiplication as `x2` of one interaction is `b` of previous iteration. Another idea maybe worth trying (only in C) is to use such an `l` during the iterations that it would make the numbers land on the word boundaries. In other words, computing few more bits here and there may not hurt the performance much but it would replace the shifts by pointer arithmetic. Here the only change is `//64*64`: def isqrt_3(n): l = (n.bit_length() - 1)//4//64*64 a = isqrt(n >> 2*l) a = (a << l) + ((n - (a**2 << 2*l)) >> l)//a//2 return a - (a*a > n) -- ___ Python tracker <https://bugs.python.org/issue43053> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue29282] Fused multiply-add: proposal to add math.fma()
New submission from Juraj Sukop: Fused multiply-add (henceforth FMA) is an operation which calculates the product of two numbers and then the sum of the product and a third number with just one floating-point rounding. More concretely: r = x*y + z The value of `r` is the same as if the RHS was calculated with infinite precision and the rounded to a 32-bit single-precision or 64-bit double-precision floating-point number [1]. Even though one FMA CPU instruction might be calculated faster than the two separate instructions for multiply and add, its main advantage comes from the increased precision of numerical computations that involve the accumulation of products. Examples which benefit from using FMA are: dot product [2], compensated arithmetic [3], polynomial evaluation [4], matrix multiplication, Newton's method and many more [5]. C99 includes [6] `fma` function to `math.h` and emulates the calculation if the FMA instruction is not present on the host CPU [7]. PEP 7 states that "Python versions greater than or equal to 3.6 use C89 with several select C99 features" and that "Future C99 features may be added to this list in the future depending on compiler support" [8]. This proposal is then about adding new `fma` function with the following signature to `math` module: math.fma(x, y, z) '''Return a float representing the result of the operation `x*y + z` with single rounding error, as defined by the platform C library. The result is the same as if the operation was carried with infinite precision and rounded to a floating-point number.''' Attached is a simple module for Python 3 demonstrating the fused multiply-add operation. On my machine, `example.py` prints: 40037.524591982365 horner_double 40037.48821639768horner_fma 40037.49486325783horner_compensated 40037.49486325783horner_decimal [1] https://en.wikipedia.org/wiki/Multiply%E2%80%93accumulate_operation [2] S. Graillat, P. Langlois, N. Louvet. Accurate dot products with FMA. 2006 [3] S. Graillat, Accurate Floating Point Product and Exponentiation. 2007. [4] S. Graillat, P. Langlois, N. Louvet. Improving the compensated Horner scheme with a Fused Multiply and Add. 2006 [5] J.-M. Muller, N. Brisebarre, F. de Dinechin, C.-P. Jeannerod, V. Lefèvre, G. Melquiond, N. Revol, D. Stehlé, S. Torres. Handbook of Floating-Point Arithmetic. 2010. Chapter 5 [6] ISO/IEC 9899:TC3, "7.12.13.1 The fma functions", Committee Draft - Septermber 7, 2007 [7] https://git.musl-libc.org/cgit/musl/tree/src/math/fma.c [8] https://www.python.org/dev/peps/pep-0007/ -- components: Library (Lib) messages: 285547 nosy: juraj.sukop priority: normal severity: normal status: open title: Fused multiply-add: proposal to add math.fma() type: enhancement versions: Python 3.7 ___ Python tracker <http://bugs.python.org/issue29282> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue29282] Fused multiply-add: proposal to add math.fma()
Changes by Juraj Sukop : Added file: http://bugs.python.org/file46298/xmathmodule.c ___ Python tracker <http://bugs.python.org/issue29282> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue29282] Fused multiply-add: proposal to add math.fma()
Changes by Juraj Sukop : Added file: http://bugs.python.org/file46299/example.py ___ Python tracker <http://bugs.python.org/issue29282> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue29282] Fused multiply-add: proposal to add math.fma()
Changes by Juraj Sukop : Added file: http://bugs.python.org/file46300/setup.py ___ Python tracker <http://bugs.python.org/issue29282> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue29282] Fused multiply-add: proposal to add math.fma()
Juraj Sukop added the comment: I would say because it has wide applicability, especially considering the amount of code it adds. It is similar in spirit to `copysign` or `fsum` which are already there and C99 includes it anyway. For simpler things like dot product or polynomial evaluation importing Numpy might be too much of an dependency. -- ___ Python tracker <http://bugs.python.org/issue29282> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com