[issue29282] Fused multiply-add: proposal to add math.fma()

2020-01-13 Thread Juraj Sukop


Juraj Sukop  added the comment:

FWIW, there is a new implementation of FMA [1] which is licensed very 
permissively [2]. Perhaps it could be used here as well..?

[1] https://github.com/smasher164/fma
[2] 
https://github.com/smasher164/fma/commit/4e85d2388c7d4d850be12df918f9431ca687f57a

--

___
Python tracker 
<https://bugs.python.org/issue29282>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue43053] Speed up math.isqrt, again

2021-01-28 Thread Juraj Sukop


New submission from Juraj Sukop :

This is a follow up to https://bugs.python.org/issue36887 and 
https://bugs.python.org/issue36957 .

The new `isqrt` is remarkably simple but it does not split the number at hand 
optimally. Ideally one would want to have 2n/n division everywhere but since 
the last iteration takes as much effort as all of the iterations before it this 
is what the attached code focuses on.

At least in my testing the `isqrt_2` code below improved the performance by 50% 
(3s down to 2s, for example) and, if used, perhaps the original `isqrt` could 
do without the final correction `a - (a*a > n)`.

--
components: Extension Modules
files: isqrt_2.py
messages: 385848
nosy: juraj.sukop
priority: normal
severity: normal
status: open
title: Speed up math.isqrt, again
type: performance
versions: Python 3.8
Added file: https://bugs.python.org/file49770/isqrt_2.py

___
Python tracker 
<https://bugs.python.org/issue43053>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue43053] Speed up math.isqrt, again

2021-01-28 Thread Juraj Sukop


Juraj Sukop  added the comment:

What the proof goes, you did most of the work already. Consider the following:

l = (n.bit_length() - 1)//4
a = isqrt(n >> 2*l)
a = ((a << l) + n//(a << l))//2
return a - (a*a > n)

This computes the square root of the (possibly longer) upper half, applies one 
Heron's step and a single correction. I think it is functionally equal to what 
you wrote. Those zeros don't contribute to the quotient so we could instead 
write:

a = ((a << l) + (n >> l)//a)//2

The problem is that the 3n/n division in the step `(a + n//a)//2` basically 
recomputes the upper half we already know and so we want to avoid it: instead 
of 3n/n giving 2n quotient, we want 2n/n giving 1n quotient. If the upper half 
is correct, the lower half to be taken care of is `n - a**2`:

a = (a << l) + ((n - (a << l)**2) >> l)//a//2

And there is no need to square the zeros either:

a = (a << l) + ((n - (a**2 << 2*l) >> l)//a//2

So I *think* it should be correct, the only thing I'm not sure about is whether 
the final correction in the original `isqrt` is needed. Perhaps the automated 
proof of yours could give an answer?

--

___
Python tracker 
<https://bugs.python.org/issue43053>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue43053] Speed up math.isqrt, again

2021-01-28 Thread Juraj Sukop


Juraj Sukop  added the comment:

The rounding down of `l` might compute more than half of the bits so that the 
final Heron' step in `isqrt_2` might correct the uncertain low bit if `a - (a*a 
> n)` is missing from `isqrt`.

As it currently stands, `a - (a*a > n)` is computed both in `isqrt` and 
`isqrt_2`. So I was thinking that maybe the former might be dropped.

Are you saying that both correction are need?

--

___
Python tracker 
<https://bugs.python.org/issue43053>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue43053] Speed up math.isqrt, again

2021-01-30 Thread Juraj Sukop


Juraj Sukop  added the comment:

Mark, thank you very much for the thorough reply!

I believe it is not for me to say whether the complication is worth it or not. 
What really huge numbers goes, if I'm not mistaken, the original `isqrt` is 
already "fast" and only constant speed-up is possible.

I was also trying to speed up the last correction, i.e. that full-width 
multiplication, but didn't get too far (only M(n) -> M(n/2)). Continuing from 
`isqrt_2`:

a = (a << l) + ((n - (a**2 << 2*l)) >> l)//a//2
return a - (a*a > n)

a = (a << l) + ((n - (a**2 << 2*l)) >> (l + 1))//a
return a - (a**2 > n)

x = a
a = (x << l) + ((n - (x**2 << 2*l)) >> (l + 1))//x
return a - (((x << l) + ((n - (x**2 << 2*l)) >> (l + 1))//x)**2 > n)

x = a
x2 = x**2
a = (x << l) + ((n - (x2 << 2*l)) >> (l + 1))//x
return a - (((x << l) + ((n - (x2 << 2*l)) >> (l + 1))//x)**2 > n)

x = a
x2 = x**2
t = (n - (x2 << 2*l)) >> (l + 1)
u = t//x
a = (x << l) + u
b = ((x << l) + u)**2
return a - (b > n)

x = a
x2 = x**2
t = (n - (x2 << 2*l)) >> (l + 1)
u = t//x
a = (x << l) + u
b = ((x << l)**2 + 2*(x << l)*u + u**2)
return a - (b > n)

x = a
x2 = x**2
t = (n - (x2 << 2*l)) >> (l + 1)
u, v = divmod(t, x)
a = (x << l) + u
b = ((x2 << 2*l) + ((t - v) << (l + 1)) + u**2)
return a - (b > n)

But then I got stuck on that `u**2`. The above should be a bit faster but 
probably is not worth the complication. On the other hand, if it was worth the 
complication, each iteration would use only single half width multiplication as 
`x2` of one interaction is `b` of previous iteration.

Another idea maybe worth trying (only in C) is to use such an `l` during the 
iterations that it would make the numbers land on the word boundaries. In other 
words, computing few more bits here and there may not hurt the performance much 
but it would replace the shifts by pointer arithmetic. Here the only change is 
`//64*64`:

def isqrt_3(n):
l = (n.bit_length() - 1)//4//64*64
a = isqrt(n >> 2*l)
a = (a << l) + ((n - (a**2 << 2*l)) >> l)//a//2
return a - (a*a > n)

--

___
Python tracker 
<https://bugs.python.org/issue43053>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue29282] Fused multiply-add: proposal to add math.fma()

2017-01-16 Thread Juraj Sukop

New submission from Juraj Sukop:

Fused multiply-add (henceforth FMA) is an operation which calculates the 
product of two numbers and then the sum of the product and a third number with 
just one floating-point rounding. More concretely:

r = x*y + z

The value of `r` is the same as if the RHS was calculated with infinite 
precision and the rounded to a 32-bit single-precision or 64-bit 
double-precision floating-point number [1].

Even though one FMA CPU instruction might be calculated faster than the two 
separate instructions for multiply and add, its main advantage comes from the 
increased precision of numerical computations that involve the accumulation of 
products. Examples which benefit from using FMA are: dot product [2], 
compensated arithmetic [3], polynomial evaluation [4], matrix multiplication, 
Newton's method and many more [5].

C99 includes [6] `fma` function to `math.h` and emulates the calculation if the 
FMA instruction is not present on the host CPU [7]. PEP 7 states that "Python 
versions greater than or equal to 3.6 use C89 with several select C99 features" 
and that "Future C99 features may be added to this list in the future depending 
on compiler support" [8].

This proposal is then about adding new `fma` function with the following 
signature to `math` module:

math.fma(x, y, z)

'''Return a float representing the result of the operation `x*y + z` with 
single rounding error, as defined by the platform C library. The result is the 
same as if the operation was carried with infinite precision and rounded to a 
floating-point number.'''

Attached is a simple module for Python 3 demonstrating the fused multiply-add 
operation. On my machine, `example.py` prints:

40037.524591982365   horner_double
40037.48821639768horner_fma
40037.49486325783horner_compensated
40037.49486325783horner_decimal

[1] https://en.wikipedia.org/wiki/Multiply%E2%80%93accumulate_operation
[2] S. Graillat, P. Langlois, N. Louvet. Accurate dot products with FMA. 2006
[3] S. Graillat, Accurate Floating Point Product and Exponentiation. 2007.
[4] S. Graillat, P. Langlois, N. Louvet. Improving the compensated Horner 
scheme with a Fused Multiply and Add. 2006
[5] J.-M. Muller, N. Brisebarre, F. de Dinechin, C.-P. Jeannerod, V. Lefèvre, 
G. Melquiond, N. Revol, D. Stehlé, S. Torres. Handbook of Floating-Point 
Arithmetic. 2010. Chapter 5
[6] ISO/IEC 9899:TC3, "7.12.13.1 The fma functions", Committee Draft - 
Septermber 7, 2007
[7] https://git.musl-libc.org/cgit/musl/tree/src/math/fma.c
[8] https://www.python.org/dev/peps/pep-0007/

--
components: Library (Lib)
messages: 285547
nosy: juraj.sukop
priority: normal
severity: normal
status: open
title: Fused multiply-add: proposal to add math.fma()
type: enhancement
versions: Python 3.7

___
Python tracker 
<http://bugs.python.org/issue29282>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue29282] Fused multiply-add: proposal to add math.fma()

2017-01-16 Thread Juraj Sukop

Changes by Juraj Sukop :


Added file: http://bugs.python.org/file46298/xmathmodule.c

___
Python tracker 
<http://bugs.python.org/issue29282>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue29282] Fused multiply-add: proposal to add math.fma()

2017-01-16 Thread Juraj Sukop

Changes by Juraj Sukop :


Added file: http://bugs.python.org/file46299/example.py

___
Python tracker 
<http://bugs.python.org/issue29282>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue29282] Fused multiply-add: proposal to add math.fma()

2017-01-16 Thread Juraj Sukop

Changes by Juraj Sukop :


Added file: http://bugs.python.org/file46300/setup.py

___
Python tracker 
<http://bugs.python.org/issue29282>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue29282] Fused multiply-add: proposal to add math.fma()

2017-01-16 Thread Juraj Sukop

Juraj Sukop added the comment:

I would say because it has wide applicability, especially considering the 
amount of code it adds. It is similar in spirit to `copysign` or `fsum` which 
are already there and C99 includes it anyway. For simpler things like dot 
product or polynomial evaluation importing Numpy might be too much of an 
dependency.

--

___
Python tracker 
<http://bugs.python.org/issue29282>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com