Re: [Numpy-discussion] NEP 48: Spending NumPy Project funds

2021-02-23 Thread Ralf Gommers
On Mon, Feb 22, 2021 at 9:34 PM Stephan Hoyer  wrote:

> On Mon, Feb 22, 2021 at 4:08 AM Pearu Peterson 
> wrote:
>
>> Hi,
>>
>> See GH discussion starting at
>> https://github.com/numpy/numpy/pull/18454#discussion_r579967791 for the
>> raised issue that is now moved here.
>>
>> Re "Compensating fairly" section:
>>
>> The NEP proposes location-dependent contracts for fair pays.
>>
>> I think this is a contradictory approach as location is not the only
>> factor that may influence fairness. As an example, contractors may have
>> different levels of obligations to their families, and one might argue this
>> should be taken into consideration as well because the family size and the
>> required level of commitment to the family members (kids, members who need
>> special care, etc) can have a huge influence on the contractors living
>> standards, not just the level of average rent in the particular location.
>> It would be unfair to take into account location but not the family
>> situation. There may be other factors as well that may influence fairness
>> and I think this will make the decision-making about contracting harder
>> and, most importantly, controversial.
>>
>> My proposal is that factors like location, family situation, etc should
>> be discarded when negotiating contract terms. The efficiency of using the
>> project funding should be defined by how well and quickly a particular
>> contractor is able to get the job done,  but not how the contractors are
>> likely to spend their pays - it is nobody's business, IMHO, and is likely
>> very hard if not impossible to verify.
>>
>
> One difference is that it is illegal (at least under US law) to consider
> factors such as family situation in determining pay.
>
> However, it is both legal and standard to consider location. I'm not
> saying we should necessarily do it, but it's an accepted practice. NumPy
> development is global, but prevailing wages are not.
>

Regarding location, that's clearly one of the most complicated things to
deal with. Aside from legality, it's indeed because it's standard practice
that we have to deal with it. The NEP text explains why both doing what's
standard and a completely location-independent approach are considered
unfair. If I'd have to choose between those two, I'd agree that
location-independent compensation is *less unfair*. It would however either
make it impossible to contract with people in expensive locations, or use
compensation levels that are up to 10x higher than the norm for other
locations.

> "The efficiency of using the project funding "

This is exactly the contradiction. We don't just want to get the most for
our money. That's the usual corporate approach: pay as little as you can
get away with. And it would lead to very strong location-dependent choices.

The proposed approach is: first figure out what we want to fund. Then look
for a great candidate. Taking into account the factors listed, like if
someone is already a part of the team and has the required skills. And
after that's settled, determine a fair compensation level. This ordering
may not be as clear as it should be in the current text, I'll try to make
it more explicit.

Cheers,
Ralf
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] ENH: Proposal to add KML_BLAS support

2021-02-23 Thread ChunLin Fang
Thanks for asking, this is a simple explanation for your questions:
1. The download link of KML_BLAS:
The Chinese page is
https://www.huaweicloud.com/kunpeng/software/KML_BLAS.html
The English page is https://kunpeng.huawei.com/en/#/developer/devkit/library,
you can find a "Math Library" Navigation entry in the bottom of this page.
"KML_BLAS" lies in there.
2. The license/redistribution policy of KML_BLAS:
The license is very similar to intel MKL, The license file is in the
process of making.
3.How to support KML_BLAS:
The support process is similar to BLIS, just need to add to
numpy.distutils, KML_BLAS will not open source in the near future.
4.What kind of ARM chips are supported:
any ARMV8 chip is supported.
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] ENH: Proposal to add KML_BLAS support

2021-02-23 Thread Ralf Gommers
On Tue, Feb 23, 2021 at 1:42 PM ChunLin Fang  wrote:

> Thanks for asking, this is a simple explanation for your questions:
> 1. The download link of KML_BLAS:
> The Chinese page is
> https://www.huaweicloud.com/kunpeng/software/KML_BLAS.html
> The English page is
> https://kunpeng.huawei.com/en/#/developer/devkit/library,  you can find a
> "Math Library" Navigation entry in the bottom of this page. "KML_BLAS" lies
> in there.
> 2. The license/redistribution policy of KML_BLAS:
> The license is very similar to intel MKL, The license file is in the
> process of making.
> 3.How to support KML_BLAS:
> The support process is similar to BLIS, just need to add to
> numpy.distutils, KML_BLAS will not open source in the near future.
>

This sounds fine to me, and the performance is potentially interesting to
ARMv8 users. Do you want to open a PR?

Side note: the email client you are using is breaking threading, you may
want to tweak a setting for that or change to another client.

Cheers,
Ralf

4.What kind of ARM chips are supported:
> any ARMV8 chip is supported.
>
___
> NumPy-Discussion mailing list
> NumPy-Discussion@python.org
> https://mail.python.org/mailman/listinfo/numpy-discussion
>
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] NEP: array API standard adoption (NEP 47)

2021-02-23 Thread Ralf Gommers
On Mon, Feb 22, 2021 at 7:49 PM Sebastian Berg 
wrote:

> On Sun, 2021-02-21 at 17:30 +0100, Ralf Gommers wrote:
> > Hi all,
> >
> > Here is a NEP, written together with Stephan Hoyer and Aaron Meurer,
> > for
> > discussion on adoption of the array API standard (
> > https://data-apis.github.io/array-api/latest/). This will add a new
> > numpy.array_api submodule containing that standardized API. The main
> > purpose of this API is to be able to write code that is portable to
> > other
> > array/tensor libraries like CuPy, PyTorch, JAX, TensorFlow, Dask, and
> > MXNet.
> >
> > We expect this NEP to remain in draft state for quite a while, while
> > we're
> > gaining experience with using it in downstream libraries, discuss
> > adding it
> > to other array libraries, and finishing some of the loose ends (e.g.,
> > specifications for linear algebra functions that aren't merged yet,
> > see
> > https://github.com/data-apis/array-api/pulls) in the API standard
> > itself.
>
>
> There is too much to unpack in a day, I hope I did not miss something
> particularly important while reading.
> Do you have plans to try some of this outside of NumPy, or maybe make a
> repo in the numpy org for it?
>

Sorry, I forgot to answer this question. That is what we're doing now, the
current prototype is at
https://github.com/data-apis/numpy/tree/array-api/numpy/_array_api. I do
expect that as soon we need any changes in C code, that becomes
impractical. I think merging as a private submodule (numpy._array_api)
makes sense. That will help with WIP PRs to other libraries - then we can
use the "test against master" CI for that, rather than having to make a
mess injecting things inside CI.

Also, there are a few parts of the NEP that are improvements outside of the
new submodule. Not only DLPack, but also consistency in "stacks of
matrices" in linalg functions, adding a missing keepdims keyword, the
never-copy mode for asarray, and improving the API for inspecting dtype
families (https://github.com/numpy/numpy/issues/17325). Those things can
all be pushed forward.

Cheers,
Ralf
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


[Numpy-discussion] C-coded dot 1000x faster than numpy?

2021-02-23 Thread Neal Becker
I have code that performs dot product of a 2D matrix of size (on the
order of) [1000,16] with a vector of size [1000].  The matrix is
float64 and the vector is complex128.  I was using numpy.dot but it
turned out to be a bottleneck.

So I coded dot2x1 in c++ (using xtensor-python just for the
interface).  No fancy simd was used, unless g++ did it on it's own.

On a simple benchmark using timeit I find my hand-coded routine is on
the order of 1000x faster than numpy?  Here is the test code:
My custom c++ code is dot2x1.  I'm not copying it here because it has
some dependencies.  Any idea what is going on?

import numpy as np

from dot2x1 import dot2x1

a = np.ones ((1000,16))
b = np.array([ 0.80311816+0.80311816j,  0.80311816-0.80311816j,
   -0.80311816+0.80311816j, -0.80311816-0.80311816j,
1.09707981+0.29396165j,  1.09707981-0.29396165j,
   -1.09707981+0.29396165j, -1.09707981-0.29396165j,
0.29396165+1.09707981j,  0.29396165-1.09707981j,
   -0.29396165+1.09707981j, -0.29396165-1.09707981j,
0.25495815+0.25495815j,  0.25495815-0.25495815j,
   -0.25495815+0.25495815j, -0.25495815-0.25495815j])

def F1():
d = dot2x1 (a, b)

def F2():
d = np.dot (a, b)

from timeit import timeit
print (timeit ('F1()', globals=globals(), number=1000))
print (timeit ('F2()', globals=globals(), number=1000))

In [13]: 0.013910860987380147 << 1st timeit
28.608758996007964  << 2nd timeit
-- 
Those who don't understand recursion are doomed to repeat it
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] C-coded dot 1000x faster than numpy?

2021-02-23 Thread Andrea Gavana
Hi,

On Tue, 23 Feb 2021 at 19.11, Neal Becker  wrote:

> I have code that performs dot product of a 2D matrix of size (on the
> order of) [1000,16] with a vector of size [1000].  The matrix is
> float64 and the vector is complex128.  I was using numpy.dot but it
> turned out to be a bottleneck.
>
> So I coded dot2x1 in c++ (using xtensor-python just for the
> interface).  No fancy simd was used, unless g++ did it on it's own.
>
> On a simple benchmark using timeit I find my hand-coded routine is on
> the order of 1000x faster than numpy?  Here is the test code:
> My custom c++ code is dot2x1.  I'm not copying it here because it has
> some dependencies.  Any idea what is going on?



I had a similar experience - albeit with an older numpy and Python 2.7, so
my comments are easily outdated and irrelevant. This was on Windows 10 64
bit, way more than plenty RAM.

It took me forever to find out that numpy.dot was the culprit, and I ended
up using fortran + f2py. Even with the overhead of having to go through
f2py bridge, the fortran dot_product was several times faster.

Sorry if It doesn’t help much.

Andrea.



>
> import numpy as np
>
> from dot2x1 import dot2x1
>
> a = np.ones ((1000,16))
> b = np.array([ 0.80311816+0.80311816j,  0.80311816-0.80311816j,
>-0.80311816+0.80311816j, -0.80311816-0.80311816j,
> 1.09707981+0.29396165j,  1.09707981-0.29396165j,
>-1.09707981+0.29396165j, -1.09707981-0.29396165j,
> 0.29396165+1.09707981j,  0.29396165-1.09707981j,
>-0.29396165+1.09707981j, -0.29396165-1.09707981j,
> 0.25495815+0.25495815j,  0.25495815-0.25495815j,
>-0.25495815+0.25495815j, -0.25495815-0.25495815j])
>
> def F1():
> d = dot2x1 (a, b)
>
> def F2():
> d = np.dot (a, b)
>
> from timeit import timeit
> print (timeit ('F1()', globals=globals(), number=1000))
> print (timeit ('F2()', globals=globals(), number=1000))
>
> In [13]: 0.013910860987380147 << 1st timeit
> 28.608758996007964  << 2nd timeit
> --
> Those who don't understand recursion are doomed to repeat it
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@python.org
> https://mail.python.org/mailman/listinfo/numpy-discussion
>
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] C-coded dot 1000x faster than numpy?

2021-02-23 Thread Roman Yurchak
For the first benchmark apparently A.dot(B) with A real and B complex is 
a known issue performance wise https://github.com/numpy/numpy/issues/10468


In general, it might be worth trying different BLAS backends. For 
instance, if you install numpy from conda-forge you should be able to 
switch between OpenBLAS, MKL and BLIS: 
https://conda-forge.org/docs/maintainer/knowledge_base.html#switching-blas-implementation


Roman

On 23/02/2021 19:32, Andrea Gavana wrote:

Hi,

On Tue, 23 Feb 2021 at 19.11, Neal Becker > wrote:


I have code that performs dot product of a 2D matrix of size (on the
order of) [1000,16] with a vector of size [1000].  The matrix is
float64 and the vector is complex128.  I was using numpy.dot but it
turned out to be a bottleneck.

So I coded dot2x1 in c++ (using xtensor-python just for the
interface).  No fancy simd was used, unless g++ did it on it's own.

On a simple benchmark using timeit I find my hand-coded routine is on
the order of 1000x faster than numpy?  Here is the test code:
My custom c++ code is dot2x1.  I'm not copying it here because it has
some dependencies.  Any idea what is going on?



I had a similar experience - albeit with an older numpy and Python 2.7, 
so my comments are easily outdated and irrelevant. This was on Windows 
10 64 bit, way more than plenty RAM.


It took me forever to find out that numpy.dot was the culprit, and I 
ended up using fortran + f2py. Even with the overhead of having to go 
through f2py bridge, the fortran dot_product was several times faster.


Sorry if It doesn’t help much.

Andrea.




import numpy as np

from dot2x1 import dot2x1

a = np.ones ((1000,16))
b = np.array([ 0.80311816+0.80311816j,  0.80311816-0.80311816j,
        -0.80311816+0.80311816j, -0.80311816-0.80311816j,
         1.09707981+0.29396165j,  1.09707981-0.29396165j,
        -1.09707981+0.29396165j, -1.09707981-0.29396165j,
         0.29396165+1.09707981j,  0.29396165-1.09707981j,
        -0.29396165+1.09707981j, -0.29396165-1.09707981j,
         0.25495815+0.25495815j,  0.25495815-0.25495815j,
        -0.25495815+0.25495815j, -0.25495815-0.25495815j])

def F1():
     d = dot2x1 (a, b)

def F2():
     d = np.dot (a, b)

from timeit import timeit
print (timeit ('F1()', globals=globals(), number=1000))
print (timeit ('F2()', globals=globals(), number=1000))

In [13]: 0.013910860987380147 << 1st timeit
28.608758996007964  << 2nd timeit
-- 
Those who don't understand recursion are doomed to repeat it

___
NumPy-Discussion mailing list
NumPy-Discussion@python.org 
https://mail.python.org/mailman/listinfo/numpy-discussion



___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] C-coded dot 1000x faster than numpy?

2021-02-23 Thread Neal Becker
One suspect is that it seems the numpy version was multi-threading.
This isn't useful here, because I'm running parallel monte-carlo
simulations using all cores.  Perhaps this is perversely slowing
things down?  I don't know how to account for 1000x  slowdown though.

On Tue, Feb 23, 2021 at 1:40 PM Roman Yurchak  wrote:
>
> For the first benchmark apparently A.dot(B) with A real and B complex is
> a known issue performance wise https://github.com/numpy/numpy/issues/10468
>
> In general, it might be worth trying different BLAS backends. For
> instance, if you install numpy from conda-forge you should be able to
> switch between OpenBLAS, MKL and BLIS:
> https://conda-forge.org/docs/maintainer/knowledge_base.html#switching-blas-implementation
>
> Roman
>
> On 23/02/2021 19:32, Andrea Gavana wrote:
> > Hi,
> >
> > On Tue, 23 Feb 2021 at 19.11, Neal Becker  > > wrote:
> >
> > I have code that performs dot product of a 2D matrix of size (on the
> > order of) [1000,16] with a vector of size [1000].  The matrix is
> > float64 and the vector is complex128.  I was using numpy.dot but it
> > turned out to be a bottleneck.
> >
> > So I coded dot2x1 in c++ (using xtensor-python just for the
> > interface).  No fancy simd was used, unless g++ did it on it's own.
> >
> > On a simple benchmark using timeit I find my hand-coded routine is on
> > the order of 1000x faster than numpy?  Here is the test code:
> > My custom c++ code is dot2x1.  I'm not copying it here because it has
> > some dependencies.  Any idea what is going on?
> >
> >
> >
> > I had a similar experience - albeit with an older numpy and Python 2.7,
> > so my comments are easily outdated and irrelevant. This was on Windows
> > 10 64 bit, way more than plenty RAM.
> >
> > It took me forever to find out that numpy.dot was the culprit, and I
> > ended up using fortran + f2py. Even with the overhead of having to go
> > through f2py bridge, the fortran dot_product was several times faster.
> >
> > Sorry if It doesn’t help much.
> >
> > Andrea.
> >
> >
> >
> >
> > import numpy as np
> >
> > from dot2x1 import dot2x1
> >
> > a = np.ones ((1000,16))
> > b = np.array([ 0.80311816+0.80311816j,  0.80311816-0.80311816j,
> > -0.80311816+0.80311816j, -0.80311816-0.80311816j,
> >  1.09707981+0.29396165j,  1.09707981-0.29396165j,
> > -1.09707981+0.29396165j, -1.09707981-0.29396165j,
> >  0.29396165+1.09707981j,  0.29396165-1.09707981j,
> > -0.29396165+1.09707981j, -0.29396165-1.09707981j,
> >  0.25495815+0.25495815j,  0.25495815-0.25495815j,
> > -0.25495815+0.25495815j, -0.25495815-0.25495815j])
> >
> > def F1():
> >  d = dot2x1 (a, b)
> >
> > def F2():
> >  d = np.dot (a, b)
> >
> > from timeit import timeit
> > print (timeit ('F1()', globals=globals(), number=1000))
> > print (timeit ('F2()', globals=globals(), number=1000))
> >
> > In [13]: 0.013910860987380147 << 1st timeit
> > 28.608758996007964  << 2nd timeit
> > --
> > Those who don't understand recursion are doomed to repeat it
> > ___
> > NumPy-Discussion mailing list
> > NumPy-Discussion@python.org 
> > https://mail.python.org/mailman/listinfo/numpy-discussion
> > 
> >
> >
> > ___
> > NumPy-Discussion mailing list
> > NumPy-Discussion@python.org
> > https://mail.python.org/mailman/listinfo/numpy-discussion
> >
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@python.org
> https://mail.python.org/mailman/listinfo/numpy-discussion



-- 
Those who don't understand recursion are doomed to repeat it
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] C-coded dot 1000x faster than numpy?

2021-02-23 Thread Carl Kleffner
https://stackoverflow.com/questions/19839539/how-to-get-faster-code-than-numpy-dot-for-matrix-multiplication

maybe C_CONTIGUOUS vs F_CONTIGUOUS?

Carl


Am Di., 23. Feb. 2021 um 19:52 Uhr schrieb Neal Becker :

> One suspect is that it seems the numpy version was multi-threading.
> This isn't useful here, because I'm running parallel monte-carlo
> simulations using all cores.  Perhaps this is perversely slowing
> things down?  I don't know how to account for 1000x  slowdown though.
>
> On Tue, Feb 23, 2021 at 1:40 PM Roman Yurchak 
> wrote:
> >
> > For the first benchmark apparently A.dot(B) with A real and B complex is
> > a known issue performance wise
> https://github.com/numpy/numpy/issues/10468
> >
> > In general, it might be worth trying different BLAS backends. For
> > instance, if you install numpy from conda-forge you should be able to
> > switch between OpenBLAS, MKL and BLIS:
> >
> https://conda-forge.org/docs/maintainer/knowledge_base.html#switching-blas-implementation
> >
> > Roman
> >
> > On 23/02/2021 19:32, Andrea Gavana wrote:
> > > Hi,
> > >
> > > On Tue, 23 Feb 2021 at 19.11, Neal Becker  > > > wrote:
> > >
> > > I have code that performs dot product of a 2D matrix of size (on
> the
> > > order of) [1000,16] with a vector of size [1000].  The matrix is
> > > float64 and the vector is complex128.  I was using numpy.dot but it
> > > turned out to be a bottleneck.
> > >
> > > So I coded dot2x1 in c++ (using xtensor-python just for the
> > > interface).  No fancy simd was used, unless g++ did it on it's own.
> > >
> > > On a simple benchmark using timeit I find my hand-coded routine is
> on
> > > the order of 1000x faster than numpy?  Here is the test code:
> > > My custom c++ code is dot2x1.  I'm not copying it here because it
> has
> > > some dependencies.  Any idea what is going on?
> > >
> > >
> > >
> > > I had a similar experience - albeit with an older numpy and Python 2.7,
> > > so my comments are easily outdated and irrelevant. This was on Windows
> > > 10 64 bit, way more than plenty RAM.
> > >
> > > It took me forever to find out that numpy.dot was the culprit, and I
> > > ended up using fortran + f2py. Even with the overhead of having to go
> > > through f2py bridge, the fortran dot_product was several times faster.
> > >
> > > Sorry if It doesn’t help much.
> > >
> > > Andrea.
> > >
> > >
> > >
> > >
> > > import numpy as np
> > >
> > > from dot2x1 import dot2x1
> > >
> > > a = np.ones ((1000,16))
> > > b = np.array([ 0.80311816+0.80311816j,  0.80311816-0.80311816j,
> > > -0.80311816+0.80311816j, -0.80311816-0.80311816j,
> > >  1.09707981+0.29396165j,  1.09707981-0.29396165j,
> > > -1.09707981+0.29396165j, -1.09707981-0.29396165j,
> > >  0.29396165+1.09707981j,  0.29396165-1.09707981j,
> > > -0.29396165+1.09707981j, -0.29396165-1.09707981j,
> > >  0.25495815+0.25495815j,  0.25495815-0.25495815j,
> > > -0.25495815+0.25495815j, -0.25495815-0.25495815j])
> > >
> > > def F1():
> > >  d = dot2x1 (a, b)
> > >
> > > def F2():
> > >  d = np.dot (a, b)
> > >
> > > from timeit import timeit
> > > print (timeit ('F1()', globals=globals(), number=1000))
> > > print (timeit ('F2()', globals=globals(), number=1000))
> > >
> > > In [13]: 0.013910860987380147 << 1st timeit
> > > 28.608758996007964  << 2nd timeit
> > > --
> > > Those who don't understand recursion are doomed to repeat it
> > > ___
> > > NumPy-Discussion mailing list
> > > NumPy-Discussion@python.org 
> > > https://mail.python.org/mailman/listinfo/numpy-discussion
> > > 
> > >
> > >
> > > ___
> > > NumPy-Discussion mailing list
> > > NumPy-Discussion@python.org
> > > https://mail.python.org/mailman/listinfo/numpy-discussion
> > >
> > ___
> > NumPy-Discussion mailing list
> > NumPy-Discussion@python.org
> > https://mail.python.org/mailman/listinfo/numpy-discussion
>
>
>
> --
> Those who don't understand recursion are doomed to repeat it
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@python.org
> https://mail.python.org/mailman/listinfo/numpy-discussion
>
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] C-coded dot 1000x faster than numpy?

2021-02-23 Thread Neal Becker
I'm using fedora 33 standard numpy.
ldd says:

/usr/lib64/python3.9/site-packages/numpy/core/_multiarray_umath.cpython-39-x86_64-linux-gnu.so:
linux-vdso.so.1 (0x7ffdd1487000)
libflexiblas.so.3 => /lib64/libflexiblas.so.3 (0x7f0512787000)

So whatever flexiblas is doing controls blas.

On Tue, Feb 23, 2021 at 1:51 PM Neal Becker  wrote:
>
> One suspect is that it seems the numpy version was multi-threading.
> This isn't useful here, because I'm running parallel monte-carlo
> simulations using all cores.  Perhaps this is perversely slowing
> things down?  I don't know how to account for 1000x  slowdown though.
>
> On Tue, Feb 23, 2021 at 1:40 PM Roman Yurchak  wrote:
> >
> > For the first benchmark apparently A.dot(B) with A real and B complex is
> > a known issue performance wise https://github.com/numpy/numpy/issues/10468
> >
> > In general, it might be worth trying different BLAS backends. For
> > instance, if you install numpy from conda-forge you should be able to
> > switch between OpenBLAS, MKL and BLIS:
> > https://conda-forge.org/docs/maintainer/knowledge_base.html#switching-blas-implementation
> >
> > Roman
> >
> > On 23/02/2021 19:32, Andrea Gavana wrote:
> > > Hi,
> > >
> > > On Tue, 23 Feb 2021 at 19.11, Neal Becker  > > > wrote:
> > >
> > > I have code that performs dot product of a 2D matrix of size (on the
> > > order of) [1000,16] with a vector of size [1000].  The matrix is
> > > float64 and the vector is complex128.  I was using numpy.dot but it
> > > turned out to be a bottleneck.
> > >
> > > So I coded dot2x1 in c++ (using xtensor-python just for the
> > > interface).  No fancy simd was used, unless g++ did it on it's own.
> > >
> > > On a simple benchmark using timeit I find my hand-coded routine is on
> > > the order of 1000x faster than numpy?  Here is the test code:
> > > My custom c++ code is dot2x1.  I'm not copying it here because it has
> > > some dependencies.  Any idea what is going on?
> > >
> > >
> > >
> > > I had a similar experience - albeit with an older numpy and Python 2.7,
> > > so my comments are easily outdated and irrelevant. This was on Windows
> > > 10 64 bit, way more than plenty RAM.
> > >
> > > It took me forever to find out that numpy.dot was the culprit, and I
> > > ended up using fortran + f2py. Even with the overhead of having to go
> > > through f2py bridge, the fortran dot_product was several times faster.
> > >
> > > Sorry if It doesn’t help much.
> > >
> > > Andrea.
> > >
> > >
> > >
> > >
> > > import numpy as np
> > >
> > > from dot2x1 import dot2x1
> > >
> > > a = np.ones ((1000,16))
> > > b = np.array([ 0.80311816+0.80311816j,  0.80311816-0.80311816j,
> > > -0.80311816+0.80311816j, -0.80311816-0.80311816j,
> > >  1.09707981+0.29396165j,  1.09707981-0.29396165j,
> > > -1.09707981+0.29396165j, -1.09707981-0.29396165j,
> > >  0.29396165+1.09707981j,  0.29396165-1.09707981j,
> > > -0.29396165+1.09707981j, -0.29396165-1.09707981j,
> > >  0.25495815+0.25495815j,  0.25495815-0.25495815j,
> > > -0.25495815+0.25495815j, -0.25495815-0.25495815j])
> > >
> > > def F1():
> > >  d = dot2x1 (a, b)
> > >
> > > def F2():
> > >  d = np.dot (a, b)
> > >
> > > from timeit import timeit
> > > print (timeit ('F1()', globals=globals(), number=1000))
> > > print (timeit ('F2()', globals=globals(), number=1000))
> > >
> > > In [13]: 0.013910860987380147 << 1st timeit
> > > 28.608758996007964  << 2nd timeit
> > > --
> > > Those who don't understand recursion are doomed to repeat it
> > > ___
> > > NumPy-Discussion mailing list
> > > NumPy-Discussion@python.org 
> > > https://mail.python.org/mailman/listinfo/numpy-discussion
> > > 
> > >
> > >
> > > ___
> > > NumPy-Discussion mailing list
> > > NumPy-Discussion@python.org
> > > https://mail.python.org/mailman/listinfo/numpy-discussion
> > >
> > ___
> > NumPy-Discussion mailing list
> > NumPy-Discussion@python.org
> > https://mail.python.org/mailman/listinfo/numpy-discussion
>
>
>
> --
> Those who don't understand recursion are doomed to repeat it



-- 
Those who don't understand recursion are doomed to repeat it
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] C-coded dot 1000x faster than numpy?

2021-02-23 Thread David Menéndez Hurtado
On Tue, 23 Feb 2021, 7:41 pm Roman Yurchak,  wrote:

> For the first benchmark apparently A.dot(B) with A real and B complex is
> a known issue performance wise https://github.com/numpy/numpy/issues/10468


I splitted B into a vector of size (N, 2) for the real and imaginary part,
and that makes the multiplication twice as fast.


My configuration (also in Fedora 33) np.show_config()


blas_mkl_info:
  NOT AVAILABLE
blis_info:
  NOT AVAILABLE
openblas_info:
libraries = ['openblas', 'openblas']
library_dirs = ['/usr/local/lib']
language = c
define_macros = [('HAVE_CBLAS', None)]
blas_opt_info:
libraries = ['openblas', 'openblas']
library_dirs = ['/usr/local/lib']
language = c
define_macros = [('HAVE_CBLAS', None)]
lapack_mkl_info:
  NOT AVAILABLE
openblas_lapack_info:
libraries = ['openblas', 'openblas']
library_dirs = ['/usr/local/lib']
language = c
define_macros = [('HAVE_CBLAS', None)]
lapack_opt_info:
libraries = ['openblas', 'openblas']
library_dirs = ['/usr/local/lib']
language = c
define_macros = [('HAVE_CBLAS', None)]
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] C-coded dot 1000x faster than numpy?

2021-02-23 Thread Carl Kleffner
The stackoverflow link above contains a simple testcase:

>>> from scipy.linalg import get_blas_funcs>>> gemm = get_blas_funcs("gemm", 
>>> [X, Y])>>> np.all(gemm(1, X, Y) == np.dot(X, Y))True

It would be of interest to benchmark gemm against np.dot. Maybe np.dot
doesn't use blas at al for whatever reason?


Am Di., 23. Feb. 2021 um 20:46 Uhr schrieb David Menéndez Hurtado <
davidmen...@gmail.com>:

> On Tue, 23 Feb 2021, 7:41 pm Roman Yurchak,  wrote:
>
>> For the first benchmark apparently A.dot(B) with A real and B complex is
>> a known issue performance wise
>> https://github.com/numpy/numpy/issues/10468
>
>
> I splitted B into a vector of size (N, 2) for the real and imaginary part,
> and that makes the multiplication twice as fast.
>
>
> My configuration (also in Fedora 33) np.show_config()
>
>
>
> blas_mkl_info:
>   NOT AVAILABLE
> blis_info:
>   NOT AVAILABLE
> openblas_info:
> libraries = ['openblas', 'openblas']
> library_dirs = ['/usr/local/lib']
> language = c
> define_macros = [('HAVE_CBLAS', None)]
> blas_opt_info:
> libraries = ['openblas', 'openblas']
> library_dirs = ['/usr/local/lib']
> language = c
> define_macros = [('HAVE_CBLAS', None)]
> lapack_mkl_info:
>   NOT AVAILABLE
> openblas_lapack_info:
> libraries = ['openblas', 'openblas']
> library_dirs = ['/usr/local/lib']
> language = c
> define_macros = [('HAVE_CBLAS', None)]
> lapack_opt_info:
> libraries = ['openblas', 'openblas']
> library_dirs = ['/usr/local/lib']
> language = c
> define_macros = [('HAVE_CBLAS', None)]
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@python.org
> https://mail.python.org/mailman/listinfo/numpy-discussion
>
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] C-coded dot 1000x faster than numpy?

2021-02-23 Thread Charles R Harris
On Tue, Feb 23, 2021 at 11:10 AM Neal Becker  wrote:

> I have code that performs dot product of a 2D matrix of size (on the
> order of) [1000,16] with a vector of size [1000].  The matrix is
> float64 and the vector is complex128.  I was using numpy.dot but it
> turned out to be a bottleneck.
>
> So I coded dot2x1 in c++ (using xtensor-python just for the
> interface).  No fancy simd was used, unless g++ did it on it's own.
>
> On a simple benchmark using timeit I find my hand-coded routine is on
> the order of 1000x faster than numpy?  Here is the test code:
> My custom c++ code is dot2x1.  I'm not copying it here because it has
> some dependencies.  Any idea what is going on?
>
> import numpy as np
>
> from dot2x1 import dot2x1
>
> a = np.ones ((1000,16))
> b = np.array([ 0.80311816+0.80311816j,  0.80311816-0.80311816j,
>-0.80311816+0.80311816j, -0.80311816-0.80311816j,
> 1.09707981+0.29396165j,  1.09707981-0.29396165j,
>-1.09707981+0.29396165j, -1.09707981-0.29396165j,
> 0.29396165+1.09707981j,  0.29396165-1.09707981j,
>-0.29396165+1.09707981j, -0.29396165-1.09707981j,
> 0.25495815+0.25495815j,  0.25495815-0.25495815j,
>-0.25495815+0.25495815j, -0.25495815-0.25495815j])
>
> def F1():
> d = dot2x1 (a, b)
>
> def F2():
> d = np.dot (a, b)
>
> from timeit import timeit
> print (timeit ('F1()', globals=globals(), number=1000))
> print (timeit ('F2()', globals=globals(), number=1000))
>
> In [13]: 0.013910860987380147 << 1st timeit
> 28.608758996007964  << 2nd timeit
>

I'm going to guess threading, although huge pages can also be a problem on
a machine under heavy load running other processes. Call overhead may also
matter for such small matrices.

Chuck
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] C-coded dot 1000x faster than numpy?

2021-02-23 Thread Charles R Harris
On Tue, Feb 23, 2021 at 5:47 PM Charles R Harris 
wrote:

>
>
> On Tue, Feb 23, 2021 at 11:10 AM Neal Becker  wrote:
>
>> I have code that performs dot product of a 2D matrix of size (on the
>> order of) [1000,16] with a vector of size [1000].  The matrix is
>> float64 and the vector is complex128.  I was using numpy.dot but it
>> turned out to be a bottleneck.
>>
>> So I coded dot2x1 in c++ (using xtensor-python just for the
>> interface).  No fancy simd was used, unless g++ did it on it's own.
>>
>> On a simple benchmark using timeit I find my hand-coded routine is on
>> the order of 1000x faster than numpy?  Here is the test code:
>> My custom c++ code is dot2x1.  I'm not copying it here because it has
>> some dependencies.  Any idea what is going on?
>>
>> import numpy as np
>>
>> from dot2x1 import dot2x1
>>
>> a = np.ones ((1000,16))
>> b = np.array([ 0.80311816+0.80311816j,  0.80311816-0.80311816j,
>>-0.80311816+0.80311816j, -0.80311816-0.80311816j,
>> 1.09707981+0.29396165j,  1.09707981-0.29396165j,
>>-1.09707981+0.29396165j, -1.09707981-0.29396165j,
>> 0.29396165+1.09707981j,  0.29396165-1.09707981j,
>>-0.29396165+1.09707981j, -0.29396165-1.09707981j,
>> 0.25495815+0.25495815j,  0.25495815-0.25495815j,
>>-0.25495815+0.25495815j, -0.25495815-0.25495815j])
>>
>> def F1():
>> d = dot2x1 (a, b)
>>
>> def F2():
>> d = np.dot (a, b)
>>
>> from timeit import timeit
>> print (timeit ('F1()', globals=globals(), number=1000))
>> print (timeit ('F2()', globals=globals(), number=1000))
>>
>> In [13]: 0.013910860987380147 << 1st timeit
>> 28.608758996007964  << 2nd timeit
>>
>
> I'm going to guess threading, although huge pages can also be a problem on
> a machine under heavy load running other processes. Call overhead may also
> matter for such small matrices.
>
>
What BLAS library are you using. I get much better results using an 8 year
old i5 and ATLAS.

Chuck
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


[Numpy-discussion] NumPy Development Meeting Wednesday - Triage Focus

2021-02-23 Thread Sebastian Berg
Hi all,

Our bi-weekly triage-focused NumPy development meeting is Wednesday,
Feb 24th at 11 am Pacific Time (19:00 UTC).
Everyone is invited to join in and edit the work-in-progress meeting
topics and notes:
https://hackmd.io/68i_JvOYQfy9ERiHgXMPvg

I encourage everyone to notify us of issues or PRs that you feel should
be prioritized, discussed, or reviewed.

Best regards

Sebastian




signature.asc
Description: This is a digitally signed message part
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion