[Numpy-discussion] How to avoid this memory copy overhead in d=a*b+c?

2022-09-16 Thread 腾刘
Hello everyone, I 'm here again to ask a naive question about Numpy
performance.

As far as I know, Numpy's vectorization operator is very effective because
it utilizes SIMD instructions and multi-threads compared to index-style
programming (using a "for" loop and assigning each element with its index
in array).

I 'm wondering how fast Numpy could be so I did some experiments. Take this
simple task as an example:
a = np.random.rand(10 000 000)
b = np.random.rand(10 000 000)
c = a + b

To check the performance, I wrote a simple C++ implementation of adding two
arrays using multi-threads too (with the compile options of: -O3 -mavx2). I
found that the C++ implementation is slightly faster than Numpy (running
100 times each to get a rather convincing statistic).

*Here comes the first question, how come there is this efficiency gap?*
I guess this is because Numpy needs to load the shared object and find the
wrapper of ufunc and then finally execute the underlying computation. Am I
right? Am I missing something here?

Then I did another experiment for this statement:  d = a * b + c , where a,
b, c and d are all numpy arrays. I also use C++ to implement this logic and
found that C++ is 2 times faster than Numpy on average (also executed 100
times each).

I guess this is because in python we first calculate:
temporary_var = a * b
and then:
d = temporary_var + c
so we have an unnecessary memory transfer overhead. Since each array is
very large,  Numpy needs to write temporary_var to memory and then read it
back to cache.

However in C++ we could just write d[i] = a[i] * b[i] + c[i] and we won't
create a temporary array along with the memory transfer penalty.

*So another problem is if there is a method to avoid this kind of overhead?*
I 've learned that in Numpy we could create our own ufunc with: *frompyfunc*,
but it seems that there is no SIMD optimization nor multi-threads utilized
since this is 100 times slower than *"d = a * b + c" way*.
___
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com


[Numpy-discussion] Re: How to avoid this memory copy overhead in d=a*b+c?

2022-09-16 Thread Kevin Sheppard
You can use inplace operators where appropriate to avoid memory allocation. a *= bc += a Kevin  From: 腾刘Sent: Friday, September 16, 2022 8:35 AMTo: Discussion of Numerical PythonSubject: [Numpy-discussion] How to avoid this memory copy overhead in d=a*b+c? Hello everyone, I 'm here again to ask a naive question about Numpy performance. As far as I know, Numpy's vectorization operator is very effective because it utilizes SIMD instructions and multi-threads compared to index-style programming (using a "for" loop and assigning each element with its index in array). I 'm wondering how fast Numpy could be so I did some experiments. Take this simple task as an example:    a = np.random.rand(10 000 000)    b = np.random.rand(10 000 000)    c = a + b To check the performance, I wrote a simple C++ implementation of adding two arrays using multi-threads too (with the compile options of: -O3 -mavx2). I found that the C++ implementation is slightly faster than Numpy (running 100 times each to get a rather convincing statistic). Here comes the first question, how come there is this efficiency gap?I guess this is because Numpy needs to load the shared object and find the wrapper of ufunc and then finally execute the underlying computation. Am I right? Am I missing something here? Then I did another experiment for this statement:  d = a * b + c , where a, b, c and d are all numpy arrays. I also use C++ to implement this logic and found that C++ is 2 times faster than Numpy on average (also executed 100 times each). I guess this is because in python we first calculate:    temporary_var = a * band then:    d = temporary_var + cso we have an unnecessary memory transfer overhead. Since each array is very large,  Numpy needs to write temporary_var to memory and then read it back to cache. However in C++ we could just write d[i] = a[i] * b[i] + c[i] and we won't create a temporary array along with the memory transfer penalty. So another problem is if there is a method to avoid this kind of overhead? I 've learned that in Numpy we could create our own ufunc with: frompyfunc, but it seems that there is no SIMD optimization nor multi-threads utilized since this is 100 times slower than "d = a * b + c" way.   
___
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com


[Numpy-discussion] Re: How to avoid this memory copy overhead in d=a*b+c?

2022-09-16 Thread 腾刘
Thanks a lot for answering this question but I still have some
uncertainties.

I 'm trying to improve the time efficiency as much as possible so I 'm not
mainly worried about memory allocation, since in my opinion it won't cost
much.
Instead, the memory accessing is my central concern because of the cache
miss penalty.

In your snippet, there will be 4 accesses to the whole arrays, which is:

the access to a (in a *= b)
the access to b (in a *= b)
the access to c (in c += a)
the access to a (in c += a)

This is better than d = a * b + c, but I truly need a new array d to hold
the final result because I don't want to spoil the data in array c also.

So let's replace c += a with d = a + c, and in this way there will be 5
accesses to the whole array in total.

However, under optimal conditions, which can be achieved by C++
implementation ( d[i] = a[i] * b[i] + c[i]), we only need four accesses to
the whole array.

In modern CPU, the cost of this kind of simple calculation is negligible
compared to memory access, so I guess I still need a better way.

So much thanks again for your reply!

Kevin Sheppard  于2022年9月16日周五 15:38写道:

> You can use inplace operators where appropriate to avoid memory allocation.
>
>
>
> a *= b
>
> c += a
>
>
>
> Kevin
>
>
>
>
>
> *From: *腾刘 <27rabbi...@gmail.com>
> *Sent: *Friday, September 16, 2022 8:35 AM
> *To: *Discussion of Numerical Python 
> *Subject: *[Numpy-discussion] How to avoid this memory copy overhead in
> d=a*b+c?
>
>
>
> Hello everyone, I 'm here again to ask a naive question about Numpy
> performance.
>
>
>
> As far as I know, Numpy's vectorization operator is very effective because
> it utilizes SIMD instructions and multi-threads compared to index-style
> programming (using a "for" loop and assigning each element with its index
> in array).
>
>
>
> I 'm wondering how fast Numpy could be so I did some experiments. Take
> this simple task as an example:
>
> a = np.random.rand(10 000 000)
>
> b = np.random.rand(10 000 000)
>
> c = a + b
>
>
>
> To check the performance, I wrote a simple C++ implementation of adding
> two arrays using multi-threads too (with the compile options of: -O3
> -mavx2). I found that the C++ implementation is slightly faster than Numpy
> (running 100 times each to get a rather convincing statistic).
>
>
>
> *Here comes the first question, how come there is this efficiency gap?*
>
> I guess this is because Numpy needs to load the shared object and find the
> wrapper of ufunc and then finally execute the underlying computation. Am I
> right? Am I missing something here?
>
>
>
> Then I did another experiment for this statement:  d = a * b + c , where
> a, b, c and d are all numpy arrays. I also use C++ to implement this logic
> and found that C++ is 2 times faster than Numpy on average (also executed
> 100 times each).
>
>
>
> I guess this is because in python we first calculate:
>
> temporary_var = a * b
>
> and then:
>
> d = temporary_var + c
>
> so we have an unnecessary memory transfer overhead. Since each array is
> very large,  Numpy needs to write temporary_var to memory and then read it
> back to cache.
>
>
>
> However in C++ we could just write d[i] = a[i] * b[i] + c[i] and we won't
> create a temporary array along with the memory transfer penalty.
>
>
>
> *So another problem is if there is a method to avoid this kind of
> overhead?* I 've learned that in Numpy we could create our own ufunc
> with: *frompyfunc*, but it seems that there is no SIMD optimization nor
> multi-threads utilized since this is 100 times slower than *"d = a * b +
> c" way*.
>
>
>
>
>
>
> ___
> NumPy-Discussion mailing list -- numpy-discussion@python.org
> To unsubscribe send an email to numpy-discussion-le...@python.org
> https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
> Member address: 27rabbi...@gmail.com
>
___
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com


[Numpy-discussion] Re: How to avoid this memory copy overhead in d=a*b+c?

2022-09-16 Thread Kevin Sheppard
Have a look at numexpr (https://github.com/pydata/numexpr).  It can achieve
large speedups in ops like this at the cost of having to write expensive
operations as strings, e.g., d = ne.evaluate('a * b + c').  You could also
write a gufunc in numba that would be memory and access efficient.

Kevin


On Fri, Sep 16, 2022 at 8:53 AM 腾刘 <27rabbi...@gmail.com> wrote:

> Thanks a lot for answering this question but I still have some
> uncertainties.
>
> I 'm trying to improve the time efficiency as much as possible so I 'm not
> mainly worried about memory allocation, since in my opinion it won't cost
> much.
> Instead, the memory accessing is my central concern because of the cache
> miss penalty.
>
> In your snippet, there will be 4 accesses to the whole arrays, which is:
>
> the access to a (in a *= b)
> the access to b (in a *= b)
> the access to c (in c += a)
> the access to a (in c += a)
>
> This is better than d = a * b + c, but I truly need a new array d to hold
> the final result because I don't want to spoil the data in array c also.
>
> So let's replace c += a with d = a + c, and in this way there will be 5
> accesses to the whole array in total.
>
> However, under optimal conditions, which can be achieved by C++
> implementation ( d[i] = a[i] * b[i] + c[i]), we only need four accesses to
> the whole array.
>
> In modern CPU, the cost of this kind of simple calculation is negligible
> compared to memory access, so I guess I still need a better way.
>
> So much thanks again for your reply!
>
> Kevin Sheppard  于2022年9月16日周五 15:38写道:
>
>> You can use inplace operators where appropriate to avoid memory
>> allocation.
>>
>>
>>
>> a *= b
>>
>> c += a
>>
>>
>>
>> Kevin
>>
>>
>>
>>
>>
>> *From: *腾刘 <27rabbi...@gmail.com>
>> *Sent: *Friday, September 16, 2022 8:35 AM
>> *To: *Discussion of Numerical Python 
>> *Subject: *[Numpy-discussion] How to avoid this memory copy overhead in
>> d=a*b+c?
>>
>>
>>
>> Hello everyone, I 'm here again to ask a naive question about Numpy
>> performance.
>>
>>
>>
>> As far as I know, Numpy's vectorization operator is very effective
>> because it utilizes SIMD instructions and multi-threads compared to
>> index-style programming (using a "for" loop and assigning each element with
>> its index in array).
>>
>>
>>
>> I 'm wondering how fast Numpy could be so I did some experiments. Take
>> this simple task as an example:
>>
>> a = np.random.rand(10 000 000)
>>
>> b = np.random.rand(10 000 000)
>>
>> c = a + b
>>
>>
>>
>> To check the performance, I wrote a simple C++ implementation of adding
>> two arrays using multi-threads too (with the compile options of: -O3
>> -mavx2). I found that the C++ implementation is slightly faster than Numpy
>> (running 100 times each to get a rather convincing statistic).
>>
>>
>>
>> *Here comes the first question, how come there is this efficiency gap?*
>>
>> I guess this is because Numpy needs to load the shared object and find
>> the wrapper of ufunc and then finally execute the underlying computation.
>> Am I right? Am I missing something here?
>>
>>
>>
>> Then I did another experiment for this statement:  d = a * b + c , where
>> a, b, c and d are all numpy arrays. I also use C++ to implement this logic
>> and found that C++ is 2 times faster than Numpy on average (also executed
>> 100 times each).
>>
>>
>>
>> I guess this is because in python we first calculate:
>>
>> temporary_var = a * b
>>
>> and then:
>>
>> d = temporary_var + c
>>
>> so we have an unnecessary memory transfer overhead. Since each array is
>> very large,  Numpy needs to write temporary_var to memory and then read it
>> back to cache.
>>
>>
>>
>> However in C++ we could just write d[i] = a[i] * b[i] + c[i] and we won't
>> create a temporary array along with the memory transfer penalty.
>>
>>
>>
>> *So another problem is if there is a method to avoid this kind of
>> overhead?* I 've learned that in Numpy we could create our own ufunc
>> with: *frompyfunc*, but it seems that there is no SIMD optimization nor
>> multi-threads utilized since this is 100 times slower than *"d = a * b +
>> c" way*.
>>
>>
>>
>>
>>
>>
>> ___
>> NumPy-Discussion mailing list -- numpy-discussion@python.org
>> To unsubscribe send an email to numpy-discussion-le...@python.org
>> https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
>> Member address: 27rabbi...@gmail.com
>>
> ___
> NumPy-Discussion mailing list -- numpy-discussion@python.org
> To unsubscribe send an email to numpy-discussion-le...@python.org
> https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
> Member address: kevin.k.shepp...@gmail.com
>
___
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch..

[Numpy-discussion] Re: How to avoid this memory copy overhead in d=a*b+c?

2022-09-16 Thread Francesc Alted
This is exactly what numexpr is meant for:
https://numexpr.readthedocs.io/projects/NumExpr3/en/latest/

In particular, see these benchmarks (made around 10 years ago, but they
should still apply):
https://numexpr.readthedocs.io/projects/NumExpr3/en/latest/intro.html#expected-performance

Cheers

On Fri, Sep 16, 2022 at 9:57 AM 腾刘 <27rabbi...@gmail.com> wrote:

> Thanks a lot for answering this question but I still have some
> uncertainties.
>
> I 'm trying to improve the time efficiency as much as possible so I 'm not
> mainly worried about memory allocation, since in my opinion it won't cost
> much.
> Instead, the memory accessing is my central concern because of the cache
> miss penalty.
>
> In your snippet, there will be 4 accesses to the whole arrays, which is:
>
> the access to a (in a *= b)
> the access to b (in a *= b)
> the access to c (in c += a)
> the access to a (in c += a)
>
> This is better than d = a * b + c, but I truly need a new array d to hold
> the final result because I don't want to spoil the data in array c also.
>
> So let's replace c += a with d = a + c, and in this way there will be 5
> accesses to the whole array in total.
>
> However, under optimal conditions, which can be achieved by C++
> implementation ( d[i] = a[i] * b[i] + c[i]), we only need four accesses to
> the whole array.
>
> In modern CPU, the cost of this kind of simple calculation is negligible
> compared to memory access, so I guess I still need a better way.
>
> So much thanks again for your reply!
>
> Kevin Sheppard  于2022年9月16日周五 15:38写道:
>
>> You can use inplace operators where appropriate to avoid memory
>> allocation.
>>
>>
>>
>> a *= b
>>
>> c += a
>>
>>
>>
>> Kevin
>>
>>
>>
>>
>>
>> *From: *腾刘 <27rabbi...@gmail.com>
>> *Sent: *Friday, September 16, 2022 8:35 AM
>> *To: *Discussion of Numerical Python 
>> *Subject: *[Numpy-discussion] How to avoid this memory copy overhead in
>> d=a*b+c?
>>
>>
>>
>> Hello everyone, I 'm here again to ask a naive question about Numpy
>> performance.
>>
>>
>>
>> As far as I know, Numpy's vectorization operator is very effective
>> because it utilizes SIMD instructions and multi-threads compared to
>> index-style programming (using a "for" loop and assigning each element with
>> its index in array).
>>
>>
>>
>> I 'm wondering how fast Numpy could be so I did some experiments. Take
>> this simple task as an example:
>>
>> a = np.random.rand(10 000 000)
>>
>> b = np.random.rand(10 000 000)
>>
>> c = a + b
>>
>>
>>
>> To check the performance, I wrote a simple C++ implementation of adding
>> two arrays using multi-threads too (with the compile options of: -O3
>> -mavx2). I found that the C++ implementation is slightly faster than Numpy
>> (running 100 times each to get a rather convincing statistic).
>>
>>
>>
>> *Here comes the first question, how come there is this efficiency gap?*
>>
>> I guess this is because Numpy needs to load the shared object and find
>> the wrapper of ufunc and then finally execute the underlying computation.
>> Am I right? Am I missing something here?
>>
>>
>>
>> Then I did another experiment for this statement:  d = a * b + c , where
>> a, b, c and d are all numpy arrays. I also use C++ to implement this logic
>> and found that C++ is 2 times faster than Numpy on average (also executed
>> 100 times each).
>>
>>
>>
>> I guess this is because in python we first calculate:
>>
>> temporary_var = a * b
>>
>> and then:
>>
>> d = temporary_var + c
>>
>> so we have an unnecessary memory transfer overhead. Since each array is
>> very large,  Numpy needs to write temporary_var to memory and then read it
>> back to cache.
>>
>>
>>
>> However in C++ we could just write d[i] = a[i] * b[i] + c[i] and we won't
>> create a temporary array along with the memory transfer penalty.
>>
>>
>>
>> *So another problem is if there is a method to avoid this kind of
>> overhead?* I 've learned that in Numpy we could create our own ufunc
>> with: *frompyfunc*, but it seems that there is no SIMD optimization nor
>> multi-threads utilized since this is 100 times slower than *"d = a * b +
>> c" way*.
>>
>>
>>
>>
>>
>>
>> ___
>> NumPy-Discussion mailing list -- numpy-discussion@python.org
>> To unsubscribe send an email to numpy-discussion-le...@python.org
>> https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
>> Member address: 27rabbi...@gmail.com
>>
> ___
> NumPy-Discussion mailing list -- numpy-discussion@python.org
> To unsubscribe send an email to numpy-discussion-le...@python.org
> https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
> Member address: fal...@gmail.com
>


-- 
Francesc Alted
___
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address

[Numpy-discussion] Re: How to avoid this memory copy overhead in d=a*b+c?

2022-09-16 Thread 腾刘
Thanks a million!! I will check these thoroughly~

Kevin Sheppard  于2022年9月16日周五 16:11写道:

> Have a look at numexpr (https://github.com/pydata/numexpr).  It can
> achieve large speedups in ops like this at the cost of having to write
> expensive operations as strings, e.g., d = ne.evaluate('a * b + c').  You
> could also write a gufunc in numba that would be memory and access
> efficient.
>
> Kevin
>
>
> On Fri, Sep 16, 2022 at 8:53 AM 腾刘 <27rabbi...@gmail.com> wrote:
>
>> Thanks a lot for answering this question but I still have some
>> uncertainties.
>>
>> I 'm trying to improve the time efficiency as much as possible so I 'm
>> not mainly worried about memory allocation, since in my opinion it won't
>> cost much.
>> Instead, the memory accessing is my central concern because of the cache
>> miss penalty.
>>
>> In your snippet, there will be 4 accesses to the whole arrays, which is:
>>
>> the access to a (in a *= b)
>> the access to b (in a *= b)
>> the access to c (in c += a)
>> the access to a (in c += a)
>>
>> This is better than d = a * b + c, but I truly need a new array d to hold
>> the final result because I don't want to spoil the data in array c also.
>>
>> So let's replace c += a with d = a + c, and in this way there will be 5
>> accesses to the whole array in total.
>>
>> However, under optimal conditions, which can be achieved by C++
>> implementation ( d[i] = a[i] * b[i] + c[i]), we only need four accesses to
>> the whole array.
>>
>> In modern CPU, the cost of this kind of simple calculation is negligible
>> compared to memory access, so I guess I still need a better way.
>>
>> So much thanks again for your reply!
>>
>> Kevin Sheppard  于2022年9月16日周五 15:38写道:
>>
>>> You can use inplace operators where appropriate to avoid memory
>>> allocation.
>>>
>>>
>>>
>>> a *= b
>>>
>>> c += a
>>>
>>>
>>>
>>> Kevin
>>>
>>>
>>>
>>>
>>>
>>> *From: *腾刘 <27rabbi...@gmail.com>
>>> *Sent: *Friday, September 16, 2022 8:35 AM
>>> *To: *Discussion of Numerical Python 
>>> *Subject: *[Numpy-discussion] How to avoid this memory copy overhead in
>>> d=a*b+c?
>>>
>>>
>>>
>>> Hello everyone, I 'm here again to ask a naive question about Numpy
>>> performance.
>>>
>>>
>>>
>>> As far as I know, Numpy's vectorization operator is very effective
>>> because it utilizes SIMD instructions and multi-threads compared to
>>> index-style programming (using a "for" loop and assigning each element with
>>> its index in array).
>>>
>>>
>>>
>>> I 'm wondering how fast Numpy could be so I did some experiments. Take
>>> this simple task as an example:
>>>
>>> a = np.random.rand(10 000 000)
>>>
>>> b = np.random.rand(10 000 000)
>>>
>>> c = a + b
>>>
>>>
>>>
>>> To check the performance, I wrote a simple C++ implementation of adding
>>> two arrays using multi-threads too (with the compile options of: -O3
>>> -mavx2). I found that the C++ implementation is slightly faster than Numpy
>>> (running 100 times each to get a rather convincing statistic).
>>>
>>>
>>>
>>> *Here comes the first question, how come there is this efficiency gap?*
>>>
>>> I guess this is because Numpy needs to load the shared object and find
>>> the wrapper of ufunc and then finally execute the underlying computation.
>>> Am I right? Am I missing something here?
>>>
>>>
>>>
>>> Then I did another experiment for this statement:  d = a * b + c , where
>>> a, b, c and d are all numpy arrays. I also use C++ to implement this logic
>>> and found that C++ is 2 times faster than Numpy on average (also executed
>>> 100 times each).
>>>
>>>
>>>
>>> I guess this is because in python we first calculate:
>>>
>>> temporary_var = a * b
>>>
>>> and then:
>>>
>>> d = temporary_var + c
>>>
>>> so we have an unnecessary memory transfer overhead. Since each array is
>>> very large,  Numpy needs to write temporary_var to memory and then read it
>>> back to cache.
>>>
>>>
>>>
>>> However in C++ we could just write d[i] = a[i] * b[i] + c[i] and we
>>> won't create a temporary array along with the memory transfer penalty.
>>>
>>>
>>>
>>> *So another problem is if there is a method to avoid this kind of
>>> overhead?* I 've learned that in Numpy we could create our own ufunc
>>> with: *frompyfunc*, but it seems that there is no SIMD optimization nor
>>> multi-threads utilized since this is 100 times slower than *"d = a *
>>> b + c" way*.
>>>
>>>
>>>
>>>
>>>
>>>
>>> ___
>>> NumPy-Discussion mailing list -- numpy-discussion@python.org
>>> To unsubscribe send an email to numpy-discussion-le...@python.org
>>> https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
>>> Member address: 27rabbi...@gmail.com
>>>
>> ___
>> NumPy-Discussion mailing list -- numpy-discussion@python.org
>> To unsubscribe send an email to numpy-discussion-le...@python.org
>> https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
>> Member address: kevin.k.shepp...@gmail.com
>>
> _

[Numpy-discussion] Re: How to avoid this memory copy overhead in d=a*b+c?

2022-09-16 Thread 腾刘
Still so naive in Python, there truly are lots of beautiful libraries at
hand.

Thanks a lot for suggestions!!

Francesc Alted  于2022年9月16日周五 16:15写道:

> This is exactly what numexpr is meant for:
> https://numexpr.readthedocs.io/projects/NumExpr3/en/latest/
>
> In particular, see these benchmarks (made around 10 years ago, but they
> should still apply):
>
> https://numexpr.readthedocs.io/projects/NumExpr3/en/latest/intro.html#expected-performance
>
> Cheers
>
> On Fri, Sep 16, 2022 at 9:57 AM 腾刘 <27rabbi...@gmail.com> wrote:
>
>> Thanks a lot for answering this question but I still have some
>> uncertainties.
>>
>> I 'm trying to improve the time efficiency as much as possible so I 'm
>> not mainly worried about memory allocation, since in my opinion it won't
>> cost much.
>> Instead, the memory accessing is my central concern because of the cache
>> miss penalty.
>>
>> In your snippet, there will be 4 accesses to the whole arrays, which is:
>>
>> the access to a (in a *= b)
>> the access to b (in a *= b)
>> the access to c (in c += a)
>> the access to a (in c += a)
>>
>> This is better than d = a * b + c, but I truly need a new array d to hold
>> the final result because I don't want to spoil the data in array c also.
>>
>> So let's replace c += a with d = a + c, and in this way there will be 5
>> accesses to the whole array in total.
>>
>> However, under optimal conditions, which can be achieved by C++
>> implementation ( d[i] = a[i] * b[i] + c[i]), we only need four accesses to
>> the whole array.
>>
>> In modern CPU, the cost of this kind of simple calculation is negligible
>> compared to memory access, so I guess I still need a better way.
>>
>> So much thanks again for your reply!
>>
>> Kevin Sheppard  于2022年9月16日周五 15:38写道:
>>
>>> You can use inplace operators where appropriate to avoid memory
>>> allocation.
>>>
>>>
>>>
>>> a *= b
>>>
>>> c += a
>>>
>>>
>>>
>>> Kevin
>>>
>>>
>>>
>>>
>>>
>>> *From: *腾刘 <27rabbi...@gmail.com>
>>> *Sent: *Friday, September 16, 2022 8:35 AM
>>> *To: *Discussion of Numerical Python 
>>> *Subject: *[Numpy-discussion] How to avoid this memory copy overhead in
>>> d=a*b+c?
>>>
>>>
>>>
>>> Hello everyone, I 'm here again to ask a naive question about Numpy
>>> performance.
>>>
>>>
>>>
>>> As far as I know, Numpy's vectorization operator is very effective
>>> because it utilizes SIMD instructions and multi-threads compared to
>>> index-style programming (using a "for" loop and assigning each element with
>>> its index in array).
>>>
>>>
>>>
>>> I 'm wondering how fast Numpy could be so I did some experiments. Take
>>> this simple task as an example:
>>>
>>> a = np.random.rand(10 000 000)
>>>
>>> b = np.random.rand(10 000 000)
>>>
>>> c = a + b
>>>
>>>
>>>
>>> To check the performance, I wrote a simple C++ implementation of adding
>>> two arrays using multi-threads too (with the compile options of: -O3
>>> -mavx2). I found that the C++ implementation is slightly faster than Numpy
>>> (running 100 times each to get a rather convincing statistic).
>>>
>>>
>>>
>>> *Here comes the first question, how come there is this efficiency gap?*
>>>
>>> I guess this is because Numpy needs to load the shared object and find
>>> the wrapper of ufunc and then finally execute the underlying computation.
>>> Am I right? Am I missing something here?
>>>
>>>
>>>
>>> Then I did another experiment for this statement:  d = a * b + c , where
>>> a, b, c and d are all numpy arrays. I also use C++ to implement this logic
>>> and found that C++ is 2 times faster than Numpy on average (also executed
>>> 100 times each).
>>>
>>>
>>>
>>> I guess this is because in python we first calculate:
>>>
>>> temporary_var = a * b
>>>
>>> and then:
>>>
>>> d = temporary_var + c
>>>
>>> so we have an unnecessary memory transfer overhead. Since each array is
>>> very large,  Numpy needs to write temporary_var to memory and then read it
>>> back to cache.
>>>
>>>
>>>
>>> However in C++ we could just write d[i] = a[i] * b[i] + c[i] and we
>>> won't create a temporary array along with the memory transfer penalty.
>>>
>>>
>>>
>>> *So another problem is if there is a method to avoid this kind of
>>> overhead?* I 've learned that in Numpy we could create our own ufunc
>>> with: *frompyfunc*, but it seems that there is no SIMD optimization nor
>>> multi-threads utilized since this is 100 times slower than *"d = a *
>>> b + c" way*.
>>>
>>>
>>>
>>>
>>>
>>>
>>> ___
>>> NumPy-Discussion mailing list -- numpy-discussion@python.org
>>> To unsubscribe send an email to numpy-discussion-le...@python.org
>>> https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
>>> Member address: 27rabbi...@gmail.com
>>>
>> ___
>> NumPy-Discussion mailing list -- numpy-discussion@python.org
>> To unsubscribe send an email to numpy-discussion-le...@python.org
>> https://mail.python.org/mailman3/lists/numpy-discussion.python.org/

[Numpy-discussion] Re: Enhancement for AArch64 SVE instruction set

2022-09-16 Thread kawakam...@fujitsu.com
Hi,

It's been a long time since I first contacted here, 
but I submitted my pull request about handling Arm64 SVE architecture yesterday.
https://github.com/numpy/numpy/pull/22265

Since there may be no public CI environment that runs the SVE instruction set, 
I tested my source code on an inhouse server (Fujitsu FX700 with A64FX).
- A64FX is one of the Armv8.2-a + SVE compliant CPUs.
I think the regression test was successfully completed.
- python3 runtests.py --cpu-baseline=sve --cpu-dispatch=none
The result shows "21354 passed, 203 skipped, 1302 deselected, 30 xfailed, 7 
xpassed".

My implementation is similar to those of AVX/AVX2/ASIMD.
- SVE intrinsics are defined in numpy/core/src/common/simd/sve/*.h files.

Travis CI reported errors
https://github.com/numpy/numpy/pull/22265/checks?check_run_id=8384699529
, but it seems that the job exceeded the maximum log length, and has been 
terminated.

I would appreciate your review of my pull request, as well as your comments and 
advice 
on this mailing list.


Thanks,
Kentaro








___
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com


[Numpy-discussion] Re: Enhancement for AArch64 SVE instruction set

2022-09-16 Thread matti picus
It seems cirrus-ci offers AWS EKS Graviton2 instances [0] and this is
free for open source projects. Do you know if that offering has
SVE-enabled CPUs?

Matti
[0] https://cirrus-ci.org/guide/linux/

On Fri, Sep 16, 2022 at 5:54 AM kawakam...@fujitsu.com
 wrote:
>
> Hi,
>
> It's been a long time since I first contacted here,
> but I submitted my pull request about handling Arm64 SVE architecture 
> yesterday.
> https://github.com/numpy/numpy/pull/22265
>
> Since there may be no public CI environment that runs the SVE instruction set,
> I tested my source code on an inhouse server (Fujitsu FX700 with A64FX).
> - A64FX is one of the Armv8.2-a + SVE compliant CPUs.
> I think the regression test was successfully completed.
> - python3 runtests.py --cpu-baseline=sve --cpu-dispatch=none
> The result shows "21354 passed, 203 skipped, 1302 deselected, 30 xfailed, 7 
> xpassed".
>
> My implementation is similar to those of AVX/AVX2/ASIMD.
> - SVE intrinsics are defined in numpy/core/src/common/simd/sve/*.h files.
>
> Travis CI reported errors
> https://github.com/numpy/numpy/pull/22265/checks?check_run_id=8384699529
> , but it seems that the job exceeded the maximum log length, and has been 
> terminated.
>
> I would appreciate your review of my pull request, as well as your comments 
> and advice
> on this mailing list.
>
>
> Thanks,
> Kentaro
>
>
>
>
>
>
>
>
> ___
> NumPy-Discussion mailing list -- numpy-discussion@python.org
> To unsubscribe send an email to numpy-discussion-le...@python.org
> https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
> Member address: matti.pi...@gmail.com
___
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com


[Numpy-discussion] Ways to achieve faster np.nanpercentile() calculation?

2022-09-16 Thread Aron Gergely

Hi all,

On my system, np.nanpercentile()  is orders of magnitude (>100x) slower 
than np.percentile().

I use numpy 1.23.1

Wondering if there is a way to speed it up.

I came across this workaround for 3D arrays:
https://krstn.eu/np.nanpercentile()-there-has-to-be-a-faster-way/

But I would need a generalized solution that works on N dimensions.
So I started adopting the above - but wondering if I am reinventing the 
wheel here?


Is there already a python package that implements a speedier 
nanpercentile for numpy? (similar idea as the 'Bottleneck' package?)

Or other known workarounds to achieve the same result?

Best regards,
Aron

___
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com