[Numpy-discussion] How to avoid this memory copy overhead in d=a*b+c?
Hello everyone, I 'm here again to ask a naive question about Numpy performance. As far as I know, Numpy's vectorization operator is very effective because it utilizes SIMD instructions and multi-threads compared to index-style programming (using a "for" loop and assigning each element with its index in array). I 'm wondering how fast Numpy could be so I did some experiments. Take this simple task as an example: a = np.random.rand(10 000 000) b = np.random.rand(10 000 000) c = a + b To check the performance, I wrote a simple C++ implementation of adding two arrays using multi-threads too (with the compile options of: -O3 -mavx2). I found that the C++ implementation is slightly faster than Numpy (running 100 times each to get a rather convincing statistic). *Here comes the first question, how come there is this efficiency gap?* I guess this is because Numpy needs to load the shared object and find the wrapper of ufunc and then finally execute the underlying computation. Am I right? Am I missing something here? Then I did another experiment for this statement: d = a * b + c , where a, b, c and d are all numpy arrays. I also use C++ to implement this logic and found that C++ is 2 times faster than Numpy on average (also executed 100 times each). I guess this is because in python we first calculate: temporary_var = a * b and then: d = temporary_var + c so we have an unnecessary memory transfer overhead. Since each array is very large, Numpy needs to write temporary_var to memory and then read it back to cache. However in C++ we could just write d[i] = a[i] * b[i] + c[i] and we won't create a temporary array along with the memory transfer penalty. *So another problem is if there is a method to avoid this kind of overhead?* I 've learned that in Numpy we could create our own ufunc with: *frompyfunc*, but it seems that there is no SIMD optimization nor multi-threads utilized since this is 100 times slower than *"d = a * b + c" way*. ___ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-le...@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: arch...@mail-archive.com
[Numpy-discussion] Re: How to avoid this memory copy overhead in d=a*b+c?
You can use inplace operators where appropriate to avoid memory allocation. a *= bc += a Kevin From: 腾刘Sent: Friday, September 16, 2022 8:35 AMTo: Discussion of Numerical PythonSubject: [Numpy-discussion] How to avoid this memory copy overhead in d=a*b+c? Hello everyone, I 'm here again to ask a naive question about Numpy performance. As far as I know, Numpy's vectorization operator is very effective because it utilizes SIMD instructions and multi-threads compared to index-style programming (using a "for" loop and assigning each element with its index in array). I 'm wondering how fast Numpy could be so I did some experiments. Take this simple task as an example: a = np.random.rand(10 000 000) b = np.random.rand(10 000 000) c = a + b To check the performance, I wrote a simple C++ implementation of adding two arrays using multi-threads too (with the compile options of: -O3 -mavx2). I found that the C++ implementation is slightly faster than Numpy (running 100 times each to get a rather convincing statistic). Here comes the first question, how come there is this efficiency gap?I guess this is because Numpy needs to load the shared object and find the wrapper of ufunc and then finally execute the underlying computation. Am I right? Am I missing something here? Then I did another experiment for this statement: d = a * b + c , where a, b, c and d are all numpy arrays. I also use C++ to implement this logic and found that C++ is 2 times faster than Numpy on average (also executed 100 times each). I guess this is because in python we first calculate: temporary_var = a * band then: d = temporary_var + cso we have an unnecessary memory transfer overhead. Since each array is very large, Numpy needs to write temporary_var to memory and then read it back to cache. However in C++ we could just write d[i] = a[i] * b[i] + c[i] and we won't create a temporary array along with the memory transfer penalty. So another problem is if there is a method to avoid this kind of overhead? I 've learned that in Numpy we could create our own ufunc with: frompyfunc, but it seems that there is no SIMD optimization nor multi-threads utilized since this is 100 times slower than "d = a * b + c" way. ___ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-le...@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: arch...@mail-archive.com
[Numpy-discussion] Re: How to avoid this memory copy overhead in d=a*b+c?
Thanks a lot for answering this question but I still have some uncertainties. I 'm trying to improve the time efficiency as much as possible so I 'm not mainly worried about memory allocation, since in my opinion it won't cost much. Instead, the memory accessing is my central concern because of the cache miss penalty. In your snippet, there will be 4 accesses to the whole arrays, which is: the access to a (in a *= b) the access to b (in a *= b) the access to c (in c += a) the access to a (in c += a) This is better than d = a * b + c, but I truly need a new array d to hold the final result because I don't want to spoil the data in array c also. So let's replace c += a with d = a + c, and in this way there will be 5 accesses to the whole array in total. However, under optimal conditions, which can be achieved by C++ implementation ( d[i] = a[i] * b[i] + c[i]), we only need four accesses to the whole array. In modern CPU, the cost of this kind of simple calculation is negligible compared to memory access, so I guess I still need a better way. So much thanks again for your reply! Kevin Sheppard 于2022年9月16日周五 15:38写道: > You can use inplace operators where appropriate to avoid memory allocation. > > > > a *= b > > c += a > > > > Kevin > > > > > > *From: *腾刘 <27rabbi...@gmail.com> > *Sent: *Friday, September 16, 2022 8:35 AM > *To: *Discussion of Numerical Python > *Subject: *[Numpy-discussion] How to avoid this memory copy overhead in > d=a*b+c? > > > > Hello everyone, I 'm here again to ask a naive question about Numpy > performance. > > > > As far as I know, Numpy's vectorization operator is very effective because > it utilizes SIMD instructions and multi-threads compared to index-style > programming (using a "for" loop and assigning each element with its index > in array). > > > > I 'm wondering how fast Numpy could be so I did some experiments. Take > this simple task as an example: > > a = np.random.rand(10 000 000) > > b = np.random.rand(10 000 000) > > c = a + b > > > > To check the performance, I wrote a simple C++ implementation of adding > two arrays using multi-threads too (with the compile options of: -O3 > -mavx2). I found that the C++ implementation is slightly faster than Numpy > (running 100 times each to get a rather convincing statistic). > > > > *Here comes the first question, how come there is this efficiency gap?* > > I guess this is because Numpy needs to load the shared object and find the > wrapper of ufunc and then finally execute the underlying computation. Am I > right? Am I missing something here? > > > > Then I did another experiment for this statement: d = a * b + c , where > a, b, c and d are all numpy arrays. I also use C++ to implement this logic > and found that C++ is 2 times faster than Numpy on average (also executed > 100 times each). > > > > I guess this is because in python we first calculate: > > temporary_var = a * b > > and then: > > d = temporary_var + c > > so we have an unnecessary memory transfer overhead. Since each array is > very large, Numpy needs to write temporary_var to memory and then read it > back to cache. > > > > However in C++ we could just write d[i] = a[i] * b[i] + c[i] and we won't > create a temporary array along with the memory transfer penalty. > > > > *So another problem is if there is a method to avoid this kind of > overhead?* I 've learned that in Numpy we could create our own ufunc > with: *frompyfunc*, but it seems that there is no SIMD optimization nor > multi-threads utilized since this is 100 times slower than *"d = a * b + > c" way*. > > > > > > > ___ > NumPy-Discussion mailing list -- numpy-discussion@python.org > To unsubscribe send an email to numpy-discussion-le...@python.org > https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ > Member address: 27rabbi...@gmail.com > ___ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-le...@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: arch...@mail-archive.com
[Numpy-discussion] Re: How to avoid this memory copy overhead in d=a*b+c?
Have a look at numexpr (https://github.com/pydata/numexpr). It can achieve large speedups in ops like this at the cost of having to write expensive operations as strings, e.g., d = ne.evaluate('a * b + c'). You could also write a gufunc in numba that would be memory and access efficient. Kevin On Fri, Sep 16, 2022 at 8:53 AM 腾刘 <27rabbi...@gmail.com> wrote: > Thanks a lot for answering this question but I still have some > uncertainties. > > I 'm trying to improve the time efficiency as much as possible so I 'm not > mainly worried about memory allocation, since in my opinion it won't cost > much. > Instead, the memory accessing is my central concern because of the cache > miss penalty. > > In your snippet, there will be 4 accesses to the whole arrays, which is: > > the access to a (in a *= b) > the access to b (in a *= b) > the access to c (in c += a) > the access to a (in c += a) > > This is better than d = a * b + c, but I truly need a new array d to hold > the final result because I don't want to spoil the data in array c also. > > So let's replace c += a with d = a + c, and in this way there will be 5 > accesses to the whole array in total. > > However, under optimal conditions, which can be achieved by C++ > implementation ( d[i] = a[i] * b[i] + c[i]), we only need four accesses to > the whole array. > > In modern CPU, the cost of this kind of simple calculation is negligible > compared to memory access, so I guess I still need a better way. > > So much thanks again for your reply! > > Kevin Sheppard 于2022年9月16日周五 15:38写道: > >> You can use inplace operators where appropriate to avoid memory >> allocation. >> >> >> >> a *= b >> >> c += a >> >> >> >> Kevin >> >> >> >> >> >> *From: *腾刘 <27rabbi...@gmail.com> >> *Sent: *Friday, September 16, 2022 8:35 AM >> *To: *Discussion of Numerical Python >> *Subject: *[Numpy-discussion] How to avoid this memory copy overhead in >> d=a*b+c? >> >> >> >> Hello everyone, I 'm here again to ask a naive question about Numpy >> performance. >> >> >> >> As far as I know, Numpy's vectorization operator is very effective >> because it utilizes SIMD instructions and multi-threads compared to >> index-style programming (using a "for" loop and assigning each element with >> its index in array). >> >> >> >> I 'm wondering how fast Numpy could be so I did some experiments. Take >> this simple task as an example: >> >> a = np.random.rand(10 000 000) >> >> b = np.random.rand(10 000 000) >> >> c = a + b >> >> >> >> To check the performance, I wrote a simple C++ implementation of adding >> two arrays using multi-threads too (with the compile options of: -O3 >> -mavx2). I found that the C++ implementation is slightly faster than Numpy >> (running 100 times each to get a rather convincing statistic). >> >> >> >> *Here comes the first question, how come there is this efficiency gap?* >> >> I guess this is because Numpy needs to load the shared object and find >> the wrapper of ufunc and then finally execute the underlying computation. >> Am I right? Am I missing something here? >> >> >> >> Then I did another experiment for this statement: d = a * b + c , where >> a, b, c and d are all numpy arrays. I also use C++ to implement this logic >> and found that C++ is 2 times faster than Numpy on average (also executed >> 100 times each). >> >> >> >> I guess this is because in python we first calculate: >> >> temporary_var = a * b >> >> and then: >> >> d = temporary_var + c >> >> so we have an unnecessary memory transfer overhead. Since each array is >> very large, Numpy needs to write temporary_var to memory and then read it >> back to cache. >> >> >> >> However in C++ we could just write d[i] = a[i] * b[i] + c[i] and we won't >> create a temporary array along with the memory transfer penalty. >> >> >> >> *So another problem is if there is a method to avoid this kind of >> overhead?* I 've learned that in Numpy we could create our own ufunc >> with: *frompyfunc*, but it seems that there is no SIMD optimization nor >> multi-threads utilized since this is 100 times slower than *"d = a * b + >> c" way*. >> >> >> >> >> >> >> ___ >> NumPy-Discussion mailing list -- numpy-discussion@python.org >> To unsubscribe send an email to numpy-discussion-le...@python.org >> https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ >> Member address: 27rabbi...@gmail.com >> > ___ > NumPy-Discussion mailing list -- numpy-discussion@python.org > To unsubscribe send an email to numpy-discussion-le...@python.org > https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ > Member address: kevin.k.shepp...@gmail.com > ___ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-le...@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: arch..
[Numpy-discussion] Re: How to avoid this memory copy overhead in d=a*b+c?
This is exactly what numexpr is meant for: https://numexpr.readthedocs.io/projects/NumExpr3/en/latest/ In particular, see these benchmarks (made around 10 years ago, but they should still apply): https://numexpr.readthedocs.io/projects/NumExpr3/en/latest/intro.html#expected-performance Cheers On Fri, Sep 16, 2022 at 9:57 AM 腾刘 <27rabbi...@gmail.com> wrote: > Thanks a lot for answering this question but I still have some > uncertainties. > > I 'm trying to improve the time efficiency as much as possible so I 'm not > mainly worried about memory allocation, since in my opinion it won't cost > much. > Instead, the memory accessing is my central concern because of the cache > miss penalty. > > In your snippet, there will be 4 accesses to the whole arrays, which is: > > the access to a (in a *= b) > the access to b (in a *= b) > the access to c (in c += a) > the access to a (in c += a) > > This is better than d = a * b + c, but I truly need a new array d to hold > the final result because I don't want to spoil the data in array c also. > > So let's replace c += a with d = a + c, and in this way there will be 5 > accesses to the whole array in total. > > However, under optimal conditions, which can be achieved by C++ > implementation ( d[i] = a[i] * b[i] + c[i]), we only need four accesses to > the whole array. > > In modern CPU, the cost of this kind of simple calculation is negligible > compared to memory access, so I guess I still need a better way. > > So much thanks again for your reply! > > Kevin Sheppard 于2022年9月16日周五 15:38写道: > >> You can use inplace operators where appropriate to avoid memory >> allocation. >> >> >> >> a *= b >> >> c += a >> >> >> >> Kevin >> >> >> >> >> >> *From: *腾刘 <27rabbi...@gmail.com> >> *Sent: *Friday, September 16, 2022 8:35 AM >> *To: *Discussion of Numerical Python >> *Subject: *[Numpy-discussion] How to avoid this memory copy overhead in >> d=a*b+c? >> >> >> >> Hello everyone, I 'm here again to ask a naive question about Numpy >> performance. >> >> >> >> As far as I know, Numpy's vectorization operator is very effective >> because it utilizes SIMD instructions and multi-threads compared to >> index-style programming (using a "for" loop and assigning each element with >> its index in array). >> >> >> >> I 'm wondering how fast Numpy could be so I did some experiments. Take >> this simple task as an example: >> >> a = np.random.rand(10 000 000) >> >> b = np.random.rand(10 000 000) >> >> c = a + b >> >> >> >> To check the performance, I wrote a simple C++ implementation of adding >> two arrays using multi-threads too (with the compile options of: -O3 >> -mavx2). I found that the C++ implementation is slightly faster than Numpy >> (running 100 times each to get a rather convincing statistic). >> >> >> >> *Here comes the first question, how come there is this efficiency gap?* >> >> I guess this is because Numpy needs to load the shared object and find >> the wrapper of ufunc and then finally execute the underlying computation. >> Am I right? Am I missing something here? >> >> >> >> Then I did another experiment for this statement: d = a * b + c , where >> a, b, c and d are all numpy arrays. I also use C++ to implement this logic >> and found that C++ is 2 times faster than Numpy on average (also executed >> 100 times each). >> >> >> >> I guess this is because in python we first calculate: >> >> temporary_var = a * b >> >> and then: >> >> d = temporary_var + c >> >> so we have an unnecessary memory transfer overhead. Since each array is >> very large, Numpy needs to write temporary_var to memory and then read it >> back to cache. >> >> >> >> However in C++ we could just write d[i] = a[i] * b[i] + c[i] and we won't >> create a temporary array along with the memory transfer penalty. >> >> >> >> *So another problem is if there is a method to avoid this kind of >> overhead?* I 've learned that in Numpy we could create our own ufunc >> with: *frompyfunc*, but it seems that there is no SIMD optimization nor >> multi-threads utilized since this is 100 times slower than *"d = a * b + >> c" way*. >> >> >> >> >> >> >> ___ >> NumPy-Discussion mailing list -- numpy-discussion@python.org >> To unsubscribe send an email to numpy-discussion-le...@python.org >> https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ >> Member address: 27rabbi...@gmail.com >> > ___ > NumPy-Discussion mailing list -- numpy-discussion@python.org > To unsubscribe send an email to numpy-discussion-le...@python.org > https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ > Member address: fal...@gmail.com > -- Francesc Alted ___ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-le...@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address
[Numpy-discussion] Re: How to avoid this memory copy overhead in d=a*b+c?
Thanks a million!! I will check these thoroughly~ Kevin Sheppard 于2022年9月16日周五 16:11写道: > Have a look at numexpr (https://github.com/pydata/numexpr). It can > achieve large speedups in ops like this at the cost of having to write > expensive operations as strings, e.g., d = ne.evaluate('a * b + c'). You > could also write a gufunc in numba that would be memory and access > efficient. > > Kevin > > > On Fri, Sep 16, 2022 at 8:53 AM 腾刘 <27rabbi...@gmail.com> wrote: > >> Thanks a lot for answering this question but I still have some >> uncertainties. >> >> I 'm trying to improve the time efficiency as much as possible so I 'm >> not mainly worried about memory allocation, since in my opinion it won't >> cost much. >> Instead, the memory accessing is my central concern because of the cache >> miss penalty. >> >> In your snippet, there will be 4 accesses to the whole arrays, which is: >> >> the access to a (in a *= b) >> the access to b (in a *= b) >> the access to c (in c += a) >> the access to a (in c += a) >> >> This is better than d = a * b + c, but I truly need a new array d to hold >> the final result because I don't want to spoil the data in array c also. >> >> So let's replace c += a with d = a + c, and in this way there will be 5 >> accesses to the whole array in total. >> >> However, under optimal conditions, which can be achieved by C++ >> implementation ( d[i] = a[i] * b[i] + c[i]), we only need four accesses to >> the whole array. >> >> In modern CPU, the cost of this kind of simple calculation is negligible >> compared to memory access, so I guess I still need a better way. >> >> So much thanks again for your reply! >> >> Kevin Sheppard 于2022年9月16日周五 15:38写道: >> >>> You can use inplace operators where appropriate to avoid memory >>> allocation. >>> >>> >>> >>> a *= b >>> >>> c += a >>> >>> >>> >>> Kevin >>> >>> >>> >>> >>> >>> *From: *腾刘 <27rabbi...@gmail.com> >>> *Sent: *Friday, September 16, 2022 8:35 AM >>> *To: *Discussion of Numerical Python >>> *Subject: *[Numpy-discussion] How to avoid this memory copy overhead in >>> d=a*b+c? >>> >>> >>> >>> Hello everyone, I 'm here again to ask a naive question about Numpy >>> performance. >>> >>> >>> >>> As far as I know, Numpy's vectorization operator is very effective >>> because it utilizes SIMD instructions and multi-threads compared to >>> index-style programming (using a "for" loop and assigning each element with >>> its index in array). >>> >>> >>> >>> I 'm wondering how fast Numpy could be so I did some experiments. Take >>> this simple task as an example: >>> >>> a = np.random.rand(10 000 000) >>> >>> b = np.random.rand(10 000 000) >>> >>> c = a + b >>> >>> >>> >>> To check the performance, I wrote a simple C++ implementation of adding >>> two arrays using multi-threads too (with the compile options of: -O3 >>> -mavx2). I found that the C++ implementation is slightly faster than Numpy >>> (running 100 times each to get a rather convincing statistic). >>> >>> >>> >>> *Here comes the first question, how come there is this efficiency gap?* >>> >>> I guess this is because Numpy needs to load the shared object and find >>> the wrapper of ufunc and then finally execute the underlying computation. >>> Am I right? Am I missing something here? >>> >>> >>> >>> Then I did another experiment for this statement: d = a * b + c , where >>> a, b, c and d are all numpy arrays. I also use C++ to implement this logic >>> and found that C++ is 2 times faster than Numpy on average (also executed >>> 100 times each). >>> >>> >>> >>> I guess this is because in python we first calculate: >>> >>> temporary_var = a * b >>> >>> and then: >>> >>> d = temporary_var + c >>> >>> so we have an unnecessary memory transfer overhead. Since each array is >>> very large, Numpy needs to write temporary_var to memory and then read it >>> back to cache. >>> >>> >>> >>> However in C++ we could just write d[i] = a[i] * b[i] + c[i] and we >>> won't create a temporary array along with the memory transfer penalty. >>> >>> >>> >>> *So another problem is if there is a method to avoid this kind of >>> overhead?* I 've learned that in Numpy we could create our own ufunc >>> with: *frompyfunc*, but it seems that there is no SIMD optimization nor >>> multi-threads utilized since this is 100 times slower than *"d = a * >>> b + c" way*. >>> >>> >>> >>> >>> >>> >>> ___ >>> NumPy-Discussion mailing list -- numpy-discussion@python.org >>> To unsubscribe send an email to numpy-discussion-le...@python.org >>> https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ >>> Member address: 27rabbi...@gmail.com >>> >> ___ >> NumPy-Discussion mailing list -- numpy-discussion@python.org >> To unsubscribe send an email to numpy-discussion-le...@python.org >> https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ >> Member address: kevin.k.shepp...@gmail.com >> > _
[Numpy-discussion] Re: How to avoid this memory copy overhead in d=a*b+c?
Still so naive in Python, there truly are lots of beautiful libraries at hand. Thanks a lot for suggestions!! Francesc Alted 于2022年9月16日周五 16:15写道: > This is exactly what numexpr is meant for: > https://numexpr.readthedocs.io/projects/NumExpr3/en/latest/ > > In particular, see these benchmarks (made around 10 years ago, but they > should still apply): > > https://numexpr.readthedocs.io/projects/NumExpr3/en/latest/intro.html#expected-performance > > Cheers > > On Fri, Sep 16, 2022 at 9:57 AM 腾刘 <27rabbi...@gmail.com> wrote: > >> Thanks a lot for answering this question but I still have some >> uncertainties. >> >> I 'm trying to improve the time efficiency as much as possible so I 'm >> not mainly worried about memory allocation, since in my opinion it won't >> cost much. >> Instead, the memory accessing is my central concern because of the cache >> miss penalty. >> >> In your snippet, there will be 4 accesses to the whole arrays, which is: >> >> the access to a (in a *= b) >> the access to b (in a *= b) >> the access to c (in c += a) >> the access to a (in c += a) >> >> This is better than d = a * b + c, but I truly need a new array d to hold >> the final result because I don't want to spoil the data in array c also. >> >> So let's replace c += a with d = a + c, and in this way there will be 5 >> accesses to the whole array in total. >> >> However, under optimal conditions, which can be achieved by C++ >> implementation ( d[i] = a[i] * b[i] + c[i]), we only need four accesses to >> the whole array. >> >> In modern CPU, the cost of this kind of simple calculation is negligible >> compared to memory access, so I guess I still need a better way. >> >> So much thanks again for your reply! >> >> Kevin Sheppard 于2022年9月16日周五 15:38写道: >> >>> You can use inplace operators where appropriate to avoid memory >>> allocation. >>> >>> >>> >>> a *= b >>> >>> c += a >>> >>> >>> >>> Kevin >>> >>> >>> >>> >>> >>> *From: *腾刘 <27rabbi...@gmail.com> >>> *Sent: *Friday, September 16, 2022 8:35 AM >>> *To: *Discussion of Numerical Python >>> *Subject: *[Numpy-discussion] How to avoid this memory copy overhead in >>> d=a*b+c? >>> >>> >>> >>> Hello everyone, I 'm here again to ask a naive question about Numpy >>> performance. >>> >>> >>> >>> As far as I know, Numpy's vectorization operator is very effective >>> because it utilizes SIMD instructions and multi-threads compared to >>> index-style programming (using a "for" loop and assigning each element with >>> its index in array). >>> >>> >>> >>> I 'm wondering how fast Numpy could be so I did some experiments. Take >>> this simple task as an example: >>> >>> a = np.random.rand(10 000 000) >>> >>> b = np.random.rand(10 000 000) >>> >>> c = a + b >>> >>> >>> >>> To check the performance, I wrote a simple C++ implementation of adding >>> two arrays using multi-threads too (with the compile options of: -O3 >>> -mavx2). I found that the C++ implementation is slightly faster than Numpy >>> (running 100 times each to get a rather convincing statistic). >>> >>> >>> >>> *Here comes the first question, how come there is this efficiency gap?* >>> >>> I guess this is because Numpy needs to load the shared object and find >>> the wrapper of ufunc and then finally execute the underlying computation. >>> Am I right? Am I missing something here? >>> >>> >>> >>> Then I did another experiment for this statement: d = a * b + c , where >>> a, b, c and d are all numpy arrays. I also use C++ to implement this logic >>> and found that C++ is 2 times faster than Numpy on average (also executed >>> 100 times each). >>> >>> >>> >>> I guess this is because in python we first calculate: >>> >>> temporary_var = a * b >>> >>> and then: >>> >>> d = temporary_var + c >>> >>> so we have an unnecessary memory transfer overhead. Since each array is >>> very large, Numpy needs to write temporary_var to memory and then read it >>> back to cache. >>> >>> >>> >>> However in C++ we could just write d[i] = a[i] * b[i] + c[i] and we >>> won't create a temporary array along with the memory transfer penalty. >>> >>> >>> >>> *So another problem is if there is a method to avoid this kind of >>> overhead?* I 've learned that in Numpy we could create our own ufunc >>> with: *frompyfunc*, but it seems that there is no SIMD optimization nor >>> multi-threads utilized since this is 100 times slower than *"d = a * >>> b + c" way*. >>> >>> >>> >>> >>> >>> >>> ___ >>> NumPy-Discussion mailing list -- numpy-discussion@python.org >>> To unsubscribe send an email to numpy-discussion-le...@python.org >>> https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ >>> Member address: 27rabbi...@gmail.com >>> >> ___ >> NumPy-Discussion mailing list -- numpy-discussion@python.org >> To unsubscribe send an email to numpy-discussion-le...@python.org >> https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
[Numpy-discussion] Re: Enhancement for AArch64 SVE instruction set
Hi, It's been a long time since I first contacted here, but I submitted my pull request about handling Arm64 SVE architecture yesterday. https://github.com/numpy/numpy/pull/22265 Since there may be no public CI environment that runs the SVE instruction set, I tested my source code on an inhouse server (Fujitsu FX700 with A64FX). - A64FX is one of the Armv8.2-a + SVE compliant CPUs. I think the regression test was successfully completed. - python3 runtests.py --cpu-baseline=sve --cpu-dispatch=none The result shows "21354 passed, 203 skipped, 1302 deselected, 30 xfailed, 7 xpassed". My implementation is similar to those of AVX/AVX2/ASIMD. - SVE intrinsics are defined in numpy/core/src/common/simd/sve/*.h files. Travis CI reported errors https://github.com/numpy/numpy/pull/22265/checks?check_run_id=8384699529 , but it seems that the job exceeded the maximum log length, and has been terminated. I would appreciate your review of my pull request, as well as your comments and advice on this mailing list. Thanks, Kentaro ___ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-le...@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: arch...@mail-archive.com
[Numpy-discussion] Re: Enhancement for AArch64 SVE instruction set
It seems cirrus-ci offers AWS EKS Graviton2 instances [0] and this is free for open source projects. Do you know if that offering has SVE-enabled CPUs? Matti [0] https://cirrus-ci.org/guide/linux/ On Fri, Sep 16, 2022 at 5:54 AM kawakam...@fujitsu.com wrote: > > Hi, > > It's been a long time since I first contacted here, > but I submitted my pull request about handling Arm64 SVE architecture > yesterday. > https://github.com/numpy/numpy/pull/22265 > > Since there may be no public CI environment that runs the SVE instruction set, > I tested my source code on an inhouse server (Fujitsu FX700 with A64FX). > - A64FX is one of the Armv8.2-a + SVE compliant CPUs. > I think the regression test was successfully completed. > - python3 runtests.py --cpu-baseline=sve --cpu-dispatch=none > The result shows "21354 passed, 203 skipped, 1302 deselected, 30 xfailed, 7 > xpassed". > > My implementation is similar to those of AVX/AVX2/ASIMD. > - SVE intrinsics are defined in numpy/core/src/common/simd/sve/*.h files. > > Travis CI reported errors > https://github.com/numpy/numpy/pull/22265/checks?check_run_id=8384699529 > , but it seems that the job exceeded the maximum log length, and has been > terminated. > > I would appreciate your review of my pull request, as well as your comments > and advice > on this mailing list. > > > Thanks, > Kentaro > > > > > > > > > ___ > NumPy-Discussion mailing list -- numpy-discussion@python.org > To unsubscribe send an email to numpy-discussion-le...@python.org > https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ > Member address: matti.pi...@gmail.com ___ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-le...@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: arch...@mail-archive.com
[Numpy-discussion] Ways to achieve faster np.nanpercentile() calculation?
Hi all, On my system, np.nanpercentile() is orders of magnitude (>100x) slower than np.percentile(). I use numpy 1.23.1 Wondering if there is a way to speed it up. I came across this workaround for 3D arrays: https://krstn.eu/np.nanpercentile()-there-has-to-be-a-faster-way/ But I would need a generalized solution that works on N dimensions. So I started adopting the above - but wondering if I am reinventing the wheel here? Is there already a python package that implements a speedier nanpercentile for numpy? (similar idea as the 'Bottleneck' package?) Or other known workarounds to achieve the same result? Best regards, Aron ___ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-le...@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: arch...@mail-archive.com