[Numpy-discussion] win32 wheels for Python 3.10

2021-11-10 Thread Aivar Annamaa
Has it been decided to stop publishing win32 wheels for Python 3.10? Numpy 
1.21.4 currently misses win32 wheel for Python 3.10 on PyPI. Is this temporary?

I'm confused, because 
https://www.mail-archive.com/numpy-discussion@python.org/msg05512.html says:

> We will continue to offer 32 wheels on Windows as it remains popular there.
___
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com


[Numpy-discussion] Conversion from C-layout to Fortran-layout in Cython

2021-11-10 Thread Ilhan Polat
I've asked this in Cython mailing list but probably I should also get some
feedback here too.

I have the following function defined in Cython and using flat memory
pointers to hold n by n array data.


cdef some_C_layout_func(double[:, :, ::1] Am) nogil: # ... cdef double *
work1 = malloc(n*n*sizeof(double)) cdef double *work2 = 
malloc(n*n*sizeof(double)) # ... # Lots of C-layout operations here # ...
dgetrs('T', &n, &n, &work1[0], &n, &ipiv[0], &work2[0], &n, &info )
dcopy(&n2, &work2[0], &int1, &Am[0, 0, 0], &int1) free(...)









Here, I have done everything in C layout with work1 and work2 but I have to
convert work2 into Fortran layout to be able to solve AX = B. A can be
transposed in Lapack internally via the flag 'T' so the only obstacle I
have now is to shuffle work2 which holds B transpose in the eyes of Fortran
since it is still in C layout.

If I go naively and make loops to get one layout to the other that actually
spoils all the speed benefits from this Cythonization due to cache misses.
In fact 60% of the time is spent in that naive loop across the whole
function. Same goes for the copy_fortran() of memoryviews.

I have measured the regular NumPy np.asfortranarray()  and the performance
is quite good enough compared to the actual linear solve. Hence whatever it
is doing underneath I would like to reach out and do the same possibly via
the C-API. But my C knowledge basically failed me around this line
https://github.com/numpy/numpy/blob/8dbd507fb6c854b362c26a0dd056cd04c9c10f25/numpy/core/src/multiarray/multiarraymodule.c#L1817

I have found the SO post from
https://stackoverflow.com/questions/45143381/making-a-memoryview-c-contiguous-fortran-contiguous
but I am not sure if that is the canonical way to do it in newer Python
versions.

Can anyone show me how to go about it without interacting with Python
objects?

Best,
ilhan
___
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com


[Numpy-discussion] Re: Conversion from C-layout to Fortran-layout in Cython

2021-11-10 Thread Benjamin Root
I have found that a bunch of lapack functions seem to have arguments for
stating whether or not the given arrays are C or F ordered. Then you
wouldn't need to worry about handling the layout yourself. For example, I
have some C++ code like so:

extern "C" {

/**
 * Forward declaration for LAPACK's Fortran dgemm function to allow use in
C/C++ code.
 *
 * This function is used for matrix multiplication between two arrays of
doubles.
 *
 * For complete reference:
http://www.netlib.org/lapack/explore-html/d1/d54/group__double__blas__level3_gaeda3cbd99c8fb834a60a6412878226e1.html
 */
void dgemm_(const char* TRANSA, const char* TRANSB, const int* M, const
int* N, const int* K,
const double* ALPHA, const double* A, const int* LDA, const double* B,
const int* LDB,
const double* BETA, double* C, const int* LDC);
}

...

dgemm_("C", "C", &nLayers, &N, &nVariables, &alpha, matrices.IW->data(),
&nVariables,
inputs.data(), &N, &beta, intermediate.data(), &nLayers);

(in this case, I was using boost multiarrays, but the basic idea is the
same). IIRC, a bunch of other lapack functions had similar features.

I hope this is helpful.

Ben Root



On Wed, Nov 10, 2021 at 6:02 PM Ilhan Polat  wrote:

> I've asked this in Cython mailing list but probably I should also get some
> feedback here too.
>
> I have the following function defined in Cython and using flat memory
> pointers to hold n by n array data.
>
>
> cdef some_C_layout_func(double[:, :, ::1] Am) nogil: # ... cdef double *
> work1 = malloc(n*n*sizeof(double)) cdef double *work2 = 
> malloc(n*n*sizeof(double)) # ... # Lots of C-layout operations here # ...
> dgetrs('T', &n, &n, &work1[0], &n, &ipiv[0], &work2[0], &n, &info )
> dcopy(&n2, &work2[0], &int1, &Am[0, 0, 0], &int1) free(...)
>
>
>
>
>
>
>
>
>
> Here, I have done everything in C layout with work1 and work2 but I have
> to convert work2 into Fortran layout to be able to solve AX = B. A can be
> transposed in Lapack internally via the flag 'T' so the only obstacle I
> have now is to shuffle work2 which holds B transpose in the eyes of Fortran
> since it is still in C layout.
>
> If I go naively and make loops to get one layout to the other that
> actually spoils all the speed benefits from this Cythonization due to cache
> misses. In fact 60% of the time is spent in that naive loop across the
> whole function. Same goes for the copy_fortran() of memoryviews.
>
> I have measured the regular NumPy np.asfortranarray()  and the performance
> is quite good enough compared to the actual linear solve. Hence whatever it
> is doing underneath I would like to reach out and do the same possibly via
> the C-API. But my C knowledge basically failed me around this line
> https://github.com/numpy/numpy/blob/8dbd507fb6c854b362c26a0dd056cd04c9c10f25/numpy/core/src/multiarray/multiarraymodule.c#L1817
>
> I have found the SO post from
> https://stackoverflow.com/questions/45143381/making-a-memoryview-c-contiguous-fortran-contiguous
> but I am not sure if that is the canonical way to do it in newer Python
> versions.
>
> Can anyone show me how to go about it without interacting with Python
> objects?
>
> Best,
> ilhan
> ___
> NumPy-Discussion mailing list -- numpy-discussion@python.org
> To unsubscribe send an email to numpy-discussion-le...@python.org
> https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
> Member address: ben.v.r...@gmail.com
>
___
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com


[Numpy-discussion] Re: Conversion from C-layout to Fortran-layout in Cython

2021-11-10 Thread Ilhan Polat
Indeed for matrix multiplication and many other L3 BLAS functions, we are
lucky however for linear solve function ?getrs unfortunately no avail.

On Thu, Nov 11, 2021 at 12:31 AM Benjamin Root  wrote:

> I have found that a bunch of lapack functions seem to have arguments for
> stating whether or not the given arrays are C or F ordered. Then you
> wouldn't need to worry about handling the layout yourself. For example, I
> have some C++ code like so:
>
> extern "C" {
>
> /**
>  * Forward declaration for LAPACK's Fortran dgemm function to allow use in
> C/C++ code.
>  *
>  * This function is used for matrix multiplication between two arrays of
> doubles.
>  *
>  * For complete reference:
> http://www.netlib.org/lapack/explore-html/d1/d54/group__double__blas__level3_gaeda3cbd99c8fb834a60a6412878226e1.html
>  */
> void dgemm_(const char* TRANSA, const char* TRANSB, const int* M, const
> int* N, const int* K,
> const double* ALPHA, const double* A, const int* LDA, const double* B,
> const int* LDB,
> const double* BETA, double* C, const int* LDC);
> }
>
> ...
>
> dgemm_("C", "C", &nLayers, &N, &nVariables, &alpha, matrices.IW->data(),
> &nVariables,
> inputs.data(), &N, &beta, intermediate.data(), &nLayers);
>
> (in this case, I was using boost multiarrays, but the basic idea is the
> same). IIRC, a bunch of other lapack functions had similar features.
>
> I hope this is helpful.
>
> Ben Root
>
>
>
> On Wed, Nov 10, 2021 at 6:02 PM Ilhan Polat  wrote:
>
>> I've asked this in Cython mailing list but probably I should also get
>> some feedback here too.
>>
>> I have the following function defined in Cython and using flat memory
>> pointers to hold n by n array data.
>>
>>
>> cdef some_C_layout_func(double[:, :, ::1] Am) nogil: # ... cdef double *
>> work1 = malloc(n*n*sizeof(double)) cdef double *work2 = > *>malloc(n*n*sizeof(double)) # ... # Lots of C-layout operations here #
>> ... dgetrs('T', &n, &n, &work1[0], &n, &ipiv[0], &work2[0], &n, &
>> info ) dcopy(&n2, &work2[0], &int1, &Am[0, 0, 0], &int1) free(...)
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> Here, I have done everything in C layout with work1 and work2 but I have
>> to convert work2 into Fortran layout to be able to solve AX = B. A can be
>> transposed in Lapack internally via the flag 'T' so the only obstacle I
>> have now is to shuffle work2 which holds B transpose in the eyes of Fortran
>> since it is still in C layout.
>>
>> If I go naively and make loops to get one layout to the other that
>> actually spoils all the speed benefits from this Cythonization due to cache
>> misses. In fact 60% of the time is spent in that naive loop across the
>> whole function. Same goes for the copy_fortran() of memoryviews.
>>
>> I have measured the regular NumPy np.asfortranarray()  and the
>> performance is quite good enough compared to the actual linear solve. Hence
>> whatever it is doing underneath I would like to reach out and do the same
>> possibly via the C-API. But my C knowledge basically failed me around this
>> line
>> https://github.com/numpy/numpy/blob/8dbd507fb6c854b362c26a0dd056cd04c9c10f25/numpy/core/src/multiarray/multiarraymodule.c#L1817
>>
>> I have found the SO post from
>> https://stackoverflow.com/questions/45143381/making-a-memoryview-c-contiguous-fortran-contiguous
>> but I am not sure if that is the canonical way to do it in newer Python
>> versions.
>>
>> Can anyone show me how to go about it without interacting with Python
>> objects?
>>
>> Best,
>> ilhan
>> ___
>> NumPy-Discussion mailing list -- numpy-discussion@python.org
>> To unsubscribe send an email to numpy-discussion-le...@python.org
>> https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
>> Member address: ben.v.r...@gmail.com
>>
> ___
> NumPy-Discussion mailing list -- numpy-discussion@python.org
> To unsubscribe send an email to numpy-discussion-le...@python.org
> https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
> Member address: ilhanpo...@gmail.com
>
___
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com


[Numpy-discussion] Re: Conversion from C-layout to Fortran-layout in Cython

2021-11-10 Thread Andras Deak
On Thursday, November 11, 2021, Ilhan Polat  wrote:

> I've asked this in Cython mailing list but probably I should also get some
> feedback here too.
>
> I have the following function defined in Cython and using flat memory
> pointers to hold n by n array data.
>
>
> cdef some_C_layout_func(double[:, :, ::1] Am) nogil: # ... cdef double *
> work1 = malloc(n*n*sizeof(double)) cdef double *work2 = 
> malloc(n*n*sizeof(double)) # ... # Lots of C-layout operations here # ...
> dgetrs('T', &n, &n, &work1[0], &n, &ipiv[0], &work2[0], &n, &info )
> dcopy(&n2, &work2[0], &int1, &Am[0, 0, 0], &int1) free(...)
>
>
>
>
>
>
>
>
>
> Here, I have done everything in C layout with work1 and work2 but I have
> to convert work2 into Fortran layout to be able to solve AX = B. A can be
> transposed in Lapack internally via the flag 'T' so the only obstacle I
> have now is to shuffle work2 which holds B transpose in the eyes of Fortran
> since it is still in C layout.
>
> If I go naively and make loops to get one layout to the other that
> actually spoils all the speed benefits from this Cythonization due to cache
> misses. In fact 60% of the time is spent in that naive loop across the
> whole function.
>
>
Sorry if this is a dumb question, but is this true whether or not you loop
over contiguous blocks of the input vs the output array? Or is the faster
of the two options still slower than the linsolve?

András


>
>  Same goes for the copy_fortran() of memoryviews.
>
> I have measured the regular NumPy np.asfortranarray()  and the performance
> is quite good enough compared to the actual linear solve. Hence whatever it
> is doing underneath I would like to reach out and do the same possibly via
> the C-API. But my C knowledge basically failed me around this line
> https://github.com/numpy/numpy/blob/8dbd507fb6c854b362c26a0dd056cd
> 04c9c10f25/numpy/core/src/multiarray/multiarraymodule.c#L1817
>
> I have found the SO post from https://stackoverflow.com/
> questions/45143381/making-a-memoryview-c-contiguous-fortran-contiguous
> but I am not sure if that is the canonical way to do it in newer Python
> versions.
>
> Can anyone show me how to go about it without interacting with Python
> objects?
>
> Best,
> ilhan
>
___
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com


[Numpy-discussion] Re: Conversion from C-layout to Fortran-layout in Cython

2021-11-10 Thread Ilhan Polat
Hmm not sure I understand the question but this is what I mean by naive
looping, suppose I allocate a scratch register work3, then

for i in range(n): for j in range(n): work3[j*n+i] = work2[i*n+j]



This basically doing the row to column based indexing and obviously we
create a lot of cache misses since work3 entries are accessed in the
shuffled fashion. The idea of all this Cython attempt is to avoid such
access hence if the original some_C_layout_func takes 10 units of time, 6
of it is spent on this loop when the data doesn't fit the cache. When I
discard the correctness of the function and comment out this loop and then
remeasure the original func spends roughly 3 units of time. However take
any random array in C order in NumPy using regular Python and use
np.asfortranarray() it spends roughly about 0.1 units of time. So
apparently it is possible to do this somehow at the low level in a
performant way. That's what I would like to understand or clear out my
misunderstanding.





On Thu, Nov 11, 2021 at 12:56 AM Andras Deak  wrote:

> On Thursday, November 11, 2021, Ilhan Polat  wrote:
>
>> I've asked this in Cython mailing list but probably I should also get
>> some feedback here too.
>>
>> I have the following function defined in Cython and using flat memory
>> pointers to hold n by n array data.
>>
>>
>> cdef some_C_layout_func(double[:, :, ::1] Am) nogil: # ... cdef double *
>> work1 = malloc(n*n*sizeof(double)) cdef double *work2 = > *>malloc(n*n*sizeof(double)) # ... # Lots of C-layout operations here #
>> ... dgetrs('T', &n, &n, &work1[0], &n, &ipiv[0], &work2[0], &n, &
>> info ) dcopy(&n2, &work2[0], &int1, &Am[0, 0, 0], &int1) free(...)
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> Here, I have done everything in C layout with work1 and work2 but I have
>> to convert work2 into Fortran layout to be able to solve AX = B. A can be
>> transposed in Lapack internally via the flag 'T' so the only obstacle I
>> have now is to shuffle work2 which holds B transpose in the eyes of Fortran
>> since it is still in C layout.
>>
>> If I go naively and make loops to get one layout to the other that
>> actually spoils all the speed benefits from this Cythonization due to cache
>> misses. In fact 60% of the time is spent in that naive loop across the
>> whole function.
>>
>>
> Sorry if this is a dumb question, but is this true whether or not you loop
> over contiguous blocks of the input vs the output array? Or is the faster
> of the two options still slower than the linsolve?
>
> András
>
>
>>
>>  Same goes for the copy_fortran() of memoryviews.
>>
>> I have measured the regular NumPy np.asfortranarray()  and the
>> performance is quite good enough compared to the actual linear solve. Hence
>> whatever it is doing underneath I would like to reach out and do the same
>> possibly via the C-API. But my C knowledge basically failed me around this
>> line
>> https://github.com/numpy/numpy/blob/8dbd507fb6c854b362c26a0dd056cd04c9c10f25/numpy/core/src/multiarray/multiarraymodule.c#L1817
>>
>> I have found the SO post from
>> https://stackoverflow.com/questions/45143381/making-a-memoryview-c-contiguous-fortran-contiguous
>> but I am not sure if that is the canonical way to do it in newer Python
>> versions.
>>
>> Can anyone show me how to go about it without interacting with Python
>> objects?
>>
>> Best,
>> ilhan
>>
> ___
> NumPy-Discussion mailing list -- numpy-discussion@python.org
> To unsubscribe send an email to numpy-discussion-le...@python.org
> https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
> Member address: ilhanpo...@gmail.com
>
___
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com


[Numpy-discussion] Re: Conversion from C-layout to Fortran-layout in Cython

2021-11-10 Thread Ilhan Polat
Here are some actual numbers within the context of operation (nogil removed
and def'd for linetracing)


Line # Hits Time Per Hit % Time Line Contents
== 80 #
Bilinear identity to shave off some flops 81 # inv(V-U) (V+U) = inv(V-U)
(V-U+2V) = I + 2 inv(V-U) U 82 1 15.0 15.0 0.1 daxpy(&n2, &neg_one, &work2[0
], &int1, &work3[0], &int1) 83 84 # Convert array layout for solving AX = B
85 1 3.0 3.0 0.0 for i in range(n): 86 60 137.0 2.3 0.9 for j in range(n):
87 3600 8437.0 2.3 57.7 work4[j*n+i] = work2[i*n+j] 88 89 1 408.0 408.0 2.8
dgetrf( &n, &n, &work3[0], &n, &ipiv[0], &info ) 90 1 4122.0 4122.0 28.2
dgetrs('T', &n, &n, &work3[0], &n, &ipiv[0], &work4[0], &n, &info )
91 1 25.0 25.0 0.2 dscal(&n2, &two, &work4[0], &int1) 92 # Add identity
matrix 93 1 4.0 4.0 0.0 for i in range(n): 94 60 146.0 2.4 1.0 work4[i*(n+1
)] += 1. 95 1 16.0 16.0 0.1 dcopy(&n2, &work4[0], &int1, &Am[0, 0, 0], &int1
)



















On Thu, Nov 11, 2021 at 1:04 AM Ilhan Polat  wrote:

> Hmm not sure I understand the question but this is what I mean by naive
> looping, suppose I allocate a scratch register work3, then
>
> for i in range(n): for j in range(n): work3[j*n+i] = work2[i*n+j]
>
>
>
> This basically doing the row to column based indexing and obviously we
> create a lot of cache misses since work3 entries are accessed in the
> shuffled fashion. The idea of all this Cython attempt is to avoid such
> access hence if the original some_C_layout_func takes 10 units of time, 6
> of it is spent on this loop when the data doesn't fit the cache. When I
> discard the correctness of the function and comment out this loop and then
> remeasure the original func spends roughly 3 units of time. However take
> any random array in C order in NumPy using regular Python and use
> np.asfortranarray() it spends roughly about 0.1 units of time. So
> apparently it is possible to do this somehow at the low level in a
> performant way. That's what I would like to understand or clear out my
> misunderstanding.
>
>
>
>
>
> On Thu, Nov 11, 2021 at 12:56 AM Andras Deak 
> wrote:
>
>> On Thursday, November 11, 2021, Ilhan Polat  wrote:
>>
>>> I've asked this in Cython mailing list but probably I should also get
>>> some feedback here too.
>>>
>>> I have the following function defined in Cython and using flat memory
>>> pointers to hold n by n array data.
>>>
>>>
>>> cdef some_C_layout_func(double[:, :, ::1] Am) nogil: # ... cdef double *
>>> work1 = malloc(n*n*sizeof(double)) cdef double *work2 = >> *>malloc(n*n*sizeof(double)) # ... # Lots of C-layout operations here #
>>> ... dgetrs('T', &n, &n, &work1[0], &n, &ipiv[0], &work2[0], &n, &
>>> info ) dcopy(&n2, &work2[0], &int1, &Am[0, 0, 0], &int1) free(...)
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> Here, I have done everything in C layout with work1 and work2 but I have
>>> to convert work2 into Fortran layout to be able to solve AX = B. A can be
>>> transposed in Lapack internally via the flag 'T' so the only obstacle I
>>> have now is to shuffle work2 which holds B transpose in the eyes of Fortran
>>> since it is still in C layout.
>>>
>>> If I go naively and make loops to get one layout to the other that
>>> actually spoils all the speed benefits from this Cythonization due to cache
>>> misses. In fact 60% of the time is spent in that naive loop across the
>>> whole function.
>>>
>>>
>> Sorry if this is a dumb question, but is this true whether or not you
>> loop over contiguous blocks of the input vs the output array? Or is the
>> faster of the two options still slower than the linsolve?
>>
>> András
>>
>>
>>>
>>>  Same goes for the copy_fortran() of memoryviews.
>>>
>>> I have measured the regular NumPy np.asfortranarray()  and the
>>> performance is quite good enough compared to the actual linear solve. Hence
>>> whatever it is doing underneath I would like to reach out and do the same
>>> possibly via the C-API. But my C knowledge basically failed me around this
>>> line
>>> https://github.com/numpy/numpy/blob/8dbd507fb6c854b362c26a0dd056cd04c9c10f25/numpy/core/src/multiarray/multiarraymodule.c#L1817
>>>
>>> I have found the SO post from
>>> https://stackoverflow.com/questions/45143381/making-a-memoryview-c-contiguous-fortran-contiguous
>>> but I am not sure if that is the canonical way to do it in newer Python
>>> versions.
>>>
>>> Can anyone show me how to go about it without interacting with Python
>>> objects?
>>>
>>> Best,
>>> ilhan
>>>
>> ___
>> NumPy-Discussion mailing list -- numpy-discussion@python.org
>> To unsubscribe send an email to numpy-discussion-le...@python.org
>> https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
>> Member address: ilhanpo...@gmail.com
>>
>
___
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailm

[Numpy-discussion] Re: Conversion from C-layout to Fortran-layout in Cython

2021-11-10 Thread Sebastian Berg
On Thu, 2021-11-11 at 01:04 +0100, Ilhan Polat wrote:
> Hmm not sure I understand the question but this is what I mean by naive
> looping, suppose I allocate a scratch register work3, then
> 
> for i in range(n): for j in range(n): work3[j*n+i] = work2[i*n+j]
> 

NumPy does not end up doing anything special.  Special would be to use
a blocked iteration and NumPy doesn't have it unfortunately.
The only thing it does is use pointers to cut some overheads, something
(very rough) like:

ptr1 = arr1.data
ptr2_col = arr2.data

strides2_col = arr.strides[0]
strides2_row = arr2.strides[1]

for i in range(n):
ptr2 = ptr2_col
for j in range(n):
 *ptr2 = *ptr1
 ptr1++
 ptr2 += strides2_row

ptr2_col += strides2_col

And if you write that in cython, you are likely faster since you can
cut quite a few corners (all is aligned, contiguous, etc.).
(with potentially, loop unrolling/compiler optimization fluctuations,
numpy probably tells GCC to unroll and optimize the innermost loop
there)

I would not be surprised if you can find a lightweight fast copy-
transpose out there, or if some tools like MKL/Cuda just include it. It
is too bad NumPy is missing it.

Cheers,

Sebastian


> 
> 
> This basically doing the row to column based indexing and obviously we
> create a lot of cache misses since work3 entries are accessed in the
> shuffled fashion. The idea of all this Cython attempt is to avoid such
> access hence if the original some_C_layout_func takes 10 units of time,
> 6
> of it is spent on this loop when the data doesn't fit the cache. When I
> discard the correctness of the function and comment out this loop and
> then
> remeasure the original func spends roughly 3 units of time. However
> take
> any random array in C order in NumPy using regular Python and use
> np.asfortranarray() it spends roughly about 0.1 units of time. So
> apparently it is possible to do this somehow at the low level in a
> performant way. That's what I would like to understand or clear out my
> misunderstanding.
> 
> 
> 
> 
> 
> On Thu, Nov 11, 2021 at 12:56 AM Andras Deak 
> wrote:
> 
> > On Thursday, November 11, 2021, Ilhan Polat 
> > wrote:
> > 
> > > I've asked this in Cython mailing list but probably I should also
> > > get
> > > some feedback here too.
> > > 
> > > I have the following function defined in Cython and using flat
> > > memory
> > > pointers to hold n by n array data.
> > > 
> > > 
> > > cdef some_C_layout_func(double[:, :, ::1] Am) nogil: # ... cdef
> > > double *
> > > work1 = malloc(n*n*sizeof(double)) cdef double *work2 =
> > >  > > *>malloc(n*n*sizeof(double)) # ... # Lots of C-layout operations
> > > here #
> > > ... dgetrs('T', &n, &n, &work1[0], &n, &ipiv[0], &work2[0],
> > > &n, &
> > > info ) dcopy(&n2, &work2[0], &int1, &Am[0, 0, 0], &int1) free(...)
> > > 
> > > 
> > > 
> > > 
> > > 
> > > 
> > > 
> > > 
> > > 
> > > Here, I have done everything in C layout with work1 and work2 but I
> > > have
> > > to convert work2 into Fortran layout to be able to solve AX = B. A
> > > can be
> > > transposed in Lapack internally via the flag 'T' so the only
> > > obstacle I
> > > have now is to shuffle work2 which holds B transpose in the eyes of
> > > Fortran
> > > since it is still in C layout.
> > > 
> > > If I go naively and make loops to get one layout to the other that
> > > actually spoils all the speed benefits from this Cythonization due
> > > to cache
> > > misses. In fact 60% of the time is spent in that naive loop across
> > > the
> > > whole function.
> > > 
> > > 
> > Sorry if this is a dumb question, but is this true whether or not you
> > loop
> > over contiguous blocks of the input vs the output array? Or is the
> > faster
> > of the two options still slower than the linsolve?
> > 
> > András
> > 
> > 
> > > 
> > >  Same goes for the copy_fortran() of memoryviews.
> > > 
> > > I have measured the regular NumPy np.asfortranarray()  and the
> > > performance is quite good enough compared to the actual linear
> > > solve. Hence
> > > whatever it is doing underneath I would like to reach out and do
> > > the same
> > > possibly via the C-API. But my C knowledge basically failed me
> > > around this
> > > line
> > > https://github.com/numpy/numpy/blob/8dbd507fb6c854b362c26a0dd056cd04c9c10f25/numpy/core/src/multiarray/multiarraymodule.c#L1817
> > > 
> > > I have found the SO post from
> > > https://stackoverflow.com/questions/45143381/making-a-memoryview-c-contiguous-fortran-contiguous
> > > but I am not sure if that is the canonical way to do it in newer
> > > Python
> > > versions.
> > > 
> > > Can anyone show me how to go about it without interacting with
> > > Python
> > > objects?
> > > 
> > > Best,
> > > ilhan
> > > 
> > ___
> > NumPy-Discussion mailing list -- numpy-discussion@python.org
> > To unsubscribe send an email to numpy-discussion-le...@python.org
> > https://mail.python.o

[Numpy-discussion] new time for community meetings

2021-11-10 Thread Inessa Pawson
It’s that time of the year again when most of us have to adjust our clocks.
For some of us the fun doesn’t stop there as we have to adjust our work
schedules too. Let’s see if we can find a new time slot on Wednesday for
bi-weekly NumPy community meetings that works for everyone:
http://whenisgood.net/hhy4hk8
--
Every good wish,
Inessa Pawson
___
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com


[Numpy-discussion] Re: Conversion from C-layout to Fortran-layout in Cython

2021-11-10 Thread Ilhan Polat
Ah I see. Thank you Sebastian, I was hoping to avoid all that blocking
(since HW dependency leaves some performance at many tables) or recursive
zooming stuff with some off-the-shelf tool but apparently I'm walking in
the dusty corners again collecting spider webs :) As you said, there are
quite a lot of low hanging fruits we might collect regarding such data
manipulations which will boost basically everything since these ops are
ubiquitous.

In case any one is wondering the context; this is for the scipy.linalg.expm
overhaul mainly kept updated at https://github.com/scipy/scipy/issues/12838



On Thu, Nov 11, 2021 at 2:40 AM Sebastian Berg 
wrote:

> On Thu, 2021-11-11 at 01:04 +0100, Ilhan Polat wrote:
> > Hmm not sure I understand the question but this is what I mean by naive
> > looping, suppose I allocate a scratch register work3, then
> >
> > for i in range(n): for j in range(n): work3[j*n+i] = work2[i*n+j]
> >
>
> NumPy does not end up doing anything special.  Special would be to use
> a blocked iteration and NumPy doesn't have it unfortunately.
> The only thing it does is use pointers to cut some overheads, something
> (very rough) like:
>
> ptr1 = arr1.data
> ptr2_col = arr2.data
>
> strides2_col = arr.strides[0]
> strides2_row = arr2.strides[1]
>
> for i in range(n):
> ptr2 = ptr2_col
> for j in range(n):
>  *ptr2 = *ptr1
>  ptr1++
>  ptr2 += strides2_row
>
> ptr2_col += strides2_col
>
> And if you write that in cython, you are likely faster since you can
> cut quite a few corners (all is aligned, contiguous, etc.).
> (with potentially, loop unrolling/compiler optimization fluctuations,
> numpy probably tells GCC to unroll and optimize the innermost loop
> there)
>
> I would not be surprised if you can find a lightweight fast copy-
> transpose out there, or if some tools like MKL/Cuda just include it. It
> is too bad NumPy is missing it.
>
> Cheers,
>
> Sebastian
>
>
> >
> >
> > This basically doing the row to column based indexing and obviously we
> > create a lot of cache misses since work3 entries are accessed in the
> > shuffled fashion. The idea of all this Cython attempt is to avoid such
> > access hence if the original some_C_layout_func takes 10 units of time,
> > 6
> > of it is spent on this loop when the data doesn't fit the cache. When I
> > discard the correctness of the function and comment out this loop and
> > then
> > remeasure the original func spends roughly 3 units of time. However
> > take
> > any random array in C order in NumPy using regular Python and use
> > np.asfortranarray() it spends roughly about 0.1 units of time. So
> > apparently it is possible to do this somehow at the low level in a
> > performant way. That's what I would like to understand or clear out my
> > misunderstanding.
> >
> >
> >
> >
> >
> > On Thu, Nov 11, 2021 at 12:56 AM Andras Deak 
> > wrote:
> >
> > > On Thursday, November 11, 2021, Ilhan Polat 
> > > wrote:
> > >
> > > > I've asked this in Cython mailing list but probably I should also
> > > > get
> > > > some feedback here too.
> > > >
> > > > I have the following function defined in Cython and using flat
> > > > memory
> > > > pointers to hold n by n array data.
> > > >
> > > >
> > > > cdef some_C_layout_func(double[:, :, ::1] Am) nogil: # ... cdef
> > > > double *
> > > > work1 = malloc(n*n*sizeof(double)) cdef double *work2 =
> > > >  > > > *>malloc(n*n*sizeof(double)) # ... # Lots of C-layout operations
> > > > here #
> > > > ... dgetrs('T', &n, &n, &work1[0], &n, &ipiv[0], &work2[0],
> > > > &n, &
> > > > info ) dcopy(&n2, &work2[0], &int1, &Am[0, 0, 0], &int1) free(...)
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > Here, I have done everything in C layout with work1 and work2 but I
> > > > have
> > > > to convert work2 into Fortran layout to be able to solve AX = B. A
> > > > can be
> > > > transposed in Lapack internally via the flag 'T' so the only
> > > > obstacle I
> > > > have now is to shuffle work2 which holds B transpose in the eyes of
> > > > Fortran
> > > > since it is still in C layout.
> > > >
> > > > If I go naively and make loops to get one layout to the other that
> > > > actually spoils all the speed benefits from this Cythonization due
> > > > to cache
> > > > misses. In fact 60% of the time is spent in that naive loop across
> > > > the
> > > > whole function.
> > > >
> > > >
> > > Sorry if this is a dumb question, but is this true whether or not you
> > > loop
> > > over contiguous blocks of the input vs the output array? Or is the
> > > faster
> > > of the two options still slower than the linsolve?
> > >
> > > András
> > >
> > >
> > > >
> > > >  Same goes for the copy_fortran() of memoryviews.
> > > >
> > > > I have measured the regular NumPy np.asfortranarray()  and the
> > > > performance is quite good enough compared to the actual linear
> > > > solve. Hence
> > > > whatever it is doing underneath I woul