Re: [Cython] gsoc: array expressions

2012-05-22 Thread Robert Bradshaw
On Mon, May 21, 2012 at 11:57 PM, Dag Sverre Seljebotn
 wrote:
> On 05/22/2012 08:48 AM, Robert Bradshaw wrote:
>>
>> On Mon, May 21, 2012 at 11:36 PM, Dag Sverre Seljebotn
>>   wrote:
>>>
>>> On 05/22/2012 08:11 AM, Robert Bradshaw wrote:


 On Mon, May 21, 2012 at 3:34 AM, Dag Sverre Seljebotn
     wrote:
>
>
> On 05/20/2012 04:03 PM, mark florisson wrote:
>>
>>
>>
>> Hey,
>>
>> For my gsoc we already have some simple initial ideas, i.e.
>> elementwise vector expressions (a + b with a and b arrays with
>> arbitrary rank), I don't think these need any discussion. However,
>> there are a lot of things that haven't been formally discussed on the
>> mailing list, so here goes.
>>
>> Frédéric, I am CCing you since you expressed interest on the numpy
>> mailing list, and I think your insights as a Theano developer can be
>> very helpful in this discussion.
>>
>> User Interface
>> ===
>> Besides simple array expressions for dense arrays I would like a
>> mechanism for "custom ufuncs", although to a different extent to what
>> Numpy or Numba provide. There are several ways in which we could want
>> them, e.g. as typed functions (cdef, or external C) functions, as
>> lambas or Python functions in the same module, or as general objects
>> (e.g. functions Cython doesn't know about).
>> To achieve maximum efficiency it will likely be good to allow sharing
>> these functions in .pxd files. We have 'cdef inline' functions, but I
>> would prefer annotated def functions where the parameters are
>> specialized on demand, e.g.
>>
>> @elemental
>> def add(a, b): # elemental functions can have any number of arguments
>> and operate on any compatible dtype
>>     return a + b
>>
>> When calling cdef functions or elemental functions with memoryview
>> arguments, the arguments perform a (broadcasted) elementwise
>> operation. Alternatively, we can have a parallel.elementwise function
>> which maps the function elementwise, which would also work for object
>> callables. I prefer the former, since I think it will read much
>> easier.
>>
>> Secondly, we can have a reduce function (and maybe a scan function),
>> that reduce (respectively scan) in a specified axis or number of axes.
>> E.g.
>>
>>     parallel.reduce(add, a, b, axis=(0, 2))
>>
>> where the default for axis is "all axes". As for the default value,
>> this could be perhaps optionally provided to the elemental decorator.
>> Otherwise, the reducer will have to get the default values from each
>> dimension that is reduced in, and then skip those values when
>> reducing. (Of course, the reducer function must be associate and
>> commutative). Also, a lambda could be passed in instead of an
>
>
>
>
> Only associative, right?
>
> Sounds good to me.
>
>
>> elementwise or typed cdef function.
>>
>> Finally, we would have a parallel.nditer/ndenumerate/nditerate
>> function, which would iterate over N memoryviews, and provide a
>> sensible memory access pattern (like numpy.nditer). I'm not sure if it
>> should provide only the indices, or also the values. e.g. an inplace
>> elementwise add would read as follows:
>>
>>     for i, j, k in parallel.nditerate(A, B):
>>         A[i, j, k] += B[i, j, k]
>
>
>
>
>
> I think this sounds good; I guess don't see a particular reason for
> "ndenumerate", I think code like the above is clearer.



 I'm assuming the index computations would not be re-done in this case
 (i.e. there's more magic going on here than looks like at first
 glance)? Otherwise there is an advantage to ndenumerate.
>>>
>>>
>>>
>>> Ideally, there is a lot more magic going on, though I don't know how far
>>> Mark wants to go.
>>>
>>> Imagine "nditerate(A, A.T)", in that case it would have to make many
>>> small
>>> tiles so that for each tile being processed, A has a tile in cache and
>>> A.T
>>> has another tile in cache (so that one doesn't waste cache line
>>> transfers).
>>>
>>> So those array lookups would potentially look up in different memory
>>> buffers, with the strides known at compile time.
>>
>>
>> Yes, being clever about the order in which to iterate over the indices
>> is the hard problem to solve here. I was thinking more in terms of the
>> inner loop iterating over the innermost dimension only to do the
>> indexing (retrieval and assignment), similar to how the generic NumPy
>> iterator works.
>
>
> The point isn't only being clever about the *order*...you need "copy-in,
> copy-out".
>
> The point is that the NumPy iterator is not good enough (for out-of-cache
> situations). Since you grab a cache line (64 bytes) each time from main
> memory, a plain NumPy broadcasted iterator throws away 

Re: [Cython] gsoc: array expressions

2012-05-22 Thread Dag Sverre Seljebotn

On 05/22/2012 09:06 AM, Robert Bradshaw wrote:

On Mon, May 21, 2012 at 11:57 PM, Dag Sverre Seljebotn
  wrote:

On 05/22/2012 08:48 AM, Robert Bradshaw wrote:


On Mon, May 21, 2012 at 11:36 PM, Dag Sverre Seljebotn
wrote:


On 05/22/2012 08:11 AM, Robert Bradshaw wrote:



On Mon, May 21, 2012 at 3:34 AM, Dag Sverre Seljebotn
  wrote:



On 05/20/2012 04:03 PM, mark florisson wrote:




Hey,

For my gsoc we already have some simple initial ideas, i.e.
elementwise vector expressions (a + b with a and b arrays with
arbitrary rank), I don't think these need any discussion. However,
there are a lot of things that haven't been formally discussed on the
mailing list, so here goes.

Frédéric, I am CCing you since you expressed interest on the numpy
mailing list, and I think your insights as a Theano developer can be
very helpful in this discussion.

User Interface
===
Besides simple array expressions for dense arrays I would like a
mechanism for "custom ufuncs", although to a different extent to what
Numpy or Numba provide. There are several ways in which we could want
them, e.g. as typed functions (cdef, or external C) functions, as
lambas or Python functions in the same module, or as general objects
(e.g. functions Cython doesn't know about).
To achieve maximum efficiency it will likely be good to allow sharing
these functions in .pxd files. We have 'cdef inline' functions, but I
would prefer annotated def functions where the parameters are
specialized on demand, e.g.

@elemental
def add(a, b): # elemental functions can have any number of arguments
and operate on any compatible dtype
 return a + b

When calling cdef functions or elemental functions with memoryview
arguments, the arguments perform a (broadcasted) elementwise
operation. Alternatively, we can have a parallel.elementwise function
which maps the function elementwise, which would also work for object
callables. I prefer the former, since I think it will read much
easier.

Secondly, we can have a reduce function (and maybe a scan function),
that reduce (respectively scan) in a specified axis or number of axes.
E.g.

 parallel.reduce(add, a, b, axis=(0, 2))

where the default for axis is "all axes". As for the default value,
this could be perhaps optionally provided to the elemental decorator.
Otherwise, the reducer will have to get the default values from each
dimension that is reduced in, and then skip those values when
reducing. (Of course, the reducer function must be associate and
commutative). Also, a lambda could be passed in instead of an





Only associative, right?

Sounds good to me.



elementwise or typed cdef function.

Finally, we would have a parallel.nditer/ndenumerate/nditerate
function, which would iterate over N memoryviews, and provide a
sensible memory access pattern (like numpy.nditer). I'm not sure if it
should provide only the indices, or also the values. e.g. an inplace
elementwise add would read as follows:

 for i, j, k in parallel.nditerate(A, B):
 A[i, j, k] += B[i, j, k]






I think this sounds good; I guess don't see a particular reason for
"ndenumerate", I think code like the above is clearer.




I'm assuming the index computations would not be re-done in this case
(i.e. there's more magic going on here than looks like at first
glance)? Otherwise there is an advantage to ndenumerate.




Ideally, there is a lot more magic going on, though I don't know how far
Mark wants to go.

Imagine "nditerate(A, A.T)", in that case it would have to make many
small
tiles so that for each tile being processed, A has a tile in cache and
A.T
has another tile in cache (so that one doesn't waste cache line
transfers).

So those array lookups would potentially look up in different memory
buffers, with the strides known at compile time.



Yes, being clever about the order in which to iterate over the indices
is the hard problem to solve here. I was thinking more in terms of the
inner loop iterating over the innermost dimension only to do the
indexing (retrieval and assignment), similar to how the generic NumPy
iterator works.



The point isn't only being clever about the *order*...you need "copy-in,
copy-out".

The point is that the NumPy iterator is not good enough (for out-of-cache
situations). Since you grab a cache line (64 bytes) each time from main
memory, a plain NumPy broadcasted iterator throws away a lot of memory for
"A + A.T", since for ndim>1 there's NO iteration order which isn't bad (for
instance, you could iterate in the order of A, and the result would be that
for each element of A.T you fetch there is 64 bytes transferred).

So the solution is to copy A.T block-wise to a temporary scratch space in
cache so that you use all the elements in the cache line before throwing it
out of cache.

In C, I've seen a simple blocking transpose operation be over four times
faster than the brute-force transpose for this reason.


Yes, I understand this. Truly element-wise arithmetic with arrays

Re: [Cython] 0.17

2012-05-22 Thread mark florisson
On 6 May 2012 15:28, mark florisson  wrote:
> Hey,
>
> I think we already have quite a bit of functionality (nearly) ready,
> after merging some pending pull requests maybe it will be a good time
> for a 0.17 release? I think it would be good to also document to what
> extent pypy support works, what works and what doesn't. Stefan, since
> you added a large majority of the features, would you want to be the
> release manager?
>
> In summary, the following pull requests should likely go in
>    - array.array support (unless further discussion prevents that)
>    - fused types runtime buffer dispatch
>    - newaxis
>    - more?
>
> The memoryview documentation should also be reworked a bit. Matthew,
> are you still willing to have a go at that? Otherwise I can clean up
> the mess first, some things are no longer true and simply outdated,
> and then have a second opinion.
>
> Mark

I think we have enough stuff in to go for a 0.17 release, I have a few
more fixes and a refactoring that I'll finish tonight that might be
useful to get in as well. Currently Jenkins is yellow though, as the
reduce_pickle test fails in Python 3.
___
cython-devel mailing list
cython-devel@python.org
http://mail.python.org/mailman/listinfo/cython-devel


Re: [Cython] gsoc: array expressions

2012-05-22 Thread mark florisson
On 22 May 2012 07:48, Robert Bradshaw  wrote:
> On Mon, May 21, 2012 at 11:36 PM, Dag Sverre Seljebotn
>  wrote:
>> On 05/22/2012 08:11 AM, Robert Bradshaw wrote:
>>>
>>> On Mon, May 21, 2012 at 3:34 AM, Dag Sverre Seljebotn
>>>   wrote:

 On 05/20/2012 04:03 PM, mark florisson wrote:
>
>
> Hey,
>
> For my gsoc we already have some simple initial ideas, i.e.
> elementwise vector expressions (a + b with a and b arrays with
> arbitrary rank), I don't think these need any discussion. However,
> there are a lot of things that haven't been formally discussed on the
> mailing list, so here goes.
>
> Frédéric, I am CCing you since you expressed interest on the numpy
> mailing list, and I think your insights as a Theano developer can be
> very helpful in this discussion.
>
> User Interface
> ===
> Besides simple array expressions for dense arrays I would like a
> mechanism for "custom ufuncs", although to a different extent to what
> Numpy or Numba provide. There are several ways in which we could want
> them, e.g. as typed functions (cdef, or external C) functions, as
> lambas or Python functions in the same module, or as general objects
> (e.g. functions Cython doesn't know about).
> To achieve maximum efficiency it will likely be good to allow sharing
> these functions in .pxd files. We have 'cdef inline' functions, but I
> would prefer annotated def functions where the parameters are
> specialized on demand, e.g.
>
> @elemental
> def add(a, b): # elemental functions can have any number of arguments
> and operate on any compatible dtype
>     return a + b
>
> When calling cdef functions or elemental functions with memoryview
> arguments, the arguments perform a (broadcasted) elementwise
> operation. Alternatively, we can have a parallel.elementwise function
> which maps the function elementwise, which would also work for object
> callables. I prefer the former, since I think it will read much
> easier.
>
> Secondly, we can have a reduce function (and maybe a scan function),
> that reduce (respectively scan) in a specified axis or number of axes.
> E.g.
>
>     parallel.reduce(add, a, b, axis=(0, 2))
>
> where the default for axis is "all axes". As for the default value,
> this could be perhaps optionally provided to the elemental decorator.
> Otherwise, the reducer will have to get the default values from each
> dimension that is reduced in, and then skip those values when
> reducing. (Of course, the reducer function must be associate and
> commutative). Also, a lambda could be passed in instead of an



 Only associative, right?

 Sounds good to me.


> elementwise or typed cdef function.
>
> Finally, we would have a parallel.nditer/ndenumerate/nditerate
> function, which would iterate over N memoryviews, and provide a
> sensible memory access pattern (like numpy.nditer). I'm not sure if it
> should provide only the indices, or also the values. e.g. an inplace
> elementwise add would read as follows:
>
>     for i, j, k in parallel.nditerate(A, B):
>         A[i, j, k] += B[i, j, k]




 I think this sounds good; I guess don't see a particular reason for
 "ndenumerate", I think code like the above is clearer.
>>>
>>>
>>> I'm assuming the index computations would not be re-done in this case
>>> (i.e. there's more magic going on here than looks like at first
>>> glance)? Otherwise there is an advantage to ndenumerate.
>>
>>
>> Ideally, there is a lot more magic going on, though I don't know how far
>> Mark wants to go.
>>
>> Imagine "nditerate(A, A.T)", in that case it would have to make many small
>> tiles so that for each tile being processed, A has a tile in cache and A.T
>> has another tile in cache (so that one doesn't waste cache line transfers).
>>
>> So those array lookups would potentially look up in different memory
>> buffers, with the strides known at compile time.
>
> Yes, being clever about the order in which to iterate over the indices
> is the hard problem to solve here. I was thinking more in terms of the
> inner loop iterating over the innermost dimension only to do the
> indexing (retrieval and assignment), similar to how the generic NumPy
> iterator works.

That's a valid point, but my experience has been that any worthy C
compiler will do common subexpression elimination for the outer
dimensions and not recompute the offset every time. It actually
generated marginally faster code for scalar assignment than a
"cascaded pointer assignment", i.e. faster than

p0 = data;
for (...) {
p1 = p0 + i * strides[0]
for (...) {
p2 = p1 + j * strides[1]
...
}
}

(haven't tried manual strength reduction there though).

>> Which begs the question: What about this body?