Re: [Cython] gsoc: array expressions

2012-05-21 Thread Dag Sverre Seljebotn

On 05/20/2012 04:03 PM, mark florisson wrote:

Hey,

For my gsoc we already have some simple initial ideas, i.e.
elementwise vector expressions (a + b with a and b arrays with
arbitrary rank), I don't think these need any discussion. However,
there are a lot of things that haven't been formally discussed on the
mailing list, so here goes.

Frédéric, I am CCing you since you expressed interest on the numpy
mailing list, and I think your insights as a Theano developer can be
very helpful in this discussion.

User Interface
===
Besides simple array expressions for dense arrays I would like a
mechanism for "custom ufuncs", although to a different extent to what
Numpy or Numba provide. There are several ways in which we could want
them, e.g. as typed functions (cdef, or external C) functions, as
lambas or Python functions in the same module, or as general objects
(e.g. functions Cython doesn't know about).
To achieve maximum efficiency it will likely be good to allow sharing
these functions in .pxd files. We have 'cdef inline' functions, but I
would prefer annotated def functions where the parameters are
specialized on demand, e.g.

@elemental
def add(a, b): # elemental functions can have any number of arguments
and operate on any compatible dtype
 return a + b

When calling cdef functions or elemental functions with memoryview
arguments, the arguments perform a (broadcasted) elementwise
operation. Alternatively, we can have a parallel.elementwise function
which maps the function elementwise, which would also work for object
callables. I prefer the former, since I think it will read much
easier.

Secondly, we can have a reduce function (and maybe a scan function),
that reduce (respectively scan) in a specified axis or number of axes.
E.g.

 parallel.reduce(add, a, b, axis=(0, 2))

where the default for axis is "all axes". As for the default value,
this could be perhaps optionally provided to the elemental decorator.
Otherwise, the reducer will have to get the default values from each
dimension that is reduced in, and then skip those values when
reducing. (Of course, the reducer function must be associate and
commutative). Also, a lambda could be passed in instead of an


Only associative, right?

Sounds good to me.


elementwise or typed cdef function.

Finally, we would have a parallel.nditer/ndenumerate/nditerate
function, which would iterate over N memoryviews, and provide a
sensible memory access pattern (like numpy.nditer). I'm not sure if it
should provide only the indices, or also the values. e.g. an inplace
elementwise add would read as follows:

 for i, j, k in parallel.nditerate(A, B):
 A[i, j, k] += B[i, j, k]



I think this sounds good; I guess don't see a particular reason for 
"ndenumerate", I think code like the above is clearer.


It's perhaps worth at least thinking about how to support "for idx in 
...", "A[idx[2], Ellipsis] = ...", i.e. arbitrary number of dimensions. 
Not in first iteration though.


Putting it in "parallel" is nice because prange already have 
out-of-order semantics But of course, there are performance benefits 
even within a single thread because of the out-of-order aspect. This 
should at least be a big NOTE box in the documentation.




Implementation
===
Frédéric, feel free to correct me at any point here :)

As for the implementation, I think it will be a good idea to at least
reuse (optionally through command line flags) Theano's optimization
pipeline. I think it would be reasonably easy to build a Theano
expression graph (after fusing the expressions in Cython first), run
the Theano optimizations on that and map back to a Cython AST.
Optionally, we could store a pickled graph representation (or even
compiled theano function?), and provide it as an optional
specialization at runtime (but mapping back correctly to memoryviews
where needed, etc). As Numba matures, a numba runtime specialization
could optionally be provided.


Can you enlighten us a bit about what Theano's optimizations involve? 
You mention doing the iteration specializations yourself below, and also 
the tiling..


Is it just "scalar" optimizations of the form "x**3 -> x * x * x" and 
numeric stabilization like "log(1 + x) -> log1p(x)" that would be 
provided by Theano?


If so, such optimizations should be done also for our scalar 
computations, not just vector, right?


Or does Theano deal with the memory access patterns?



As for the implementation of the C specializations, I currently think
we should implement our own, since theano heavily uses the numpy C
API, and since its easier to have an optional theano runtime
specialization anyway. I propose the following specializations, to be
selected at runtime

 - vectorized contiguous, collapsed and aligned
 - this function can be called by a strided, inner dimension
contiguous specialization
 - tiled (ndenumerate/nditerate)
 - tiled vectorized
 - plain C loops

With 'aligned' it is not 

Re: [Cython] gsoc: array expressions

2012-05-21 Thread mark florisson
On 21 May 2012 11:34, Dag Sverre Seljebotn  wrote:
> On 05/20/2012 04:03 PM, mark florisson wrote:
>>
>> Hey,
>>
>> For my gsoc we already have some simple initial ideas, i.e.
>> elementwise vector expressions (a + b with a and b arrays with
>> arbitrary rank), I don't think these need any discussion. However,
>> there are a lot of things that haven't been formally discussed on the
>> mailing list, so here goes.
>>
>> Frédéric, I am CCing you since you expressed interest on the numpy
>> mailing list, and I think your insights as a Theano developer can be
>> very helpful in this discussion.
>>
>> User Interface
>> ===
>> Besides simple array expressions for dense arrays I would like a
>> mechanism for "custom ufuncs", although to a different extent to what
>> Numpy or Numba provide. There are several ways in which we could want
>> them, e.g. as typed functions (cdef, or external C) functions, as
>> lambas or Python functions in the same module, or as general objects
>> (e.g. functions Cython doesn't know about).
>> To achieve maximum efficiency it will likely be good to allow sharing
>> these functions in .pxd files. We have 'cdef inline' functions, but I
>> would prefer annotated def functions where the parameters are
>> specialized on demand, e.g.
>>
>> @elemental
>> def add(a, b): # elemental functions can have any number of arguments
>> and operate on any compatible dtype
>>     return a + b
>>
>> When calling cdef functions or elemental functions with memoryview
>> arguments, the arguments perform a (broadcasted) elementwise
>> operation. Alternatively, we can have a parallel.elementwise function
>> which maps the function elementwise, which would also work for object
>> callables. I prefer the former, since I think it will read much
>> easier.
>>
>> Secondly, we can have a reduce function (and maybe a scan function),
>> that reduce (respectively scan) in a specified axis or number of axes.
>> E.g.
>>
>>     parallel.reduce(add, a, b, axis=(0, 2))
>>
>> where the default for axis is "all axes". As for the default value,
>> this could be perhaps optionally provided to the elemental decorator.
>> Otherwise, the reducer will have to get the default values from each
>> dimension that is reduced in, and then skip those values when
>> reducing. (Of course, the reducer function must be associate and
>> commutative). Also, a lambda could be passed in instead of an
>
>
> Only associative, right?
>
> Sounds good to me.
>

Ah, I guess, because we can reduce thead-local results manually in a
specified (elementwise) order (I was thinking of generating OpenMP
annotated loops, that can be enabled/disabled at the C level, with an
'if' clause with a sensible lower bound of iterations required).

>> elementwise or typed cdef function.
>>
>> Finally, we would have a parallel.nditer/ndenumerate/nditerate
>> function, which would iterate over N memoryviews, and provide a
>> sensible memory access pattern (like numpy.nditer). I'm not sure if it
>> should provide only the indices, or also the values. e.g. an inplace
>> elementwise add would read as follows:
>>
>>     for i, j, k in parallel.nditerate(A, B):
>>         A[i, j, k] += B[i, j, k]
>
>
>
> I think this sounds good; I guess don't see a particular reason for
> "ndenumerate", I think code like the above is clearer.
>
> It's perhaps worth at least thinking about how to support "for idx in ...",
> "A[idx[2], Ellipsis] = ...", i.e. arbitrary number of dimensions. Not in
> first iteration though.

Yeah, definitely.

> Putting it in "parallel" is nice because prange already have out-of-order
> semantics But of course, there are performance benefits even within a
> single thread because of the out-of-order aspect. This should at least be a
> big NOTE box in the documentation.
>
>
>>
>> Implementation
>> ===
>> Frédéric, feel free to correct me at any point here :)
>>
>> As for the implementation, I think it will be a good idea to at least
>> reuse (optionally through command line flags) Theano's optimization
>> pipeline. I think it would be reasonably easy to build a Theano
>> expression graph (after fusing the expressions in Cython first), run
>> the Theano optimizations on that and map back to a Cython AST.
>> Optionally, we could store a pickled graph representation (or even
>> compiled theano function?), and provide it as an optional
>> specialization at runtime (but mapping back correctly to memoryviews
>> where needed, etc). As Numba matures, a numba runtime specialization
>> could optionally be provided.
>
>
> Can you enlighten us a bit about what Theano's optimizations involve? You
> mention doing the iteration specializations yourself below, and also the
> tiling..
>
> Is it just "scalar" optimizations of the form "x**3 -> x * x * x" and
> numeric stabilization like "log(1 + x) -> log1p(x)" that would be provided
> by Theano?

Yes, it does those kind of things, and it also eliminates common
subexpressions, and it transforms certain expressions 

Re: [Cython] gsoc: array expressions

2012-05-21 Thread mark florisson
On 21 May 2012 11:34, Dag Sverre Seljebotn  wrote:
> On 05/20/2012 04:03 PM, mark florisson wrote:
>>
>> Hey,
>>
>> For my gsoc we already have some simple initial ideas, i.e.
>> elementwise vector expressions (a + b with a and b arrays with
>> arbitrary rank), I don't think these need any discussion. However,
>> there are a lot of things that haven't been formally discussed on the
>> mailing list, so here goes.
>>
>> Frédéric, I am CCing you since you expressed interest on the numpy
>> mailing list, and I think your insights as a Theano developer can be
>> very helpful in this discussion.
>>
>> User Interface
>> ===
>> Besides simple array expressions for dense arrays I would like a
>> mechanism for "custom ufuncs", although to a different extent to what
>> Numpy or Numba provide. There are several ways in which we could want
>> them, e.g. as typed functions (cdef, or external C) functions, as
>> lambas or Python functions in the same module, or as general objects
>> (e.g. functions Cython doesn't know about).
>> To achieve maximum efficiency it will likely be good to allow sharing
>> these functions in .pxd files. We have 'cdef inline' functions, but I
>> would prefer annotated def functions where the parameters are
>> specialized on demand, e.g.
>>
>> @elemental
>> def add(a, b): # elemental functions can have any number of arguments
>> and operate on any compatible dtype
>>     return a + b
>>
>> When calling cdef functions or elemental functions with memoryview
>> arguments, the arguments perform a (broadcasted) elementwise
>> operation. Alternatively, we can have a parallel.elementwise function
>> which maps the function elementwise, which would also work for object
>> callables. I prefer the former, since I think it will read much
>> easier.
>>
>> Secondly, we can have a reduce function (and maybe a scan function),
>> that reduce (respectively scan) in a specified axis or number of axes.
>> E.g.
>>
>>     parallel.reduce(add, a, b, axis=(0, 2))
>>
>> where the default for axis is "all axes". As for the default value,
>> this could be perhaps optionally provided to the elemental decorator.
>> Otherwise, the reducer will have to get the default values from each
>> dimension that is reduced in, and then skip those values when
>> reducing. (Of course, the reducer function must be associate and
>> commutative). Also, a lambda could be passed in instead of an
>
>
> Only associative, right?
>
> Sounds good to me.
>
>
>> elementwise or typed cdef function.
>>
>> Finally, we would have a parallel.nditer/ndenumerate/nditerate
>> function, which would iterate over N memoryviews, and provide a
>> sensible memory access pattern (like numpy.nditer). I'm not sure if it
>> should provide only the indices, or also the values. e.g. an inplace
>> elementwise add would read as follows:
>>
>>     for i, j, k in parallel.nditerate(A, B):
>>         A[i, j, k] += B[i, j, k]
>
>
>
> I think this sounds good; I guess don't see a particular reason for
> "ndenumerate", I think code like the above is clearer.
>
> It's perhaps worth at least thinking about how to support "for idx in ...",
> "A[idx[2], Ellipsis] = ...", i.e. arbitrary number of dimensions. Not in
> first iteration though.
>
> Putting it in "parallel" is nice because prange already have out-of-order
> semantics But of course, there are performance benefits even within a
> single thread because of the out-of-order aspect. This should at least be a
> big NOTE box in the documentation.
>
>
>>
>> Implementation
>> ===
>> Frédéric, feel free to correct me at any point here :)
>>
>> As for the implementation, I think it will be a good idea to at least
>> reuse (optionally through command line flags) Theano's optimization
>> pipeline. I think it would be reasonably easy to build a Theano
>> expression graph (after fusing the expressions in Cython first), run
>> the Theano optimizations on that and map back to a Cython AST.
>> Optionally, we could store a pickled graph representation (or even
>> compiled theano function?), and provide it as an optional
>> specialization at runtime (but mapping back correctly to memoryviews
>> where needed, etc). As Numba matures, a numba runtime specialization
>> could optionally be provided.
>
>
> Can you enlighten us a bit about what Theano's optimizations involve? You
> mention doing the iteration specializations yourself below, and also the
> tiling..
>
> Is it just "scalar" optimizations of the form "x**3 -> x * x * x" and
> numeric stabilization like "log(1 + x) -> log1p(x)" that would be provided
> by Theano?
>
> If so, such optimizations should be done also for our scalar computations,
> not just vector, right?

As for this, I don't think CSE is important for scalar computations,
since if they are objects, you have to go through a Python layer since
you have no idea what the code does, and if they are C objects, the C
compiler will readily optimize that out. You want this for vector
expressions since the compu

Re: [Cython] gsoc: array expressions

2012-05-21 Thread mark florisson
On 21 May 2012 12:08, mark florisson  wrote:
> On 21 May 2012 11:34, Dag Sverre Seljebotn  wrote:
>> On 05/20/2012 04:03 PM, mark florisson wrote:
>>>
>>> Hey,
>>>
>>> For my gsoc we already have some simple initial ideas, i.e.
>>> elementwise vector expressions (a + b with a and b arrays with
>>> arbitrary rank), I don't think these need any discussion. However,
>>> there are a lot of things that haven't been formally discussed on the
>>> mailing list, so here goes.
>>>
>>> Frédéric, I am CCing you since you expressed interest on the numpy
>>> mailing list, and I think your insights as a Theano developer can be
>>> very helpful in this discussion.
>>>
>>> User Interface
>>> ===
>>> Besides simple array expressions for dense arrays I would like a
>>> mechanism for "custom ufuncs", although to a different extent to what
>>> Numpy or Numba provide. There are several ways in which we could want
>>> them, e.g. as typed functions (cdef, or external C) functions, as
>>> lambas or Python functions in the same module, or as general objects
>>> (e.g. functions Cython doesn't know about).
>>> To achieve maximum efficiency it will likely be good to allow sharing
>>> these functions in .pxd files. We have 'cdef inline' functions, but I
>>> would prefer annotated def functions where the parameters are
>>> specialized on demand, e.g.
>>>
>>> @elemental
>>> def add(a, b): # elemental functions can have any number of arguments
>>> and operate on any compatible dtype
>>>     return a + b
>>>
>>> When calling cdef functions or elemental functions with memoryview
>>> arguments, the arguments perform a (broadcasted) elementwise
>>> operation. Alternatively, we can have a parallel.elementwise function
>>> which maps the function elementwise, which would also work for object
>>> callables. I prefer the former, since I think it will read much
>>> easier.
>>>
>>> Secondly, we can have a reduce function (and maybe a scan function),
>>> that reduce (respectively scan) in a specified axis or number of axes.
>>> E.g.
>>>
>>>     parallel.reduce(add, a, b, axis=(0, 2))
>>>
>>> where the default for axis is "all axes". As for the default value,
>>> this could be perhaps optionally provided to the elemental decorator.
>>> Otherwise, the reducer will have to get the default values from each
>>> dimension that is reduced in, and then skip those values when
>>> reducing. (Of course, the reducer function must be associate and
>>> commutative). Also, a lambda could be passed in instead of an
>>
>>
>> Only associative, right?
>>
>> Sounds good to me.
>>
>>
>>> elementwise or typed cdef function.
>>>
>>> Finally, we would have a parallel.nditer/ndenumerate/nditerate
>>> function, which would iterate over N memoryviews, and provide a
>>> sensible memory access pattern (like numpy.nditer). I'm not sure if it
>>> should provide only the indices, or also the values. e.g. an inplace
>>> elementwise add would read as follows:
>>>
>>>     for i, j, k in parallel.nditerate(A, B):
>>>         A[i, j, k] += B[i, j, k]
>>
>>
>>
>> I think this sounds good; I guess don't see a particular reason for
>> "ndenumerate", I think code like the above is clearer.
>>
>> It's perhaps worth at least thinking about how to support "for idx in ...",
>> "A[idx[2], Ellipsis] = ...", i.e. arbitrary number of dimensions. Not in
>> first iteration though.
>>
>> Putting it in "parallel" is nice because prange already have out-of-order
>> semantics But of course, there are performance benefits even within a
>> single thread because of the out-of-order aspect. This should at least be a
>> big NOTE box in the documentation.
>>
>>
>>>
>>> Implementation
>>> ===
>>> Frédéric, feel free to correct me at any point here :)
>>>
>>> As for the implementation, I think it will be a good idea to at least
>>> reuse (optionally through command line flags) Theano's optimization
>>> pipeline. I think it would be reasonably easy to build a Theano
>>> expression graph (after fusing the expressions in Cython first), run
>>> the Theano optimizations on that and map back to a Cython AST.
>>> Optionally, we could store a pickled graph representation (or even
>>> compiled theano function?), and provide it as an optional
>>> specialization at runtime (but mapping back correctly to memoryviews
>>> where needed, etc). As Numba matures, a numba runtime specialization
>>> could optionally be provided.
>>
>>
>> Can you enlighten us a bit about what Theano's optimizations involve? You
>> mention doing the iteration specializations yourself below, and also the
>> tiling..
>>
>> Is it just "scalar" optimizations of the form "x**3 -> x * x * x" and
>> numeric stabilization like "log(1 + x) -> log1p(x)" that would be provided
>> by Theano?
>>
>> If so, such optimizations should be done also for our scalar computations,
>> not just vector, right?
>
> As for this, I don't think CSE is important for scalar computations,
> since if they are objects, you have to go through a Python layer since

Re: [Cython] gsoc: array expressions

2012-05-21 Thread Dag Sverre Seljebotn

On 05/21/2012 12:56 PM, mark florisson wrote:

On 21 May 2012 11:34, Dag Sverre Seljebotn  wrote:

On 05/20/2012 04:03 PM, mark florisson wrote:


Hey,

For my gsoc we already have some simple initial ideas, i.e.
elementwise vector expressions (a + b with a and b arrays with
arbitrary rank), I don't think these need any discussion. However,
there are a lot of things that haven't been formally discussed on the
mailing list, so here goes.

Frédéric, I am CCing you since you expressed interest on the numpy
mailing list, and I think your insights as a Theano developer can be
very helpful in this discussion.

User Interface
===
Besides simple array expressions for dense arrays I would like a
mechanism for "custom ufuncs", although to a different extent to what
Numpy or Numba provide. There are several ways in which we could want
them, e.g. as typed functions (cdef, or external C) functions, as
lambas or Python functions in the same module, or as general objects
(e.g. functions Cython doesn't know about).
To achieve maximum efficiency it will likely be good to allow sharing
these functions in .pxd files. We have 'cdef inline' functions, but I
would prefer annotated def functions where the parameters are
specialized on demand, e.g.

@elemental
def add(a, b): # elemental functions can have any number of arguments
and operate on any compatible dtype
 return a + b

When calling cdef functions or elemental functions with memoryview
arguments, the arguments perform a (broadcasted) elementwise
operation. Alternatively, we can have a parallel.elementwise function
which maps the function elementwise, which would also work for object
callables. I prefer the former, since I think it will read much
easier.

Secondly, we can have a reduce function (and maybe a scan function),
that reduce (respectively scan) in a specified axis or number of axes.
E.g.

 parallel.reduce(add, a, b, axis=(0, 2))

where the default for axis is "all axes". As for the default value,
this could be perhaps optionally provided to the elemental decorator.
Otherwise, the reducer will have to get the default values from each
dimension that is reduced in, and then skip those values when
reducing. (Of course, the reducer function must be associate and
commutative). Also, a lambda could be passed in instead of an



Only associative, right?

Sounds good to me.



Ah, I guess, because we can reduce thead-local results manually in a
specified (elementwise) order (I was thinking of generating OpenMP
annotated loops, that can be enabled/disabled at the C level, with an
'if' clause with a sensible lower bound of iterations required).


elementwise or typed cdef function.

Finally, we would have a parallel.nditer/ndenumerate/nditerate
function, which would iterate over N memoryviews, and provide a
sensible memory access pattern (like numpy.nditer). I'm not sure if it
should provide only the indices, or also the values. e.g. an inplace
elementwise add would read as follows:

 for i, j, k in parallel.nditerate(A, B):
 A[i, j, k] += B[i, j, k]




I think this sounds good; I guess don't see a particular reason for
"ndenumerate", I think code like the above is clearer.

It's perhaps worth at least thinking about how to support "for idx in ...",
"A[idx[2], Ellipsis] = ...", i.e. arbitrary number of dimensions. Not in
first iteration though.


Yeah, definitely.


Putting it in "parallel" is nice because prange already have out-of-order
semantics But of course, there are performance benefits even within a
single thread because of the out-of-order aspect. This should at least be a
big NOTE box in the documentation.




Implementation
===
Frédéric, feel free to correct me at any point here :)

As for the implementation, I think it will be a good idea to at least
reuse (optionally through command line flags) Theano's optimization
pipeline. I think it would be reasonably easy to build a Theano
expression graph (after fusing the expressions in Cython first), run
the Theano optimizations on that and map back to a Cython AST.
Optionally, we could store a pickled graph representation (or even
compiled theano function?), and provide it as an optional
specialization at runtime (but mapping back correctly to memoryviews
where needed, etc). As Numba matures, a numba runtime specialization
could optionally be provided.



Can you enlighten us a bit about what Theano's optimizations involve? You
mention doing the iteration specializations yourself below, and also the
tiling..

Is it just "scalar" optimizations of the form "x**3 ->  x * x * x" and
numeric stabilization like "log(1 + x) ->  log1p(x)" that would be provided
by Theano?


Yes, it does those kind of things, and it also eliminates common
subexpressions, and it transforms certain expressions to BLAS/LAPACK
functionality. I'm not sure we want that specifically. I'm thinking it
might be more fruitful to start off with a theano-only specialization,
and implement low-level code generation

Re: [Cython] gsoc: array expressions

2012-05-21 Thread mark florisson
On 21 May 2012 12:14, Dag Sverre Seljebotn  wrote:
> On 05/21/2012 12:56 PM, mark florisson wrote:
>>
>> On 21 May 2012 11:34, Dag Sverre Seljebotn
>>  wrote:
>>>
>>> On 05/20/2012 04:03 PM, mark florisson wrote:


 Hey,

 For my gsoc we already have some simple initial ideas, i.e.
 elementwise vector expressions (a + b with a and b arrays with
 arbitrary rank), I don't think these need any discussion. However,
 there are a lot of things that haven't been formally discussed on the
 mailing list, so here goes.

 Frédéric, I am CCing you since you expressed interest on the numpy
 mailing list, and I think your insights as a Theano developer can be
 very helpful in this discussion.

 User Interface
 ===
 Besides simple array expressions for dense arrays I would like a
 mechanism for "custom ufuncs", although to a different extent to what
 Numpy or Numba provide. There are several ways in which we could want
 them, e.g. as typed functions (cdef, or external C) functions, as
 lambas or Python functions in the same module, or as general objects
 (e.g. functions Cython doesn't know about).
 To achieve maximum efficiency it will likely be good to allow sharing
 these functions in .pxd files. We have 'cdef inline' functions, but I
 would prefer annotated def functions where the parameters are
 specialized on demand, e.g.

 @elemental
 def add(a, b): # elemental functions can have any number of arguments
 and operate on any compatible dtype
     return a + b

 When calling cdef functions or elemental functions with memoryview
 arguments, the arguments perform a (broadcasted) elementwise
 operation. Alternatively, we can have a parallel.elementwise function
 which maps the function elementwise, which would also work for object
 callables. I prefer the former, since I think it will read much
 easier.

 Secondly, we can have a reduce function (and maybe a scan function),
 that reduce (respectively scan) in a specified axis or number of axes.
 E.g.

     parallel.reduce(add, a, b, axis=(0, 2))

 where the default for axis is "all axes". As for the default value,
 this could be perhaps optionally provided to the elemental decorator.
 Otherwise, the reducer will have to get the default values from each
 dimension that is reduced in, and then skip those values when
 reducing. (Of course, the reducer function must be associate and
 commutative). Also, a lambda could be passed in instead of an
>>>
>>>
>>>
>>> Only associative, right?
>>>
>>> Sounds good to me.
>>>
>>
>> Ah, I guess, because we can reduce thead-local results manually in a
>> specified (elementwise) order (I was thinking of generating OpenMP
>> annotated loops, that can be enabled/disabled at the C level, with an
>> 'if' clause with a sensible lower bound of iterations required).
>>
 elementwise or typed cdef function.

 Finally, we would have a parallel.nditer/ndenumerate/nditerate
 function, which would iterate over N memoryviews, and provide a
 sensible memory access pattern (like numpy.nditer). I'm not sure if it
 should provide only the indices, or also the values. e.g. an inplace
 elementwise add would read as follows:

     for i, j, k in parallel.nditerate(A, B):
         A[i, j, k] += B[i, j, k]
>>>
>>>
>>>
>>>
>>> I think this sounds good; I guess don't see a particular reason for
>>> "ndenumerate", I think code like the above is clearer.
>>>
>>> It's perhaps worth at least thinking about how to support "for idx in
>>> ...",
>>> "A[idx[2], Ellipsis] = ...", i.e. arbitrary number of dimensions. Not in
>>> first iteration though.
>>
>>
>> Yeah, definitely.
>>
>>> Putting it in "parallel" is nice because prange already have out-of-order
>>> semantics But of course, there are performance benefits even within a
>>> single thread because of the out-of-order aspect. This should at least be
>>> a
>>> big NOTE box in the documentation.
>>>
>>>

 Implementation
 ===
 Frédéric, feel free to correct me at any point here :)

 As for the implementation, I think it will be a good idea to at least
 reuse (optionally through command line flags) Theano's optimization
 pipeline. I think it would be reasonably easy to build a Theano
 expression graph (after fusing the expressions in Cython first), run
 the Theano optimizations on that and map back to a Cython AST.
 Optionally, we could store a pickled graph representation (or even
 compiled theano function?), and provide it as an optional
 specialization at runtime (but mapping back correctly to memoryviews
 where needed, etc). As Numba matures, a numba runtime specialization
 could optionally be provided.
>>>
>>>
>>>
>>> Can you enlighten us a bit about what Theano's optimizations involve? You
>>> men

Re: [Cython] gsoc: array expressions

2012-05-21 Thread Robert Bradshaw
On Mon, May 21, 2012 at 3:34 AM, Dag Sverre Seljebotn
 wrote:
> On 05/20/2012 04:03 PM, mark florisson wrote:
>>
>> Hey,
>>
>> For my gsoc we already have some simple initial ideas, i.e.
>> elementwise vector expressions (a + b with a and b arrays with
>> arbitrary rank), I don't think these need any discussion. However,
>> there are a lot of things that haven't been formally discussed on the
>> mailing list, so here goes.
>>
>> Frédéric, I am CCing you since you expressed interest on the numpy
>> mailing list, and I think your insights as a Theano developer can be
>> very helpful in this discussion.
>>
>> User Interface
>> ===
>> Besides simple array expressions for dense arrays I would like a
>> mechanism for "custom ufuncs", although to a different extent to what
>> Numpy or Numba provide. There are several ways in which we could want
>> them, e.g. as typed functions (cdef, or external C) functions, as
>> lambas or Python functions in the same module, or as general objects
>> (e.g. functions Cython doesn't know about).
>> To achieve maximum efficiency it will likely be good to allow sharing
>> these functions in .pxd files. We have 'cdef inline' functions, but I
>> would prefer annotated def functions where the parameters are
>> specialized on demand, e.g.
>>
>> @elemental
>> def add(a, b): # elemental functions can have any number of arguments
>> and operate on any compatible dtype
>>     return a + b
>>
>> When calling cdef functions or elemental functions with memoryview
>> arguments, the arguments perform a (broadcasted) elementwise
>> operation. Alternatively, we can have a parallel.elementwise function
>> which maps the function elementwise, which would also work for object
>> callables. I prefer the former, since I think it will read much
>> easier.
>>
>> Secondly, we can have a reduce function (and maybe a scan function),
>> that reduce (respectively scan) in a specified axis or number of axes.
>> E.g.
>>
>>     parallel.reduce(add, a, b, axis=(0, 2))
>>
>> where the default for axis is "all axes". As for the default value,
>> this could be perhaps optionally provided to the elemental decorator.
>> Otherwise, the reducer will have to get the default values from each
>> dimension that is reduced in, and then skip those values when
>> reducing. (Of course, the reducer function must be associate and
>> commutative). Also, a lambda could be passed in instead of an
>
>
> Only associative, right?
>
> Sounds good to me.
>
>
>> elementwise or typed cdef function.
>>
>> Finally, we would have a parallel.nditer/ndenumerate/nditerate
>> function, which would iterate over N memoryviews, and provide a
>> sensible memory access pattern (like numpy.nditer). I'm not sure if it
>> should provide only the indices, or also the values. e.g. an inplace
>> elementwise add would read as follows:
>>
>>     for i, j, k in parallel.nditerate(A, B):
>>         A[i, j, k] += B[i, j, k]
>
>
>
> I think this sounds good; I guess don't see a particular reason for
> "ndenumerate", I think code like the above is clearer.

I'm assuming the index computations would not be re-done in this case
(i.e. there's more magic going on here than looks like at first
glance)? Otherwise there is an advantage to ndenumerate.

- Robert
___
cython-devel mailing list
cython-devel@python.org
http://mail.python.org/mailman/listinfo/cython-devel


Re: [Cython] gsoc: array expressions

2012-05-21 Thread Dag Sverre Seljebotn

On 05/22/2012 08:11 AM, Robert Bradshaw wrote:

On Mon, May 21, 2012 at 3:34 AM, Dag Sverre Seljebotn
  wrote:

On 05/20/2012 04:03 PM, mark florisson wrote:


Hey,

For my gsoc we already have some simple initial ideas, i.e.
elementwise vector expressions (a + b with a and b arrays with
arbitrary rank), I don't think these need any discussion. However,
there are a lot of things that haven't been formally discussed on the
mailing list, so here goes.

Frédéric, I am CCing you since you expressed interest on the numpy
mailing list, and I think your insights as a Theano developer can be
very helpful in this discussion.

User Interface
===
Besides simple array expressions for dense arrays I would like a
mechanism for "custom ufuncs", although to a different extent to what
Numpy or Numba provide. There are several ways in which we could want
them, e.g. as typed functions (cdef, or external C) functions, as
lambas or Python functions in the same module, or as general objects
(e.g. functions Cython doesn't know about).
To achieve maximum efficiency it will likely be good to allow sharing
these functions in .pxd files. We have 'cdef inline' functions, but I
would prefer annotated def functions where the parameters are
specialized on demand, e.g.

@elemental
def add(a, b): # elemental functions can have any number of arguments
and operate on any compatible dtype
 return a + b

When calling cdef functions or elemental functions with memoryview
arguments, the arguments perform a (broadcasted) elementwise
operation. Alternatively, we can have a parallel.elementwise function
which maps the function elementwise, which would also work for object
callables. I prefer the former, since I think it will read much
easier.

Secondly, we can have a reduce function (and maybe a scan function),
that reduce (respectively scan) in a specified axis or number of axes.
E.g.

 parallel.reduce(add, a, b, axis=(0, 2))

where the default for axis is "all axes". As for the default value,
this could be perhaps optionally provided to the elemental decorator.
Otherwise, the reducer will have to get the default values from each
dimension that is reduced in, and then skip those values when
reducing. (Of course, the reducer function must be associate and
commutative). Also, a lambda could be passed in instead of an



Only associative, right?

Sounds good to me.



elementwise or typed cdef function.

Finally, we would have a parallel.nditer/ndenumerate/nditerate
function, which would iterate over N memoryviews, and provide a
sensible memory access pattern (like numpy.nditer). I'm not sure if it
should provide only the indices, or also the values. e.g. an inplace
elementwise add would read as follows:

 for i, j, k in parallel.nditerate(A, B):
 A[i, j, k] += B[i, j, k]




I think this sounds good; I guess don't see a particular reason for
"ndenumerate", I think code like the above is clearer.


I'm assuming the index computations would not be re-done in this case
(i.e. there's more magic going on here than looks like at first
glance)? Otherwise there is an advantage to ndenumerate.


Ideally, there is a lot more magic going on, though I don't know how far 
Mark wants to go.


Imagine "nditerate(A, A.T)", in that case it would have to make many 
small tiles so that for each tile being processed, A has a tile in cache 
and A.T has another tile in cache (so that one doesn't waste cache line 
transfers).


So those array lookups would potentially look up in different memory 
buffers, with the strides known at compile time.


Which begs the question: What about this body?

if i < 100:
continue
else:
A[i, j, k] += B[i - 100, j, k]

I guess just fall back to a non-tiled version? One could of course do 
some shifting of which tiles of B to grab etc., but there's a limit to 
how smart one should try to be; one could emit a warning and say that 
one should slice and dice the memoryviews into shape before they are 
passed to nditerate.


Dag
___
cython-devel mailing list
cython-devel@python.org
http://mail.python.org/mailman/listinfo/cython-devel


Re: [Cython] gsoc: array expressions

2012-05-21 Thread Robert Bradshaw
On Mon, May 21, 2012 at 11:36 PM, Dag Sverre Seljebotn
 wrote:
> On 05/22/2012 08:11 AM, Robert Bradshaw wrote:
>>
>> On Mon, May 21, 2012 at 3:34 AM, Dag Sverre Seljebotn
>>   wrote:
>>>
>>> On 05/20/2012 04:03 PM, mark florisson wrote:


 Hey,

 For my gsoc we already have some simple initial ideas, i.e.
 elementwise vector expressions (a + b with a and b arrays with
 arbitrary rank), I don't think these need any discussion. However,
 there are a lot of things that haven't been formally discussed on the
 mailing list, so here goes.

 Frédéric, I am CCing you since you expressed interest on the numpy
 mailing list, and I think your insights as a Theano developer can be
 very helpful in this discussion.

 User Interface
 ===
 Besides simple array expressions for dense arrays I would like a
 mechanism for "custom ufuncs", although to a different extent to what
 Numpy or Numba provide. There are several ways in which we could want
 them, e.g. as typed functions (cdef, or external C) functions, as
 lambas or Python functions in the same module, or as general objects
 (e.g. functions Cython doesn't know about).
 To achieve maximum efficiency it will likely be good to allow sharing
 these functions in .pxd files. We have 'cdef inline' functions, but I
 would prefer annotated def functions where the parameters are
 specialized on demand, e.g.

 @elemental
 def add(a, b): # elemental functions can have any number of arguments
 and operate on any compatible dtype
     return a + b

 When calling cdef functions or elemental functions with memoryview
 arguments, the arguments perform a (broadcasted) elementwise
 operation. Alternatively, we can have a parallel.elementwise function
 which maps the function elementwise, which would also work for object
 callables. I prefer the former, since I think it will read much
 easier.

 Secondly, we can have a reduce function (and maybe a scan function),
 that reduce (respectively scan) in a specified axis or number of axes.
 E.g.

     parallel.reduce(add, a, b, axis=(0, 2))

 where the default for axis is "all axes". As for the default value,
 this could be perhaps optionally provided to the elemental decorator.
 Otherwise, the reducer will have to get the default values from each
 dimension that is reduced in, and then skip those values when
 reducing. (Of course, the reducer function must be associate and
 commutative). Also, a lambda could be passed in instead of an
>>>
>>>
>>>
>>> Only associative, right?
>>>
>>> Sounds good to me.
>>>
>>>
 elementwise or typed cdef function.

 Finally, we would have a parallel.nditer/ndenumerate/nditerate
 function, which would iterate over N memoryviews, and provide a
 sensible memory access pattern (like numpy.nditer). I'm not sure if it
 should provide only the indices, or also the values. e.g. an inplace
 elementwise add would read as follows:

     for i, j, k in parallel.nditerate(A, B):
         A[i, j, k] += B[i, j, k]
>>>
>>>
>>>
>>>
>>> I think this sounds good; I guess don't see a particular reason for
>>> "ndenumerate", I think code like the above is clearer.
>>
>>
>> I'm assuming the index computations would not be re-done in this case
>> (i.e. there's more magic going on here than looks like at first
>> glance)? Otherwise there is an advantage to ndenumerate.
>
>
> Ideally, there is a lot more magic going on, though I don't know how far
> Mark wants to go.
>
> Imagine "nditerate(A, A.T)", in that case it would have to make many small
> tiles so that for each tile being processed, A has a tile in cache and A.T
> has another tile in cache (so that one doesn't waste cache line transfers).
>
> So those array lookups would potentially look up in different memory
> buffers, with the strides known at compile time.

Yes, being clever about the order in which to iterate over the indices
is the hard problem to solve here. I was thinking more in terms of the
inner loop iterating over the innermost dimension only to do the
indexing (retrieval and assignment), similar to how the generic NumPy
iterator works.

> Which begs the question: What about this body?
>
> if i < 100:
>    continue
> else:
>    A[i, j, k] += B[i - 100, j, k]
>
> I guess just fall back to a non-tiled version? One could of course do some
> shifting of which tiles of B to grab etc., but there's a limit to how smart
> one should try to be; one could emit a warning and say that one should slice
> and dice the memoryviews into shape before they are passed to nditerate.

Linear transformations of the index variables could probably be
handled, but that's certainly not v1 (and not too difficult for the
user to express manually).

- Robert
___
cython-devel mailing list
cython-devel@py

Re: [Cython] gsoc: array expressions

2012-05-21 Thread Dag Sverre Seljebotn

On 05/22/2012 08:48 AM, Robert Bradshaw wrote:

On Mon, May 21, 2012 at 11:36 PM, Dag Sverre Seljebotn
  wrote:

On 05/22/2012 08:11 AM, Robert Bradshaw wrote:


On Mon, May 21, 2012 at 3:34 AM, Dag Sverre Seljebotn
wrote:


On 05/20/2012 04:03 PM, mark florisson wrote:



Hey,

For my gsoc we already have some simple initial ideas, i.e.
elementwise vector expressions (a + b with a and b arrays with
arbitrary rank), I don't think these need any discussion. However,
there are a lot of things that haven't been formally discussed on the
mailing list, so here goes.

Frédéric, I am CCing you since you expressed interest on the numpy
mailing list, and I think your insights as a Theano developer can be
very helpful in this discussion.

User Interface
===
Besides simple array expressions for dense arrays I would like a
mechanism for "custom ufuncs", although to a different extent to what
Numpy or Numba provide. There are several ways in which we could want
them, e.g. as typed functions (cdef, or external C) functions, as
lambas or Python functions in the same module, or as general objects
(e.g. functions Cython doesn't know about).
To achieve maximum efficiency it will likely be good to allow sharing
these functions in .pxd files. We have 'cdef inline' functions, but I
would prefer annotated def functions where the parameters are
specialized on demand, e.g.

@elemental
def add(a, b): # elemental functions can have any number of arguments
and operate on any compatible dtype
 return a + b

When calling cdef functions or elemental functions with memoryview
arguments, the arguments perform a (broadcasted) elementwise
operation. Alternatively, we can have a parallel.elementwise function
which maps the function elementwise, which would also work for object
callables. I prefer the former, since I think it will read much
easier.

Secondly, we can have a reduce function (and maybe a scan function),
that reduce (respectively scan) in a specified axis or number of axes.
E.g.

 parallel.reduce(add, a, b, axis=(0, 2))

where the default for axis is "all axes". As for the default value,
this could be perhaps optionally provided to the elemental decorator.
Otherwise, the reducer will have to get the default values from each
dimension that is reduced in, and then skip those values when
reducing. (Of course, the reducer function must be associate and
commutative). Also, a lambda could be passed in instead of an




Only associative, right?

Sounds good to me.



elementwise or typed cdef function.

Finally, we would have a parallel.nditer/ndenumerate/nditerate
function, which would iterate over N memoryviews, and provide a
sensible memory access pattern (like numpy.nditer). I'm not sure if it
should provide only the indices, or also the values. e.g. an inplace
elementwise add would read as follows:

 for i, j, k in parallel.nditerate(A, B):
 A[i, j, k] += B[i, j, k]





I think this sounds good; I guess don't see a particular reason for
"ndenumerate", I think code like the above is clearer.



I'm assuming the index computations would not be re-done in this case
(i.e. there's more magic going on here than looks like at first
glance)? Otherwise there is an advantage to ndenumerate.



Ideally, there is a lot more magic going on, though I don't know how far
Mark wants to go.

Imagine "nditerate(A, A.T)", in that case it would have to make many small
tiles so that for each tile being processed, A has a tile in cache and A.T
has another tile in cache (so that one doesn't waste cache line transfers).

So those array lookups would potentially look up in different memory
buffers, with the strides known at compile time.


Yes, being clever about the order in which to iterate over the indices
is the hard problem to solve here. I was thinking more in terms of the
inner loop iterating over the innermost dimension only to do the
indexing (retrieval and assignment), similar to how the generic NumPy
iterator works.


The point isn't only being clever about the *order*...you need "copy-in, 
copy-out".


The point is that the NumPy iterator is not good enough (for 
out-of-cache situations). Since you grab a cache line (64 bytes) each 
time from main memory, a plain NumPy broadcasted iterator throws away a 
lot of memory for "A + A.T", since for ndim>1 there's NO iteration order 
which isn't bad (for instance, you could iterate in the order of A, and 
the result would be that for each element of A.T you fetch there is 64 
bytes transferred).


So the solution is to copy A.T block-wise to a temporary scratch space 
in cache so that you use all the elements in the cache line before 
throwing it out of cache.


In C, I've seen a simple blocking transpose operation be over four times 
faster than the brute-force transpose for this reason.


Dag




Which begs the question: What about this body?

if i<  100:
continue
else:
A[i, j, k] += B[i - 100, j, k]

I guess just fall back to a non-tiled version? One could 

Re: [Cython] gsoc: array expressions

2012-05-21 Thread Dag Sverre Seljebotn

On 05/22/2012 08:57 AM, Dag Sverre Seljebotn wrote:

On 05/22/2012 08:48 AM, Robert Bradshaw wrote:

On Mon, May 21, 2012 at 11:36 PM, Dag Sverre Seljebotn
 wrote:

On 05/22/2012 08:11 AM, Robert Bradshaw wrote:


On Mon, May 21, 2012 at 3:34 AM, Dag Sverre Seljebotn
 wrote:


On 05/20/2012 04:03 PM, mark florisson wrote:



Hey,

For my gsoc we already have some simple initial ideas, i.e.
elementwise vector expressions (a + b with a and b arrays with
arbitrary rank), I don't think these need any discussion. However,
there are a lot of things that haven't been formally discussed on the
mailing list, so here goes.

Frédéric, I am CCing you since you expressed interest on the numpy
mailing list, and I think your insights as a Theano developer can be
very helpful in this discussion.

User Interface
===
Besides simple array expressions for dense arrays I would like a
mechanism for "custom ufuncs", although to a different extent to what
Numpy or Numba provide. There are several ways in which we could want
them, e.g. as typed functions (cdef, or external C) functions, as
lambas or Python functions in the same module, or as general objects
(e.g. functions Cython doesn't know about).
To achieve maximum efficiency it will likely be good to allow sharing
these functions in .pxd files. We have 'cdef inline' functions, but I
would prefer annotated def functions where the parameters are
specialized on demand, e.g.

@elemental
def add(a, b): # elemental functions can have any number of arguments
and operate on any compatible dtype
return a + b

When calling cdef functions or elemental functions with memoryview
arguments, the arguments perform a (broadcasted) elementwise
operation. Alternatively, we can have a parallel.elementwise function
which maps the function elementwise, which would also work for object
callables. I prefer the former, since I think it will read much
easier.

Secondly, we can have a reduce function (and maybe a scan function),
that reduce (respectively scan) in a specified axis or number of
axes.
E.g.

parallel.reduce(add, a, b, axis=(0, 2))

where the default for axis is "all axes". As for the default value,
this could be perhaps optionally provided to the elemental decorator.
Otherwise, the reducer will have to get the default values from each
dimension that is reduced in, and then skip those values when
reducing. (Of course, the reducer function must be associate and
commutative). Also, a lambda could be passed in instead of an




Only associative, right?

Sounds good to me.



elementwise or typed cdef function.

Finally, we would have a parallel.nditer/ndenumerate/nditerate
function, which would iterate over N memoryviews, and provide a
sensible memory access pattern (like numpy.nditer). I'm not sure
if it
should provide only the indices, or also the values. e.g. an inplace
elementwise add would read as follows:

for i, j, k in parallel.nditerate(A, B):
A[i, j, k] += B[i, j, k]





I think this sounds good; I guess don't see a particular reason for
"ndenumerate", I think code like the above is clearer.



I'm assuming the index computations would not be re-done in this case
(i.e. there's more magic going on here than looks like at first
glance)? Otherwise there is an advantage to ndenumerate.



Ideally, there is a lot more magic going on, though I don't know how far
Mark wants to go.

Imagine "nditerate(A, A.T)", in that case it would have to make many
small
tiles so that for each tile being processed, A has a tile in cache
and A.T
has another tile in cache (so that one doesn't waste cache line
transfers).

So those array lookups would potentially look up in different memory
buffers, with the strides known at compile time.


Yes, being clever about the order in which to iterate over the indices
is the hard problem to solve here. I was thinking more in terms of the
inner loop iterating over the innermost dimension only to do the
indexing (retrieval and assignment), similar to how the generic NumPy
iterator works.


The point isn't only being clever about the *order*...you need "copy-in,
copy-out".

The point is that the NumPy iterator is not good enough (for
out-of-cache situations). Since you grab a cache line (64 bytes) each
time from main memory, a plain NumPy broadcasted iterator throws away a
lot of memory for "A + A.T", since for ndim>1 there's NO iteration order
which isn't bad (for instance, you could iterate in the order of A, and
the result would be that for each element of A.T you fetch there is 64
bytes transferred).


I meant, "throws away a lot of memory *bandwidth*".

Dag



So the solution is to copy A.T block-wise to a temporary scratch space
in cache so that you use all the elements in the cache line before
throwing it out of cache.

In C, I've seen a simple blocking transpose operation be over four times
faster than the brute-force transpose for this reason.

Dag




Which begs the question: What about this body?

if i< 100:
continue
else:
A[i, j, k] += B[i - 100, j