Re: [Cython] [cython-users] freelist benchmarks

2013-02-27 Thread Robert Bradshaw
On Tue, Feb 26, 2013 at 11:24 PM, Stefan Behnel  wrote:
> Robert Bradshaw, 26.02.2013 21:16:
>> On Mon, Feb 25, 2013 at 1:17 AM, Stefan Behnel wrote:
>>> David Roe, 25.02.2013 00:00:
 I changed the current type pointer check to look at tp_basicsize instead.

> That made it work for almost all classes in lxml's own Element hierarchy,
> with only a couple of exceptions in lxml.objectify that have one 
> additional
> object field. So, just extending the freelist support to use two different
> lists for different struct sizes instead of just one would make it work 
> for
> all of lxml already. Taking a look at Sage to see how the situation 
> appears
> over there would be interesting, I guess.

 I found some chains of length 5.  This could be shortened to 4 by putting
 the freelist at the level of Element (which is where you most care about
 speed of object creation).
>>>
>>> It's substantially easier to keep it in the top-level base class, though.
>>> Otherwise, we'd need a new protocol between inheriting types as I
>>> previously described. That add a *lot* of complexity.
>>>
>>>
 SageObject
 -> Element (_parent attribute and cdef methods)
 -> Vector (_degree)
 -> FreeModuleElement (_is_mutable)
 -> FreeModuleElement_generic_dense (_entries)

 SageObject
 -> Element (_parent attribute and cdef methods)
 ->sage.structure.element.Matrix (_nrows)
 -> sage.matrix.matrix.Matrix (_base_ring)
 -> Matrix_integer_dense (_entries)
>>
>> I don't know that (expensive) matrices are the best example, and often
>> the chains are larger for elements one really cares about.
>>
>> sage: def base_tree(x): return [] if x is None else [x] +
>> base_tree(x.__base__)
>>...:
>>
>> sage: base_tree(Integer)
>>  [, > 'sage.structure.element.EuclideanDomainElement'>, > 'sage.structure.element.PrincipalIdealDomainElement'>, > 'sage.structure.element.DedekindDomainElement'>, > 'sage.structure.element.IntegralDomainElement'>, > 'sage.structure.element.CommutativeRingElement'>, > 'sage.structure.element.RingElement'>, > 'sage.structure.element.ModuleElement'>, > 'sage.structure.element.Element'>, > 'sage.structure.sage_object.SageObject'>, ]
>>
>> sage: base_tree(RealDoubleElement)
>>  [, > 'sage.structure.element.FieldElement'>, > 'sage.structure.element.CommutativeRingElement'>, > 'sage.structure.element.RingElement'>, > 'sage.structure.element.ModuleElement'>, > 'sage.structure.element.Element'>, > 'sage.structure.sage_object.SageObject'>, ]
>>
>> sage: base_tree(type(mod(1, 10)))
>>  [, > 'sage.rings.finite_rings.integer_mod.IntegerMod_abstract'>, > 'sage.rings.finite_rings.element_base.FiniteRingElement'>, > 'sage.structure.element.CommutativeRingElement'>, > 'sage.structure.element.RingElement'>, > 'sage.structure.element.ModuleElement'>, > 'sage.structure.element.Element'>, > 'sage.structure.sage_object.SageObject'>, ]
>
> My original question was if they have differently sized object structs or
> not. Those that don't would currently go into the same freelist.

They all add to the struct at the leaf subclass.

>>> Ok, so even for something as large as Sage, we'd apparently end up with
>>> just a couple of freelists for a given base type. That really makes it
>>> appear reasonable to make that number a compile time constant as well. I
>>> mean, even if you *really* oversize it, all you loose is the static memory
>>> for a couple of pointers. On a 64 bit system, if you use a freelist size of
>>> 8 objects and provision freelists for 8 differently sized subtypes, that's
>>> 8*8*8 bytes in total, or half a KB, statically allocated. Even a hundred
>>> times that size shouldn't hurt anyone. Unused subtype freelists really take
>>> almost no space and won't hurt performance either.
>>
>> Elements in Sage are typically larger than 8 bytes
>
> I wasn't adding up the size of the objects, only of the pointers in the
> freelists. If the objects end up in the freelist, they'll also be used on
> the next instantiation, so their size doesn't really matter.

It does if you use a lot of them, then they just sit around forever,
but I suppose if you use them once you're willing to pay the price. It
also doesn't make sense for a lot of them that are rather expensive
anyways (e.g. every kind of matrix or polynomial specialization).

>> and our
>> experiments for Integer showed that the benefit (for this class)
>> extended well beyond 8 items. On the other hand lots of elements are
>> so expensive that they don't merit this at all.
>>
>> I think one thing to keep in mind is that Python's heap is essentially
>> a "freelist" of objects of every size up to 128(?) bytes, so what are
>> we trying to save by putting it at the base type and going up and down
>> the __cinit__/__dealloc__ chain?
>
> Allocation still seems to take its time.

Yes, it does.

>> I suppose we save the zero-ing out of
>> memory and a function call

Re: [Cython] freelist benchmarks

2013-02-27 Thread Stefan Behnel
Robert Bradshaw, 27.02.2013 09:54:
> On Tue, Feb 26, 2013 at 11:24 PM, Stefan Behnel wrote:
>> I imagine that the freelist could leave the initial vtable untouched in
>> some cases, but that would mean that we need a freelist per actual type,
>> instead of object struct size.
>>
>> Now, if we move the freelist handling into each subtype (as you and Mark
>> proposed already), we'd get some of this for free, because the objects that
>> get freed are already properly set up for the specific type, including
>> vtable etc. All that remains to be done is to zero out the (known) C typed
>> attributes, set the (known) object attributes to None, and call any
>> __cinit__() methods in the super types to do the rest for us. We might have
>> to do it in the right order, i.e. initialise some attributes, call the
>> corresponding __cinit__() method, initialise some more attributes, ...
>>
>> So, basically, we'd manually inline the bottom-up aggregation of all tp_new
>> functions into the current one, skipping those operations that we don't
>> consider necessary in the freelist case, such as the vtable setup.
>>
>> Now, the only remaining issue is how to get at the __cinit__() functions if
>> the base type isn't in the same module, but as Mark proposed, that could
>> still be done if we require it to be exported in a C-API (and assume that
>> it doesn't exist if not?). Would be better to know it at compile time,
>> though...
> 
> Yes, and that's still going to (potentially) be expensive. I'd rather
> have a way of controlling what, if anything, gets zero'd out/set to
> None, as most of that (in Sage's case at least) will still be valid
> for the newly-reused type or instantly over-written (though perhaps
> the default could be to call __dealloc__/__cinit__). With this we
> could skip going up and down the type hierarchy at all.

I don't think the zeroing is a problem. Just bursting out static data to
memory should be plenty fast these days and not incur any wait cycles or
pipeline stalls, as long as the compiler/processor can figure out that
there are no interdependencies between the assignments. The None
assignments may be a problem due to the INCREFs, but even in that case, the
C compiler and processor should be able to detect that they are all just
incrementing the same address in memory and may end up reducing a series of
updates into one. The only real problem are the calls to __cinit__(), which
run user code and can thus do anything. If they can't be inlined, the C
compiler needs to lessen a lot of its assumptions.

Would it make sense to require users to implement __cinit__() as an inline
method in a .pxd file if they want to use a freelist on a subtype? Or would
that be overly restrictive? It would prevent them from using module
globals, for example. That's quite a restriction normally, but I'm not sure
how much it hurts the "average" code in the specific case of __cinit__().

Stefan

___
cython-devel mailing list
cython-devel@python.org
http://mail.python.org/mailman/listinfo/cython-devel


[Cython] Non-deterministic behavoiur?

2013-02-27 Thread Dave Hirschfeld
Using the following test code:

import numpy as np
from lapack import dgelsy
from numpy.linalg import lstsq

A = np.array([[ 0.12, -8.19,  7.69, -2.26, -4.71],
   [-6.91,  2.22, -5.12, -9.08,  9.96],
   [-3.33, -8.94, -6.72, -4.4 , -9.98],
   [ 3.97,  3.33, -2.74, -7.92, -3.2 ]])
#
b = np.array([[ 7.3 ,  0.47, -6.28],
   [ 1.33,  6.58, -3.42],
   [ 2.68, -1.71,  3.46],
   [-9.62, -0.79,  0.41]])
#
print '#'
print '# ACTUAL RESULT #'
print '#'
print lstsq(A, b)[0]
print
print '#'
print '# DGELSY RESULT #'
print '#'
print dgelsy(A, b)

I get:

#
# ACTUAL RESULT #
#
[[-0.6858 -0.2431  0.0642]
 [-0.795  -0.0836  0.2118]
 [ 0.3767  0.1208 -0.6541]
 [ 0.2885 -0.2415  0.4176]
 [ 0.2916  0.3525 -0.3015]]

#
# DGELSY RESULT #
#
[[-0.6858 -0.2431  0.0642]
 [-0.795  -0.0836  0.2118]
 [ 0.3767  0.1208 -0.6541]
 [ 0.2885 -0.2415  0.4176]
 [ 0.2916  0.3525 -0.3015]]


All well and good, however if I type the `tmp` variable as a memview in the 
following code

cdef double[:] res
cdef double[:,:] tmp
tmp = np.zeros([ldb, nrhs], order='F', dtype=np.float64)
tmp[:b.shape[0]] = b
res = np.ravel(tmp, order='F')

the result changes?!?

#
# DGELSY RESULT #
#
[[-0.7137 -0.2429  0.0876]
 [-0.2882 -0.0884 -0.2117]
 [-0.4282  0.1284  0.0185]
 [ 0.9564 -0.2478 -0.1404]
 [ 0.3625  0.3519 -0.3607]]


Remove the `cdef double[:,:] tmp` and I'm back to the correct result.
Does this make any sense?

To try and figure out what was going on I put in a couple of debugging print 
statements:

print 'res = ', repr(np.asarray(res))
print 'res.flags = {{{flags}}}'.format(flags=np.asarray(res).flags)


Only changing these lines resulted in the same incorrect result

#
# DGELSY RESULT #
#
res =  array([ 7.3 ,  1.33,  2.68, -9.62,  0.,
   0.47,  6.58, -1.71, -0.79,  0.,
  -6.28, -3.42,  3.46,  0.41,  0.])

res.flags = {  C_CONTIGUOUS : True
  F_CONTIGUOUS : True
  OWNDATA : False
  WRITEABLE : True
  ALIGNED : True
  UPDATEIFCOPY : False}

[[-0.7137 -0.2429  0.0876]
 [-0.2882 -0.0884 -0.2117]
 [-0.4282  0.1284  0.0185]
 [ 0.9564 -0.2478 -0.1404]
 [ 0.3625  0.3519 -0.3607]]


Removing (only) the print statements again gave me the correct results.

So, it seems either typing the array as a memview or printing res
will screw up the calculation.

The cython code is given below. Any ideas if this is a cython bug or something 
I'm doing wrong?

Thanks,
Dave


cdef extern from "mkl_lapack.h" nogil:
void DGELSY(const MKL_INT* m,
const MKL_INT* n,
const MKL_INT* nrhs,
double* a,
const MKL_INT* lda,
double* b,
const MKL_INT* ldb,
MKL_INT* jpvt,
const double* rcond,
MKL_INT* rank,
double* work,
const MKL_INT* lwork,
MKL_INT* info)


@cython.embedsignature(True)
@cython.boundscheck(False)
@cython.wraparound(False)
@cython.cdivision(True)
def dgelsy(double[:,:] A,
  double[:,:] b,
  double rcond=1e-15,
  overwrite_A=False,
  overwrite_b=False):
cdef MKL_INT rank, info
cdef MKL_INT m = A.shape[0]
cdef MKL_INT n = A.shape[1]
cdef MKL_INT nrhs = b.shape[1]
cdef MKL_INT lda = m
cdef MKL_INT ldb = max(m, n)
cdef MKL_INT lwork = -1
cdef double worksize = 0
#cdef double[:,:] tmp
cdef double[:] res, work
cdef MKL_INT[:] jpvt

if b.shape[0] != m:
message = "b.shape[0] must be equal to A.shape[0].\n"
message += "b.shape[0] = {b.shape[0]}\n"
message += "A.shape[0] = {A.shape[0]}\n"
raise MKL_LAPACK_ERROR(message.format(A=A, b=b))

flags = np.asarray(A).flags
if not flags['F_CONTIGUOUS'] or not overwrite_A:
A = A.copy_fortran()

flags = np.asarray(b).flags
if not flags['F_CONTIGUOUS'] or not overwrite_b or b.shape[0] < n:
tmp = np.zeros([ldb, nrhs], order='F', dtype=np.float64)
tmp[:b.shape[0]] = b
res = np.ravel(tmp, order='F')
else:
res = np.ravel(b, order='F')

#print 'res = ', repr(np.asarray(res))
#print 'res.flags = {{{flags}}}'.format(flags=np.asarray(res).flags)

jpvt = np.empty(n, dtype=np.int32)
DGELSY(&m, &n, &nrhs, &A[0,0], &lda, &res[0], &ldb, &jpvt[0], &rcond, 
&rank, 
&worksize, &lwork, &info)
if info != 0:
message = "Parameter {i} had an illegal value when calling gelsy."
raise MKL_LAPACK_ERROR(message.format(i=info))

lwork = int(worksize)
work = np.empty(lwork, dtype=np.float64)
DGELSY(&m, &n, &nrhs, &A[0,0], &lda, &res[0], &ldb, &jpvt[0], &rcond, 
&rank, 
&worksize, &lwork, &info)
if info != 0:
message = "Parameter {i} had an illegal value when calling gelsy."
raise MKL_LAPACK_ERROR(message.

Re: [Cython] freelist benchmarks

2013-02-27 Thread Robert Bradshaw
On Wed, Feb 27, 2013 at 5:09 AM, Stefan Behnel  wrote:
> Robert Bradshaw, 27.02.2013 09:54:
>> On Tue, Feb 26, 2013 at 11:24 PM, Stefan Behnel wrote:
>>> I imagine that the freelist could leave the initial vtable untouched in
>>> some cases, but that would mean that we need a freelist per actual type,
>>> instead of object struct size.
>>>
>>> Now, if we move the freelist handling into each subtype (as you and Mark
>>> proposed already), we'd get some of this for free, because the objects that
>>> get freed are already properly set up for the specific type, including
>>> vtable etc. All that remains to be done is to zero out the (known) C typed
>>> attributes, set the (known) object attributes to None, and call any
>>> __cinit__() methods in the super types to do the rest for us. We might have
>>> to do it in the right order, i.e. initialise some attributes, call the
>>> corresponding __cinit__() method, initialise some more attributes, ...
>>>
>>> So, basically, we'd manually inline the bottom-up aggregation of all tp_new
>>> functions into the current one, skipping those operations that we don't
>>> consider necessary in the freelist case, such as the vtable setup.
>>>
>>> Now, the only remaining issue is how to get at the __cinit__() functions if
>>> the base type isn't in the same module, but as Mark proposed, that could
>>> still be done if we require it to be exported in a C-API (and assume that
>>> it doesn't exist if not?). Would be better to know it at compile time,
>>> though...
>>
>> Yes, and that's still going to (potentially) be expensive. I'd rather
>> have a way of controlling what, if anything, gets zero'd out/set to
>> None, as most of that (in Sage's case at least) will still be valid
>> for the newly-reused type or instantly over-written (though perhaps
>> the default could be to call __dealloc__/__cinit__). With this we
>> could skip going up and down the type hierarchy at all.
>
> I don't think the zeroing is a problem. Just bursting out static data to
> memory should be plenty fast these days and not incur any wait cycles or
> pipeline stalls, as long as the compiler/processor can figure out that
> there are no interdependencies between the assignments. The None
> assignments may be a problem due to the INCREFs, but even in that case, the
> C compiler and processor should be able to detect that they are all just
> incrementing the same address in memory and may end up reducing a series of
> updates into one. The only real problem are the calls to __cinit__(), which
> run user code and can thus do anything. If they can't be inlined, the C
> compiler needs to lessen a lot of its assumptions.
>
> Would it make sense to require users to implement __cinit__() as an inline
> method in a .pxd file if they want to use a freelist on a subtype? Or would
> that be overly restrictive? It would prevent them from using module
> globals, for example. That's quite a restriction normally, but I'm not sure
> how much it hurts the "average" code in the specific case of __cinit__().

It would hurt in the couple of examples I've thought about (e.g. fast
Sage elements, where one wants to set the Parent field correctly).

- Robert
___
cython-devel mailing list
cython-devel@python.org
http://mail.python.org/mailman/listinfo/cython-devel


[Cython] MemoryViews require writeable arrays?

2013-02-27 Thread Dave Hirschfeld
%%cython
cimport cython

import numpy as np
cimport numpy as np

@cython.boundscheck(False)
@cython.wraparound(False)
@cython.cdivision(True)
cpdef double[:] return_one(double[:] x):
return np.array([1.0])


In [43]: x = randn(3)
...: return_one(x)
Out[43]: 

In [44]: x.flags['WRITEABLE'] = False
...: return_one(x)
Traceback (most recent call last):

  File "", line 2, in 
return_one(x)

  File "_cython_magic_7761e77f78c4e321261152684b47c674.pyx", line 11, in 
_cython_magic_7761e77f78c4e321261152684b47c674.return_one 
(C:\Users\dhirschfeld\.ipython\cython\_cython_magic_7761e77f78c4e321261152684b47
c674.c:1727)

  File "stringsource", line 619, in View.MemoryView.memoryview_cwrapper 
(C:\Users\dhirschfeld\.ipython\cython\_cython_magic_7761e77f78c4e321261152684b47
c674.c:8819)

  File "stringsource", line 327, in View.MemoryView.memoryview.__cinit__ 
(C:\Users\dhirschfeld\.ipython\cython\_cython_magic_7761e77f78c4e321261152684b47
c674.c:5594)

ValueError: buffer source array is read-only



Is this a required restriction? Is there any workaround?
The context is calling cython routines using IPython.parallel.
IIUC any input arrays sent over zmq are necessarily read-only.

As can be seen with the example, even if we don't modify (or use)
the input array at all we still get the error.

Any help, esp. in regards to a workaround would be greatly appreciated!

Thanks,
Dave



___
cython-devel mailing list
cython-devel@python.org
http://mail.python.org/mailman/listinfo/cython-devel


Re: [Cython] MemoryViews require writeable arrays?

2013-02-27 Thread Dave Hirschfeld
Dave Hirschfeld  writes:

> 
> cpdef double[:] return_one(double[:] x):
> return np.array([1.0])
> 
> In [43]: x = randn(3)
> ...: return_one(x)
> Out[43]: 
> 
> In [44]: x.flags['WRITEABLE'] = False
> ...: return_one(x)
> ValueError: buffer source array is read-only
> 
> 
> Any help, esp. in regards to a workaround would be greatly appreciated!
> 
> Thanks,
> Dave
> 
> 


It seems using the numpy buffer interface works but I guess it would still
be good if this worked for memviews too:


%%cython
cimport cython

import numpy as np
cimport numpy as np

ctypedef np.float64_t float64_t

@cython.boundscheck(False)
@cython.wraparound(False)
@cython.cdivision(True)
cpdef double[:] return_one_np(np.ndarray[float64_t, ndim=1] x):
return np.array([1.0])


In [203]: return_one_np(x)
Out[203]: 


Cheers,
Dave

___
cython-devel mailing list
cython-devel@python.org
http://mail.python.org/mailman/listinfo/cython-devel