Re: [Cython] [cython-users] freelist benchmarks
On Tue, Feb 26, 2013 at 11:24 PM, Stefan Behnel wrote: > Robert Bradshaw, 26.02.2013 21:16: >> On Mon, Feb 25, 2013 at 1:17 AM, Stefan Behnel wrote: >>> David Roe, 25.02.2013 00:00: I changed the current type pointer check to look at tp_basicsize instead. > That made it work for almost all classes in lxml's own Element hierarchy, > with only a couple of exceptions in lxml.objectify that have one > additional > object field. So, just extending the freelist support to use two different > lists for different struct sizes instead of just one would make it work > for > all of lxml already. Taking a look at Sage to see how the situation > appears > over there would be interesting, I guess. I found some chains of length 5. This could be shortened to 4 by putting the freelist at the level of Element (which is where you most care about speed of object creation). >>> >>> It's substantially easier to keep it in the top-level base class, though. >>> Otherwise, we'd need a new protocol between inheriting types as I >>> previously described. That add a *lot* of complexity. >>> >>> SageObject -> Element (_parent attribute and cdef methods) -> Vector (_degree) -> FreeModuleElement (_is_mutable) -> FreeModuleElement_generic_dense (_entries) SageObject -> Element (_parent attribute and cdef methods) ->sage.structure.element.Matrix (_nrows) -> sage.matrix.matrix.Matrix (_base_ring) -> Matrix_integer_dense (_entries) >> >> I don't know that (expensive) matrices are the best example, and often >> the chains are larger for elements one really cares about. >> >> sage: def base_tree(x): return [] if x is None else [x] + >> base_tree(x.__base__) >>...: >> >> sage: base_tree(Integer) >> [, > 'sage.structure.element.EuclideanDomainElement'>, > 'sage.structure.element.PrincipalIdealDomainElement'>, > 'sage.structure.element.DedekindDomainElement'>, > 'sage.structure.element.IntegralDomainElement'>, > 'sage.structure.element.CommutativeRingElement'>, > 'sage.structure.element.RingElement'>, > 'sage.structure.element.ModuleElement'>, > 'sage.structure.element.Element'>, > 'sage.structure.sage_object.SageObject'>, ] >> >> sage: base_tree(RealDoubleElement) >> [, > 'sage.structure.element.FieldElement'>, > 'sage.structure.element.CommutativeRingElement'>, > 'sage.structure.element.RingElement'>, > 'sage.structure.element.ModuleElement'>, > 'sage.structure.element.Element'>, > 'sage.structure.sage_object.SageObject'>, ] >> >> sage: base_tree(type(mod(1, 10))) >> [, > 'sage.rings.finite_rings.integer_mod.IntegerMod_abstract'>, > 'sage.rings.finite_rings.element_base.FiniteRingElement'>, > 'sage.structure.element.CommutativeRingElement'>, > 'sage.structure.element.RingElement'>, > 'sage.structure.element.ModuleElement'>, > 'sage.structure.element.Element'>, > 'sage.structure.sage_object.SageObject'>, ] > > My original question was if they have differently sized object structs or > not. Those that don't would currently go into the same freelist. They all add to the struct at the leaf subclass. >>> Ok, so even for something as large as Sage, we'd apparently end up with >>> just a couple of freelists for a given base type. That really makes it >>> appear reasonable to make that number a compile time constant as well. I >>> mean, even if you *really* oversize it, all you loose is the static memory >>> for a couple of pointers. On a 64 bit system, if you use a freelist size of >>> 8 objects and provision freelists for 8 differently sized subtypes, that's >>> 8*8*8 bytes in total, or half a KB, statically allocated. Even a hundred >>> times that size shouldn't hurt anyone. Unused subtype freelists really take >>> almost no space and won't hurt performance either. >> >> Elements in Sage are typically larger than 8 bytes > > I wasn't adding up the size of the objects, only of the pointers in the > freelists. If the objects end up in the freelist, they'll also be used on > the next instantiation, so their size doesn't really matter. It does if you use a lot of them, then they just sit around forever, but I suppose if you use them once you're willing to pay the price. It also doesn't make sense for a lot of them that are rather expensive anyways (e.g. every kind of matrix or polynomial specialization). >> and our >> experiments for Integer showed that the benefit (for this class) >> extended well beyond 8 items. On the other hand lots of elements are >> so expensive that they don't merit this at all. >> >> I think one thing to keep in mind is that Python's heap is essentially >> a "freelist" of objects of every size up to 128(?) bytes, so what are >> we trying to save by putting it at the base type and going up and down >> the __cinit__/__dealloc__ chain? > > Allocation still seems to take its time. Yes, it does. >> I suppose we save the zero-ing out of >> memory and a function call
Re: [Cython] freelist benchmarks
Robert Bradshaw, 27.02.2013 09:54: > On Tue, Feb 26, 2013 at 11:24 PM, Stefan Behnel wrote: >> I imagine that the freelist could leave the initial vtable untouched in >> some cases, but that would mean that we need a freelist per actual type, >> instead of object struct size. >> >> Now, if we move the freelist handling into each subtype (as you and Mark >> proposed already), we'd get some of this for free, because the objects that >> get freed are already properly set up for the specific type, including >> vtable etc. All that remains to be done is to zero out the (known) C typed >> attributes, set the (known) object attributes to None, and call any >> __cinit__() methods in the super types to do the rest for us. We might have >> to do it in the right order, i.e. initialise some attributes, call the >> corresponding __cinit__() method, initialise some more attributes, ... >> >> So, basically, we'd manually inline the bottom-up aggregation of all tp_new >> functions into the current one, skipping those operations that we don't >> consider necessary in the freelist case, such as the vtable setup. >> >> Now, the only remaining issue is how to get at the __cinit__() functions if >> the base type isn't in the same module, but as Mark proposed, that could >> still be done if we require it to be exported in a C-API (and assume that >> it doesn't exist if not?). Would be better to know it at compile time, >> though... > > Yes, and that's still going to (potentially) be expensive. I'd rather > have a way of controlling what, if anything, gets zero'd out/set to > None, as most of that (in Sage's case at least) will still be valid > for the newly-reused type or instantly over-written (though perhaps > the default could be to call __dealloc__/__cinit__). With this we > could skip going up and down the type hierarchy at all. I don't think the zeroing is a problem. Just bursting out static data to memory should be plenty fast these days and not incur any wait cycles or pipeline stalls, as long as the compiler/processor can figure out that there are no interdependencies between the assignments. The None assignments may be a problem due to the INCREFs, but even in that case, the C compiler and processor should be able to detect that they are all just incrementing the same address in memory and may end up reducing a series of updates into one. The only real problem are the calls to __cinit__(), which run user code and can thus do anything. If they can't be inlined, the C compiler needs to lessen a lot of its assumptions. Would it make sense to require users to implement __cinit__() as an inline method in a .pxd file if they want to use a freelist on a subtype? Or would that be overly restrictive? It would prevent them from using module globals, for example. That's quite a restriction normally, but I'm not sure how much it hurts the "average" code in the specific case of __cinit__(). Stefan ___ cython-devel mailing list cython-devel@python.org http://mail.python.org/mailman/listinfo/cython-devel
[Cython] Non-deterministic behavoiur?
Using the following test code: import numpy as np from lapack import dgelsy from numpy.linalg import lstsq A = np.array([[ 0.12, -8.19, 7.69, -2.26, -4.71], [-6.91, 2.22, -5.12, -9.08, 9.96], [-3.33, -8.94, -6.72, -4.4 , -9.98], [ 3.97, 3.33, -2.74, -7.92, -3.2 ]]) # b = np.array([[ 7.3 , 0.47, -6.28], [ 1.33, 6.58, -3.42], [ 2.68, -1.71, 3.46], [-9.62, -0.79, 0.41]]) # print '#' print '# ACTUAL RESULT #' print '#' print lstsq(A, b)[0] print print '#' print '# DGELSY RESULT #' print '#' print dgelsy(A, b) I get: # # ACTUAL RESULT # # [[-0.6858 -0.2431 0.0642] [-0.795 -0.0836 0.2118] [ 0.3767 0.1208 -0.6541] [ 0.2885 -0.2415 0.4176] [ 0.2916 0.3525 -0.3015]] # # DGELSY RESULT # # [[-0.6858 -0.2431 0.0642] [-0.795 -0.0836 0.2118] [ 0.3767 0.1208 -0.6541] [ 0.2885 -0.2415 0.4176] [ 0.2916 0.3525 -0.3015]] All well and good, however if I type the `tmp` variable as a memview in the following code cdef double[:] res cdef double[:,:] tmp tmp = np.zeros([ldb, nrhs], order='F', dtype=np.float64) tmp[:b.shape[0]] = b res = np.ravel(tmp, order='F') the result changes?!? # # DGELSY RESULT # # [[-0.7137 -0.2429 0.0876] [-0.2882 -0.0884 -0.2117] [-0.4282 0.1284 0.0185] [ 0.9564 -0.2478 -0.1404] [ 0.3625 0.3519 -0.3607]] Remove the `cdef double[:,:] tmp` and I'm back to the correct result. Does this make any sense? To try and figure out what was going on I put in a couple of debugging print statements: print 'res = ', repr(np.asarray(res)) print 'res.flags = {{{flags}}}'.format(flags=np.asarray(res).flags) Only changing these lines resulted in the same incorrect result # # DGELSY RESULT # # res = array([ 7.3 , 1.33, 2.68, -9.62, 0., 0.47, 6.58, -1.71, -0.79, 0., -6.28, -3.42, 3.46, 0.41, 0.]) res.flags = { C_CONTIGUOUS : True F_CONTIGUOUS : True OWNDATA : False WRITEABLE : True ALIGNED : True UPDATEIFCOPY : False} [[-0.7137 -0.2429 0.0876] [-0.2882 -0.0884 -0.2117] [-0.4282 0.1284 0.0185] [ 0.9564 -0.2478 -0.1404] [ 0.3625 0.3519 -0.3607]] Removing (only) the print statements again gave me the correct results. So, it seems either typing the array as a memview or printing res will screw up the calculation. The cython code is given below. Any ideas if this is a cython bug or something I'm doing wrong? Thanks, Dave cdef extern from "mkl_lapack.h" nogil: void DGELSY(const MKL_INT* m, const MKL_INT* n, const MKL_INT* nrhs, double* a, const MKL_INT* lda, double* b, const MKL_INT* ldb, MKL_INT* jpvt, const double* rcond, MKL_INT* rank, double* work, const MKL_INT* lwork, MKL_INT* info) @cython.embedsignature(True) @cython.boundscheck(False) @cython.wraparound(False) @cython.cdivision(True) def dgelsy(double[:,:] A, double[:,:] b, double rcond=1e-15, overwrite_A=False, overwrite_b=False): cdef MKL_INT rank, info cdef MKL_INT m = A.shape[0] cdef MKL_INT n = A.shape[1] cdef MKL_INT nrhs = b.shape[1] cdef MKL_INT lda = m cdef MKL_INT ldb = max(m, n) cdef MKL_INT lwork = -1 cdef double worksize = 0 #cdef double[:,:] tmp cdef double[:] res, work cdef MKL_INT[:] jpvt if b.shape[0] != m: message = "b.shape[0] must be equal to A.shape[0].\n" message += "b.shape[0] = {b.shape[0]}\n" message += "A.shape[0] = {A.shape[0]}\n" raise MKL_LAPACK_ERROR(message.format(A=A, b=b)) flags = np.asarray(A).flags if not flags['F_CONTIGUOUS'] or not overwrite_A: A = A.copy_fortran() flags = np.asarray(b).flags if not flags['F_CONTIGUOUS'] or not overwrite_b or b.shape[0] < n: tmp = np.zeros([ldb, nrhs], order='F', dtype=np.float64) tmp[:b.shape[0]] = b res = np.ravel(tmp, order='F') else: res = np.ravel(b, order='F') #print 'res = ', repr(np.asarray(res)) #print 'res.flags = {{{flags}}}'.format(flags=np.asarray(res).flags) jpvt = np.empty(n, dtype=np.int32) DGELSY(&m, &n, &nrhs, &A[0,0], &lda, &res[0], &ldb, &jpvt[0], &rcond, &rank, &worksize, &lwork, &info) if info != 0: message = "Parameter {i} had an illegal value when calling gelsy." raise MKL_LAPACK_ERROR(message.format(i=info)) lwork = int(worksize) work = np.empty(lwork, dtype=np.float64) DGELSY(&m, &n, &nrhs, &A[0,0], &lda, &res[0], &ldb, &jpvt[0], &rcond, &rank, &worksize, &lwork, &info) if info != 0: message = "Parameter {i} had an illegal value when calling gelsy." raise MKL_LAPACK_ERROR(message.
Re: [Cython] freelist benchmarks
On Wed, Feb 27, 2013 at 5:09 AM, Stefan Behnel wrote: > Robert Bradshaw, 27.02.2013 09:54: >> On Tue, Feb 26, 2013 at 11:24 PM, Stefan Behnel wrote: >>> I imagine that the freelist could leave the initial vtable untouched in >>> some cases, but that would mean that we need a freelist per actual type, >>> instead of object struct size. >>> >>> Now, if we move the freelist handling into each subtype (as you and Mark >>> proposed already), we'd get some of this for free, because the objects that >>> get freed are already properly set up for the specific type, including >>> vtable etc. All that remains to be done is to zero out the (known) C typed >>> attributes, set the (known) object attributes to None, and call any >>> __cinit__() methods in the super types to do the rest for us. We might have >>> to do it in the right order, i.e. initialise some attributes, call the >>> corresponding __cinit__() method, initialise some more attributes, ... >>> >>> So, basically, we'd manually inline the bottom-up aggregation of all tp_new >>> functions into the current one, skipping those operations that we don't >>> consider necessary in the freelist case, such as the vtable setup. >>> >>> Now, the only remaining issue is how to get at the __cinit__() functions if >>> the base type isn't in the same module, but as Mark proposed, that could >>> still be done if we require it to be exported in a C-API (and assume that >>> it doesn't exist if not?). Would be better to know it at compile time, >>> though... >> >> Yes, and that's still going to (potentially) be expensive. I'd rather >> have a way of controlling what, if anything, gets zero'd out/set to >> None, as most of that (in Sage's case at least) will still be valid >> for the newly-reused type or instantly over-written (though perhaps >> the default could be to call __dealloc__/__cinit__). With this we >> could skip going up and down the type hierarchy at all. > > I don't think the zeroing is a problem. Just bursting out static data to > memory should be plenty fast these days and not incur any wait cycles or > pipeline stalls, as long as the compiler/processor can figure out that > there are no interdependencies between the assignments. The None > assignments may be a problem due to the INCREFs, but even in that case, the > C compiler and processor should be able to detect that they are all just > incrementing the same address in memory and may end up reducing a series of > updates into one. The only real problem are the calls to __cinit__(), which > run user code and can thus do anything. If they can't be inlined, the C > compiler needs to lessen a lot of its assumptions. > > Would it make sense to require users to implement __cinit__() as an inline > method in a .pxd file if they want to use a freelist on a subtype? Or would > that be overly restrictive? It would prevent them from using module > globals, for example. That's quite a restriction normally, but I'm not sure > how much it hurts the "average" code in the specific case of __cinit__(). It would hurt in the couple of examples I've thought about (e.g. fast Sage elements, where one wants to set the Parent field correctly). - Robert ___ cython-devel mailing list cython-devel@python.org http://mail.python.org/mailman/listinfo/cython-devel
[Cython] MemoryViews require writeable arrays?
%%cython cimport cython import numpy as np cimport numpy as np @cython.boundscheck(False) @cython.wraparound(False) @cython.cdivision(True) cpdef double[:] return_one(double[:] x): return np.array([1.0]) In [43]: x = randn(3) ...: return_one(x) Out[43]: In [44]: x.flags['WRITEABLE'] = False ...: return_one(x) Traceback (most recent call last): File "", line 2, in return_one(x) File "_cython_magic_7761e77f78c4e321261152684b47c674.pyx", line 11, in _cython_magic_7761e77f78c4e321261152684b47c674.return_one (C:\Users\dhirschfeld\.ipython\cython\_cython_magic_7761e77f78c4e321261152684b47 c674.c:1727) File "stringsource", line 619, in View.MemoryView.memoryview_cwrapper (C:\Users\dhirschfeld\.ipython\cython\_cython_magic_7761e77f78c4e321261152684b47 c674.c:8819) File "stringsource", line 327, in View.MemoryView.memoryview.__cinit__ (C:\Users\dhirschfeld\.ipython\cython\_cython_magic_7761e77f78c4e321261152684b47 c674.c:5594) ValueError: buffer source array is read-only Is this a required restriction? Is there any workaround? The context is calling cython routines using IPython.parallel. IIUC any input arrays sent over zmq are necessarily read-only. As can be seen with the example, even if we don't modify (or use) the input array at all we still get the error. Any help, esp. in regards to a workaround would be greatly appreciated! Thanks, Dave ___ cython-devel mailing list cython-devel@python.org http://mail.python.org/mailman/listinfo/cython-devel
Re: [Cython] MemoryViews require writeable arrays?
Dave Hirschfeld writes: > > cpdef double[:] return_one(double[:] x): > return np.array([1.0]) > > In [43]: x = randn(3) > ...: return_one(x) > Out[43]: > > In [44]: x.flags['WRITEABLE'] = False > ...: return_one(x) > ValueError: buffer source array is read-only > > > Any help, esp. in regards to a workaround would be greatly appreciated! > > Thanks, > Dave > > It seems using the numpy buffer interface works but I guess it would still be good if this worked for memviews too: %%cython cimport cython import numpy as np cimport numpy as np ctypedef np.float64_t float64_t @cython.boundscheck(False) @cython.wraparound(False) @cython.cdivision(True) cpdef double[:] return_one_np(np.ndarray[float64_t, ndim=1] x): return np.array([1.0]) In [203]: return_one_np(x) Out[203]: Cheers, Dave ___ cython-devel mailing list cython-devel@python.org http://mail.python.org/mailman/listinfo/cython-devel