Re: [Numpy-discussion] sparse array data

Wolfgang Kerzendorf Sat, 05 May 2012 20:24:28 -0700

Hey guys,

Thanks for all the suggestions. They seem already relatively involved, but I'll 
have a look at the table implementation. That seems to be the easiest of them 
all.


Cheers
    Wolfgang 

On 2012-05-03, at 9:39 AM, Charles R Harris wrote:

> 
> 
> On Thu, May 3, 2012 at 3:41 AM, Nathaniel Smith <[email protected]> wrote:
> On Thu, May 3, 2012 at 4:44 AM, Charles R Harris
> <[email protected]> wrote:
> >
> >
> > On Wed, May 2, 2012 at 3:20 PM, Nathaniel Smith <[email protected]> wrote:
> >> This coordinate format is also what's used by the MATLAB Tensor
> >> Toolbox. They have a paper justifying this choice and describing some
> >> tricks for how to work with them:
> >>  http://epubs.siam.org/sisc/resource/1/sjoce3/v30/i1/p205_s1
> >> (Spoiler: you use a lot of sort operations. Conveniently, timsort
> >> appears to be perfectly adapted for their algorithmic requirements.)
> >>
> >
> > Timsort probably isn't the best choice here, it is optimized for python
> > lists of python objects where there is at least one level of indirection and
> > compares are expensive, even more expensive for compound objects. If the
> > coordinates are stored in numpy structured arrays an ordinary sort is likely
> > to be faster. Lexsort might even be faster as it could access aligned
> > integer data and not have to move lists of indexes around.
> 
> To be clear, I don't mean Timsort-the-implementation, I mean
> Timsort-the-algorithm (which is now also the default sorting algorithm
> in Java). That said, it may well be optimized for expensive compares
> and something like a radix sort would be even better.
> 
> 
> Java uses Timsort for sorting object arrays (pointers) and dual pivot 
> quicksort for sorting arrays of native types, ints and such. Timsort is very 
> good for almost sorted arrays and the mixed algorithms are becoming popular, 
> i.e., introsort and recent updates to the dual pivot sort that also look for 
> runs. One of the reasons compares can be expensive for arrays of pointers to 
> objects is that the objects can be located all over memory, which blows the 
> cache.
> 
> There are also a few mods to the numpy quicksort that might speed things up a 
> bit more for common cases where there are a lot of repeated elements.
> 
> In these sparse tensor algorithms, we often need to sort by one
> coordinate axis, and then sort by another (a "corner turn"). The
> reason Timsort seems appealing is that (1) it goes faster than O(n log
> n) when there is existing structure in the data being sorted, (2)
> because it's a stable sort, sorting on one axis then sorting on
> another will leave lots of that structure behind to be exploited
> later. So one can expect it to hit its happy case relatively often.
> 
> 
> Yes, that's why we have mergesort. An optimistic version making some use of 
> runs might make it faster. We do have object arrays and no type specialized 
> sort for them, so bringing Timsort in for those could be useful.
> 
> Chuck  
> 
> 
> _______________________________________________
> NumPy-Discussion mailing list
> [email protected]
> http://mail.scipy.org/mailman/listinfo/numpy-discussion

_______________________________________________
NumPy-Discussion mailing list
[email protected]
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] sparse array data

Reply via email to