On Thu, May 3, 2012 at 9:57 AM, Robert Kern <[email protected]> wrote:
> On Thu, May 3, 2012 at 2:50 PM, Robert Elsner <[email protected]> wrote: > > > > Am 03.05.2012 15:45, schrieb Robert Kern: > >> On Thu, May 3, 2012 at 2:24 PM, Robert Elsner <[email protected]> > wrote: > >>> Hello Everybody, > >>> > >>> is there any news on the status of np.bincount with respect to "big" > >>> numbers? It seems I have just been bitten by #225. Is there an > efficient > >>> way around? I found the np.histogram function painfully slow. > >>> > >>> Below a simple script, that demonstrates bincount failing with a memory > >>> error on big numbers > >>> > >>> import numpy as np > >>> > >>> x = np.array((30e9,)).astype(int) > >>> np.bincount(x) > >>> > >>> > >>> Any good idea how to work around it. My arrays contain somewhat 50M > >>> entries in the range from 0 to 30e9. And I would like to have them > >>> bincounted... > >> > >> You need a sparse data structure, then. Are you sure you even have > duplicates? > >> > >> Anyways, I won't work out all of the details, but let me sketch > >> something that might get you your answers. First, sort your array. > >> Then use np.not_equal(x[:-1], x[1:]) as a mask on np.arange(1,len(x)) > >> to find the indices where each sorted value changes over to the next. > >> The np.diff() of that should give you the size of each. Use np.unique > >> to get the sorted unique values to match up with those sizes. > >> > >> Fixing all of the off-by-one errors and dealing with the boundary > >> conditions correctly is left as an exercise for the reader. > >> > > > > ?? I suspect that this mail was meant to end up in the thread about > > sparse array data? > > No, I am responding to you. > > Hi Robert (Elsner), Just to expand a bit on Robert Kern's explanation: Your problem is only partly related to Ticket #225 <http://projects.scipy.org/numpy/ticket/225>. Even if that is fixed, you won't be able to call `bincount` with an array containing `30e9` unless you implement something using sparse arrays because `bincount` wants return an array that's `30e9 + 1` in length, which isn't going to happen. -Tony
_______________________________________________ NumPy-Discussion mailing list [email protected] http://mail.scipy.org/mailman/listinfo/numpy-discussion
