On Thu, May 3, 2012 at 12:51 PM, Tony Yu <[email protected]> wrote: > > > On Thu, May 3, 2012 at 9:57 AM, Robert Kern <[email protected]> wrote: >> >> On Thu, May 3, 2012 at 2:50 PM, Robert Elsner <[email protected]> wrote: >> > >> > Am 03.05.2012 15:45, schrieb Robert Kern: >> >> On Thu, May 3, 2012 at 2:24 PM, Robert Elsner <[email protected]> >> >> wrote: >> >>> Hello Everybody, >> >>> >> >>> is there any news on the status of np.bincount with respect to "big" >> >>> numbers? It seems I have just been bitten by #225. Is there an >> >>> efficient >> >>> way around? I found the np.histogram function painfully slow. >> >>> >> >>> Below a simple script, that demonstrates bincount failing with a >> >>> memory >> >>> error on big numbers >> >>> >> >>> import numpy as np >> >>> >> >>> x = np.array((30e9,)).astype(int) >> >>> np.bincount(x) >> >>> >> >>> >> >>> Any good idea how to work around it. My arrays contain somewhat 50M >> >>> entries in the range from 0 to 30e9. And I would like to have them >> >>> bincounted... >> >> >> >> You need a sparse data structure, then. Are you sure you even have >> >> duplicates? >> >> >> >> Anyways, I won't work out all of the details, but let me sketch >> >> something that might get you your answers. First, sort your array. >> >> Then use np.not_equal(x[:-1], x[1:]) as a mask on np.arange(1,len(x)) >> >> to find the indices where each sorted value changes over to the next. >> >> The np.diff() of that should give you the size of each. Use np.unique >> >> to get the sorted unique values to match up with those sizes. >> >> >> >> Fixing all of the off-by-one errors and dealing with the boundary >> >> conditions correctly is left as an exercise for the reader. >> >> >> > >> > ?? I suspect that this mail was meant to end up in the thread about >> > sparse array data? >> >> No, I am responding to you. >> > > Hi Robert (Elsner), > > Just to expand a bit on Robert Kern's explanation: Your problem is only > partly related to Ticket #225. Even if that is fixed, you won't be able to > call `bincount` with an array containing `30e9` unless you implement > something using sparse arrays because `bincount` wants return an array > that's `30e9 + 1` in length, which isn't going to happen. > > -Tony > > _______________________________________________ > NumPy-Discussion mailing list > [email protected] > http://mail.scipy.org/mailman/listinfo/numpy-discussion >
hi Robert, I suggest you try the value_counts instance method on pandas.Series: In [9]: ints = np.random.randint(0, 30e9, size=100000) In [10]: all_ints = Series(ints.repeat(500)) In [11]: all_ints.value_counts() Out[11]: 16420382874 500 7147863689 500 4019588415 500 17462388002 500 11956087699 500 14888898988 500 3811318398 500 6333517765 500 16077665866 500 17559759721 500 5898309082 500 25213150655 500 17877388690 500 3122117900 500 6242860212 500 ... 6344689036 500 16817048573 500 16361777055 500 4376828961 500 15910505187 500 12051499627 500 23857610954 500 24557975709 500 28135006018 500 1661624653 500 6747702840 500 24601775145 500 7290769930 500 9417075109 500 12071596222 500 Length: 100000 This method uses a C hash table and takes about 1 second to compute the bin counts for 50mm entries and 100k unique values. - Wes _______________________________________________ NumPy-Discussion mailing list [email protected] http://mail.scipy.org/mailman/listinfo/numpy-discussion
