Re: [Numpy-discussion] Status of np.bincount

Wes McKinney Thu, 03 May 2012 10:37:11 -0700

On Thu, May 3, 2012 at 12:51 PM, Tony Yu <[email protected]> wrote:
>
>
> On Thu, May 3, 2012 at 9:57 AM, Robert Kern <[email protected]> wrote:
>>
>> On Thu, May 3, 2012 at 2:50 PM, Robert Elsner <[email protected]> wrote:
>> >
>> > Am 03.05.2012 15:45, schrieb Robert Kern:
>> >> On Thu, May 3, 2012 at 2:24 PM, Robert Elsner <[email protected]>
>> >> wrote:
>> >>> Hello Everybody,
>> >>>
>> >>> is there any news on the status of np.bincount with respect to "big"
>> >>> numbers? It seems I have just been bitten by #225. Is there an
>> >>> efficient
>> >>> way around? I found the np.histogram function painfully slow.
>> >>>
>> >>> Below a simple script, that demonstrates bincount failing with a
>> >>> memory
>> >>> error on big numbers
>> >>>
>> >>> import numpy as np
>> >>>
>> >>> x = np.array((30e9,)).astype(int)
>> >>> np.bincount(x)
>> >>>
>> >>>
>> >>> Any good idea how to work around it. My arrays contain somewhat 50M
>> >>> entries in the range from 0 to 30e9. And I would like to have them
>> >>> bincounted...
>> >>
>> >> You need a sparse data structure, then. Are you sure you even have
>> >> duplicates?
>> >>
>> >> Anyways, I won't work out all of the details, but let me sketch
>> >> something that might get you your answers. First, sort your array.
>> >> Then use np.not_equal(x[:-1], x[1:]) as a mask on np.arange(1,len(x))
>> >> to find the indices where each sorted value changes over to the next.
>> >> The np.diff() of that should give you the size of each. Use np.unique
>> >> to get the sorted unique values to match up with those sizes.
>> >>
>> >> Fixing all of the off-by-one errors and dealing with the boundary
>> >> conditions correctly is left as an exercise for the reader.
>> >>
>> >
>> > ?? I suspect that this mail was meant to end up in the thread about
>> > sparse array data?
>>
>> No, I am responding to you.
>>
>
> Hi Robert (Elsner),
>
> Just to expand a bit on Robert Kern's explanation: Your problem is only
> partly related to Ticket #225. Even if that is fixed, you won't be able to
> call `bincount` with an array containing `30e9` unless you implement
> something using sparse arrays because `bincount` wants return an array
> that's `30e9 + 1` in length, which isn't going to happen.
>
> -Tony
>
> _______________________________________________
> NumPy-Discussion mailing list
> [email protected]
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>


hi Robert,

I suggest you try the value_counts instance method on pandas.Series:


In [9]: ints = np.random.randint(0, 30e9, size=100000)

In [10]: all_ints = Series(ints.repeat(500))

In [11]: all_ints.value_counts()
Out[11]:
16420382874    500
7147863689     500
4019588415     500
17462388002    500
11956087699    500
14888898988    500
3811318398     500
6333517765     500
16077665866    500
17559759721    500
5898309082     500
25213150655    500
17877388690    500
3122117900     500
6242860212     500
...
6344689036     500
16817048573    500
16361777055    500
4376828961     500
15910505187    500
12051499627    500
23857610954    500
24557975709    500
28135006018    500
1661624653     500
6747702840     500
24601775145    500
7290769930     500
9417075109     500
12071596222    500
Length: 100000

This method uses a C hash table and takes about 1 second to compute
the bin counts for 50mm entries and 100k unique values.

- Wes
_______________________________________________
NumPy-Discussion mailing list
[email protected]
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] Status of np.bincount

Reply via email to