Re: Distinct Counter Proposal for Cassandra

2012-06-29 Thread Tim Wintle
Would it be possible to support this in a more general case by providing a distributed |= operator over arbitrary byte strings (like the + operator on counter columns), which would allow distributed bloom filters as well? Tim Wintle On Fri, Jun 29, 2012 at 6:31 AM, Chris Burroughs wrote: > Well

Re: Distinct Counter Proposal for Cassandra

2012-06-29 Thread Chris Burroughs
Well I obviously think it would be handy. If this get's proposed and end's up using stream-lib don't be shy about asking for help. On a more general note, it would be great to see the special case Counter code become more general atomic operation code. On 06/13/2012 01:15 PM, Utku Can Topçu wrot

Re: Distinct Counter Proposal for Cassandra

2012-06-29 Thread Chris Burroughs
On 06/13/2012 01:00 PM, Yuki Morishita wrote: > The above implementation and most of the other ones (including stream-lib) > implement the optimized version of the algorithm which counts up to 10^9, so > may need some work. > > Other alternative is self-learning bitmap > (http://ect.bell-labs.c

Re: Distinct Counter Proposal for Cassandra

2012-06-13 Thread Utku Can Topçu
Hi Yuki, I think I should have used the word discussion instead of proposal for the mailing subject. I have quite some of a design in my mind but I think it's not yet ripe enough to formalize. I'll try to simplify it and open a Jira ticket. But first I'm wondering if there would be any excitement

Re: Distinct Counter Proposal for Cassandra

2012-06-13 Thread Yuki Morishita
You can open JIRA ticket at https://issues.apache.org/jira/browse/CASSANDRA with your proposal. Just for the input: I had once implemented HyperLogLog counter to use internally in Cassandra, but it turned out I didn't need it so I just put it to gist. You can find it here: https://gist.github.

Distinct Counter Proposal for Cassandra

2012-06-13 Thread Utku Can Topçu
Hi All, Let's assume we have a use case where we need to count the number of columns for a given key. Let's say the key is the URL and the column-name is the IP address or any cardinality identifier. The straight forward implementation seems to be simple, just inserting the IP Adresses as columns