Bitmap indexes - reviving CASSANDRA-1472

2013-04-10 Thread mrevilgnome
I'm currently building a distributed cluster on top of cassandra to perform
fast set manipulation via bitmap indexes. This gives me the ability to
perform unions, intersections, and set subtraction across sub-queries.
Currently I'm storing index information for thousands of dimensions as
cassandra rows, and my cluster keeps this information cached, distributed
and replicated in order to answer queries.

Every couple of days I think to myself this should really exist in C*.
Given all the benifits would there be any interest in
reviving CASSANDRA-1472?

Some downsides are that this is very memory intensive, even for sparse
bitmaps.


Re: Bitmap indexes - reviving CASSANDRA-1472

2013-04-10 Thread Jonathan Ellis
If you mean, "Can someone help me figure out how to get started updating
these old patches to trunk and cleaning out the Avro?" then yes, I've been
knee-deep in indexing code recently.


On Wed, Apr 10, 2013 at 11:34 AM, mrevilgnome  wrote:

> I'm currently building a distributed cluster on top of cassandra to perform
> fast set manipulation via bitmap indexes. This gives me the ability to
> perform unions, intersections, and set subtraction across sub-queries.
> Currently I'm storing index information for thousands of dimensions as
> cassandra rows, and my cluster keeps this information cached, distributed
> and replicated in order to answer queries.
>
> Every couple of days I think to myself this should really exist in C*.
> Given all the benifits would there be any interest in
> reviving CASSANDRA-1472?
>
> Some downsides are that this is very memory intensive, even for sparse
> bitmaps.
>



-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder, http://www.datastax.com
@spyced


Re: [VOTE] Release Apache Cassandra 1.2.4

2013-04-10 Thread Brandon Williams
On Mon, Apr 8, 2013 at 12:54 PM, Sylvain Lebresne wrote:

> A fair enough number of bugs fixed since 1.2.3, I propose the following
> artifacts for release as 1.2.4.
>
> sha1: 2e96d07114dc88f85e56d88f7322a6b91da36c0d
>

+1


Re: Bitmap indexes - reviving CASSANDRA-1472

2013-04-10 Thread mrevilgnome
What do you think about set manipulation via indexes in Cassandra? I'm
interested in answering queries such as give me all users that performed
event 1, 2, and 3, but not 4. If the answer is yes than I can make a case
for spending my time on C*. The only downside for us would be our current
prototype is in C++ so we would loose some performance and the ability to
dedicate an entire machine to caching/performing queries.


On Wed, Apr 10, 2013 at 11:57 AM, Jonathan Ellis  wrote:

> If you mean, "Can someone help me figure out how to get started updating
> these old patches to trunk and cleaning out the Avro?" then yes, I've been
> knee-deep in indexing code recently.
>
>
> On Wed, Apr 10, 2013 at 11:34 AM, mrevilgnome 
> wrote:
>
> > I'm currently building a distributed cluster on top of cassandra to
> perform
> > fast set manipulation via bitmap indexes. This gives me the ability to
> > perform unions, intersections, and set subtraction across sub-queries.
> > Currently I'm storing index information for thousands of dimensions as
> > cassandra rows, and my cluster keeps this information cached, distributed
> > and replicated in order to answer queries.
> >
> > Every couple of days I think to myself this should really exist in C*.
> > Given all the benifits would there be any interest in
> > reviving CASSANDRA-1472?
> >
> > Some downsides are that this is very memory intensive, even for sparse
> > bitmaps.
> >
>
>
>
> --
> Jonathan Ellis
> Project Chair, Apache Cassandra
> co-founder, http://www.datastax.com
> @spyced
>


Re: Bitmap indexes - reviving CASSANDRA-1472

2013-04-10 Thread Brian O'Neill

How does this compare with Druid?
https://github.com/metamx/druid

We're currently evaluating Acunu, Vertica and Druid...
http://brianoneill.blogspot.com/2013/04/bianalytics-on-big-datacassandra.html

With its bitmapped indexes, Druid appears to have the most potential.  
They boast some pretty impressive stats, especially WRT handling "real-time" 
updates and adding new dimensions.

They also use a compression algorithm, CONCISE, to cut down on the space 
requirements.
http://ricerca.mat.uniroma3.it/users/colanton/concise.html

I haven't looked too deep into the Druid code, but I've been meaning to see if 
it could be backed by C*.

We'd be game to join the hunt if you pursue such a beast. (with your code, or 
with portions of Druid)

-brian


On Apr 10, 2013, at 5:40 PM, mrevilgnome wrote:

> What do you think about set manipulation via indexes in Cassandra? I'm
> interested in answering queries such as give me all users that performed
> event 1, 2, and 3, but not 4. If the answer is yes than I can make a case
> for spending my time on C*. The only downside for us would be our current
> prototype is in C++ so we would loose some performance and the ability to
> dedicate an entire machine to caching/performing queries.
> 
> 
> On Wed, Apr 10, 2013 at 11:57 AM, Jonathan Ellis  wrote:
> 
>> If you mean, "Can someone help me figure out how to get started updating
>> these old patches to trunk and cleaning out the Avro?" then yes, I've been
>> knee-deep in indexing code recently.
>> 
>> 
>> On Wed, Apr 10, 2013 at 11:34 AM, mrevilgnome 
>> wrote:
>> 
>>> I'm currently building a distributed cluster on top of cassandra to
>> perform
>>> fast set manipulation via bitmap indexes. This gives me the ability to
>>> perform unions, intersections, and set subtraction across sub-queries.
>>> Currently I'm storing index information for thousands of dimensions as
>>> cassandra rows, and my cluster keeps this information cached, distributed
>>> and replicated in order to answer queries.
>>> 
>>> Every couple of days I think to myself this should really exist in C*.
>>> Given all the benifits would there be any interest in
>>> reviving CASSANDRA-1472?
>>> 
>>> Some downsides are that this is very memory intensive, even for sparse
>>> bitmaps.
>>> 
>> 
>> 
>> 
>> --
>> Jonathan Ellis
>> Project Chair, Apache Cassandra
>> co-founder, http://www.datastax.com
>> @spyced
>> 

-- 
Brian ONeill
Lead Architect, Health Market Science (http://healthmarketscience.com)
mobile:215.588.6024
blog: http://weblogs.java.net/blog/boneill42/
blog: http://brianoneill.blogspot.com/



Re: Bitmap indexes - reviving CASSANDRA-1472

2013-04-10 Thread Matt Stump
Druid was our inspiration to layer bitmap indexes on top of Cassandra.
Druid doesn't work for us because or data set is too large. We would need
many hundreds of nodes just for the pre-processed data. What I envisioned
was the ability to perform druid style queries (no aggregation) without the
limitations imposed by having the entire dataset in memory. I primarily
need to query whether a user performed some event, but I also intend to add
trigram indexes for LIKE, ILIKE or possibly regex style matching.

I wasn't aware of CONCISE, thanks for the pointer. We are currently
evaluating fastbit, which is a very similar project:
https://sdm.lbl.gov/fastbit/


On Wed, Apr 10, 2013 at 5:49 PM, Brian O'Neill wrote:

>
> How does this compare with Druid?
> https://github.com/metamx/druid
>
> We're currently evaluating Acunu, Vertica and Druid...
>
> http://brianoneill.blogspot.com/2013/04/bianalytics-on-big-datacassandra.html
>
> With its bitmapped indexes, Druid appears to have the most potential.
> They boast some pretty impressive stats, especially WRT handling
> "real-time" updates and adding new dimensions.
>
> They also use a compression algorithm, CONCISE, to cut down on the space
> requirements.
> http://ricerca.mat.uniroma3.it/users/colanton/concise.html
>
> I haven't looked too deep into the Druid code, but I've been meaning to
> see if it could be backed by C*.
>
> We'd be game to join the hunt if you pursue such a beast. (with your code,
> or with portions of Druid)
>
> -brian
>
>
> On Apr 10, 2013, at 5:40 PM, mrevilgnome wrote:
>
> > What do you think about set manipulation via indexes in Cassandra? I'm
> > interested in answering queries such as give me all users that performed
> > event 1, 2, and 3, but not 4. If the answer is yes than I can make a case
> > for spending my time on C*. The only downside for us would be our current
> > prototype is in C++ so we would loose some performance and the ability to
> > dedicate an entire machine to caching/performing queries.
> >
> >
> > On Wed, Apr 10, 2013 at 11:57 AM, Jonathan Ellis 
> wrote:
> >
> >> If you mean, "Can someone help me figure out how to get started updating
> >> these old patches to trunk and cleaning out the Avro?" then yes, I've
> been
> >> knee-deep in indexing code recently.
> >>
> >>
> >> On Wed, Apr 10, 2013 at 11:34 AM, mrevilgnome 
> >> wrote:
> >>
> >>> I'm currently building a distributed cluster on top of cassandra to
> >> perform
> >>> fast set manipulation via bitmap indexes. This gives me the ability to
> >>> perform unions, intersections, and set subtraction across sub-queries.
> >>> Currently I'm storing index information for thousands of dimensions as
> >>> cassandra rows, and my cluster keeps this information cached,
> distributed
> >>> and replicated in order to answer queries.
> >>>
> >>> Every couple of days I think to myself this should really exist in C*.
> >>> Given all the benifits would there be any interest in
> >>> reviving CASSANDRA-1472?
> >>>
> >>> Some downsides are that this is very memory intensive, even for sparse
> >>> bitmaps.
> >>>
> >>
> >>
> >>
> >> --
> >> Jonathan Ellis
> >> Project Chair, Apache Cassandra
> >> co-founder, http://www.datastax.com
> >> @spyced
> >>
>
> --
> Brian ONeill
> Lead Architect, Health Market Science (http://healthmarketscience.com)
> mobile:215.588.6024
> blog: http://weblogs.java.net/blog/boneill42/
> blog: http://brianoneill.blogspot.com/
>
>


Re: Bitmap indexes - reviving CASSANDRA-1472

2013-04-10 Thread Carl Yeksigian
This discussion is off topic for the dev list. If you want to continue it,
please move to user@.

Thanks,
Carl


On Wed, Apr 10, 2013 at 10:43 PM, Matt Stump  wrote:

> Druid was our inspiration to layer bitmap indexes on top of Cassandra.
> Druid doesn't work for us because or data set is too large. We would need
> many hundreds of nodes just for the pre-processed data. What I envisioned
> was the ability to perform druid style queries (no aggregation) without the
> limitations imposed by having the entire dataset in memory. I primarily
> need to query whether a user performed some event, but I also intend to add
> trigram indexes for LIKE, ILIKE or possibly regex style matching.
>
> I wasn't aware of CONCISE, thanks for the pointer. We are currently
> evaluating fastbit, which is a very similar project:
> https://sdm.lbl.gov/fastbit/
>
>
> On Wed, Apr 10, 2013 at 5:49 PM, Brian O'Neill  >wrote:
>
> >
> > How does this compare with Druid?
> > https://github.com/metamx/druid
> >
> > We're currently evaluating Acunu, Vertica and Druid...
> >
> >
> http://brianoneill.blogspot.com/2013/04/bianalytics-on-big-datacassandra.html
> >
> > With its bitmapped indexes, Druid appears to have the most potential.
> > They boast some pretty impressive stats, especially WRT handling
> > "real-time" updates and adding new dimensions.
> >
> > They also use a compression algorithm, CONCISE, to cut down on the space
> > requirements.
> > http://ricerca.mat.uniroma3.it/users/colanton/concise.html
> >
> > I haven't looked too deep into the Druid code, but I've been meaning to
> > see if it could be backed by C*.
> >
> > We'd be game to join the hunt if you pursue such a beast. (with your
> code,
> > or with portions of Druid)
> >
> > -brian
> >
> >
> > On Apr 10, 2013, at 5:40 PM, mrevilgnome wrote:
> >
> > > What do you think about set manipulation via indexes in Cassandra? I'm
> > > interested in answering queries such as give me all users that
> performed
> > > event 1, 2, and 3, but not 4. If the answer is yes than I can make a
> case
> > > for spending my time on C*. The only downside for us would be our
> current
> > > prototype is in C++ so we would loose some performance and the ability
> to
> > > dedicate an entire machine to caching/performing queries.
> > >
> > >
> > > On Wed, Apr 10, 2013 at 11:57 AM, Jonathan Ellis 
> > wrote:
> > >
> > >> If you mean, "Can someone help me figure out how to get started
> updating
> > >> these old patches to trunk and cleaning out the Avro?" then yes, I've
> > been
> > >> knee-deep in indexing code recently.
> > >>
> > >>
> > >> On Wed, Apr 10, 2013 at 11:34 AM, mrevilgnome 
> > >> wrote:
> > >>
> > >>> I'm currently building a distributed cluster on top of cassandra to
> > >> perform
> > >>> fast set manipulation via bitmap indexes. This gives me the ability
> to
> > >>> perform unions, intersections, and set subtraction across
> sub-queries.
> > >>> Currently I'm storing index information for thousands of dimensions
> as
> > >>> cassandra rows, and my cluster keeps this information cached,
> > distributed
> > >>> and replicated in order to answer queries.
> > >>>
> > >>> Every couple of days I think to myself this should really exist in
> C*.
> > >>> Given all the benifits would there be any interest in
> > >>> reviving CASSANDRA-1472?
> > >>>
> > >>> Some downsides are that this is very memory intensive, even for
> sparse
> > >>> bitmaps.
> > >>>
> > >>
> > >>
> > >>
> > >> --
> > >> Jonathan Ellis
> > >> Project Chair, Apache Cassandra
> > >> co-founder, http://www.datastax.com
> > >> @spyced
> > >>
> >
> > --
> > Brian ONeill
> > Lead Architect, Health Market Science (http://healthmarketscience.com)
> > mobile:215.588.6024
> > blog: http://weblogs.java.net/blog/boneill42/
> > blog: http://brianoneill.blogspot.com/
> >
> >
>


Re: Bitmap indexes - reviving CASSANDRA-1472

2013-04-10 Thread Brandon Williams
On Wed, Apr 10, 2013 at 9:50 PM, Carl Yeksigian  wrote:

> This discussion is off topic for the dev list. If you want to continue it,
> please move to user@.
>

I disagree entirely, this is absolutely dev-oriented.

-Brandon


Re: Bitmap indexes - reviving CASSANDRA-1472

2013-04-10 Thread Jawed
information shared in this discussion is quite informative for developers.
Would like to go through this kind of discussion in the group.


On Thu, Apr 11, 2013 at 9:14 AM, Brandon Williams  wrote:

> On Wed, Apr 10, 2013 at 9:50 PM, Carl Yeksigian 
> wrote:
>
> > This discussion is off topic for the dev list. If you want to continue
> it,
> > please move to user@.
> >
>
> I disagree entirely, this is absolutely dev-oriented.
>
> -Brandon
>