Implementing a secondary index

2021-11-17 Thread Claude Warren
Greetings,

I am looking to implement a Multidimensional Bloom filter index [1] [2] on
a Cassandra table.  OK, I know that is a lot to take in.  What I need is
any documentation that explains the architecture of the index options, or
someone I can ask questions of -- a mentor if you will.

I have a proof of concept for the index that works from the client side
[3].  What I want to do is move some of that processing to the server
side.

I basically I think I need to do the following:

   1. On each partition create an SST to store the index data.  This table
   comprises, 2 integer data points and the primary key for the data table.
   2. When the index cell gets updated in the original table (there will
   only be on column), update one or more rows in the SST table.
   3. When querying perform multiple queries against the index data, and
   return the primary key values (or the data associated with the primary keys
   -- I am unclear on this bit).

Any help or guidance would be appreciated,
Claude

[1] https://archive.org/details/arxiv-1501.01941/mode/2up
[2] https://archive.fosdem.org/2020/schedule/event/bloom_filters/
[3] https://github.com/Claude-at-Instaclustr/blooming_cassandra




-- 

[image: Instaclustr logo]


*Claude Warren*

Principal Software Engineer

Instaclustr


Re: [VOTE] CEP-17: SSTable format API

2021-11-17 Thread Benjamin Lerer
+1

Le mar. 16 nov. 2021 à 18:05, Joshua McKenzie  a
écrit :

> +1
>
> On Tue, Nov 16, 2021 at 10:14 AM Andrés de la Peña 
> wrote:
>
> > +1
> >
> > On Tue, 16 Nov 2021 at 08:39, Sam Tunnicliffe  wrote:
> >
> > > +1
> > >
> > > > On 15 Nov 2021, at 19:42, Branimir Lambov 
> wrote:
> > > >
> > > > Hi everyone,
> > > >
> > > > I would like to start a vote on this CEP.
> > > >
> > > > Proposal:
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-17%3A+SSTable+format+API
> > > >
> > > > Discussion:
> > > >
> > >
> >
> https://lists.apache.org/thread.html/r636bebcab4e678dbee042285449193e8e75d3753200a1b404fcc7196%40%3Cdev.cassandra.apache.org%3E
> > > >
> > > > The vote will be open for 72 hours.
> > > > A vote passes if there are at least three binding +1s and no binding
> > > vetoes.
> > > >
> > > > Regards,
> > > > Branimir
> > >
> > >
> > > -
> > > To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
> > > For additional commands, e-mail: dev-h...@cassandra.apache.org
> > >
> > >
> >
>


Re: Implementing a secondary index

2021-11-17 Thread DuyHai Doan
Hello Claude

I have written a blog post about 2nd index architecture a long time ago but
most of the content should still be relevant, worth checking

https://www.doanduyhai.com/blog/?p=13191

Regards

Duy Hai DOAN

Le mer. 17 nov. 2021 à 10:17, Claude Warren 
a écrit :

> Greetings,
>
> I am looking to implement a Multidimensional Bloom filter index [1] [2] on
> a Cassandra table.  OK, I know that is a lot to take in.  What I need is
> any documentation that explains the architecture of the index options, or
> someone I can ask questions of -- a mentor if you will.
>
> I have a proof of concept for the index that works from the client side
> [3].  What I want to do is move some of that processing to the server
> side.
>
> I basically I think I need to do the following:
>
>1. On each partition create an SST to store the index data.  This table
>comprises, 2 integer data points and the primary key for the data table.
>2. When the index cell gets updated in the original table (there will
>only be on column), update one or more rows in the SST table.
>3. When querying perform multiple queries against the index data, and
>return the primary key values (or the data associated with the primary
> keys
>-- I am unclear on this bit).
>
> Any help or guidance would be appreciated,
> Claude
>
> [1] https://archive.org/details/arxiv-1501.01941/mode/2up
> [2] https://archive.fosdem.org/2020/schedule/event/bloom_filters/
> [3] https://github.com/Claude-at-Instaclustr/blooming_cassandra
>
>
>
>
> --
>
> [image: Instaclustr logo]
>
>
> *Claude Warren*
>
> Principal Software Engineer
>
> Instaclustr
>


Re: [DISCUSS] Releasable trunk and quality

2021-11-17 Thread Henrik Ingo
There's an old joke: How many people read Slashdot? The answer is 5. The
rest of us just write comments without reading... In that spirit, I wanted
to share some thoughts in response to your question, even if I know some of
it will have been said in this thread already :-)

Basically, I just want to share what has worked well in my past projects...

Visualization: Now that we have Butler running, we can already see a
decline in failing tests for 4.0 and trunk! This shows that contributors
want to do the right thing, we just need the right tools and processes to
achieve success.

Process: I'm confident we will soon be back to seeing 0 failures for 4.0
and trunk. However, keeping that state requires constant vigilance! At
Mongodb we had a role called Build Baron (aka Build Cop, etc...). This is a
weekly rotating role where the person who is the Build Baron will at least
once per day go through all of the Butler dashboards to catch new
regressions early. We have used the same process also at Datastax to guard
our downstream fork of Cassandra 4.0. It's the responsibility of the Build
Baron to
 - file a jira ticket for new failures
 - determine which commit is responsible for introducing the regression.
Sometimes this is obvious, sometimes this requires "bisecting" by running
more builds e.g. between two nightly builds.
 - assign the jira ticket to the author of the commit that introduced the
regression

Given that Cassandra is a community that includes part time and volunteer
developers, we may want to try some variation of this, such as pairing 2
build barons each week?

Reverting: A policy that the commit causing the regression is automatically
reverted can be scary. It takes courage to be the junior test engineer who
reverts yesterday's commit from the founder and CTO, just to give an
example... Yet this is the most efficient way to keep the build green. And
it turns out it's not that much additional work for the original author to
fix the issue and then re-merge the patch.

Merge-train: For any project with more than 1 commit per day, it will
inevitably happen that you need to rebase a PR before merging, and even if
it passed all tests before, after rebase it won't. In the downstream
Cassandra fork previously mentioned, we have tried to enable a github rule
which requires a) that all tests passed before merging, and b) the PR is
against the head of the branch merged into, and c) the tests were run after
such rebase. Unfortunately this leads to infinite loops where a large PR
may never be able to commit because it has to be rebased again and again
when smaller PRs can merge faster. The solution to this problem is to have
an automated process for the rebase-test-merge cycle. Gitlab supports such
a feature and calls it merge-trean:
https://docs.gitlab.com/ee/ci/pipelines/merge_trains.html

The merge-train can be considered an advanced feature and we can return to
it later. The other points should be sufficient to keep a reasonably green
trunk.

I guess the major area where we can improve daily test coverage would be
performance tests. To that end we recently open sourced a nice tool that
can algorithmically detects performance regressions in a timeseries history
of benchmark results: https://github.com/datastax-labs/hunter Just like
with correctness testing it's my experience that catching regressions the
day they happened is much better than trying to do it at beta or rc time.

Piotr also blogged about Hunter when it was released:
https://medium.com/building-the-open-data-stack/detecting-performance-regressions-with-datastax-hunter-c22dc444aea4

henrik



On Sat, Oct 30, 2021 at 4:00 PM Joshua McKenzie 
wrote:

> We as a project have gone back and forth on the topic of quality and the
> notion of a releasable trunk for quite a few years. If people are
> interested, I'd like to rekindle this discussion a bit and see if we're
> happy with where we are as a project or if we think there's steps we should
> take to change the quality bar going forward. The following questions have
> been rattling around for me for awhile:
>
> 1. How do we define what "releasable trunk" means? All reviewed by M
> committers? Passing N% of tests? Passing all tests plus some other metrics
> (manual testing, raising the number of reviewers, test coverage, usage in
> dev or QA environments, etc)? Something else entirely?
>
> 2. With a definition settled upon in #1, what steps, if any, do we need to
> take to get from where we are to having *and keeping* that releasable
> trunk? Anything to codify there?
>
> 3. What are the benefits of having a releasable trunk as defined here? What
> are the costs? Is it worth pursuing? What are the alternatives (for
> instance: a freeze before a release + stabilization focus by the community
> i.e. 4.0 push or the tock in tick-tock)?
>
> Given the large volumes of work coming down the pike with CEP's, this seems
> like a good time to at least check in on this topic as a community.
>
> Full disclosure

Re: [DISCUSS] Mentoring newcomers

2021-11-17 Thread Blake Eggleston
I’m happy to help out

> On Nov 12, 2021, at 9:04 AM, Benjamin Lerer  wrote:
> 
> Hi everybody
> 
> As discussed in the *Creating a new slack channel for newcomers* thead, a
> solution to help newcomers engage with the project would be to provide a
> list of mentors that newcomers can contact when they feel insecure about
> asking questions through our cassandra-dev channel or through the mailing
> list.
> 
> I would like to collect the list of people that are interested in helping
> out newcomers so that we can post that list on our website.


-
To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
For additional commands, e-mail: dev-h...@cassandra.apache.org



Re: [DISCUSS] Releasable trunk and quality

2021-11-17 Thread Joshua McKenzie
Thanks for the feedback and insight Henrik; it's valuable to hear how other
large complex infra projects have tackled this problem set.

To attempt to summarize, what I got from your email:
[Phase one]
1) Build Barons: rotation where there's always someone active tying
failures to changes and adding those failures to our ticketing system
2) Best effort process of "test breakers" being assigned tickets to fix the
things their work broke
3) Moving to a culture where we regularly revert commits that break tests
4) Running tests before we merge changes

[Phase two]
1) Suite of performance tests on a regular cadence against trunk (w/hunter
or otherwise)
2) Integration w/ github merge-train pipelines

That cover the highlights? I agree with these points as useful places for
us to invest in as a project and I'll work on getting this into a gdoc for
us to align on and discuss further this week.

~Josh


On Wed, Nov 17, 2021 at 10:23 AM Henrik Ingo 
wrote:

> There's an old joke: How many people read Slashdot? The answer is 5. The
> rest of us just write comments without reading... In that spirit, I wanted
> to share some thoughts in response to your question, even if I know some of
> it will have been said in this thread already :-)
>
> Basically, I just want to share what has worked well in my past projects...
>
> Visualization: Now that we have Butler running, we can already see a
> decline in failing tests for 4.0 and trunk! This shows that contributors
> want to do the right thing, we just need the right tools and processes to
> achieve success.
>
> Process: I'm confident we will soon be back to seeing 0 failures for 4.0
> and trunk. However, keeping that state requires constant vigilance! At
> Mongodb we had a role called Build Baron (aka Build Cop, etc...). This is a
> weekly rotating role where the person who is the Build Baron will at least
> once per day go through all of the Butler dashboards to catch new
> regressions early. We have used the same process also at Datastax to guard
> our downstream fork of Cassandra 4.0. It's the responsibility of the Build
> Baron to
>  - file a jira ticket for new failures
>  - determine which commit is responsible for introducing the regression.
> Sometimes this is obvious, sometimes this requires "bisecting" by running
> more builds e.g. between two nightly builds.
>  - assign the jira ticket to the author of the commit that introduced the
> regression
>
> Given that Cassandra is a community that includes part time and volunteer
> developers, we may want to try some variation of this, such as pairing 2
> build barons each week?
>
> Reverting: A policy that the commit causing the regression is automatically
> reverted can be scary. It takes courage to be the junior test engineer who
> reverts yesterday's commit from the founder and CTO, just to give an
> example... Yet this is the most efficient way to keep the build green. And
> it turns out it's not that much additional work for the original author to
> fix the issue and then re-merge the patch.
>
> Merge-train: For any project with more than 1 commit per day, it will
> inevitably happen that you need to rebase a PR before merging, and even if
> it passed all tests before, after rebase it won't. In the downstream
> Cassandra fork previously mentioned, we have tried to enable a github rule
> which requires a) that all tests passed before merging, and b) the PR is
> against the head of the branch merged into, and c) the tests were run after
> such rebase. Unfortunately this leads to infinite loops where a large PR
> may never be able to commit because it has to be rebased again and again
> when smaller PRs can merge faster. The solution to this problem is to have
> an automated process for the rebase-test-merge cycle. Gitlab supports such
> a feature and calls it merge-trean:
> https://docs.gitlab.com/ee/ci/pipelines/merge_trains.html
>
> The merge-train can be considered an advanced feature and we can return to
> it later. The other points should be sufficient to keep a reasonably green
> trunk.
>
> I guess the major area where we can improve daily test coverage would be
> performance tests. To that end we recently open sourced a nice tool that
> can algorithmically detects performance regressions in a timeseries history
> of benchmark results: https://github.com/datastax-labs/hunter Just like
> with correctness testing it's my experience that catching regressions the
> day they happened is much better than trying to do it at beta or rc time.
>
> Piotr also blogged about Hunter when it was released:
>
> https://medium.com/building-the-open-data-stack/detecting-performance-regressions-with-datastax-hunter-c22dc444aea4
>
> henrik
>
>
>
> On Sat, Oct 30, 2021 at 4:00 PM Joshua McKenzie 
> wrote:
>
> > We as a project have gone back and forth on the topic of quality and the
> > notion of a releasable trunk for quite a few years. If people are
> > interested, I'd like to rekindle this discussion a bit and see if we're
> > 

Re: [DISCUSS] Releasable trunk and quality

2021-11-17 Thread bened...@apache.org
I raised this before, but to highlight it again: how do these approaches 
interface with our merge strategy?

We might have to rebase several dependent merge commits and want to merge them 
atomically. So far as I know these tools don’t work fantastically in this 
scenario, but if I’m wrong that’s fantastic. If not, given how important these 
things are, should we consider revisiting our merge strategy?

From: Joshua McKenzie 
Date: Wednesday, 17 November 2021 at 16:39
To: dev@cassandra.apache.org 
Subject: Re: [DISCUSS] Releasable trunk and quality
Thanks for the feedback and insight Henrik; it's valuable to hear how other
large complex infra projects have tackled this problem set.

To attempt to summarize, what I got from your email:
[Phase one]
1) Build Barons: rotation where there's always someone active tying
failures to changes and adding those failures to our ticketing system
2) Best effort process of "test breakers" being assigned tickets to fix the
things their work broke
3) Moving to a culture where we regularly revert commits that break tests
4) Running tests before we merge changes

[Phase two]
1) Suite of performance tests on a regular cadence against trunk (w/hunter
or otherwise)
2) Integration w/ github merge-train pipelines

That cover the highlights? I agree with these points as useful places for
us to invest in as a project and I'll work on getting this into a gdoc for
us to align on and discuss further this week.

~Josh


On Wed, Nov 17, 2021 at 10:23 AM Henrik Ingo 
wrote:

> There's an old joke: How many people read Slashdot? The answer is 5. The
> rest of us just write comments without reading... In that spirit, I wanted
> to share some thoughts in response to your question, even if I know some of
> it will have been said in this thread already :-)
>
> Basically, I just want to share what has worked well in my past projects...
>
> Visualization: Now that we have Butler running, we can already see a
> decline in failing tests for 4.0 and trunk! This shows that contributors
> want to do the right thing, we just need the right tools and processes to
> achieve success.
>
> Process: I'm confident we will soon be back to seeing 0 failures for 4.0
> and trunk. However, keeping that state requires constant vigilance! At
> Mongodb we had a role called Build Baron (aka Build Cop, etc...). This is a
> weekly rotating role where the person who is the Build Baron will at least
> once per day go through all of the Butler dashboards to catch new
> regressions early. We have used the same process also at Datastax to guard
> our downstream fork of Cassandra 4.0. It's the responsibility of the Build
> Baron to
>  - file a jira ticket for new failures
>  - determine which commit is responsible for introducing the regression.
> Sometimes this is obvious, sometimes this requires "bisecting" by running
> more builds e.g. between two nightly builds.
>  - assign the jira ticket to the author of the commit that introduced the
> regression
>
> Given that Cassandra is a community that includes part time and volunteer
> developers, we may want to try some variation of this, such as pairing 2
> build barons each week?
>
> Reverting: A policy that the commit causing the regression is automatically
> reverted can be scary. It takes courage to be the junior test engineer who
> reverts yesterday's commit from the founder and CTO, just to give an
> example... Yet this is the most efficient way to keep the build green. And
> it turns out it's not that much additional work for the original author to
> fix the issue and then re-merge the patch.
>
> Merge-train: For any project with more than 1 commit per day, it will
> inevitably happen that you need to rebase a PR before merging, and even if
> it passed all tests before, after rebase it won't. In the downstream
> Cassandra fork previously mentioned, we have tried to enable a github rule
> which requires a) that all tests passed before merging, and b) the PR is
> against the head of the branch merged into, and c) the tests were run after
> such rebase. Unfortunately this leads to infinite loops where a large PR
> may never be able to commit because it has to be rebased again and again
> when smaller PRs can merge faster. The solution to this problem is to have
> an automated process for the rebase-test-merge cycle. Gitlab supports such
> a feature and calls it merge-trean:
> https://docs.gitlab.com/ee/ci/pipelines/merge_trains.html
>
> The merge-train can be considered an advanced feature and we can return to
> it later. The other points should be sufficient to keep a reasonably green
> trunk.
>
> I guess the major area where we can improve daily test coverage would be
> performance tests. To that end we recently open sourced a nice tool that
> can algorithmically detects performance regressions in a timeseries history
> of benchmark results: https://github.com/datastax-labs/hunter Just like
> with correctness testing it's my experience that catching regressions the
> da

Re: [DISCUSS] Releasable trunk and quality

2021-11-17 Thread Joshua McKenzie
Sorry for not catching that Benedict, you're absolutely right. So long as
we're using merge commits between branches I don't think auto-merging via
train or blocking on green CI are options via the tooling, and multi-branch
reverts will be something we should document very clearly should we even
choose to go that route (a lot of room to make mistakes there).

It may not be a huge issue as we can expect the more disruptive changes
(i.e. potentially destabilizing) to be happening on trunk only, so perhaps
we can get away with slightly different workflows or policies based on
whether you're doing a multi-branch bugfix or a feature on trunk. Bears
thinking more deeply about.

I'd also be game for revisiting our merge strategy. I don't see much
difference in labor between merging between branches vs. preparing separate
patches for an individual developer, however I'm sure there's maintenance
and integration implications there I'm not thinking of right now.

On Wed, Nov 17, 2021 at 12:03 PM bened...@apache.org 
wrote:

> I raised this before, but to highlight it again: how do these approaches
> interface with our merge strategy?
>
> We might have to rebase several dependent merge commits and want to merge
> them atomically. So far as I know these tools don’t work fantastically in
> this scenario, but if I’m wrong that’s fantastic. If not, given how
> important these things are, should we consider revisiting our merge
> strategy?
>
> From: Joshua McKenzie 
> Date: Wednesday, 17 November 2021 at 16:39
> To: dev@cassandra.apache.org 
> Subject: Re: [DISCUSS] Releasable trunk and quality
> Thanks for the feedback and insight Henrik; it's valuable to hear how other
> large complex infra projects have tackled this problem set.
>
> To attempt to summarize, what I got from your email:
> [Phase one]
> 1) Build Barons: rotation where there's always someone active tying
> failures to changes and adding those failures to our ticketing system
> 2) Best effort process of "test breakers" being assigned tickets to fix the
> things their work broke
> 3) Moving to a culture where we regularly revert commits that break tests
> 4) Running tests before we merge changes
>
> [Phase two]
> 1) Suite of performance tests on a regular cadence against trunk (w/hunter
> or otherwise)
> 2) Integration w/ github merge-train pipelines
>
> That cover the highlights? I agree with these points as useful places for
> us to invest in as a project and I'll work on getting this into a gdoc for
> us to align on and discuss further this week.
>
> ~Josh
>
>
> On Wed, Nov 17, 2021 at 10:23 AM Henrik Ingo 
> wrote:
>
> > There's an old joke: How many people read Slashdot? The answer is 5. The
> > rest of us just write comments without reading... In that spirit, I
> wanted
> > to share some thoughts in response to your question, even if I know some
> of
> > it will have been said in this thread already :-)
> >
> > Basically, I just want to share what has worked well in my past
> projects...
> >
> > Visualization: Now that we have Butler running, we can already see a
> > decline in failing tests for 4.0 and trunk! This shows that contributors
> > want to do the right thing, we just need the right tools and processes to
> > achieve success.
> >
> > Process: I'm confident we will soon be back to seeing 0 failures for 4.0
> > and trunk. However, keeping that state requires constant vigilance! At
> > Mongodb we had a role called Build Baron (aka Build Cop, etc...). This
> is a
> > weekly rotating role where the person who is the Build Baron will at
> least
> > once per day go through all of the Butler dashboards to catch new
> > regressions early. We have used the same process also at Datastax to
> guard
> > our downstream fork of Cassandra 4.0. It's the responsibility of the
> Build
> > Baron to
> >  - file a jira ticket for new failures
> >  - determine which commit is responsible for introducing the regression.
> > Sometimes this is obvious, sometimes this requires "bisecting" by running
> > more builds e.g. between two nightly builds.
> >  - assign the jira ticket to the author of the commit that introduced the
> > regression
> >
> > Given that Cassandra is a community that includes part time and volunteer
> > developers, we may want to try some variation of this, such as pairing 2
> > build barons each week?
> >
> > Reverting: A policy that the commit causing the regression is
> automatically
> > reverted can be scary. It takes courage to be the junior test engineer
> who
> > reverts yesterday's commit from the founder and CTO, just to give an
> > example... Yet this is the most efficient way to keep the build green.
> And
> > it turns out it's not that much additional work for the original author
> to
> > fix the issue and then re-merge the patch.
> >
> > Merge-train: For any project with more than 1 commit per day, it will
> > inevitably happen that you need to rebase a PR before merging, and even
> if
> > it passed all tests before, after rebase it won'