Re: Save CircleCI resources with optional test jobs

2021-11-01 Thread Andrés de la Peña
Hi all,

I have just created CASSANDRA-17113 adding scripting options to select the
workflow to be used, trying to implement Benedict's suggestion.

It adds some flags to the existing .circleci/generate.sh config generation
script. A -p flag generates only the pre-commit workflows, whereas a -s
flag generates only the workflows with separate approval steps for each
test job. Both flags can be used together to generate the two pairs of
workflows. The default option is generating all the workflows, so users can
decide what workflow are they going to use in the CircleCI GUI, after
pushing their changes. We can easily change the workflows that are
generated by default to anything that suits us better.

Additionally, there is a -r flag that disables the first approval step of
the generated workflows. For the separate_tests workflows it means that the
build is automatically run, but the individual steps still need to be
manually approved in the GUI. For the pre-commit_tests workflows, the -r
flag will automatically run the build and the most relevant tests. That
way, users pushing a mostly ready patch and wanting to run the tests with
HIGHRES would probably want to generate their config file with generate.sh
-hpr.

What do you think?

On Thu, 21 Oct 2021 at 14:37, Andrés de la Peña 
wrote:

> The two separate workflows try to serve both types of pushes, early and
> almost ready. The separate-test j8/j11 workflows are for those who push
> early patches, or special cases. The j8/j11 pre-commit workflows are for
> final stages, and it’s probably what those who push mostly-final patches
> would use. I think that these folks can just ignore the separate-tests
> workflows and use the “run everything” approval step on the pre-commit
> workflow.
>
> Focusing on those who push mostly final patches, and therefore use the
> pre-commit workflow, would there any value on running the build
> automatically and then requesting manual approval to run all the tests? If
> the patch is almost ready the tests have to be run, so I think there isn’t
> a difference between running the build step or not. In the end, we still
> have to press a single approval step, and if the build fails the tests
> aren’t going to start.
>
>
> El El jue, 21 oct 2021 a las 14:47, Oleksandr Petrov <
> oleksandr.pet...@gmail.com> escribió:
>
>> Thank you for responding,
>>
>> If there's an option that Benedict has suggested (to allow folks who push
>> mostly-final patches, and the folks who push rather-early patches and then
>> update them continuously, to coexist and be able to quickly switch between
>> configs) I'd be more in favour of this rather than just enabling a
>> build/compile step for everyone.
>>
>> On Wed, Oct 20, 2021 at 8:25 PM Andrés de la Peña 
>> wrote:
>>
>> > Hi all,
>> >
>> > As Ekaterina has mentioned I’ll be OOO until Monday, and I won’t be
>> able to
>> > make any changes until then.
>> >
>> > Alex, I missed to answer your suggestion about building on every commit,
>> > apologies for that. The discussion was open on multiple fronts and I
>> > somehow missed that one. I can easily change it back to automatic
>> builds on
>> > Monday if we prefer to do so.
>> >
>> > The pre-commit workflows have a single button to start the build and all
>> > the tests that were previously run automatically. If we had automatic
>> > builds we would still have a button to start the tests. Thus, I think
>> that
>> > automatic builds in the pre-commit workflow would’t make a difference in
>> > terms of usability, and we would be wasting resources in intermediate
>> > commits.
>> >
>> > As for the workflows with separate approval steps, the automatic build
>> > would save us a click at the cost of wasting resources in some cases.
>> Since
>> > the cost of building is not that high it might make more sense to have
>> > automatic builds in these workflows. Alternatively, a detail that could
>> > improve things a bit in the separate-tests workflows is making the
>> approval
>> > steps for running the tests depend on the approval for the build. That
>> > wouldn’t save us any clicks but it would make impossible to miss the
>> build
>> > approval, since we would need to click it in order to enable the buttons
>> > starting the tests.
>> >
>> > It would be really great if the build started when one approves a
>> certain
>> > test job, but as it has been mentioned that doesn’t seem possible with
>> the
>> > current CircleCI features.
>> >
>> > Having a config that entirely satisfies the needs of everyone doesn’t
>> seem
>> > possible, and I also think that we’ll eventually need better tooling to
>> > generate the config file, although we still need to agree on a default
>> > config. CASSANDRA-16989 has recently added flags to the config
>> generation
>> > script allowing to swap resources and specify the environment vars. We
>> > should probably continue adding similar options to be able to manipulate
>> > the approval steps, parallelism, etc.
>> >
>> > Regards,

Re: [DISCUSS] CEP-17: SSTable format API (CASSANDRA-17056)

2021-11-01 Thread Branimir Lambov
As Jacek is not a committer, this proposal needs a shepherd. I would be
happy to take this role.

> to me the interfaces has to be at the SSTable level, which then expose
readers/writers, but also has to expose the other things we do outside of
those paths

Could you give some detail on what these things are? Are they something
different from what the standalone Cassandra tools (scrub/verify/upgrade)
are currently doing? Obviously, any pluggability proposal will have to
provide a solution to these, and it would be helpful to know what needs to
be done beyond making sure the bundled tools work correctly (which includes
iterating indexes; format-specific operations (e.g. index summary
redistribution) are excluded as they are to be handled by the individual
format).

There is another problem in the current code alluded to in the question, in
the fact that "SSTableReader" (tied to the sstable format and ready for
querying data (i.e. with open data files and bloom filters loaded in
memory)) is the only concept that the code uses to work with sstables. As I
understand it, this proposal does not aim to solve that problem, only to
make sure that we can properly read and write sstables of a given format,
including in streaming and standalone tools. In other words, to provide the
machinery to convert sstable descriptors into sstable readers and writers.

I see this as an expansion of CASSANDRA-7443 and cleanup of any changes
that came after it and broke the intended capability.

Regards,
Branimir

On Thu, Oct 28, 2021 at 7:43 PM David Capwell 
wrote:

> Sorry about that; used -1/+1 to show preference, not binding action
>
> > On Oct 28, 2021, at 5:50 AM, bened...@apache.org wrote:
> >
> >> I am -1 here, for the reasons listed above; the problem (in my eye) is
> not reader/writer but higher level at the actual SSTable.  If we plug out
> read/write but still allow direct file access, then these abstractions fail
> to provide the goals of the CEP.
> >
> > Be careful dropping -1s, as your -1s here are binding. I realise this
> isn’t a vote thread, but the effect is the same. IMO we should try to
> express our preferences and defer to the collective opinion where possible.
> True -1s should very rarely appear.
> >
> >
> > From: David Capwell 
> > Date: Wednesday, 27 October 2021 at 15:33
> > To: dev@cassandra.apache.org 
> > Subject: Re: [DISCUSS] CEP-17: SSTable format API (CASSANDRA-17056)
> > Reading the CEP I don’t see any mention to the systems which access
> SSTables; such as streaming (small callout to zero-copy-streaming with
> ZeroCopyBigTableWriter) and repair.  If you are abstracting out
> BigTableReader then you are not dealing with the implementation assumptions
> that users of SSTables have (such as direct mutation to auxiliary files
> outside of -Data.db).
> >
> >> Audience
> >>   • Cassandra developers who wish to see SSTableReader and
> SSTableWriter more modular than they are today,
> >
> > This statement relates to the above comment, many parts of the code do
> not use Reader/Writer but instead use direct format knowledge to apply
> changes to the file format (normally outside of -Data.db); to me the
> interfaces has to be at the SSTable level, which then expose
> readers/writers, but also has to expose the other things we do outside of
> those paths.
> >
> >>   • move the metrics related to sstable format out from
> TableMetrics class and make them tied to certain sstable implementation
> >
> > I am curious about this comment, are you removing exposing this
> information?
> >
> >>   • have a single factory for creating both readers and writers for
> particular implementation of sstable and use it consistently - no direct
> creation of any reader / writer
> >
> > I am -1 here, for the reasons listed above; the problem (in my eye) is
> not reader/writer but higher level at the actual SSTable.  If we plug out
> read/write but still allow direct file access, then these abstractions fail
> to provide the goals of the CEP.
> >
> > I am +1 to the intent of the CEP.
> >
> > And last comment, which I have also done in the other modularity thread…
> backwards compatibility and maintenance. It is not clear right now what
> java interfaces may not break and how we can maintain and extend such
> interfaces in the future.  If the goal is to allow 3rd parties to plugin
> and offer new SSTable formats, are we as a project ok with having a minor
> release do a binary or source non-compatible change?  If not how do we
> detect this?  Until this problem is solved, I do not think we should add
> any such interfaces.
> >
> >> On Oct 22, 2021, at 7:23 AM, Jeremiah Jordan 
> wrote:
> >>
> >> Hi Stefan,
> >> That idea is not related to this CEP which is about the file formats of
> the
> >> sstables, not file system access.  But you should take a look at the
> work
> >> recently committed in
> https://issues.apache.org/jira/browse/CASSANDRA-16926
> >> to switch to using java.nio.file.Path for file access.  This sh

[DISCUSS] CEP-3: Guardrails

2021-11-01 Thread Andrés de la Peña
Hi everyone,

I'd like to start a discussion about Guardrails proposal:
https://cwiki.apache.org/confluence/display/CASSANDRA/%28DRAFT%29+-+CEP-3%3A+Guardrails

Guardrails are an easy way to enforce system-wide soft and hard limits to
prevent anti-patterns of bad usage and in the long run make it not possible
to severely degrade the performance of a node/cluster through user actions
such as having too many secondary indexes, too large partitions, almost
full disks, etc.

Thanks,


Re: [DISCUSS] CEP-17: SSTable format API (CASSANDRA-17056)

2021-11-01 Thread David Capwell
Inline

> On Nov 1, 2021, at 9:23 AM, Branimir Lambov  wrote:
> 
> As Jacek is not a committer, this proposal needs a shepherd. I would be
> happy to take this role.
> 
>> to me the interfaces has to be at the SSTable level, which then expose
> readers/writers, but also has to expose the other things we do outside of
> those paths
> 
> Could you give some detail on what these things are? Are they something
> different from what the standalone Cassandra tools (scrub/verify/upgrade)
> are currently doing? Obviously, any pluggability proposal will have to
> provide a solution to these, and it would be helpful to know what needs to
> be done beyond making sure the bundled tools work correctly (which includes
> iterating indexes; format-specific operations (e.g. index summary
> redistribution) are excluded as they are to be handled by the individual
> format).

Looking closer at compaction and repair I had forgotten that they were changed 
in CASSANDRA-15861 to go through the reader interface rather than directly 
mutate the files (concurrency bug).  I was thinking the logic which is now 
org.apache.cassandra.io.sstable.format.SSTableReader#mutateLevelAndReload and 
org.apache.cassandra.io.sstable.format.SSTableReader#mutateRepairedAndReload; 
so I believe compaction/repair may be ok with reader/writer; ignore those 
examples.

Checking usage of descriptor you find examples like

org.apache.cassandra.db.streaming.CassandraEntireSSTableStreamReader#read - 
which calls: 
writer.descriptor.getMetadataSerializer().mutate(writer.descriptor, 
description, transform);
org.apache.cassandra.tools.Util#metadataFromSSTable - which is used by 
sstablemetadata tool
org.apache.cassandra.io.sstable.KeyIterator#KeyIterator - directly loads 
primary index from descriptor: new In(new 
File(desc.filenameFor(Component.PRIMARY_INDEX)));

Non of the examples I see couldn’t be rewritten to use read/writer; so relying 
on reader/writer as the main interfaces would work.

> 
> There is another problem in the current code alluded to in the question, in
> the fact that "SSTableReader" (tied to the sstable format and ready for
> querying data (i.e. with open data files and bloom filters loaded in
> memory)) is the only concept that the code uses to work with sstables. As I
> understand it, this proposal does not aim to solve that problem, only to
> make sure that we can properly read and write sstables of a given format,
> including in streaming and standalone tools. In other words, to provide the
> machinery to convert sstable descriptors into sstable readers and writers.
> 
> I see this as an expansion of CASSANDRA-7443 and cleanup of any changes
> that came after it and broke the intended capability.
> 
> Regards,
> Branimir
> 
> On Thu, Oct 28, 2021 at 7:43 PM David Capwell 
> wrote:
> 
>> Sorry about that; used -1/+1 to show preference, not binding action
>> 
>>> On Oct 28, 2021, at 5:50 AM, bened...@apache.org wrote:
>>> 
 I am -1 here, for the reasons listed above; the problem (in my eye) is
>> not reader/writer but higher level at the actual SSTable.  If we plug out
>> read/write but still allow direct file access, then these abstractions fail
>> to provide the goals of the CEP.
>>> 
>>> Be careful dropping -1s, as your -1s here are binding. I realise this
>> isn’t a vote thread, but the effect is the same. IMO we should try to
>> express our preferences and defer to the collective opinion where possible.
>> True -1s should very rarely appear.
>>> 
>>> 
>>> From: David Capwell 
>>> Date: Wednesday, 27 October 2021 at 15:33
>>> To: dev@cassandra.apache.org 
>>> Subject: Re: [DISCUSS] CEP-17: SSTable format API (CASSANDRA-17056)
>>> Reading the CEP I don’t see any mention to the systems which access
>> SSTables; such as streaming (small callout to zero-copy-streaming with
>> ZeroCopyBigTableWriter) and repair.  If you are abstracting out
>> BigTableReader then you are not dealing with the implementation assumptions
>> that users of SSTables have (such as direct mutation to auxiliary files
>> outside of -Data.db).
>>> 
 Audience
  • Cassandra developers who wish to see SSTableReader and
>> SSTableWriter more modular than they are today,
>>> 
>>> This statement relates to the above comment, many parts of the code do
>> not use Reader/Writer but instead use direct format knowledge to apply
>> changes to the file format (normally outside of -Data.db); to me the
>> interfaces has to be at the SSTable level, which then expose
>> readers/writers, but also has to expose the other things we do outside of
>> those paths.
>>> 
  • move the metrics related to sstable format out from
>> TableMetrics class and make them tied to certain sstable implementation
>>> 
>>> I am curious about this comment, are you removing exposing this
>> information?
>>> 
  • have a single factory for creating both readers and writers for
>> particular implementation of sstable and use it consistently - no direct
>> creation of any read

Re: [DISCUSS] CEP-3: Guardrails

2021-11-01 Thread Jeff Jirsa
Without bike-shedding too much, guardrails would be great, building them
into a more general purpose framework that limits various dangerous things
would be fantastic. The CEP says that the guardrails should be distinct
from the capability restrictions (
https://issues.apache.org/jira/browse/CASSANDRA-8303 ), but I don't see why
that needs to be the case. A system-level guardrail and a personal-level
guardrail are both restrictions, they just have different scopes, so
implement the restriction framework first, and allow the scopes to be
expanded as needed?

Naming wise, I don't know that I'd actually surface these as "guardrails",
but more as general "limits", and having them only configured via yaml
seems like a bad outcome



https://issues.apache.org/jira/browse/CASSANDRA-8303



On Mon, Nov 1, 2021 at 9:31 AM Andrés de la Peña 
wrote:

> Hi everyone,
>
> I'd like to start a discussion about Guardrails proposal:
>
> https://cwiki.apache.org/confluence/display/CASSANDRA/%28DRAFT%29+-+CEP-3%3A+Guardrails
>
> Guardrails are an easy way to enforce system-wide soft and hard limits to
> prevent anti-patterns of bad usage and in the long run make it not possible
> to severely degrade the performance of a node/cluster through user actions
> such as having too many secondary indexes, too large partitions, almost
> full disks, etc.
>
> Thanks,
>


Re: [DISCUSS] CEP-3: Guardrails

2021-11-01 Thread David Capwell
Under "Migrating existing cassandra.yaml warn/fail thresholds”, I recently 
added a few things which are basically guardrails, so should be included in 
this set; they are configured by track_warnings (coordinator_read_size, 
local_read_size, and row_index_size).  With track_warnings I setup the plumbing 
to have read queries trigger warnings (or abort the query) to the client exists 
(under "Event logging" you mention "and also to the client connection when 
applicable”) and isn’t limited to the coordinator participating in the query 
(previous limitation for tombstone warnings).  One thing I found which was 
problematic for track_warnings was that altering clients is annoying as java 
and python both ignore the error message we send (see 
https://github.com/datastax/java-driver/blob/3.11.0/driver-core/src/main/java/com/datastax/driver/core/Responses.java#L73-L131).
 We log client warnings (if enabled) but ignore any detailed error message 
received from the server; it would be good to talk about client integrations 
and how users are informed of issues in more detail.


> On Nov 1, 2021, at 9:46 AM, Jeff Jirsa  wrote:
> 
> Without bike-shedding too much, guardrails would be great, building them
> into a more general purpose framework that limits various dangerous things
> would be fantastic. The CEP says that the guardrails should be distinct
> from the capability restrictions (
> https://issues.apache.org/jira/browse/CASSANDRA-8303 ), but I don't see why
> that needs to be the case. A system-level guardrail and a personal-level
> guardrail are both restrictions, they just have different scopes, so
> implement the restriction framework first, and allow the scopes to be
> expanded as needed?
> 
> Naming wise, I don't know that I'd actually surface these as "guardrails",
> but more as general "limits", and having them only configured via yaml
> seems like a bad outcome
> 
> 
> 
> https://issues.apache.org/jira/browse/CASSANDRA-8303
> 
> 
> 
> On Mon, Nov 1, 2021 at 9:31 AM Andrés de la Peña 
> wrote:
> 
>> Hi everyone,
>> 
>> I'd like to start a discussion about Guardrails proposal:
>> 
>> https://cwiki.apache.org/confluence/display/CASSANDRA/%28DRAFT%29+-+CEP-3%3A+Guardrails
>> 
>> Guardrails are an easy way to enforce system-wide soft and hard limits to
>> prevent anti-patterns of bad usage and in the long run make it not possible
>> to severely degrade the performance of a node/cluster through user actions
>> such as having too many secondary indexes, too large partitions, almost
>> full disks, etc.
>> 
>> Thanks,
>> 


-
To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
For additional commands, e-mail: dev-h...@cassandra.apache.org



Re: [DISCUSS] CEP-3: Guardrails

2021-11-01 Thread C. Scott Andreas

Thank you for starting discussion on this CEP, Andrés!Can the "Scope" section of the doc be filled out? It currently reads "TBD," but 
having a better understanding of the scope of work would help focus discussion: https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-3%3A+GuardrailsRe: 
configuration via yaml, it will be important that these guardrails can be modified via JMX as well - e.g., in the case of a user running up against a limit 
and needing a path to being unblocked that doesn't require a yaml change and rolling restart.As David mentions, raising the visibility of the soft limit 
warnings will help users avoid being caught off-guard. Enabling the logging of wire protocol warnings received in CQL responses in the drivers by default 
would help if related JIRA tickets in those projects can be considered.On Nov 1, 2021, at 10:05 AM, David Capwell  
wrote:Under "Migrating existing cassandra.yaml warn/fail thresholds”, I recently added a few things which are basically guardrails, so should be 
included in this set; they are configured by track_warnings (coordinator_read_size, local_read_size, and row_index_size).  With track_warnings I setup the 
plumbing to have read queries trigger warnings (or abort the query) to the client exists (under "Event logging" you mention "and also to the 
client connection when applicable”) and isn’t limited to the coordinator participating in the query (previous limitation for tombstone warnings).  One thing 
I found which was problematic for track_warnings was that altering clients is annoying as java and python both ignore the error message we send (see 
https://github.com/datastax/java-driver/blob/3.11.0/driver-core/src/main/java/com/datastax/driver/core/Responses.java#L73-L131). We log client warnings (if 
enabled) but ignore any detailed error message received from the server; it would be good to talk about client integrations and how users are informed of 
issues in more detail.On Nov 1, 2021, at 9:46 AM, Jeff Jirsa  wrote:Without bike-shedding too much, guardrails would be great, 
building theminto a more general purpose framework that limits various dangerous thingswould be fantastic. The CEP says that the guardrails should be 
distinctfrom the capability restrictions (https://issues.apache.org/jira/browse/CASSANDRA-8303 ), but I don't see whythat needs to be the case. A 
system-level guardrail and a personal-levelguardrail are both restrictions, they just have different scopes, soimplement the restriction framework first, and 
allow the scopes to beexpanded as needed?Naming wise, I don't know that I'd actually surface these as "guardrails",but more as general 
"limits", and having them only configured via yamlseems like a bad outcomehttps://issues.apache.org/jira/browse/CASSANDRA-8303On Mon, Nov 1, 2021 
at 9:31 AM Andrés de la Peña wrote:Hi everyone,I'd like to start a discussion about Guardrails 
proposal:https://cwiki.apache.org/confluence/display/CASSANDRA/%28DRAFT%29+-+CEP-3%3A+GuardrailsGuardrails are an easy way to enforce system-wide soft and 
hard limits toprevent anti-patterns of bad usage and in the long run make it not possibleto severely degrade the performance of a node/cluster through user 
actionssuch as having too many secondary indexes, too large partitions, almostfull disks, 
etc.Thanks,-To unsubscribe, e-mail: dev-unsubscribe@cassandra.apache.orgFor additional 
commands, e-mail: dev-h...@cassandra.apache.org

Re: [DISCUSS] CEP-3: Guardrails

2021-11-01 Thread Patrick McFadin
"it will be important that these guardrails can be modified via JMX as well"

I think you all know my feels on JMX. Maybe this is something we can go
straight to virtual tables?


On Mon, Nov 1, 2021 at 12:12 PM C. Scott Andreas 
wrote:

> Thank you for starting discussion on this CEP, Andrés!
>
> Can the "Scope" section of the doc be filled out? It currently reads
> "TBD," but having a better understanding of the scope of work would help
> focus discussion:
> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-3%3A+Guardrails
>
> Re: configuration via yaml, it will be important that these guardrails can
> be modified via JMX as well - e.g., in the case of a user running up
> against a limit and needing a path to being unblocked that doesn't require
> a yaml change and rolling restart.
>
> As David mentions, raising the visibility of the soft limit warnings will
> help users avoid being caught off-guard. Enabling the logging of wire
> protocol warnings received in CQL responses in the drivers by default would
> help if related JIRA tickets in those projects can be considered.
>
> On Nov 1, 2021, at 10:05 AM, David Capwell 
> wrote:
>
>
> Under "Migrating existing cassandra.yaml warn/fail thresholds”, I recently
> added a few things which are basically guardrails, so should be included in
> this set; they are configured by track_warnings (coordinator_read_size,
> local_read_size, and row_index_size). With track_warnings I setup the
> plumbing to have read queries trigger warnings (or abort the query) to the
> client exists (under "Event logging" you mention "and also to the client
> connection when applicable”) and isn’t limited to the coordinator
> participating in the query (previous limitation for tombstone warnings).
> One thing I found which was problematic for track_warnings was that
> altering clients is annoying as java and python both ignore the error
> message we send (see
> https://github.com/datastax/java-driver/blob/3.11.0/driver-core/src/main/java/com/datastax/driver/core/Responses.java#L73-L131).
> We log client warnings (if enabled) but ignore any detailed error message
> received from the server; it would be good to talk about client
> integrations and how users are informed of issues in more detail.
>
>
> On Nov 1, 2021, at 9:46 AM, Jeff Jirsa  wrote:
>
> Without bike-shedding too much, guardrails would be great, building them
> into a more general purpose framework that limits various dangerous things
> would be fantastic. The CEP says that the guardrails should be distinct
> from the capability restrictions (
> https://issues.apache.org/jira/browse/CASSANDRA-8303 ), but I don't see
> why
> that needs to be the case. A system-level guardrail and a personal-level
> guardrail are both restrictions, they just have different scopes, so
> implement the restriction framework first, and allow the scopes to be
> expanded as needed?
>
> Naming wise, I don't know that I'd actually surface these as "guardrails",
> but more as general "limits", and having them only configured via yaml
> seems like a bad outcome
>
>
>
> https://issues.apache.org/jira/browse/CASSANDRA-8303
>
>
>
> On Mon, Nov 1, 2021 at 9:31 AM Andrés de la Peña 
> wrote:
>
> Hi everyone,
>
> I'd like to start a discussion about Guardrails proposal:
>
>
> https://cwiki.apache.org/confluence/display/CASSANDRA/%28DRAFT%29+-+CEP-3%3A+Guardrails
>
> Guardrails are an easy way to enforce system-wide soft and hard limits to
> prevent anti-patterns of bad usage and in the long run make it not possible
> to severely degrade the performance of a node/cluster through user actions
> such as having too many secondary indexes, too large partitions, almost
> full disks, etc.
>
> Thanks,
>
>
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
> For additional commands, e-mail: dev-h...@cassandra.apache.org
>
>
>
>
>


Re: [DISCUSS] CEP-3: Guardrails

2021-11-01 Thread bened...@apache.org
> having them only configured via yaml seems like a bad outcome

+1

I would like to see us move towards configuration being driven through virtual 
tables where possible, so that the whole cluster can be managed from a single 
interface. Not sure if this is the right place to bite this off, but perhaps?

From: Jeff Jirsa 
Date: Monday, 1 November 2021 at 16:47
To: Cassandra DEV 
Subject: Re: [DISCUSS] CEP-3: Guardrails
Without bike-shedding too much, guardrails would be great, building them
into a more general purpose framework that limits various dangerous things
would be fantastic. The CEP says that the guardrails should be distinct
from the capability restrictions (
https://issues.apache.org/jira/browse/CASSANDRA-8303 ), but I don't see why
that needs to be the case. A system-level guardrail and a personal-level
guardrail are both restrictions, they just have different scopes, so
implement the restriction framework first, and allow the scopes to be
expanded as needed?

Naming wise, I don't know that I'd actually surface these as "guardrails",
but more as general "limits", and having them only configured via yaml
seems like a bad outcome



https://issues.apache.org/jira/browse/CASSANDRA-8303



On Mon, Nov 1, 2021 at 9:31 AM Andrés de la Peña 
wrote:

> Hi everyone,
>
> I'd like to start a discussion about Guardrails proposal:
>
> https://cwiki.apache.org/confluence/display/CASSANDRA/%28DRAFT%29+-+CEP-3%3A+Guardrails
>
> Guardrails are an easy way to enforce system-wide soft and hard limits to
> prevent anti-patterns of bad usage and in the long run make it not possible
> to severely degrade the performance of a node/cluster through user actions
> such as having too many secondary indexes, too large partitions, almost
> full disks, etc.
>
> Thanks,
>


Re: [DISCUSS] CEP-3: Guardrails

2021-11-01 Thread David Capwell
If anyone wants to bite off making 
https://github.com/apache/cassandra/blob/ab920c30310a8c095ba76b363142b8e74cbf0a0a/src/java/org/apache/cassandra/db/virtual/SettingsTable.java
 

 support mutability then we get vtable support.  I am cool with JMX and/or 
vtable, to me its just more important to allow dynamic setting of these configs.

> On Nov 1, 2021, at 10:36 AM, bened...@apache.org wrote:
> 
>> having them only configured via yaml seems like a bad outcome
> 
> +1
> 
> I would like to see us move towards configuration being driven through 
> virtual tables where possible, so that the whole cluster can be managed from 
> a single interface. Not sure if this is the right place to bite this off, but 
> perhaps?
> 
> From: Jeff Jirsa 
> Date: Monday, 1 November 2021 at 16:47
> To: Cassandra DEV 
> Subject: Re: [DISCUSS] CEP-3: Guardrails
> Without bike-shedding too much, guardrails would be great, building them
> into a more general purpose framework that limits various dangerous things
> would be fantastic. The CEP says that the guardrails should be distinct
> from the capability restrictions (
> https://issues.apache.org/jira/browse/CASSANDRA-8303 ), but I don't see why
> that needs to be the case. A system-level guardrail and a personal-level
> guardrail are both restrictions, they just have different scopes, so
> implement the restriction framework first, and allow the scopes to be
> expanded as needed?
> 
> Naming wise, I don't know that I'd actually surface these as "guardrails",
> but more as general "limits", and having them only configured via yaml
> seems like a bad outcome
> 
> 
> 
> https://issues.apache.org/jira/browse/CASSANDRA-8303
> 
> 
> 
> On Mon, Nov 1, 2021 at 9:31 AM Andrés de la Peña 
> wrote:
> 
>> Hi everyone,
>> 
>> I'd like to start a discussion about Guardrails proposal:
>> 
>> https://cwiki.apache.org/confluence/display/CASSANDRA/%28DRAFT%29+-+CEP-3%3A+Guardrails
>> 
>> Guardrails are an easy way to enforce system-wide soft and hard limits to
>> prevent anti-patterns of bad usage and in the long run make it not possible
>> to severely degrade the performance of a node/cluster through user actions
>> such as having too many secondary indexes, too large partitions, almost
>> full disks, etc.
>> 
>> Thanks,
>> 



Re: [DISCUSS] CEP-3: Guardrails

2021-11-01 Thread C. Scott Andreas

Re: "I think you all know my feels on JMX." –Super fair - I'd meant to speak in terms of desired outcome ("the feature should be dynamically 
configurable at runtime") rather than implementation ("this should be via JMX"). 👍On Nov 1, 2021, at 1:24 PM, David Capwell 
 wrote:If anyone wants to bite off making 
https://github.com/apache/cassandra/blob/ab920c30310a8c095ba76b363142b8e74cbf0a0a/src/java/org/apache/cassandra/db/virtual/SettingsTable.java 
 support 
mutability then we get vtable support.  I am cool with JMX and/or vtable, to me its just more important to allow dynamic setting of these configs.On Nov 1, 
2021, at 10:36 AM, bened...@apache.org wrote:having them only configured via yaml seems like a bad outcome+1I would like to see us move towards configuration 
being driven through virtual tables where possible, so that the whole cluster can be managed from a single interface. Not sure if this is the right place to 
bite this off, but perhaps?From: Jeff Jirsa Date: Monday, 1 November 2021 at 16:47To: Cassandra DEV 
Subject: Re: [DISCUSS] CEP-3: GuardrailsWithout bike-shedding too much, guardrails would be great, building theminto a more 
general purpose framework that limits various dangerous thingswould be fantastic. The CEP says that the guardrails should be distinctfrom the capability 
restrictions (https://issues.apache.org/jira/browse/CASSANDRA-8303 ), but I don't see whythat needs to be the case. A system-level guardrail and a 
personal-levelguardrail are both restrictions, they just have different scopes, soimplement the restriction framework first, and allow the scopes to beexpanded 
as needed?Naming wise, I don't know that I'd actually surface these as "guardrails",but more as general "limits", and having them only 
configured via yamlseems like a bad outcomehttps://issues.apache.org/jira/browse/CASSANDRA-8303On Mon, Nov 1, 2021 at 9:31 AM Andrés de la Peña 
wrote:Hi everyone,I'd like to start a discussion about Guardrails 
proposal:https://cwiki.apache.org/confluence/display/CASSANDRA/%28DRAFT%29+-+CEP-3%3A+GuardrailsGuardrails are an easy way to enforce system-wide soft and hard 
limits toprevent anti-patterns of bad usage and in the long run make it not possibleto severely degrade the performance of a node/cluster through user 
actionssuch as having too many secondary indexes, too large partitions, almostfull disks, etc.Thanks,

Re: [DISCUSS] Releasable trunk and quality

2021-11-01 Thread David Capwell
> How do we define what "releasable trunk" means?

One thing I would love is for us to adopt a “run all tests needed to release 
before commit” mentality, and to link a successful run in JIRA when closing (we 
talked about this once in slack).  If we look at CircleCI we currently do not 
run all the tests needed to sign off; below are the tests disabled in the 
“pre-commit” workflows (see 
https://github.com/apache/cassandra/blob/trunk/.circleci/config-2_1.yml#L381):

start_utests_long
start_utests_compression
start_utests_stress
start_utests_fqltool
start_utests_system_keyspace_directory
start_jvm_upgrade_dtest
start_upgrade_tests

Given the configuration right now we have to opt-in to upgrade tests, but we 
can’t release if those are broken (same for compression/fqltool/cdc (not 
covered in circle)).

> On Oct 30, 2021, at 6:24 AM, bened...@apache.org wrote:
> 
>> How do we define what "releasable trunk" means?
> 
> For me, the major criteria is ensuring that work is not merged that is known 
> to require follow-up work, or could reasonably have been known to require 
> follow-up work if better QA practices had been followed.
> 
> So, a big part of this is ensuring we continue to exceed our targets for 
> improved QA. For me this means trying to weave tools like Harry and the 
> Simulator into our development workflow early on, but we’ll see how well 
> these tools gain broader adoption. This also means focus in general on 
> possible negative effects of a change.
> 
> I think we could do with producing guidance documentation for how to approach 
> QA, where we can record our best practices and evolve them as we discover 
> flaws or pitfalls, either for ergonomics or for bug discovery.
> 
>> What are the benefits of having a releasable trunk as defined here?
> 
> If we want to have any hope of meeting reasonable release cadences _and_ the 
> high project quality we expect today, then I think a ~shippable trunk policy 
> is an absolute necessity.
> 
> I don’t think means guaranteeing there are no failing tests (though ideally 
> this would also happen), but about ensuring our best practices are followed 
> for every merge. 4.0 took so long to release because of the amount of hidden 
> work that was created by merging work that didn’t meet the standard for 
> release.
> 
> Historically we have also had significant pressure to backport features to 
> earlier versions due to the cost and risk of upgrading. If we maintain 
> broader version compatibility for upgrade, and reduce the risk of adopting 
> newer versions, then this pressure is also reduced significantly. Though 
> perhaps we will stick to our guns here anyway, as there seems to be renewed 
> pressure to limit work in GA releases to bug fixes exclusively. It remains to 
> be seen if this holds.
> 
>> What are the costs?
> 
> I think the costs are quite low, perhaps even negative. Hidden work produced 
> by merges that break things can be much more costly than getting the work 
> right first time, as attribution is much more challenging.
> 
> One cost that is created, however, is for version compatibility as we cannot 
> say “well, this is a minor version bump so we don’t need to support 
> downgrade”. But I think we should be investing in this anyway for operator 
> simplicity and confidence, so I actually see this as a benefit as well.
> 
>> Full disclosure: running face-first into 60+ failing tests on trunk
> 
> I have to apologise here. CircleCI did not uncover these problems, apparently 
> due to some way it resolves dependencies, and so I am responsible for a 
> significant number of these and have been quite sick since.
> 
> I think a push to eliminate flaky tests will probably help here in future, 
> though, and perhaps the project needs to have some (low) threshold of flaky 
> or failing tests at which point we block merges to force a correction.
> 
> 
> From: Joshua McKenzie 
> Date: Saturday, 30 October 2021 at 14:00
> To: dev@cassandra.apache.org 
> Subject: [DISCUSS] Releasable trunk and quality
> We as a project have gone back and forth on the topic of quality and the
> notion of a releasable trunk for quite a few years. If people are
> interested, I'd like to rekindle this discussion a bit and see if we're
> happy with where we are as a project or if we think there's steps we should
> take to change the quality bar going forward. The following questions have
> been rattling around for me for awhile:
> 
> 1. How do we define what "releasable trunk" means? All reviewed by M
> committers? Passing N% of tests? Passing all tests plus some other metrics
> (manual testing, raising the number of reviewers, test coverage, usage in
> dev or QA environments, etc)? Something else entirely?
> 
> 2. With a definition settled upon in #1, what steps, if any, do we need to
> take to get from where we are to having *and keeping* that releasable
> trunk? Anything to codify there?
> 
> 3. What are the benefits of having a releasable trunk as defined here

Re: [DISCUSS] Releasable trunk and quality

2021-11-01 Thread David Capwell
> I have to apologise here. CircleCI did not uncover these problems, apparently 
> due to some way it resolves dependencies,

I double checked your CircleCI run for the trunk branch, and the problem 
doesn’t have to do with “resolves dependencies”, the problem lies with our CI 
being too complex and doesn’t natively support multi-branch commits.

Right now you need to opt-in to 2 builds to run the single jvm-dtest upgrade 
test build (missed in your CI); this should not be opt-in (see my previous 
comment about this), and it really shouldn’t be 2 approvals for a single build…
Enabling “upgrade tests” does not run all the upgrade tests… you need to 
approve 2 other builds to run the full set of upgrade tests (see problem 
above).  I see in the build you ran the upgrade tests, which only touches the 
python-dtest upgrade tests
Lastly, you need to hack the circleci configuration to support multi-branch CI, 
if you do not it will run against w/e is already committed to 2.2, 3.0, 3.11, 
and 4.0.  Multi-branch commits are very normal for our project, but doing CI 
properly in these cases is way too hard (you can not do multi-branch tests in 
Jenkins 
https://ci-cassandra.apache.org/view/patches/job/Cassandra-devbranch-test/build 
;
 no support to run against your other branches).

> On Nov 1, 2021, at 3:03 PM, David Capwell  wrote:
> 
>> How do we define what "releasable trunk" means?
> 
> One thing I would love is for us to adopt a “run all tests needed to release 
> before commit” mentality, and to link a successful run in JIRA when closing 
> (we talked about this once in slack).  If we look at CircleCI we currently do 
> not run all the tests needed to sign off; below are the tests disabled in the 
> “pre-commit” workflows (see 
> https://github.com/apache/cassandra/blob/trunk/.circleci/config-2_1.yml#L381):
> 
> start_utests_long
> start_utests_compression
> start_utests_stress
> start_utests_fqltool
> start_utests_system_keyspace_directory
> start_jvm_upgrade_dtest
> start_upgrade_tests
> 
> Given the configuration right now we have to opt-in to upgrade tests, but we 
> can’t release if those are broken (same for compression/fqltool/cdc (not 
> covered in circle)).
> 
>> On Oct 30, 2021, at 6:24 AM, bened...@apache.org wrote:
>> 
>>> How do we define what "releasable trunk" means?
>> 
>> For me, the major criteria is ensuring that work is not merged that is known 
>> to require follow-up work, or could reasonably have been known to require 
>> follow-up work if better QA practices had been followed.
>> 
>> So, a big part of this is ensuring we continue to exceed our targets for 
>> improved QA. For me this means trying to weave tools like Harry and the 
>> Simulator into our development workflow early on, but we’ll see how well 
>> these tools gain broader adoption. This also means focus in general on 
>> possible negative effects of a change.
>> 
>> I think we could do with producing guidance documentation for how to 
>> approach QA, where we can record our best practices and evolve them as we 
>> discover flaws or pitfalls, either for ergonomics or for bug discovery.
>> 
>>> What are the benefits of having a releasable trunk as defined here?
>> 
>> If we want to have any hope of meeting reasonable release cadences _and_ the 
>> high project quality we expect today, then I think a ~shippable trunk policy 
>> is an absolute necessity.
>> 
>> I don’t think means guaranteeing there are no failing tests (though ideally 
>> this would also happen), but about ensuring our best practices are followed 
>> for every merge. 4.0 took so long to release because of the amount of hidden 
>> work that was created by merging work that didn’t meet the standard for 
>> release.
>> 
>> Historically we have also had significant pressure to backport features to 
>> earlier versions due to the cost and risk of upgrading. If we maintain 
>> broader version compatibility for upgrade, and reduce the risk of adopting 
>> newer versions, then this pressure is also reduced significantly. Though 
>> perhaps we will stick to our guns here anyway, as there seems to be renewed 
>> pressure to limit work in GA releases to bug fixes exclusively. It remains 
>> to be seen if this holds.
>> 
>>> What are the costs?
>> 
>> I think the costs are quite low, perhaps even negative. Hidden work produced 
>> by merges that break things can be much more costly than getting the work 
>> right first time, as attribution is much more challenging.
>> 
>> One cost that is created, however, is for version compatibility as we cannot 
>> say “well, this is a minor version bump so we don’t need to support 
>> downgrade”. But I think we should be investing in this anyway for operator 
>> simplicity and confidence, so I actually see this as a benefit as well.
>> 
>>> Full disclosure: running face-first into 60+ failing tests on trunk
>> 
>> I have to apologise here. CircleCI did 

Re: [DISCUSS] Releasable trunk and quality

2021-11-01 Thread Jacek Lewandowski
>
> I don’t think means guaranteeing there are no failing tests (though
> ideally this would also happen), but about ensuring our best practices are
> followed for every merge. 4.0 took so long to release because of the amount
> of hidden work that was created by merging work that didn’t meet the
> standard for release.
>

Tests are sometimes considered flaky because they fail intermittently but
it may not be related to the insufficiently consistent test implementation
and can reveal some real problem in the production code. I saw that in
various codebases and I think that it would be great if each such test (or
test group) was guaranteed to have a ticket and some preliminary analysis
was done to confirm it is just a test problem before releasing the new
version

Historically we have also had significant pressure to backport features to
> earlier versions due to the cost and risk of upgrading. If we maintain
> broader version compatibility for upgrade, and reduce the risk of adopting
> newer versions, then this pressure is also reduced significantly. Though
> perhaps we will stick to our guns here anyway, as there seems to be renewed
> pressure to limit work in GA releases to bug fixes exclusively. It remains
> to be seen if this holds.


Are there any precise requirements for supported upgrade and downgrade
paths?

Thanks
- - -- --- -  -
Jacek Lewandowski


On Sat, Oct 30, 2021 at 4:07 PM bened...@apache.org 
wrote:

> > How do we define what "releasable trunk" means?
>
> For me, the major criteria is ensuring that work is not merged that is
> known to require follow-up work, or could reasonably have been known to
> require follow-up work if better QA practices had been followed.
>
> So, a big part of this is ensuring we continue to exceed our targets for
> improved QA. For me this means trying to weave tools like Harry and the
> Simulator into our development workflow early on, but we’ll see how well
> these tools gain broader adoption. This also means focus in general on
> possible negative effects of a change.
>
> I think we could do with producing guidance documentation for how to
> approach QA, where we can record our best practices and evolve them as we
> discover flaws or pitfalls, either for ergonomics or for bug discovery.
>
> > What are the benefits of having a releasable trunk as defined here?
>
> If we want to have any hope of meeting reasonable release cadences _and_
> the high project quality we expect today, then I think a ~shippable trunk
> policy is an absolute necessity.
>
> I don’t think means guaranteeing there are no failing tests (though
> ideally this would also happen), but about ensuring our best practices are
> followed for every merge. 4.0 took so long to release because of the amount
> of hidden work that was created by merging work that didn’t meet the
> standard for release.
>
> Historically we have also had significant pressure to backport features to
> earlier versions due to the cost and risk of upgrading. If we maintain
> broader version compatibility for upgrade, and reduce the risk of adopting
> newer versions, then this pressure is also reduced significantly. Though
> perhaps we will stick to our guns here anyway, as there seems to be renewed
> pressure to limit work in GA releases to bug fixes exclusively. It remains
> to be seen if this holds.
>
> > What are the costs?
>
> I think the costs are quite low, perhaps even negative. Hidden work
> produced by merges that break things can be much more costly than getting
> the work right first time, as attribution is much more challenging.
>
> One cost that is created, however, is for version compatibility as we
> cannot say “well, this is a minor version bump so we don’t need to support
> downgrade”. But I think we should be investing in this anyway for operator
> simplicity and confidence, so I actually see this as a benefit as well.
>
> > Full disclosure: running face-first into 60+ failing tests on trunk
>
> I have to apologise here. CircleCI did not uncover these problems,
> apparently due to some way it resolves dependencies, and so I am
> responsible for a significant number of these and have been quite sick
> since.
>
> I think a push to eliminate flaky tests will probably help here in future,
> though, and perhaps the project needs to have some (low) threshold of flaky
> or failing tests at which point we block merges to force a correction.
>
>
> From: Joshua McKenzie 
> Date: Saturday, 30 October 2021 at 14:00
> To: dev@cassandra.apache.org 
> Subject: [DISCUSS] Releasable trunk and quality
> We as a project have gone back and forth on the topic of quality and the
> notion of a releasable trunk for quite a few years. If people are
> interested, I'd like to rekindle this discussion a bit and see if we're
> happy with where we are as a project or if we think there's steps we should
> take to change the quality bar going forward. The following questions have
> been rattling around for me for a

Re: [DISCUSS] Releasable trunk and quality

2021-11-01 Thread Berenguer Blasi
Hi,

we already have a way to confirm flakiness on circle by running the test
repeatedly N times. Like 100 or 500. That has proven to work very well
so far, at least for me. #collaborating #justfyi

On the 60+ failures it is not as bad as it looks. Let me explain. I have
been tracking failures in 4.0 and trunk daily, it's grown as a habit in
me after the 4.0 push. And 4.0 and trunk were hovering around <10
failures solidly (you can check jenkins ci graphs). The random bisect or
fix was needed leaving behind 3 or 4 tests that have defeated already 2
or 3 committers, so the really tough guys. I am reasonably convinced
once the 60+ failures fix merges we'll be back to the <10 failures with
relative little effort.

So we're just in the middle of a 'fix' but overall we shouldn't be as
bad as it looks now as we've been quite good at keeping CI green-ish imo.

Also +1 to releasable branches, which whatever we settle it means it is
not a wall of failures, bc of reasons explained like the hidden costs etc

My 2cts.

On 2/11/21 6:07, Jacek Lewandowski wrote:
>> I don’t think means guaranteeing there are no failing tests (though
>> ideally this would also happen), but about ensuring our best practices are
>> followed for every merge. 4.0 took so long to release because of the amount
>> of hidden work that was created by merging work that didn’t meet the
>> standard for release.
>>
> Tests are sometimes considered flaky because they fail intermittently but
> it may not be related to the insufficiently consistent test implementation
> and can reveal some real problem in the production code. I saw that in
> various codebases and I think that it would be great if each such test (or
> test group) was guaranteed to have a ticket and some preliminary analysis
> was done to confirm it is just a test problem before releasing the new
> version
>
> Historically we have also had significant pressure to backport features to
>> earlier versions due to the cost and risk of upgrading. If we maintain
>> broader version compatibility for upgrade, and reduce the risk of adopting
>> newer versions, then this pressure is also reduced significantly. Though
>> perhaps we will stick to our guns here anyway, as there seems to be renewed
>> pressure to limit work in GA releases to bug fixes exclusively. It remains
>> to be seen if this holds.
>
> Are there any precise requirements for supported upgrade and downgrade
> paths?
>
> Thanks
> - - -- --- -  -
> Jacek Lewandowski
>
>
> On Sat, Oct 30, 2021 at 4:07 PM bened...@apache.org 
> wrote:
>
>>> How do we define what "releasable trunk" means?
>> For me, the major criteria is ensuring that work is not merged that is
>> known to require follow-up work, or could reasonably have been known to
>> require follow-up work if better QA practices had been followed.
>>
>> So, a big part of this is ensuring we continue to exceed our targets for
>> improved QA. For me this means trying to weave tools like Harry and the
>> Simulator into our development workflow early on, but we’ll see how well
>> these tools gain broader adoption. This also means focus in general on
>> possible negative effects of a change.
>>
>> I think we could do with producing guidance documentation for how to
>> approach QA, where we can record our best practices and evolve them as we
>> discover flaws or pitfalls, either for ergonomics or for bug discovery.
>>
>>> What are the benefits of having a releasable trunk as defined here?
>> If we want to have any hope of meeting reasonable release cadences _and_
>> the high project quality we expect today, then I think a ~shippable trunk
>> policy is an absolute necessity.
>>
>> I don’t think means guaranteeing there are no failing tests (though
>> ideally this would also happen), but about ensuring our best practices are
>> followed for every merge. 4.0 took so long to release because of the amount
>> of hidden work that was created by merging work that didn’t meet the
>> standard for release.
>>
>> Historically we have also had significant pressure to backport features to
>> earlier versions due to the cost and risk of upgrading. If we maintain
>> broader version compatibility for upgrade, and reduce the risk of adopting
>> newer versions, then this pressure is also reduced significantly. Though
>> perhaps we will stick to our guns here anyway, as there seems to be renewed
>> pressure to limit work in GA releases to bug fixes exclusively. It remains
>> to be seen if this holds.
>>
>>> What are the costs?
>> I think the costs are quite low, perhaps even negative. Hidden work
>> produced by merges that break things can be much more costly than getting
>> the work right first time, as attribution is much more challenging.
>>
>> One cost that is created, however, is for version compatibility as we
>> cannot say “well, this is a minor version bump so we don’t need to support
>> downgrade”. But I think we should be investing in this anyway for operator
>> simplicity and