Re: [DISCUSS] Limiting query results by size (CASSANDRA-11745)

2023-07-10 Thread Jacek Lewandowski
Given what was said, I propose rephrasing this functionality to limit the
memory used to execute a query. We will not expose the page size measured
in bytes to the client. Instead, an upper limit will be a guardrail so that
we won't fetch more data.

Aggregation query with grouping is a special case in which we would count
only those columns marked as queried in a ColumnFilter for a grouped result
(maximum sizes of those columns in a group).

This way, we can still achieve the goal of making the server more stable
under heavy load. Letting the user specify a page size in bytes is indeed a
separate story, as the result set size needs to be measured on a higher
level, where the selectors are applied.

thanks,
Jacek


wt., 13 cze 2023 o 10:42 Benjamin Lerer  napisał(a):

> So my other question - for aggregation with the "group by" clause, we
>> return an aggregated row which is computed from a group of rows - with my
>> current implementation, it is approximated by counting the size of the
>> largest row in that group - I think it is the safest and simplest
>> approximation - wdyt?
>
>
> I feel that there are something that was not discussed here. The storage
> engine can return some rows that are much larger than the actual row
> returned to the user depending on the projections being used. Therefore
> there will only be a reliable matching between the size of the page loaded
> internally and the size of the page returned to the user when the full row
> is queried without transformation. For all the other case the difference
> can be really significant. For a group by queries doing a count(*), the
> approach suggested will return a page size that is totally off with what
> was requested.
>
> Le mar. 13 juin 2023 à 07:00, Jacek Lewandowski <
> lewandowski.ja...@gmail.com> a écrit :
>
>> Josh, that answers my question exactly; thank you.
>>
>> I will not implement limiting the result set in CQL (that is, by LIMIT
>> clause) and stay with just paging. Whether the page size is defined in
>> bytes or rows can be determined by a flag - there are many unused bits for
>> that.
>>
>> So my other question - for aggregation with the "group by" clause, we
>> return an aggregated row which is computed from a group of rows - with my
>> current implementation, it is approximated by counting the size of the
>> largest row in that group - I think it is the safest and simplest
>> approximation - wdyt?
>>
>>
>> pon., 12 cze 2023 o 22:55 Josh McKenzie 
>> napisał(a):
>>
>>> As long as it is valid in the paging protocol to return a short page,
>>> but still say “there are more pages”, I think that is fine to do that.
>>>
>>> Thankfully the v3-v5 spec all make it clear that clients need to respect
>>> what the server has to say about there being more pages:
>>> https://github.com/apache/cassandra/blob/trunk/doc/native_protocol_v5.spec#L1247-L1253
>>>
>>>   - Clients should not rely on the actual size of the result set
>>> returned to
>>> decide if there are more results to fetch or not. Instead, they
>>> should always
>>> check the Has_more_pages flag (unless they did not enable paging for
>>> the query
>>> obviously). Clients should also not assert that no result will have
>>> more than
>>>  results. While the current implementation always
>>> respects
>>> the exact value of , we reserve the right to return
>>> slightly smaller or bigger pages in the future for performance
>>> reasons.
>>>
>>>
>>> On Mon, Jun 12, 2023, at 3:19 PM, Jeremiah Jordan wrote:
>>>
>>> As long as it is valid in the paging protocol to return a short page,
>>> but still say “there are more pages”, I think that is fine to do that.  For
>>> an actual LIMIT that is part of the user query, I think the server must
>>> always have returned all data that fits into the LIMIT when all pages have
>>> been returned.
>>>
>>> -Jeremiah
>>>
>>> On Jun 12, 2023 at 12:56:14 PM, Josh McKenzie 
>>> wrote:
>>>
>>>
>>> Yeah, my bad. I have paging on the brain. Seriously.
>>>
>>> I can't think of a use-case in which a LIMIT based on # bytes makes
>>> sense from a user perspective.
>>>
>>> On Mon, Jun 12, 2023, at 1:35 PM, Jeff Jirsa wrote:
>>>
>>>
>>>
>>> On Mon, Jun 12, 2023 at 9:50 AM Benjamin Lerer 
>>> wrote:
>>>
>>> If you have rows that vary significantly in their size, your latencies
>>> could end up being pretty unpredictable using a LIMIT BY . Being
>>> able to specify a limit by bytes at the driver / API level would allow app
>>> devs to get more deterministic results out of their interaction w/the DB if
>>> they're looking to respond back to a client within a certain time frame and
>>> / or determine next steps in the app (continue paging, stop, etc) based on
>>> how long it took to get results back.
>>>
>>>
>>> Are you talking about the page size or the LIMIT. Once the LIMIT is
>>> reached there is no "continue paging". LIMIT is also at the CQL level not
>>> at the driver level.
>>> I can totally understand the need for a page size in bytes

Re: [DISCUSS] When to run CheckStyle and other verificiations

2023-07-10 Thread Jacek Lewandowski
Maxim, I don't think it would work, especially this command:


"ant test -Dno-build=true"


would execute the whole pipeline up to the "test" target, skipping only the
"build" target. However, none of its dependencies would be missed. In Ant,
when a target is skipped due to some property, skipping applies only to
that target, but its dependencies are executed as usual.


Also, I started this discussion because the commands we execute locally
should be short, simple, easily memorable, and self-explanatory. We want
everybody to refrain from digging into the build.xml to learn flags.


Please correct me if I'm wrong, but I believe this is what most of us do
from the terminal:

- cleaning,

- building jars,

- running checks (currently implicitly by building or testing),

- running tests of a single method / test-class

- generating IDE configuration


Everyone has excellent ideas, but those suggestions do not converge into a
single solution. That's why I'd like to make that simple change that would
reduce flags' use in everyday development. Concretely, can we get an
agreement to:

   - Remove the checkstyle dependency from "jar" and "test"
   - Create a single "check" target that includes all the checks we expect
   to pass in the CI (currently Checkstyle, RAT, and Eclipse-Warnings), making
   this task the default.


?


thanks,

Jacek

czw., 6 lip 2023 o 19:55 Jon Meredith  napisał(a):

> sorry, hit send early.
>
> ant test is an interesting one as it seems impractical to run all tests
> sequentially, but somebody may want to I suppose.
>
> On Thu, Jul 6, 2023 at 11:53 AM Jon Meredith 
> wrote:
>
>> I think the -Dno-blah settings have usability issues. As they look at
>> the property name, not the value, you cannot override them or default
>> them with ANT_ARGS or by importing to another ant build file.  The way
>> rat.skip does it seems much better using configured value.
>>
>> Ideally, I would like an easy/fast configuration to set a default for
>> checks that slow up the compilation/test cycle locally to be able to
>> iterate quickly compile and deal with javadoc/checkstyle comments when
>> they're ready to commit, or opt into them on the commandline when
>> needed.
>>
>> e.g.
>> export ANT_ARGS="-Dcheckstyle.default=skip -Djavadoc.default=skip"
>> ant # should just compile, no checkstyle/javadoc etc
>> ant checkstyle  # explicitly requested, run checkstyle
>>
>> Similarly I'd like to have the option to configure any CI system I run so
>> all
>> non-execution essential checks run in their own pipeline and fail the
>> build if there's a problem, but still run the other test targets despite
>> violations. Each builder wasted the time running the checks that only
>> need to happen once and you didn't get feedback about your tests that
>> could have run. Of course not everybody may want that and the main
>> Apache Cassandra CI may only want to run tests for checked commits
>> for resource reasons.
>>
>> Also,as a minor nuisance, if you forget the =true as in the examples,
>> ant consumes the next argument as the value, so "ant publish
>> -Dno-tests -Dno-checks" would set no-tests=-Dno-checks and run the
>> checks you tried to skip anyway.
>>
>> Back to the proposal, I like the idea of an explicit check target that
>> runs all checks,
>> I would not personally have the default target run them but think that's
>> fine as long
>> as you can disable them.
>>
>> ant test is an interesting one
>>
>> On Thu, Jul 6, 2023 at 7:30 AM Maxim Muzafarov  wrote:
>>
>>> In my humble opinion, it is better to have only one plain and
>>> straightforward build pipeline for the whole project, with custom
>>> flags used to skip a particular step, than to have multiple pipelines
>>> under the ant tool with multiple endpoints accordingly. I mean, all
>>> the steps need to be lined up, with each step in the pipeline
>>> executing everything that stands before it unless skip flags are
>>> specified. Meanwhile, I like your idea of grouping all the checks
>>> under the dedicated step (and changing the no-checkstyle flag to
>>> no-checks accordingly as Ekaterina mentioned).
>>>
>>>
>>> Let me share a simple example of what I'm talking about with one
>>> single endpoint.
>>> Let's assume the following step order:
>>>
>>> init -> _build_java (compile) -> checks -> build -> jar -> test ->
>>> artifacts -> publish;
>>>
>>> So, the use would be:
>>>
>>> ant jar -Dno-checks
>>> ant test -Dno-build
>>> ant publish -Dno-tests -Dno-checks
>>>
>>>
>>> I'm not saying what you've proposed is bad, in fact, we're not
>>> currently doing the pipeline I'm talking about, but adding an
>>> additional endpoint is something we should consider very carefully as
>>> it may create some difficulties for Maven/Gradle migration if it ever
>>> happens.
>>>
>>> So, if I'm not mistaken the following you're trying to add a new
>>> endpoint to the way how we might build the project:
>>>
>>> - "ant [check]" = build + all checks (first endpoint)
>>> - "ant 

Re: [DISCUSS] When to run CheckStyle and other verificiations

2023-07-10 Thread Brandon Williams
On Mon, Jul 10, 2023 at 6:07 AM Jacek Lewandowski
 wrote:
> Remove the checkstyle dependency from "jar" and "test"
> Create a single "check" target that includes all the checks we expect to pass 
> in the CI (currently Checkstyle, RAT, and Eclipse-Warnings), making this task 
> the default.

I support this.  Having checkstyle run when building is clearly
constant friction for many, even though you can disable it.


Removal of CloudstackSnitch

2023-07-10 Thread Miklosovic, Stefan
Hi list,

I want to ask about the future of CloudstackSnitch.

This snitch was added 9 years ago (1). I contacted the original author of that 
snitch, Pierre-Yves Ritschard, who is currently CEO of a company he coded that 
snitch for.

In a nutshell, Pierre answered that he does not think this snitch is relevant 
anymore and the company is using different way how to fetch metadata from a 
node, rendering CloudstackSnitch, as is, irrelevant for them.

I also wrote an email to user ML list (2) about two weeks ago and nobody 
answered that they are using it either.

The current implementation is using this approach (3) but I think that it is 
already obsolete in the snitch because snitch is adding a path to parsed 
metadata service IP which is probably not there at all in the default 
implementation of Cloudstack data server.

What also bothers me is that we, as a community, seem to not be able to test 
the functionality of this snitch as I do not know anybody with a Cloudstack 
deployment who would be able to test this reliably.

For completeness, in (1), Brandon expressed his opinion that unless users come 
forward for this snitch, he thinks the retiring it is the best option.

For all cloud-based snitches, we did the refactorization of the code in 16555 
an we work on improvement in 18438 which introduces a generic way how metadata 
services are called and plugging in custom logic or reusing a default 
implementation of a cloud connector is very easy, further making this snitch 
less relevant.

This being said, should we:

1) remove it in 5.0
2) keep it there in 5.0 but mark it @Deprecated
3) keep it there 

Regards

(1) https://issues.apache.org/jira/browse/CASSANDRA-7147
(2) https://lists.apache.org/thread/k4woljlk23m2oylvrbnod6wocno2dlm3
(3) 
https://docs.cloudstack.apache.org/en/latest/adminguide/virtual_machines/user-data.html#determining-the-virtual-router-address-without-dns

Re: Removal of CloudstackSnitch

2023-07-10 Thread Ekaterina Dimitrova
Hi Stefan,

I think we should follow our deprecation rules and deprecate it in 5.0,
potentially remove in 6.0. (Deprecate in one major, remove in the next
major)
Maybe the deprecation can come with a note on your findings for the users,
just in case someone somewhere uses it and did not follow the user mailing
list?

Thank you
Ekaterina

On Mon, 10 Jul 2023 at 9:47, Miklosovic, Stefan <
stefan.mikloso...@netapp.com> wrote:

> Hi list,
>
> I want to ask about the future of CloudstackSnitch.
>
> This snitch was added 9 years ago (1). I contacted the original author of
> that snitch, Pierre-Yves Ritschard, who is currently CEO of a company he
> coded that snitch for.
>
> In a nutshell, Pierre answered that he does not think this snitch is
> relevant anymore and the company is using different way how to fetch
> metadata from a node, rendering CloudstackSnitch, as is, irrelevant for
> them.
>
> I also wrote an email to user ML list (2) about two weeks ago and nobody
> answered that they are using it either.
>
> The current implementation is using this approach (3) but I think that it
> is already obsolete in the snitch because snitch is adding a path to parsed
> metadata service IP which is probably not there at all in the default
> implementation of Cloudstack data server.
>
> What also bothers me is that we, as a community, seem to not be able to
> test the functionality of this snitch as I do not know anybody with a
> Cloudstack deployment who would be able to test this reliably.
>
> For completeness, in (1), Brandon expressed his opinion that unless users
> come forward for this snitch, he thinks the retiring it is the best option.
>
> For all cloud-based snitches, we did the refactorization of the code in
> 16555 an we work on improvement in 18438 which introduces a generic way how
> metadata services are called and plugging in custom logic or reusing a
> default implementation of a cloud connector is very easy, further making
> this snitch less relevant.
>
> This being said, should we:
>
> 1) remove it in 5.0
> 2) keep it there in 5.0 but mark it @Deprecated
> 3) keep it there
>
> Regards
>
> (1) https://issues.apache.org/jira/browse/CASSANDRA-7147
> (2) https://lists.apache.org/thread/k4woljlk23m2oylvrbnod6wocno2dlm3
> (3)
> https://docs.cloudstack.apache.org/en/latest/adminguide/virtual_machines/user-data.html#determining-the-virtual-router-address-without-dns


Re: Removal of CloudstackSnitch

2023-07-10 Thread Miklosovic, Stefan
Hey,

should we still keep it around if we are not even sure it still works? As I see 
it, we are not able to verify it works on 5.0 release. What value there is in a 
snitch we do not know is still functioning?

Regards


From: Ekaterina Dimitrova 
Sent: Monday, July 10, 2023 15:54
To: dev@cassandra.apache.org
Subject: Re: Removal of CloudstackSnitch

NetApp Security WARNING: This is an external email. Do not click links or open 
attachments unless you recognize the sender and know the content is safe.



Hi Stefan,

I think we should follow our deprecation rules and deprecate it in 5.0, 
potentially remove in 6.0. (Deprecate in one major, remove in the next major)
Maybe the deprecation can come with a note on your findings for the users, just 
in case someone somewhere uses it and did not follow the user mailing list?

Thank you
Ekaterina

On Mon, 10 Jul 2023 at 9:47, Miklosovic, Stefan 
mailto:stefan.mikloso...@netapp.com>> wrote:
Hi list,

I want to ask about the future of CloudstackSnitch.

This snitch was added 9 years ago (1). I contacted the original author of that 
snitch, Pierre-Yves Ritschard, who is currently CEO of a company he coded that 
snitch for.

In a nutshell, Pierre answered that he does not think this snitch is relevant 
anymore and the company is using different way how to fetch metadata from a 
node, rendering CloudstackSnitch, as is, irrelevant for them.

I also wrote an email to user ML list (2) about two weeks ago and nobody 
answered that they are using it either.

The current implementation is using this approach (3) but I think that it is 
already obsolete in the snitch because snitch is adding a path to parsed 
metadata service IP which is probably not there at all in the default 
implementation of Cloudstack data server.

What also bothers me is that we, as a community, seem to not be able to test 
the functionality of this snitch as I do not know anybody with a Cloudstack 
deployment who would be able to test this reliably.

For completeness, in (1), Brandon expressed his opinion that unless users come 
forward for this snitch, he thinks the retiring it is the best option.

For all cloud-based snitches, we did the refactorization of the code in 16555 
an we work on improvement in 18438 which introduces a generic way how metadata 
services are called and plugging in custom logic or reusing a default 
implementation of a cloud connector is very easy, further making this snitch 
less relevant.

This being said, should we:

1) remove it in 5.0
2) keep it there in 5.0 but mark it @Deprecated
3) keep it there

Regards

(1) 
https://issues.apache.org/jira/browse/CASSANDRA-7147
(2) 
https://lists.apache.org/thread/k4woljlk23m2oylvrbnod6wocno2dlm3
(3) 
https://docs.cloudstack.apache.org/en/latest/adminguide/virtual_machines/user-data.html#determining-the-virtual-router-address-without-dns


Re: Removal of CloudstackSnitch

2023-07-10 Thread Brandon Williams
I agree with Ekaterina, but also want to point out that snitches are
pluggable, so whatever we do should be pretty safe.  If someone
discovers after the removal that they need it, they can just plug it
back in.

Kind Regards,
Brandon

On Mon, Jul 10, 2023 at 8:54 AM Ekaterina Dimitrova
 wrote:
>
> Hi Stefan,
>
> I think we should follow our deprecation rules and deprecate it in 5.0, 
> potentially remove in 6.0. (Deprecate in one major, remove in the next major)
> Maybe the deprecation can come with a note on your findings for the users, 
> just in case someone somewhere uses it and did not follow the user mailing 
> list?
>
> Thank you
> Ekaterina
>
> On Mon, 10 Jul 2023 at 9:47, Miklosovic, Stefan 
>  wrote:
>>
>> Hi list,
>>
>> I want to ask about the future of CloudstackSnitch.
>>
>> This snitch was added 9 years ago (1). I contacted the original author of 
>> that snitch, Pierre-Yves Ritschard, who is currently CEO of a company he 
>> coded that snitch for.
>>
>> In a nutshell, Pierre answered that he does not think this snitch is 
>> relevant anymore and the company is using different way how to fetch 
>> metadata from a node, rendering CloudstackSnitch, as is, irrelevant for them.
>>
>> I also wrote an email to user ML list (2) about two weeks ago and nobody 
>> answered that they are using it either.
>>
>> The current implementation is using this approach (3) but I think that it is 
>> already obsolete in the snitch because snitch is adding a path to parsed 
>> metadata service IP which is probably not there at all in the default 
>> implementation of Cloudstack data server.
>>
>> What also bothers me is that we, as a community, seem to not be able to test 
>> the functionality of this snitch as I do not know anybody with a Cloudstack 
>> deployment who would be able to test this reliably.
>>
>> For completeness, in (1), Brandon expressed his opinion that unless users 
>> come forward for this snitch, he thinks the retiring it is the best option.
>>
>> For all cloud-based snitches, we did the refactorization of the code in 
>> 16555 an we work on improvement in 18438 which introduces a generic way how 
>> metadata services are called and plugging in custom logic or reusing a 
>> default implementation of a cloud connector is very easy, further making 
>> this snitch less relevant.
>>
>> This being said, should we:
>>
>> 1) remove it in 5.0
>> 2) keep it there in 5.0 but mark it @Deprecated
>> 3) keep it there
>>
>> Regards
>>
>> (1) https://issues.apache.org/jira/browse/CASSANDRA-7147
>> (2) https://lists.apache.org/thread/k4woljlk23m2oylvrbnod6wocno2dlm3
>> (3) 
>> https://docs.cloudstack.apache.org/en/latest/adminguide/virtual_machines/user-data.html#determining-the-virtual-router-address-without-dns


Re: Fwd: [DISCUSS] Formalizing requirements for pre-commit patches on new CI

2023-07-10 Thread Josh McKenzie
I'm personally not thinking about CircleCI at all; I'm envisioning a world 
where all of us have 1 CI *software* system (i.e. reproducible on any env) that 
we use for pre-commit validation, and then post-commit happens on reference ASF 
hardware.

So:
1: Pre-commit subset of tests (suites + matrices + env) runs. On green, merge.
2: Post-commit tests (all suites, matrices, env) runs. If failure, link back to 
the JIRA where the commit took place

Circle would need to remain in lockstep with the requirements for point 1 here.

On Mon, Jul 10, 2023, at 1:04 AM, Berenguer Blasi wrote:
> +1 to Josh which is exactly my line of thought as well. But that is only 
> valid if we have a solid Jenkins that will eventually run all test configs. 
> So I think I lost track a bit here. Are you proposing:
> 
> 1- CircleCI: Run pre-commit a single (the most common/meaningful, TBD) config 
> of tests
> 
> 2- Jenkins: Runs post-commit _all_ test configs and emails/notifies you in 
> case of problems?
> 
> Or sthg different like having 1 also in Jenkins?
> 
> On 7/7/23 17:55, Andrés de la Peña wrote:
>> I think 500 runs combining all configs could be reasonable, since it's 
>> unlikely to have config-specific flaky tests. As in five configs with 100 
>> repetitions each.
>> 
>> On Fri, 7 Jul 2023 at 16:14, Josh McKenzie  wrote:
>>> Maybe. Kind of depends on how long we write our tests to run doesn't it? :)
>>> 
>>> But point taken. Any non-trivial test would start to be something of a 
>>> beast under this approach.
>>> 
>>> On Fri, Jul 7, 2023, at 11:12 AM, Brandon Williams wrote:
 On Fri, Jul 7, 2023 at 10:09 AM Josh McKenzie  wrote:
 > 3. Multiplexed tests (changed, added) run against all JDK's and a 
 > broader range of configs (no-vnode, vnode default, compression, etc)
 
 I think this is going to be too heavy...we're taking 500 iterations
 and multiplying that by like 4 or 5?
 
>>> 


Re: [DISCUSS] When to run CheckStyle and other verificiations

2023-07-10 Thread Josh McKenzie
>  • Remove the checkstyle dependency from "jar" and "test"
>  • Create a single "check" target that includes all the checks we expect to 
> pass in the CI (currently Checkstyle, RAT, and Eclipse-Warnings), making this 
> task the default.
+1 here.

(of note: haven't forgotten the request from this thread to share local env; 
just gotten sidetracked by things and also realized how little I've actually 
modified locally since I just run most of the linting against delta'ed files 
only to keep my changed work in compliance. Still a very noisy mess when 
SpotBugs is run against the entire codebase proper)

On Mon, Jul 10, 2023, at 7:13 AM, Brandon Williams wrote:
> On Mon, Jul 10, 2023 at 6:07 AM Jacek Lewandowski
>  wrote:
> > Remove the checkstyle dependency from "jar" and "test"
> > Create a single "check" target that includes all the checks we expect to 
> > pass in the CI (currently Checkstyle, RAT, and Eclipse-Warnings), making 
> > this task the default.
> 
> I support this.  Having checkstyle run when building is clearly
> constant friction for many, even though you can disable it.
> 


Re: Removal of CloudstackSnitch

2023-07-10 Thread Josh McKenzie
> 2) keep it there in 5.0 but mark it @Deprecated
I'd say Deprecate, log warnings that it's not supported nor maintained and 
people to use it at their own risk, and that it's going to be removed.

That is, assuming the maintenance burden of it isn't high. I assume not since, 
as Brandon said, they're quite pluggable and well modularized.

On Mon, Jul 10, 2023, at 9:57 AM, Brandon Williams wrote:
> I agree with Ekaterina, but also want to point out that snitches are
> pluggable, so whatever we do should be pretty safe.  If someone
> discovers after the removal that they need it, they can just plug it
> back in.
> 
> Kind Regards,
> Brandon
> 
> On Mon, Jul 10, 2023 at 8:54 AM Ekaterina Dimitrova
>  wrote:
> >
> > Hi Stefan,
> >
> > I think we should follow our deprecation rules and deprecate it in 5.0, 
> > potentially remove in 6.0. (Deprecate in one major, remove in the next 
> > major)
> > Maybe the deprecation can come with a note on your findings for the users, 
> > just in case someone somewhere uses it and did not follow the user mailing 
> > list?
> >
> > Thank you
> > Ekaterina
> >
> > On Mon, 10 Jul 2023 at 9:47, Miklosovic, Stefan 
> >  wrote:
> >>
> >> Hi list,
> >>
> >> I want to ask about the future of CloudstackSnitch.
> >>
> >> This snitch was added 9 years ago (1). I contacted the original author of 
> >> that snitch, Pierre-Yves Ritschard, who is currently CEO of a company he 
> >> coded that snitch for.
> >>
> >> In a nutshell, Pierre answered that he does not think this snitch is 
> >> relevant anymore and the company is using different way how to fetch 
> >> metadata from a node, rendering CloudstackSnitch, as is, irrelevant for 
> >> them.
> >>
> >> I also wrote an email to user ML list (2) about two weeks ago and nobody 
> >> answered that they are using it either.
> >>
> >> The current implementation is using this approach (3) but I think that it 
> >> is already obsolete in the snitch because snitch is adding a path to 
> >> parsed metadata service IP which is probably not there at all in the 
> >> default implementation of Cloudstack data server.
> >>
> >> What also bothers me is that we, as a community, seem to not be able to 
> >> test the functionality of this snitch as I do not know anybody with a 
> >> Cloudstack deployment who would be able to test this reliably.
> >>
> >> For completeness, in (1), Brandon expressed his opinion that unless users 
> >> come forward for this snitch, he thinks the retiring it is the best option.
> >>
> >> For all cloud-based snitches, we did the refactorization of the code in 
> >> 16555 an we work on improvement in 18438 which introduces a generic way 
> >> how metadata services are called and plugging in custom logic or reusing a 
> >> default implementation of a cloud connector is very easy, further making 
> >> this snitch less relevant.
> >>
> >> This being said, should we:
> >>
> >> 1) remove it in 5.0
> >> 2) keep it there in 5.0 but mark it @Deprecated
> >> 3) keep it there
> >>
> >> Regards
> >>
> >> (1) https://issues.apache.org/jira/browse/CASSANDRA-7147
> >> (2) https://lists.apache.org/thread/k4woljlk23m2oylvrbnod6wocno2dlm3
> >> (3) 
> >> https://docs.cloudstack.apache.org/en/latest/adminguide/virtual_machines/user-data.html#determining-the-virtual-router-address-without-dns
> 


Proposed update to cassandra-stress to use Apache Commons CLI

2023-07-10 Thread Brad
The Apache Commons CLI library provides an API for parsing command line
options with the package org.apache.commons.cli and this is already used by
a dozen of existing Cassandra utilities including:

SSTableMetadataViewer, StandaloneScrubber, StandaloneSplitter,
SSTableExport, BulkLoader, and others.


However, cassandra-stress is an outlier which uses its own custom classes
to parse command line options with classes such as OptionsSimple.  In
addition, the options syntax for username, password, and others are not
aligned with the format used by CQLSH.

This suggestion is to:

a) Upgrade cassandra-stress to use Apache Commons CLI (no new dependencies
are required as this library is already used by the project)

b) Align the cassandra-stress CLI options with those in CQLSH,

For example, using the new syntax like CQLSH:


cassandra-stress -username foo -password bar


and replacing the old syntax:

cassandra-stress -mode username=foo and password=bar


This will simplify and unify the code base, eliminate code and reduce the
confusion between similar named classes such
as org.apache.cassandra.stress.settings.{Option, OptionsMulti,
OptionsSimple} and org.apache.commons.cli.{Option, OptionGroup, Options)

If there are no significant objections, I can raise a Jira for this
proposal.

Regards,

Brad Schoening


Re: Proposed update to cassandra-stress to use Apache Commons CLI

2023-07-10 Thread Ekaterina Dimitrova
Hey Brad,

Thanks for raising the topic. I wanted to mention we are now on a very old
version of commons-cli (1.1 from 2007). So I would suggest we first update
it.
While there is activity in the commons-cli github repo (a lot of dependency
updates as far as I can tell from a quick look), the last version is 1.5
from 2021.

Best regards,
Ekaterina

On Mon, 10 Jul 2023 at 11:47, Brad  wrote:

> The Apache Commons CLI library provides an API for parsing command line
> options with the package org.apache.commons.cli and this is already used by
> a dozen of existing Cassandra utilities including:
>
> SSTableMetadataViewer, StandaloneScrubber, StandaloneSplitter,
> SSTableExport, BulkLoader, and others.
>
>
> However, cassandra-stress is an outlier which uses its own custom classes
> to parse command line options with classes such as OptionsSimple.  In
> addition, the options syntax for username, password, and others are not
> aligned with the format used by CQLSH.
>
> This suggestion is to:
>
> a) Upgrade cassandra-stress to use Apache Commons CLI (no new dependencies
> are required as this library is already used by the project)
>
> b) Align the cassandra-stress CLI options with those in CQLSH,
>
> For example, using the new syntax like CQLSH:
>
>
> cassandra-stress -username foo -password bar
>
>
> and replacing the old syntax:
>
> cassandra-stress -mode username=foo and password=bar
>
>
> This will simplify and unify the code base, eliminate code and reduce the
> confusion between similar named classes such
> as org.apache.cassandra.stress.settings.{Option, OptionsMulti,
> OptionsSimple} and org.apache.commons.cli.{Option, OptionGroup, Options)
>
> If there are no significant objections, I can raise a Jira for this
> proposal.
>
> Regards,
>
> Brad Schoening
>


Re: Removal of CloudstackSnitch

2023-07-10 Thread Jeff Jirsa
+1


On Mon, Jul 10, 2023 at 8:42 AM Josh McKenzie  wrote:

> 2) keep it there in 5.0 but mark it @Deprecated
>
> I'd say Deprecate, log warnings that it's not supported nor maintained and
> people to use it at their own risk, and that it's going to be removed.
>
> That is, assuming the maintenance burden of it isn't high. I assume not
> since, as Brandon said, they're quite pluggable and well modularized.
>
> On Mon, Jul 10, 2023, at 9:57 AM, Brandon Williams wrote:
>
> I agree with Ekaterina, but also want to point out that snitches are
> pluggable, so whatever we do should be pretty safe.  If someone
> discovers after the removal that they need it, they can just plug it
> back in.
>
> Kind Regards,
> Brandon
>
> On Mon, Jul 10, 2023 at 8:54 AM Ekaterina Dimitrova
>  wrote:
> >
> > Hi Stefan,
> >
> > I think we should follow our deprecation rules and deprecate it in 5.0,
> potentially remove in 6.0. (Deprecate in one major, remove in the next
> major)
> > Maybe the deprecation can come with a note on your findings for the
> users, just in case someone somewhere uses it and did not follow the user
> mailing list?
> >
> > Thank you
> > Ekaterina
> >
> > On Mon, 10 Jul 2023 at 9:47, Miklosovic, Stefan <
> stefan.mikloso...@netapp.com> wrote:
> >>
> >> Hi list,
> >>
> >> I want to ask about the future of CloudstackSnitch.
> >>
> >> This snitch was added 9 years ago (1). I contacted the original author
> of that snitch, Pierre-Yves Ritschard, who is currently CEO of a company he
> coded that snitch for.
> >>
> >> In a nutshell, Pierre answered that he does not think this snitch is
> relevant anymore and the company is using different way how to fetch
> metadata from a node, rendering CloudstackSnitch, as is, irrelevant for
> them.
> >>
> >> I also wrote an email to user ML list (2) about two weeks ago and
> nobody answered that they are using it either.
> >>
> >> The current implementation is using this approach (3) but I think that
> it is already obsolete in the snitch because snitch is adding a path to
> parsed metadata service IP which is probably not there at all in the
> default implementation of Cloudstack data server.
> >>
> >> What also bothers me is that we, as a community, seem to not be able to
> test the functionality of this snitch as I do not know anybody with a
> Cloudstack deployment who would be able to test this reliably.
> >>
> >> For completeness, in (1), Brandon expressed his opinion that unless
> users come forward for this snitch, he thinks the retiring it is the best
> option.
> >>
> >> For all cloud-based snitches, we did the refactorization of the code in
> 16555 an we work on improvement in 18438 which introduces a generic way how
> metadata services are called and plugging in custom logic or reusing a
> default implementation of a cloud connector is very easy, further making
> this snitch less relevant.
> >>
> >> This being said, should we:
> >>
> >> 1) remove it in 5.0
> >> 2) keep it there in 5.0 but mark it @Deprecated
> >> 3) keep it there
> >>
> >> Regards
> >>
> >> (1) https://issues.apache.org/jira/browse/CASSANDRA-7147
> >> (2) https://lists.apache.org/thread/k4woljlk23m2oylvrbnod6wocno2dlm3
> >> (3)
> https://docs.cloudstack.apache.org/en/latest/adminguide/virtual_machines/user-data.html#determining-the-virtual-router-address-without-dns
>
>
>


Re: [DISCUSS] When to run CheckStyle and other verificiations

2023-07-10 Thread Jon Meredith
+1 from me too. I would support removing all of the optional checks from
jar/test as I also hit issues with rat from time to time while iterating,
as long as the CI system runs them and makes it very clear for any
committer there are failures.

On Mon, Jul 10, 2023 at 9:40 AM Josh McKenzie  wrote:

>
>- Remove the checkstyle dependency from "jar" and "test"
>- Create a single "check" target that includes all the checks we
>expect to pass in the CI (currently Checkstyle, RAT, and Eclipse-Warnings),
>making this task the default.
>
> +1 here.
>
> (of note: haven't forgotten the request from this thread to share local
> env; just gotten sidetracked by things and also realized how little I've
> actually modified locally since I just run most of the linting against
> delta'ed files only to keep my changed work in compliance. Still a very
> noisy mess when SpotBugs is run against the entire codebase proper)
>
> On Mon, Jul 10, 2023, at 7:13 AM, Brandon Williams wrote:
>
> On Mon, Jul 10, 2023 at 6:07 AM Jacek Lewandowski
>  wrote:
> > Remove the checkstyle dependency from "jar" and "test"
> > Create a single "check" target that includes all the checks we expect to
> pass in the CI (currently Checkstyle, RAT, and Eclipse-Warnings), making
> this task the default.
>
> I support this.  Having checkstyle run when building is clearly
> constant friction for many, even though you can disable it.
>
>
>


Re: Changing the output of tooling between majors

2023-07-10 Thread Eric Evans
On Fri, Jul 7, 2023 at 10:20 AM Miklosovic, Stefan <
stefan.mikloso...@netapp.com> wrote:

> Hi list,
>
> I want to clarify the policy we have when we want to / going to change the
> output of the tooling (nodetool or tools/bin etc.).
>
> I am not sure it is written somewhere explicitly, but how I get it from
> the gossip over years is that we should not change the output (e.g.
> changing the name of fields etc) in minors, but for majors (4.0 -> 5.0),
> this is OK, correct?
>
> For example, when some tool prints this:
>
> thisIsAStatistic: 10
>
> and we see that all other lines in that output print it like this:
>
> This Is Another Statistic: abc
>
> scratching the itch is almost irresistible so we want to change the output
> to:
>
> This Is a Statistic: 10
>
> This is the natural way how fixes are done. We are improving the output,
> making it consistent etc.
>
> Someone may argue that we are changing "public api" and people are
> actually parsing the output like this and we better not to change it
> because we might break "the scripts" for somebody.
>

If that output is (or at some earlier point, was) the most obvious way to
obtain something, then for all intents and purposes it is a public api.
When it's changed, you should assume it *will* break scripts.


> While I get this for minors and it is understandable that minors should be
> same, is this relevant for majors? Because if we care about majors too in
> this situation, how are we supposed to evolve the output over time? Is it
> supposed to be just frozen for ever? I do not buy this argument. For
> minors, fine. But for majors, I do not think so.
>

Majors are expected to be more disruptive, but I don't personally interpret
that as a license to be disruptive.  The pain this creates isn't any less
because it is a major.


>
> I feel like "not break the output because API" is more or less an urban
> legend we keep repeating ourselves. I yet need to meet somebody who is
> stressing over the fact that her output changed *between majors*.
>

Hi Stefan, I'm Eric, nice to meet you; I stress a great deal at all of the
changes —large and small— that occur between major versions. :)  They
create additional work, introduce risk, and often end up delaying (by
months and years even) a major upgrade.  You might be surprised by the kind
of breakage (often subtle) that even the smallest changes can create, or
how frustrating it can be when it was only done to satisfy a sense of
aesthetics.


> If that is the case, we should start to treat this problem completely
> differently and we should not rely on the output of tooling at all and we
> should either provide corresponding JMX method to retrieve it or we should
> offer other formats tooling prints, like JSON or YAML.


Absolutely.  But I think the right thing in this situation is to
acknowledge that the console output is a contract, and act accordingly.  In
this case: offer (and promote) those structured replacements (JSON, YAML),
document the intended instability, and follow through after a sufficient
window.

-- 
Eric Evans
john.eric.ev...@gmail.com


Re: Changing the output of tooling between majors

2023-07-10 Thread Eric Evans
On Sun, Jul 9, 2023 at 9:10 PM Dinesh Joshi  wrote:

> On Jul 8, 2023, at 8:43 AM, Miklosovic, Stefan <
> stefan.mikloso...@netapp.com> wrote:
>
>
>
> If we are providing CQL / JSON / YAML for couple years, I do not believe
> that the argument "lets not break it for folks in nodetool" is still
> relevant. CQL output is there from times of 4.0 at least (at least!) and
> YAML / JSON is also not something completely new. It is not like we are
> suddenly forcing people to change their habits, there was enough time to
> update the stuff to CQL / json / yaml etc ...
>
>
> What % of Cassandra users are using 4.0+? Operators who upgrade to 4.0 and
> beyond may still use their existing scripts. Therefore keeping things
> stable is important. Until nodetool can support JSON as output format for
> all interaction and there is a significant adoption in the user community,
> I would strongly advise against making breaking changes to the CLI output.
>

+1

-- 
Eric Evans
john.eric.ev...@gmail.com


Re: Removal of CloudstackSnitch

2023-07-10 Thread Miklosovic, Stefan
OK, thanks all, we will go with 2), we will deprecate it in 5.0 and we remove 
it the next major.


From: Jeff Jirsa 
Sent: Monday, July 10, 2023 18:13
To: dev@cassandra.apache.org
Subject: Re: Removal of CloudstackSnitch

NetApp Security WARNING: This is an external email. Do not click links or open 
attachments unless you recognize the sender and know the content is safe.



+1


On Mon, Jul 10, 2023 at 8:42 AM Josh McKenzie 
mailto:jmcken...@apache.org>> wrote:
2) keep it there in 5.0 but mark it @Deprecated
I'd say Deprecate, log warnings that it's not supported nor maintained and 
people to use it at their own risk, and that it's going to be removed.

That is, assuming the maintenance burden of it isn't high. I assume not since, 
as Brandon said, they're quite pluggable and well modularized.

On Mon, Jul 10, 2023, at 9:57 AM, Brandon Williams wrote:
I agree with Ekaterina, but also want to point out that snitches are
pluggable, so whatever we do should be pretty safe.  If someone
discovers after the removal that they need it, they can just plug it
back in.

Kind Regards,
Brandon

On Mon, Jul 10, 2023 at 8:54 AM Ekaterina Dimitrova
mailto:e.dimitr...@gmail.com>> wrote:
>
> Hi Stefan,
>
> I think we should follow our deprecation rules and deprecate it in 5.0, 
> potentially remove in 6.0. (Deprecate in one major, remove in the next major)
> Maybe the deprecation can come with a note on your findings for the users, 
> just in case someone somewhere uses it and did not follow the user mailing 
> list?
>
> Thank you
> Ekaterina
>
> On Mon, 10 Jul 2023 at 9:47, Miklosovic, Stefan 
> mailto:stefan.mikloso...@netapp.com>> wrote:
>>
>> Hi list,
>>
>> I want to ask about the future of CloudstackSnitch.
>>
>> This snitch was added 9 years ago (1). I contacted the original author of 
>> that snitch, Pierre-Yves Ritschard, who is currently CEO of a company he 
>> coded that snitch for.
>>
>> In a nutshell, Pierre answered that he does not think this snitch is 
>> relevant anymore and the company is using different way how to fetch 
>> metadata from a node, rendering CloudstackSnitch, as is, irrelevant for them.
>>
>> I also wrote an email to user ML list (2) about two weeks ago and nobody 
>> answered that they are using it either.
>>
>> The current implementation is using this approach (3) but I think that it is 
>> already obsolete in the snitch because snitch is adding a path to parsed 
>> metadata service IP which is probably not there at all in the default 
>> implementation of Cloudstack data server.
>>
>> What also bothers me is that we, as a community, seem to not be able to test 
>> the functionality of this snitch as I do not know anybody with a Cloudstack 
>> deployment who would be able to test this reliably.
>>
>> For completeness, in (1), Brandon expressed his opinion that unless users 
>> come forward for this snitch, he thinks the retiring it is the best option.
>>
>> For all cloud-based snitches, we did the refactorization of the code in 
>> 16555 an we work on improvement in 18438 which introduces a generic way how 
>> metadata services are called and plugging in custom logic or reusing a 
>> default implementation of a cloud connector is very easy, further making 
>> this snitch less relevant.
>>
>> This being said, should we:
>>
>> 1) remove it in 5.0
>> 2) keep it there in 5.0 but mark it @Deprecated
>> 3) keep it there
>>
>> Regards
>>
>> (1) 
>> https://issues.apache.org/jira/browse/CASSANDRA-7147
>> (2) 
>> https://lists.apache.org/thread/k4woljlk23m2oylvrbnod6wocno2dlm3
>> (3) 
>> https://docs.cloudstack.apache.org/en/latest/adminguide/virtual_machines/user-data.html#determining-the-virtual-router-address-without-dns

Re: Changing the output of tooling between majors

2023-07-10 Thread Fleming, Jackson
We use Nodetool in scripts sparsely, in my opinion trying to programmatically 
parse the human readable output should be avoided as much as possible, it’s 
usually leads to implementations that are brittle.

I certainly agree you don’t want to make these kinds of changes in 3.11 or 4.x 
(and I don’t think that’s what Stefan was suggesting), but I don’t necessarily 
agree that you can’t make these kinds of changes in major versions. Chasing 
compatibility like this seems like a deep rabbit hole one could possibly go 
down, I personally don’t see it as unreasonable for commands that are designed 
to be read by humans to be updated over time to improve readability, as that is 
the purpose of those commands. While people script against that output I don’t 
think anyone is going to say it’s an official API, the project also makes no 
public commitment to that either.

If the proposal is to treat Nodetool input and output like a contract/API, it’d 
be great for a formal specification, or at least the documentation to be 
updated to cover what users should expect as output from Nodetool, if the 
project is going to such effort to maintain a specification, why not make it 
official? That way the maintainers of scripts have a fighting chance of finding 
incompatibilities before upgrading their infrastructure and the project could 
make these kinds of changes and provide a mechanism for users to validate.

Currently the argument could be made that there’s no guarantee about Nodetool 
output since it’s not actually written down anywhere official outside the 
codebase.

Isn’t this one of the reasons Cassandra maintains the NEWS and CHANGES files in 
the repo, and follows semantic versioning, to communicate potentially breaking 
changes as clearly as possible? Surely a message like (but with some more 
detail) “Nodetool command x has had its human readable output restructured, 
item y was removed/renamed to z” would suffice.

Not sure if you can deprecate the human readable output without generating a 
lot of noise for the user, and if it’s being parsed by a bash script, the user 
would never see it anyway, but sounds like that’s what the project needs.

To the note about having users migrate over to more machine friendly output 
types (JSON etc), in my experience the operators who maintain these scripts 
aren’t going to re-write them just because a better way of doing them is newly 
available, usually they’re too busy with other work and will keep using those 
old scripts until they stop working, so in my view it’s not really a solution 
to this problem.

Regards,

Jackson

From: Eric Evans 
Date: Tuesday, 11 July 2023 at 4:14 am
To: dev@cassandra.apache.org 
Subject: Re: Changing the output of tooling between majors
You don't often get email from john.eric.ev...@gmail.com. Learn why this is 
important

NetApp Security WARNING: This is an external email. Do not click links or open 
attachments unless you recognize the sender and know the content is safe.




On Sun, Jul 9, 2023 at 9:10 PM Dinesh Joshi 
mailto:djo...@apache.org>> wrote:
On Jul 8, 2023, at 8:43 AM, Miklosovic, Stefan 
mailto:stefan.mikloso...@netapp.com>> wrote:

If we are providing CQL / JSON / YAML for couple years, I do not believe that 
the argument "lets not break it for folks in nodetool" is still relevant. CQL 
output is there from times of 4.0 at least (at least!) and YAML / JSON is also 
not something completely new. It is not like we are suddenly forcing people to 
change their habits, there was enough time to update the stuff to CQL / json / 
yaml etc ...

What % of Cassandra users are using 4.0+? Operators who upgrade to 4.0 and 
beyond may still use their existing scripts. Therefore keeping things stable is 
important. Until nodetool can support JSON as output format for all interaction 
and there is a significant adoption in the user community, I would strongly 
advise against making breaking changes to the CLI output.

+1

--
Eric Evans
john.eric.ev...@gmail.com


Re: CASSANDRA-18654 - start publishing CQLSH to PyPI as part of the release process

2023-07-10 Thread German Eichberger via dev
Same - really appreciate those efforts and also welcome the upstreaming and 
release automation...

German

From: Jeff Widman 
Sent: Sunday, July 9, 2023 1:44 PM
To: Max C. 
Cc: dev@cassandra.apache.org ; Brad Schoening 

Subject: [EXTERNAL] Re: CASSANDRA-18654 - start publishing CQLSH to PyPI as 
part of the release process

You don't often get email from j...@jeffwidman.com. Learn why this is 
important
Thanks Max, always encouraging to hear that the time I spend on open source is 
helping others.

Your use case is very similar to what drove my original desire to get involved 
with the project. Being able to `pip install cqlsh` from a dev machine was so 
much lighter weight than the alternatives.

Anyone else care to weigh in on this?

What are the next steps to move to a decision?

Cheers,
Jeff

On Sat, Jul 8, 2023, 7:23 PM Max C. 
mailto:mc_cassand...@core43.com>> wrote:

As a user, I really appreciate your efforts Jeff & Brad.  I would *love* for 
the C* project to officially support this.

In our environment we have a lot of client machines that all share common NFS 
mounted directories.  It's much easier for us to create a Python virtual 
environment on a file server with the cqlsh PyPI package installed than it is 
to install the Cassandra RPMs on every single machine.  Before I discovered 
your PyPI package, our developers would need to login to  a Cassandra node in 
order to run cqlsh.  The cqlsh PyPI package, however, is in our standard 
"python dev tools" virtual environment -- along with Ansible, black, isort and 
various other Python packages; which means it's accessible to everyone, 
everywhere.

I agree that this should not replace packaging cqlsh in the Cassandra RPM, so 
much provide an additional option for installing cqlsh without the baggage of 
installing the full Cassandra package.

Thanks again for your work Jeff & Brad.

- Max

On 7/6/2023 5:55 PM, Jeff Widman wrote:
Myself and Brad Schoening currently maintain https://pypi.org/project/cqlsh/ 
which repackages CQLSH that ships with every Cassandra release.

This way:

  *   anyone who wants a lightweight client to talk to a remote cassandra can 
simply `pip install cqlsh` without having to download the full cassandra 
source, unzip it, etc.
  *   it's very easy for folks to use it as scaffolding in their python 
scripts/tooling since they can simply include it in the list of their required 
dependencies.

We currently handle the packaging by waiting for a release, then manually 
copy/pasting the code out of the cassandra source tree into 
https://github.com/jeffwidman/cqlsh which has some additional build/python 
package configuration files, then using standard python tooling to publish to 
PyPI.

Given that our project is simply a build/packaging project, I wanted to start a 
conversation about upstreaming this into core Cassandra. I realize that 
Cassandra has no interest in maintaining lots of build targets... but given 
that cqlsh is written in Python and publishing to PyPI enables DBA's to share 
more complicated tooling built on top of it this seems like a natural fit for 
core cassandra rather than a standalone project.

Goal:
When a Cassandra release happens, the build/release process automatically 
publishes cqlsh to https://pypi.org/project/cqlsh/.

Non-Goal: This is _not_ about having cassandra itself rely on PyPI. There was 
some initial chatter about that in 
https://issues.apache.org/jira/browse/CASSANDRA-18654, but that adds a lot of 
complexity, and I'm honestly not sure it's a great idea. Even if folks later 
want to go that route, the first hurdle is publishing to PyPI, so for now let's 
keep the scope of the discussion limited to treating PyPI purely as a release 
target, and not as an ingredient to a release.

>From an implementation perspective, this should be very straightforward. We 
>don't have any differences from the CQLSH source that's in cassandra, instead 
>we point folks to make changes to cqlsh in the Cassandra source. In fact we've 
>made multiple contributions back to `cqlsh` ourselves and have drastically 
>cleaned up the code: 
>https://github.com/search?q=repo%3Aapache%2Fcassandra%20is%3Apr%20author%3Ajeffwidman%20author%3Abschoening&type=pullrequests.
> So the only real change is adding the package config files and the build / 
>release pipeline.

We realize the Cassandra team isn't python/PyPI experts, so we'd be more than 
happy to help wire this up and maintain it. I am also a maintainer of kazoo and 
kafka-python which are both popular python clients for other distributed 
databases. So I'm very familiar with open source, python, and distributed 
databases.

My one hesitation around this discussion is that I'm a little concerned that we 
might lose the nimbleness we've currently got from having a separate project. 
Ie, if something is screwed up on PyPI / the build process, we can quickly get 
it fixed and get a n

Re: CASSANDRA-18654 - start publishing CQLSH to PyPI as part of the release process

2023-07-10 Thread Patrick McFadin
I would say it helps a lot of people. 45k downloads in just last month:
https://pypistats.org/packages/cqlsh

I feel like a CEP would be in order, along the lines of CEP-8:
https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-8%3A+Datastax+Drivers+Donation

Unless anyone objects, I can help you get the CEP together and we can get a
vote, then a JIRA in place for any changes in trunk.

Patrick

On Mon, Jul 10, 2023 at 4:58 PM German Eichberger via dev <
dev@cassandra.apache.org> wrote:

> Same - really appreciate those efforts and also welcome the upstreaming
> and release automation...
>
> German
> --
> *From:* Jeff Widman 
> *Sent:* Sunday, July 9, 2023 1:44 PM
> *To:* Max C. 
> *Cc:* dev@cassandra.apache.org ; Brad Schoening
> 
> *Subject:* [EXTERNAL] Re: CASSANDRA-18654 - start publishing CQLSH to
> PyPI as part of the release process
>
> You don't often get email from j...@jeffwidman.com. Learn why this is
> important 
> Thanks Max, always encouraging to hear that the time I spend on open
> source is helping others.
>
> Your use case is very similar to what drove my original desire to get
> involved with the project. Being able to `pip install cqlsh` from a dev
> machine was so much lighter weight than the alternatives.
>
> Anyone else care to weigh in on this?
>
> What are the next steps to move to a decision?
>
> Cheers,
> Jeff
>
> On Sat, Jul 8, 2023, 7:23 PM Max C.  wrote:
>
> As a user, I really appreciate your efforts Jeff & Brad.  I would *love*
> for the C* project to officially support this.
>
> In our environment we have a lot of client machines that all share common
> NFS mounted directories.  It's much easier for us to create a Python
> virtual environment on a file server with the cqlsh PyPI package installed
> than it is to install the Cassandra RPMs on every single machine.  Before I
> discovered your PyPI package, our developers would need to login to  a
> Cassandra node in order to run cqlsh.  The cqlsh PyPI package, however, is
> in our standard "python dev tools" virtual environment -- along with
> Ansible, black, isort and various other Python packages; which means it's
> accessible to everyone, everywhere.
>
> I agree that this should not *replace* packaging cqlsh in the Cassandra
> RPM, so much provide an additional *option* for installing cqlsh without
> the baggage of installing the full Cassandra package.
>
> Thanks again for your work Jeff & Brad.
>
> - Max
> On 7/6/2023 5:55 PM, Jeff Widman wrote:
>
> Myself and Brad Schoening currently maintain
> https://pypi.org/project/cqlsh/ which repackages CQLSH that ships with
> every Cassandra release.
>
> This way:
>
>- anyone who wants a lightweight client to talk to a remote cassandra
>can simply `pip install cqlsh` without having to download the full
>cassandra source, unzip it, etc.
>- it's very easy for folks to use it as scaffolding in their python
>scripts/tooling since they can simply include it in the list of their
>required dependencies.
>
> We currently handle the packaging by waiting for a release, then manually
> copy/pasting the code out of the cassandra source tree into
> https://github.com/jeffwidman/cqlsh which has some additional
> build/python package configuration files, then using standard
> python tooling to publish to PyPI.
>
> Given that our project is simply a build/packaging project, I wanted to
> start a conversation about upstreaming this into core Cassandra. I realize
> that Cassandra has no interest in maintaining lots of build targets... but
> given that cqlsh is written in Python and publishing to PyPI enables DBA's
> to share more complicated tooling built on top of it this seems like a
> natural fit for core cassandra rather than a standalone project.
>
> Goal:
> When a Cassandra release happens, the build/release process automatically
> publishes cqlsh to https://pypi.org/project/cqlsh/.
>
> Non-Goal: This is _not_ about having cassandra itself rely on PyPI. There
> was some initial chatter about that in
> https://issues.apache.org/jira/browse/CASSANDRA-18654, but that adds a
> lot of complexity, and I'm honestly not sure it's a great idea. Even if
> folks later want to go that route, the first hurdle is publishing to PyPI,
> so for now let's keep the scope of the discussion limited to treating PyPI
> purely as a release target, and not as an ingredient to a release.
>
> From an implementation perspective, this should be very straightforward.
> We don't have any differences from the CQLSH source that's in cassandra,
> instead we point folks to make changes to cqlsh in the Cassandra source. In
> fact we've made multiple contributions back to `cqlsh` ourselves and have
> drastically cleaned up the code:
> https://github.com/search?q=repo%3Aapache%2Fcassandra%20is%3Apr%20author%3Ajeffwidman%20author%3Abschoening&type=pullrequests.
> So the only real change is adding the package config files and the b

[DISCUSS] Conducting a User Survey

2023-07-10 Thread Patrick McFadin
For quite a few years, I have done Twitter polls to gather helpful
information about how people use Apache Cassandra. Twitter is no longer the
best place to conduct this kind of activity since it has become a ghost
town.

We should ask more comprehensive questions to get the pulse of our user
community. I want to do a simple Google Form survey that we can promote on
every channel for a few weeks. I'll anonymize the results and post them on
cassandra.apache.org.

Here are the proposed questions I have compiled. A pretty basic set of
questions, but it would be fun to know the answer to several of these:
https://docs.google.com/document/d/18627E1UV-BjLyuNFgV0cgPwPmtjUHy7Th9Mk15ll1IA/edit?usp=sharing

Comments are open to all. Please let me know what you think.

Patrick


Bloom filter calculation

2023-07-10 Thread Claude Warren, Jr via dev
Can someone explain to me how the Bloom filter table in
BloomFilterCalculations was derived and how it is supposed to work?  As I
read the table it seems to indicate that with 14 hashes and 20 bits you get
a fp of 6.71e-05.  But if you plug those numbers into the Bloom filter
calculator [1],  that is calculated only for 1 item being in the filter.
If you merge multiple filters together the false positive rate goes up.
And as [1] shows by 5 merges you are over 50% fp rate and by 10 you are at
close to 100% fp.  So I have to assume this analysis is wrong.  Can someone
point me to the correct calculations?

Claude

[1] https://hur.st/bloomfilter/?n=&p=6.71e-05&m=20&k=14


Re: Fwd: [DISCUSS] Formalizing requirements for pre-commit patches on new CI

2023-07-10 Thread Berenguer Blasi
Add a 'devBranch' jenkins job to that imo: The possibility to run the 
full suite + multiplex new tests before commit when you're about to 
release a Kraken into the codebase: Accord, TCM, TTL, SAI, Vector, 
JDK... So:


1: Pre-commit subset of tests (suites + matrices + env) runs. On green, 
merge.
2: Pre-commit 'devBranch' full suite for high risk/disruptive merges: at 
reviewer's discretion
3: Post-commit tests (all suites, matrices, env) runs. If failure, link 
back to the JIRA where the commit took place


My 2cts

On 10/7/23 17:36, Josh McKenzie wrote:
I'm personally not thinking about CircleCI at all; I'm envisioning a 
world where all of us have 1 CI /software/ system (i.e. reproducible 
on any env) that we use for pre-commit validation, and then 
post-commit happens on reference ASF hardware.


So:
1: Pre-commit subset of tests (suites + matrices + env) runs. On 
green, merge.
2: Post-commit tests (all suites, matrices, env) runs. If failure, 
link back to the JIRA where the commit took place


Circle would need to remain in lockstep with the requirements for 
point 1 here.


On Mon, Jul 10, 2023, at 1:04 AM, Berenguer Blasi wrote:


+1 to Josh which is exactly my line of thought as well. But that is 
only valid if we have a solid Jenkins that will eventually run all 
test configs. So I think I lost track a bit here. Are you proposing:


1- CircleCI: Run pre-commit a single (the most common/meaningful, 
TBD) config of tests


2- Jenkins: Runs post-commit _all_ test configs and emails/notifies 
you in case of problems?


Or sthg different like having 1 also in Jenkins?

On 7/7/23 17:55, Andrés de la Peña wrote:
I think 500 runs combining all configs could be reasonable, since 
it's unlikely to have config-specific flaky tests. As in five 
configs with 100 repetitions each.


On Fri, 7 Jul 2023 at 16:14, Josh McKenzie > wrote:


Maybe. Kind of depends on how long we write our tests to run
doesn't it? :)

But point taken. Any non-trivial test would start to be
something of a beast under this approach.

On Fri, Jul 7, 2023, at 11:12 AM, Brandon Williams wrote:

On Fri, Jul 7, 2023 at 10:09 AM Josh McKenzie
mailto:jmcken...@apache.org>> wrote:
> 3. Multiplexed tests (changed, added) run against all JDK's
and a broader range of configs (no-vnode, vnode default,
compression, etc)

I think this is going to be too heavy...we're taking 500 iterations
and multiplying that by like 4 or 5?





Re: [DISCUSS] When to run CheckStyle and other verificiations

2023-07-10 Thread Jacek Lewandowski
Thanks,

I will follow that path then,



pon., 10 lip 2023 o 19:03 Jon Meredith  napisał(a):

> +1 from me too. I would support removing all of the optional checks from
> jar/test as I also hit issues with rat from time to time while iterating,
> as long as the CI system runs them and makes it very clear for any
> committer there are failures.
>
> On Mon, Jul 10, 2023 at 9:40 AM Josh McKenzie 
> wrote:
>
>>
>>- Remove the checkstyle dependency from "jar" and "test"
>>- Create a single "check" target that includes all the checks we
>>expect to pass in the CI (currently Checkstyle, RAT, and 
>> Eclipse-Warnings),
>>making this task the default.
>>
>> +1 here.
>>
>> (of note: haven't forgotten the request from this thread to share local
>> env; just gotten sidetracked by things and also realized how little I've
>> actually modified locally since I just run most of the linting against
>> delta'ed files only to keep my changed work in compliance. Still a very
>> noisy mess when SpotBugs is run against the entire codebase proper)
>>
>> On Mon, Jul 10, 2023, at 7:13 AM, Brandon Williams wrote:
>>
>> On Mon, Jul 10, 2023 at 6:07 AM Jacek Lewandowski
>>  wrote:
>> > Remove the checkstyle dependency from "jar" and "test"
>> > Create a single "check" target that includes all the checks we expect
>> to pass in the CI (currently Checkstyle, RAT, and Eclipse-Warnings), making
>> this task the default.
>>
>> I support this.  Having checkstyle run when building is clearly
>> constant friction for many, even though you can disable it.
>>
>>
>>