Re: Meaningless emptiness and filtering

2025-02-13 Thread Mick Semb Wever
On Tue, 11 Feb 2025 at 19:56, Caleb Rackliffe 
wrote:

> When we add IS [NOT] NULL support, that would preferably NOT match EMPTY
> values for the types where empty means something, like strings. For
> everything else, EMPTY could be equivalent to null and match IS NULL.
>


Makes sense to me to say this is what we intend in advance of IS NULL
landing.

i.e. `isEmptyValueMeaningless=true and v=EMPTY_BYTE_BUFFER` is for now
equivalent to what will be `IS NULL`, so isEmptyValueMeaningless
effectively (temporarily?) means isEmptyValueTreatedAsNull

And in the meantime just say that SAI currently does not support such NULL
values (and just leave it as-is wrong in 2i and SASI – they're legacy –
despite making CQL statements inconsistent depending on the index impl).


Re: Merging compaction improvements to 5.0

2025-02-13 Thread Jordan West
For 15452 that’s correct (and I believe also for 20092). For 15452, the
trunk and 5.0 patch are basically identical.

Jordan

On Thu, Feb 13, 2025 at 01:06 C. Scott Andreas  wrote:

> Checking to confirm the specific patches proposed for backport – is it the
> trunk commit for C-20092 and the open GitHub PR against the 5.0 branch for
> C-15452 linked below?
>
> CASSANDRA-20092: Introduce SSTableSimpleScanner for compaction (committed
> to trunk)
> https://github.com/apache/cassandra/commit/3078aea1cfc70092a185bab8ac5dc8a35627330f
>
>  CASSANDRA-15452: Improve disk access patterns during compaction and range
> reads (PR available) https://github.com/apache/cassandra/pull/3606
>
> Thanks,
>
> – Scott
>
> On Feb 12, 2025, at 9:45 PM, guo Maxwell  wrote:
>
>
> Of course, I definitely hope to see it merged into 5.0.x as soon as
> possible
>
> Jordan West  于2025年2月13日周四 10:48写道:
>
>> Regarding the buffer size, it is configurable. My personal take is that
>> we’ve tested this on a variety of hardware (from laptops to large instance
>> sizes) already, as well as a few different disk configs (it’s also been run
>> internally, in test, at a few places) and that it has been reviewed by four
>> committers and another contributor. Always love to see more numbers. if
>> folks want to take it for a spin on Alibaba cloud, azure, etc and determine
>> the best buffer size that’s awesome. We could document which is suggested
>> for the community. I don’t think it’s necessary to block on that however.
>>
>> Also I am of course +1 to including this in 5.0.
>>
>> Jordan
>>
>> On Wed, Feb 12, 2025 at 19:50 guo Maxwell  wrote:
>>
>>> What I understand is that there will be some differences in block
>>> storage among various cloud platforms. More intuitively, the default
>>> read-ahead size will be the same. For example, AWS ebs seems to be 256K,
>>> and Alibaba Cloud seems to be 512K(If I remember correctly).
>>>
>>> Just like 19488, give the test method, see who can assist in the test ,
>>> and provide the results.
>>>
>>> Jon Haddad  于2025年2月13日周四 08:30写道:
>>>
 Can you elaborate why?  This would be several hundred hours of work and
 would cost me thousands of $$ to perform.

 Filesystems and block devices are well understood.  Could you give me
 an example of what you think might be different here?  This is already one
 of the most well tested and documented performance patches ever contributed
 to the project.

 On Wed, Feb 12, 2025 at 4:26 PM guo Maxwell 
 wrote:

>  I think it should be tested on most cloud platforms(at least
> aws、azure、gcp) before merged into 5.0 . Just like  CASSANDRA-19488.
>
> Paulo Motta 于2025年2月13日 周四上午6:10写道:
>
>> I'm looking forward to these improvements, compaction needs tlc. :-)
>> A couple of questions:
>>
>> Has this been tested only on EBS, or also EC2/bare-metal/Azure/etc? My
>> only concern is if this is an optimization for EBS that can be a
>> deoptimization for other environments.
>>
>> Are there reproducible scripts that anyone can run to verify the
>> improvements in their own environments ? This could help alleviate any
>> concerns and gain confidence to introduce a perf. improvement in a
>> patch release.
>>
>> I have not read the ticket in detail, so apologies if this was already
>> discussed there or elsewhere.
>>
>> On Wed, Feb 12, 2025 at 3:01 PM Jon Haddad 
>> wrote:
>> >
>> > Hey folks,
>> >
>> > Over the last 9 months Jordan and I have worked on CASSANDRA-15452
>> [1].  The TL;DR is that we're internalizing a read ahead buffer to allow 
>> us
>> to do fewer requests to disk during compaction and range reads.  This
>> results in far fewer system calls (roughly 16x reduction) and on systems
>> with higher read latency, a significant improvement in compaction
>> throughput.  We've tested several different EBS configurations and found 
>> it
>> delivers up to a 10x improvement when read ahead is optimized to minimize
>> read latency.  I worked with AWS and the EBS team directly on this and 
>> the
>> Best Practices for C* on EBS [2] I wrote for them.  I've performance 
>> tested
>> this patch extensively with hundreds of billions of operations across
>> several clusters and thousands of compactions.  It has less of an impact 
>> on
>> local NVMe, since the p99 latency is already 10-30x less than what you 
>> see
>> on EBS (100micros vs 1-3ms), and you can do hundreds of thousands of IOPS
>> vs a max of 16K.
>> >
>> > Related to this, Branimir wrote CASSANDRA-20092 [3], which
>> significantly improves compaction by avoiding reading the partition 
>> index.
>> CASSANDRA-20092 has been merged to trunk already [4].
>> >
>> > I think we should merge both of these patches into 5.0, as the perf
>> improvement should allow teams to increase

Re: Merging compaction improvements to 5.0

2025-02-13 Thread Patrick McFadin
I’ve been following this for a while and I think it’s just some solid
engineering based on real-world challenges. Probably one of the best types
of contributions to have. I’m +1 on adding it to 5

Patrick

On Thu, Feb 13, 2025 at 7:31 AM Dmitry Konstantinov 
wrote:

> +1 (nb) from my side, I raised a few comments for CASSANDRA-15452 some
> time ago and Jordan addressed them.
> I have also backported CASSANDRA-15452 changes to my internal 4.1 fork and
> got about 15% reduction in compaction time even for a node with a local SSD.
>
> On Thu, 13 Feb 2025 at 13:22, Jordan West  wrote:
>
>> For 15452 that’s correct (and I believe also for 20092). For 15452, the
>> trunk and 5.0 patch are basically identical.
>>
>> Jordan
>>
>> On Thu, Feb 13, 2025 at 01:06 C. Scott Andreas 
>> wrote:
>>
>>> Checking to confirm the specific patches proposed for backport – is it
>>> the trunk commit for C-20092 and the open GitHub PR against the 5.0 branch
>>> for C-15452 linked below?
>>>
>>> CASSANDRA-20092: Introduce SSTableSimpleScanner for compaction
>>> (committed to trunk)
>>> https://github.com/apache/cassandra/commit/3078aea1cfc70092a185bab8ac5dc8a35627330f
>>>
>>>  CASSANDRA-15452: Improve disk access patterns during compaction and
>>> range reads (PR available) https://github.com/apache/cassandra/pull/3606
>>>
>>> Thanks,
>>>
>>> – Scott
>>>
>>> On Feb 12, 2025, at 9:45 PM, guo Maxwell  wrote:
>>>
>>>
>>> Of course, I definitely hope to see it merged into 5.0.x as soon as
>>> possible
>>>
>>> Jordan West  于2025年2月13日周四 10:48写道:
>>>
 Regarding the buffer size, it is configurable. My personal take is that
 we’ve tested this on a variety of hardware (from laptops to large instance
 sizes) already, as well as a few different disk configs (it’s also been run
 internally, in test, at a few places) and that it has been reviewed by four
 committers and another contributor. Always love to see more numbers. if
 folks want to take it for a spin on Alibaba cloud, azure, etc and determine
 the best buffer size that’s awesome. We could document which is suggested
 for the community. I don’t think it’s necessary to block on that however.

 Also I am of course +1 to including this in 5.0.

 Jordan

 On Wed, Feb 12, 2025 at 19:50 guo Maxwell  wrote:

> What I understand is that there will be some differences in block
> storage among various cloud platforms. More intuitively, the default
> read-ahead size will be the same. For example, AWS ebs seems to be 256K,
> and Alibaba Cloud seems to be 512K(If I remember correctly).
>
> Just like 19488, give the test method, see who can assist in the test
> , and provide the results.
>
> Jon Haddad  于2025年2月13日周四 08:30写道:
>
>> Can you elaborate why?  This would be several hundred hours of work
>> and would cost me thousands of $$ to perform.
>>
>> Filesystems and block devices are well understood.  Could you give me
>> an example of what you think might be different here?  This is already 
>> one
>> of the most well tested and documented performance patches ever 
>> contributed
>> to the project.
>>
>> On Wed, Feb 12, 2025 at 4:26 PM guo Maxwell 
>> wrote:
>>
>>>  I think it should be tested on most cloud platforms(at least
>>> aws、azure、gcp) before merged into 5.0 . Just like  CASSANDRA-19488.
>>>
>>> Paulo Motta 于2025年2月13日 周四上午6:10写道:
>>>
 I'm looking forward to these improvements, compaction needs tlc. :-)
 A couple of questions:

 Has this been tested only on EBS, or also EC2/bare-metal/Azure/etc?
 My
 only concern is if this is an optimization for EBS that can be a
 deoptimization for other environments.

 Are there reproducible scripts that anyone can run to verify the
 improvements in their own environments ? This could help alleviate
 any
 concerns and gain confidence to introduce a perf. improvement in a
 patch release.

 I have not read the ticket in detail, so apologies if this was
 already
 discussed there or elsewhere.

 On Wed, Feb 12, 2025 at 3:01 PM Jon Haddad 
 wrote:
 >
 > Hey folks,
 >
 > Over the last 9 months Jordan and I have worked on
 CASSANDRA-15452 [1].  The TL;DR is that we're internalizing a read 
 ahead
 buffer to allow us to do fewer requests to disk during compaction and 
 range
 reads.  This results in far fewer system calls (roughly 16x reduction) 
 and
 on systems with higher read latency, a significant improvement in
 compaction throughput.  We've tested several different EBS 
 configurations
 and found it delivers up to a 10x improvement when read ahead is 
 optimized
 to minimize read latency.  I worked with

Re: Merging compaction improvements to 5.0

2025-02-13 Thread Abe Ratnofsky
Another +1 (nb) in favor of merging to 5.0. This patch has been thoroughly 
tested and reviewed, and will likely be a strong reason for users to upgrade.


Re: [Discuss] Decoupling java driver dependency from server code; migrate tools and tests to 4.x driver

2025-02-13 Thread Tolbert, Andy
Thanks Abe!  I had a bit of a blind spot in checking for prior tickets.  It
was good to look at the discussion on CASSANDRA-15750
.

> Out of curiosity - why do you prefer tests move towards 4.x driver vs.
in-tree SimpleClient

Great call out, I think we should definitely evaluate whether SimpleClient
could be used in place of a full driver implementation.  If its not too
difficult to implement functionality the driver provides that we need for
tests, it may be worth it.

On the other hand, now that the driver is now an Apache project and that it
would no longer be a core server dependency makes it more justifiable to
use if its just for test and tools.

In any case, it's something we should look at before we get to the point of
porting test code to the 4.x driver.

Andy


On Thu, Feb 13, 2025 at 2:06 PM Abe Ratnofsky  wrote:

> Thanks for opening this discussion Andy. I'm also supportive of the plan
> you've proposed.
>
> Pushback from past discussion was mostly due to the 4.0 stabilization
> effort. Since then, cassandra-java-driver has been donated to ASF and
> driver 4.x has had a number of releases, so it feels like the right time to
> update.
>
> CASSANDRA-15750
> CASSANDRA-17231
>
> As far as I know, it's safe for the two drivers to co-exist on the same
> classpath as well.
>
> Out of curiosity - why do you prefer tests move towards 4.x driver vs.
> in-tree SimpleClient? We're already using SimpleClient in tests, and it's
> in-tree so we don't need to be concerned with API compatibility or leakage.
> We'd probably have to add some convenience APIs, like binding to prepared
> statements, to make the transition easier.
>


Re: [Discuss] Decoupling java driver dependency from server code; migrate tools and tests to 4.x driver

2025-02-13 Thread Jon Haddad
Yeah, also I lean strongly towards using the Java driver.

Dogfooding the Java driver has real benefits.  I don't see any benefit to
maintaining the SimpleClient.

Jon

On Thu, Feb 13, 2025 at 12:25 PM Tolbert, Andy  wrote:

> Thanks Abe!  I had a bit of a blind spot in checking for prior tickets.
> It was good to look at the discussion on CASSANDRA-15750
> .
>
> > Out of curiosity - why do you prefer tests move towards 4.x driver vs.
> in-tree SimpleClient
>
> Great call out, I think we should definitely evaluate whether SimpleClient
> could be used in place of a full driver implementation.  If its not too
> difficult to implement functionality the driver provides that we need for
> tests, it may be worth it.
>
> On the other hand, now that the driver is now an Apache project and that
> it would no longer be a core server dependency makes it more justifiable to
> use if its just for test and tools.
>
> In any case, it's something we should look at before we get to the point
> of porting test code to the 4.x driver.
>
> Andy
>
>
> On Thu, Feb 13, 2025 at 2:06 PM Abe Ratnofsky  wrote:
>
>> Thanks for opening this discussion Andy. I'm also supportive of the plan
>> you've proposed.
>>
>> Pushback from past discussion was mostly due to the 4.0 stabilization
>> effort. Since then, cassandra-java-driver has been donated to ASF and
>> driver 4.x has had a number of releases, so it feels like the right time to
>> update.
>>
>> CASSANDRA-15750
>> CASSANDRA-17231
>>
>> As far as I know, it's safe for the two drivers to co-exist on the same
>> classpath as well.
>>
>> Out of curiosity - why do you prefer tests move towards 4.x driver vs.
>> in-tree SimpleClient? We're already using SimpleClient in tests, and it's
>> in-tree so we don't need to be concerned with API compatibility or leakage.
>> We'd probably have to add some convenience APIs, like binding to prepared
>> statements, to make the transition easier.
>>
>


Re: [Discuss] Decoupling java driver dependency from server code; migrate tools and tests to 4.x driver

2025-02-13 Thread Jeremiah Jordan
 Given we do not have any end users that use SimpleClient, and now that the
java driver is part of the project, I would suggest we focus more on use of
the java driver.
I think it is important that the end to end integration tests that are
using a driver be using one that actual clients will use.  This better
ensures that those feature will work when deployed in the wild.

-Jeremiah

On Feb 13, 2025 at 2:24:18 PM, "Tolbert, Andy"  wrote:

> Thanks Abe!  I had a bit of a blind spot in checking for prior tickets.
> It was good to look at the discussion on CASSANDRA-15750
> .
>
> > Out of curiosity - why do you prefer tests move towards 4.x driver vs.
> in-tree SimpleClient
>
> Great call out, I think we should definitely evaluate whether SimpleClient
> could be used in place of a full driver implementation.  If its not too
> difficult to implement functionality the driver provides that we need for
> tests, it may be worth it.
>
> On the other hand, now that the driver is now an Apache project and that
> it would no longer be a core server dependency makes it more justifiable to
> use if its just for test and tools.
>
> In any case, it's something we should look at before we get to the point
> of porting test code to the 4.x driver.
>
> Andy
>
>
> On Thu, Feb 13, 2025 at 2:06 PM Abe Ratnofsky  wrote:
>
>> Thanks for opening this discussion Andy. I'm also supportive of the plan
>> you've proposed.
>>
>> Pushback from past discussion was mostly due to the 4.0 stabilization
>> effort. Since then, cassandra-java-driver has been donated to ASF and
>> driver 4.x has had a number of releases, so it feels like the right time to
>> update.
>>
>> CASSANDRA-15750
>> CASSANDRA-17231
>>
>> As far as I know, it's safe for the two drivers to co-exist on the same
>> classpath as well.
>>
>> Out of curiosity - why do you prefer tests move towards 4.x driver vs.
>> in-tree SimpleClient? We're already using SimpleClient in tests, and it's
>> in-tree so we don't need to be concerned with API compatibility or leakage.
>> We'd probably have to add some convenience APIs, like binding to prepared
>> statements, to make the transition easier.
>>
>


Re: Merging compaction improvements to 5.0

2025-02-13 Thread Dmitry Konstantinov
+1 (nb) from my side, I raised a few comments for CASSANDRA-15452 some time
ago and Jordan addressed them.
I have also backported CASSANDRA-15452 changes to my internal 4.1 fork and
got about 15% reduction in compaction time even for a node with a local SSD.

On Thu, 13 Feb 2025 at 13:22, Jordan West  wrote:

> For 15452 that’s correct (and I believe also for 20092). For 15452, the
> trunk and 5.0 patch are basically identical.
>
> Jordan
>
> On Thu, Feb 13, 2025 at 01:06 C. Scott Andreas 
> wrote:
>
>> Checking to confirm the specific patches proposed for backport – is it
>> the trunk commit for C-20092 and the open GitHub PR against the 5.0 branch
>> for C-15452 linked below?
>>
>> CASSANDRA-20092: Introduce SSTableSimpleScanner for compaction (committed
>> to trunk)
>> https://github.com/apache/cassandra/commit/3078aea1cfc70092a185bab8ac5dc8a35627330f
>>
>>  CASSANDRA-15452: Improve disk access patterns during compaction and
>> range reads (PR available) https://github.com/apache/cassandra/pull/3606
>>
>> Thanks,
>>
>> – Scott
>>
>> On Feb 12, 2025, at 9:45 PM, guo Maxwell  wrote:
>>
>>
>> Of course, I definitely hope to see it merged into 5.0.x as soon as
>> possible
>>
>> Jordan West  于2025年2月13日周四 10:48写道:
>>
>>> Regarding the buffer size, it is configurable. My personal take is that
>>> we’ve tested this on a variety of hardware (from laptops to large instance
>>> sizes) already, as well as a few different disk configs (it’s also been run
>>> internally, in test, at a few places) and that it has been reviewed by four
>>> committers and another contributor. Always love to see more numbers. if
>>> folks want to take it for a spin on Alibaba cloud, azure, etc and determine
>>> the best buffer size that’s awesome. We could document which is suggested
>>> for the community. I don’t think it’s necessary to block on that however.
>>>
>>> Also I am of course +1 to including this in 5.0.
>>>
>>> Jordan
>>>
>>> On Wed, Feb 12, 2025 at 19:50 guo Maxwell  wrote:
>>>
 What I understand is that there will be some differences in block
 storage among various cloud platforms. More intuitively, the default
 read-ahead size will be the same. For example, AWS ebs seems to be 256K,
 and Alibaba Cloud seems to be 512K(If I remember correctly).

 Just like 19488, give the test method, see who can assist in the test ,
 and provide the results.

 Jon Haddad  于2025年2月13日周四 08:30写道:

> Can you elaborate why?  This would be several hundred hours of work
> and would cost me thousands of $$ to perform.
>
> Filesystems and block devices are well understood.  Could you give me
> an example of what you think might be different here?  This is already one
> of the most well tested and documented performance patches ever 
> contributed
> to the project.
>
> On Wed, Feb 12, 2025 at 4:26 PM guo Maxwell 
> wrote:
>
>>  I think it should be tested on most cloud platforms(at least
>> aws、azure、gcp) before merged into 5.0 . Just like  CASSANDRA-19488.
>>
>> Paulo Motta 于2025年2月13日 周四上午6:10写道:
>>
>>> I'm looking forward to these improvements, compaction needs tlc. :-)
>>> A couple of questions:
>>>
>>> Has this been tested only on EBS, or also EC2/bare-metal/Azure/etc?
>>> My
>>> only concern is if this is an optimization for EBS that can be a
>>> deoptimization for other environments.
>>>
>>> Are there reproducible scripts that anyone can run to verify the
>>> improvements in their own environments ? This could help alleviate
>>> any
>>> concerns and gain confidence to introduce a perf. improvement in a
>>> patch release.
>>>
>>> I have not read the ticket in detail, so apologies if this was
>>> already
>>> discussed there or elsewhere.
>>>
>>> On Wed, Feb 12, 2025 at 3:01 PM Jon Haddad 
>>> wrote:
>>> >
>>> > Hey folks,
>>> >
>>> > Over the last 9 months Jordan and I have worked on CASSANDRA-15452
>>> [1].  The TL;DR is that we're internalizing a read ahead buffer to 
>>> allow us
>>> to do fewer requests to disk during compaction and range reads.  This
>>> results in far fewer system calls (roughly 16x reduction) and on systems
>>> with higher read latency, a significant improvement in compaction
>>> throughput.  We've tested several different EBS configurations and 
>>> found it
>>> delivers up to a 10x improvement when read ahead is optimized to 
>>> minimize
>>> read latency.  I worked with AWS and the EBS team directly on this and 
>>> the
>>> Best Practices for C* on EBS [2] I wrote for them.  I've performance 
>>> tested
>>> this patch extensively with hundreds of billions of operations across
>>> several clusters and thousands of compactions.  It has less of an 
>>> impact on
>>> local NVMe, since the p99 latency is already 10-30x less than what you 
>>> see
>>

Re: [Discuss] Decoupling java driver dependency from server code; migrate tools and tests to 4.x driver

2025-02-13 Thread Abe Ratnofsky
SimpleClient is definitely limited - it doesn't manage connection pools, load 
balancing, or error handling. I'd love to get to the point where we can check a 
driver release by running C* tests against the latest snapshot as part of 
qualification, so I'm on board for consolidating on driver 4.x where it's 
appropriate.

Re: Merging compaction improvements to 5.0

2025-02-13 Thread Jon Haddad
Yeah, this is how I feel too.

This is different from CASSANDRA-19488 in that there aren't any cloud
provider specific details that we need to account for with our patch.
We're doing normal IO here.  The same code works everywhere.  The results
will vary based on disk latency and quotas, but imo, figuring out what that
variability is should not be our responsibility.  We're not here to
benchmark hardware.

>From the standpoint of understanding variability across cloud providers,
there's nothing that exists in Azure that can make 16 synchronous requests
of 16KB faster than a single request of 256KB.  What we have here is faster
than the page cache, that's the optimal filesystem path.

If folks want to point to the docs for each cloud provider for the maximum
block size per IO request, we can certainly document that somewhere.  If
someone wants to spent the time figuring that out, I warmly welcome it, but
I don't think it should be a blocker on merging to 5.0.

Jon


On Wed, Feb 12, 2025 at 6:50 PM Jordan West  wrote:

> Regarding the buffer size, it is configurable. My personal take is that
> we’ve tested this on a variety of hardware (from laptops to large instance
> sizes) already, as well as a few different disk configs (it’s also been run
> internally, in test, at a few places) and that it has been reviewed by four
> committers and another contributor. Always love to see more numbers. if
> folks want to take it for a spin on Alibaba cloud, azure, etc and determine
> the best buffer size that’s awesome. We could document which is suggested
> for the community. I don’t think it’s necessary to block on that however.
>
> Also I am of course +1 to including this in 5.0.
>
> Jordan
>
> On Wed, Feb 12, 2025 at 19:50 guo Maxwell  wrote:
>
>> What I understand is that there will be some differences in block storage
>> among various cloud platforms. More intuitively, the default read-ahead
>> size will be the same. For example, AWS ebs seems to be 256K, and Alibaba
>> Cloud seems to be 512K(If I remember correctly).
>>
>> Just like 19488, give the test method, see who can assist in the test ,
>> and provide the results.
>>
>> Jon Haddad  于2025年2月13日周四 08:30写道:
>>
>>> Can you elaborate why?  This would be several hundred hours of work and
>>> would cost me thousands of $$ to perform.
>>>
>>> Filesystems and block devices are well understood.  Could you give me an
>>> example of what you think might be different here?  This is already one of
>>> the most well tested and documented performance patches ever contributed to
>>> the project.
>>>
>>> On Wed, Feb 12, 2025 at 4:26 PM guo Maxwell 
>>> wrote:
>>>
  I think it should be tested on most cloud platforms(at least
 aws、azure、gcp) before merged into 5.0 . Just like  CASSANDRA-19488.

 Paulo Motta 于2025年2月13日 周四上午6:10写道:

> I'm looking forward to these improvements, compaction needs tlc. :-)
> A couple of questions:
>
> Has this been tested only on EBS, or also EC2/bare-metal/Azure/etc? My
> only concern is if this is an optimization for EBS that can be a
> deoptimization for other environments.
>
> Are there reproducible scripts that anyone can run to verify the
> improvements in their own environments ? This could help alleviate any
> concerns and gain confidence to introduce a perf. improvement in a
> patch release.
>
> I have not read the ticket in detail, so apologies if this was already
> discussed there or elsewhere.
>
> On Wed, Feb 12, 2025 at 3:01 PM Jon Haddad 
> wrote:
> >
> > Hey folks,
> >
> > Over the last 9 months Jordan and I have worked on CASSANDRA-15452
> [1].  The TL;DR is that we're internalizing a read ahead buffer to allow 
> us
> to do fewer requests to disk during compaction and range reads.  This
> results in far fewer system calls (roughly 16x reduction) and on systems
> with higher read latency, a significant improvement in compaction
> throughput.  We've tested several different EBS configurations and found 
> it
> delivers up to a 10x improvement when read ahead is optimized to minimize
> read latency.  I worked with AWS and the EBS team directly on this and the
> Best Practices for C* on EBS [2] I wrote for them.  I've performance 
> tested
> this patch extensively with hundreds of billions of operations across
> several clusters and thousands of compactions.  It has less of an impact 
> on
> local NVMe, since the p99 latency is already 10-30x less than what you see
> on EBS (100micros vs 1-3ms), and you can do hundreds of thousands of IOPS
> vs a max of 16K.
> >
> > Related to this, Branimir wrote CASSANDRA-20092 [3], which
> significantly improves compaction by avoiding reading the partition index.
> CASSANDRA-20092 has been merged to trunk already [4].
> >
> > I think we should merge both of these patches into 5.0, as the perf
> improve

Re: [Discuss] Decoupling java driver dependency from server code; migrate tools and tests to 4.x driver

2025-02-13 Thread Abe Ratnofsky
Thanks for opening this discussion Andy. I'm also supportive of the plan you've 
proposed.

Pushback from past discussion was mostly due to the 4.0 stabilization effort. 
Since then, cassandra-java-driver has been donated to ASF and driver 4.x has 
had a number of releases, so it feels like the right time to update.

CASSANDRA-15750
CASSANDRA-17231

As far as I know, it's safe for the two drivers to co-exist on the same 
classpath as well.

Out of curiosity - why do you prefer tests move towards 4.x driver vs. in-tree 
SimpleClient? We're already using SimpleClient in tests, and it's in-tree so we 
don't need to be concerned with API compatibility or leakage. We'd probably 
have to add some convenience APIs, like binding to prepared statements, to make 
the transition easier.

Re: [Discuss] Decoupling java driver dependency from server code; migrate tools and tests to 4.x driver

2025-02-13 Thread Tolbert, Andy
Hi All,

I went ahead and started an epic to track the work here:
https://issues.apache.org/jira/browse/CASSANDRA-20326

Initially the focus will be on removing the driver as a dependency of
core server code.  Once that has been achieved we can start working on
migrating tooling and test code to the 4.x driver.

Thanks,
Andy


On Wed, Feb 12, 2025 at 8:40 PM Tolbert, Andy  wrote:
>
> Jon
>
> > We can probably skip cassandra-stress, since it looks like easy-cass-stress 
> > can be donated.  That does need a driver upgrade to support a vector 
> > workload, but imo there's no point in investing more in cassandra-stress 
> > when we have an alternative with more features available.  Not a hill I'm 
> > going to die on, just an opportunity to do less work.
>
> That sounds great!   I think it's likely that we can manage separate 
> classpaths for the various tools, so we could update the driver dependency in 
> fqltool and cassandra-loader in the meantime and leave cassandra-stress as is 
> if it is going to be superseded.
>
> JD,
>
> > For the tests, maybe we can have two test class paths for a while?  One for 
> > driver 3 and one for driver 4?  That way we don’t need to migrate them all 
> > in a giant big bang patch?  They could be moved over a few at a time making 
> > review much easier.
>
> I'll explore if this is possible as I think that could potentially work and 
> be more manageable.  I can also evaluate whether its possible for the two 
> drivers to co-exist on the same classpath (I think that may be the case, but 
> I'm not certain).
>
> Thanks,
> Andy
>
>
> On Wed, Feb 12, 2025 at 6:59 PM J. D. Jordan  
> wrote:
>>
>> Sounds like a reasonable plan to me. +1
>>
>> For the tests, maybe we can have two test class paths for a while?  One for 
>> driver 3 and one for driver 4?  That way we don’t need to migrate them all 
>> in a giant big bang patch?  They could be moved over a few at a time making 
>> review much easier.
>>
>> On Feb 12, 2025, at 6:35 PM, Jon Haddad  wrote:
>>
>> 
>> Hey Andy,
>>
>> This seems like a reasonable proposal.
>>
>> We can probably skip cassandra-stress, since it looks like easy-cass-stress 
>> can be donated.  That does need a driver upgrade to support a vector 
>> workload, but imo there's no point in investing more in cassandra-stress 
>> when we have an alternative with more features available.  Not a hill I'm 
>> going to die on, just an opportunity to do less work.
>>
>> Jon
>>
>>
>>
>>
>> On Wed, Feb 12, 2025 at 3:06 PM Tolbert, Andy  wrote:
>>>
>>> Hi All,
>>>
>>> I'd like to propose decoupling the java driver as a dependency from the core
>>> Cassandra server code.
>>>
>>> I also want to propose a path towards eventually migrating test and tools 
>>> code
>>> from Apache Cassandra java driver 3.x to 4.x when the time is right for the
>>> project.
>>>
>>> Refactoring test code to 4.x is likely to be quite invasive, as I count
>>> 128 source files utilizing driver code.  We'd want to find a good time to do
>>> this to minimize disruption to ongoing development.
>>>
>>> Java driver 4.x is effectively a rewrite of the 3.x driver.  Its first 
>>> release
>>> was in March of 2019. While it has similar APIs, it is not binary compatible
>>> with the 3.x driver [1].
>>>
>>> While there hasn't been a clear decision on how the 3.x driver will be
>>> supported going forward (although we should consider discussing this!), we
>>> expect and have seen active development take place mostly exclusively
>>> on the 4.x driver.
>>>
>>> It would be useful to migrate to the 4.x driver to test new and future 
>>> features
>>> of which the 4.x driver will actively support.  For example, the 4.x driver
>>> supports Vector types, where the 3.x driver does not.
>>>
>>> I've iterated the codebase and identified the following uses of the driver:
>>>
>>> 0. Core code that uses the driver
>>>
>>> * UntypedResultSet uses CodecUtils.fromUnsignedToSignedInt from the driver
>>>   which is just adding Integer.MIN_VALUE to an int so can easily be removed.
>>> * PreparedStatementHelper is used only by dtest fuzz tests to validate
>>>   Prepared Statements.  Can be moved to test code.
>>> * ThreadAwareSecurityManager.checkPermission makes reference to skipping
>>>   checking accessDeclaredMembers due to use of CodecUtils, can probably 
>>> remove
>>>   that with its use removed.
>>> * sstableloader uses the driver to fetch schema and metadata
>>>
>>> 1. Tools that use the driver
>>>
>>> * fqltool replay (replaying queries from captured logs)
>>> * cassandra-stress (making queries to generate load)
>>>
>>> 2. Test code
>>>
>>> * Understandably, quite a bit of test code uses the driver. This is where I
>>>   anticipate the most work would be be needed.
>>>
>>> I'd like to propose doing the following:
>>>
>>> Can be done now:
>>>
>>> * Move sstableloader source into its own tools directly, much like fqltool
>>>   and cassandra-stress.  For compatibility, we could retain the existing 
>>> shell
>>>   script

Re: Merging compaction improvements to 5.0

2025-02-13 Thread Doug Rohrer
+1 - Thanks for doing the work to figure this out and find a good fix.

Doug

> On Feb 13, 2025, at 11:28 AM, Patrick McFadin  wrote:
> 
> I’ve been following this for a while and I think it’s just some solid 
> engineering based on real-world challenges. Probably one of the best types of 
> contributions to have. I’m +1 on adding it to 5
> 
> Patrick
> 
> On Thu, Feb 13, 2025 at 7:31 AM Dmitry Konstantinov  > wrote:
>> +1 (nb) from my side, I raised a few comments for CASSANDRA-15452 some time 
>> ago and Jordan addressed them.
>> I have also backported CASSANDRA-15452 changes to my internal 4.1 fork and 
>> got about 15% reduction in compaction time even for a node with a local SSD.
>> 
>> On Thu, 13 Feb 2025 at 13:22, Jordan West > > wrote:
>>> For 15452 that’s correct (and I believe also for 20092). For 15452, the 
>>> trunk and 5.0 patch are basically identical. 
>>> 
>>> Jordan 
>>> 
>>> On Thu, Feb 13, 2025 at 01:06 C. Scott Andreas >> > wrote:
 Checking to confirm the specific patches proposed for backport – is it the 
 trunk commit for C-20092 and the open GitHub PR against the 5.0 branch for 
 C-15452 linked below?
 
 CASSANDRA-20092: Introduce SSTableSimpleScanner for compaction (committed 
 to trunk) 
 https://github.com/apache/cassandra/commit/3078aea1cfc70092a185bab8ac5dc8a35627330f
 
  CASSANDRA-15452: Improve disk access patterns during compaction and range 
 reads (PR available) https://github.com/apache/cassandra/pull/3606
 
 Thanks,
 
 – Scott
 
> On Feb 12, 2025, at 9:45 PM, guo Maxwell  > wrote:
> 
> 
> Of course, I definitely hope to see it merged into 5.0.x as soon as 
> possible
> 
> Jordan West mailto:jw...@apache.org>> 于2025年2月13日周四 
> 10:48写道:
>> Regarding the buffer size, it is configurable. My personal take is that 
>> we’ve tested this on a variety of hardware (from laptops to large 
>> instance sizes) already, as well as a few different disk configs (it’s 
>> also been run internally, in test, at a few places) and that it has been 
>> reviewed by four committers and another contributor. Always love to see 
>> more numbers. if folks want to take it for a spin on Alibaba cloud, 
>> azure, etc and determine the best buffer size that’s awesome. We could 
>> document which is suggested for the community. I don’t think it’s 
>> necessary to block on that however. 
>> 
>> Also I am of course +1 to including this in 5.0. 
>> 
>> Jordan 
>> 
>> On Wed, Feb 12, 2025 at 19:50 guo Maxwell > > wrote:
>>> What I understand is that there will be some differences in block 
>>> storage among various cloud platforms. More intuitively, the default 
>>> read-ahead size will be the same. For example, AWS ebs seems to be 
>>> 256K, and Alibaba Cloud seems to be 512K(If I remember correctly).
>>> 
>>> Just like 19488, give the test method, see who can assist in the test , 
>>> and provide the results.  
>>> 
>>> Jon Haddad mailto:j...@rustyrazorblade.com>> 
>>> 于2025年2月13日周四 08:30写道:
 Can you elaborate why?  This would be several hundred hours of work 
 and would cost me thousands of $$ to perform.
 
 Filesystems and block devices are well understood.  Could you give me 
 an example of what you think might be different here?  This is already 
 one of the most well tested and documented performance patches ever 
 contributed to the project.
 
 On Wed, Feb 12, 2025 at 4:26 PM guo Maxwell >>> > wrote:
>  I think it should be tested on most cloud platforms(at least 
> aws、azure、gcp) before merged into 5.0 . Just like  CASSANDRA-19488.
> 
> Paulo Motta mailto:pa...@apache.org>>于2025年2月13日 
> 周四上午6:10写道:
>> I'm looking forward to these improvements, compaction needs tlc. :-)
>> A couple of questions:
>> 
>> Has this been tested only on EBS, or also EC2/bare-metal/Azure/etc? 
>> My
>> only concern is if this is an optimization for EBS that can be a
>> deoptimization for other environments.
>> 
>> Are there reproducible scripts that anyone can run to verify the
>> improvements in their own environments ? This could help alleviate 
>> any
>> concerns and gain confidence to introduce a perf. improvement in a
>> patch release.
>> 
>> I have not read the ticket in detail, so apologies if this was 
>> already
>> discussed there or elsewhere.
>> 
>> On Wed, Feb 12, 2025 at 3:01 PM Jon Haddad > > wrote:
>> >
>> > Hey folks,
>

Re: Merging compaction improvements to 5.0

2025-02-13 Thread Paulo Motta
> My personal take is that we’ve tested this on a variety of hardware (from
laptops to large instance sizes) already, as well as a few different disk
configs (it’s also been run internally, in test, at a few places) and that
it has been reviewed by four committers and another contributor.

Thanks for the additional context Jordan.

I will not be able to test this soon, but it looks like it has broad
support and no apparent objections so this seems like a welcome change. +1
to include extraordinarily perf improvement in 5.0 if no objections.

I will add additional feedback once I have the chance to test this.

On Wed, 12 Feb 2025 at 21:48 Jordan West  wrote:

> Regarding the buffer size, it is configurable. My personal take is that
> we’ve tested this on a variety of hardware (from laptops to large instance
> sizes) already, as well as a few different disk configs (it’s also been run
> internally, in test, at a few places) and that it has been reviewed by four
> committers and another contributor. Always love to see more numbers. if
> folks want to take it for a spin on Alibaba cloud, azure, etc and determine
> the best buffer size that’s awesome. We could document which is suggested
> for the community. I don’t think it’s necessary to block on that however.
>
> Also I am of course +1 to including this in 5.0.
>
> Jordan
>
> On Wed, Feb 12, 2025 at 19:50 guo Maxwell  wrote:
>
>> What I understand is that there will be some differences in block storage
>> among various cloud platforms. More intuitively, the default read-ahead
>> size will be the same. For example, AWS ebs seems to be 256K, and Alibaba
>> Cloud seems to be 512K(If I remember correctly).
>>
>> Just like 19488, give the test method, see who can assist in the test ,
>> and provide the results.
>>
>> Jon Haddad  于2025年2月13日周四 08:30写道:
>>
>>> Can you elaborate why?  This would be several hundred hours of work and
>>> would cost me thousands of $$ to perform.
>>>
>>> Filesystems and block devices are well understood.  Could you give me an
>>> example of what you think might be different here?  This is already one of
>>> the most well tested and documented performance patches ever contributed to
>>> the project.
>>>
>>> On Wed, Feb 12, 2025 at 4:26 PM guo Maxwell 
>>> wrote:
>>>
  I think it should be tested on most cloud platforms(at least
 aws、azure、gcp) before merged into 5.0 . Just like  CASSANDRA-19488.

 Paulo Motta 于2025年2月13日 周四上午6:10写道:

> I'm looking forward to these improvements, compaction needs tlc. :-)
> A couple of questions:
>
> Has this been tested only on EBS, or also EC2/bare-metal/Azure/etc? My
> only concern is if this is an optimization for EBS that can be a
> deoptimization for other environments.
>
> Are there reproducible scripts that anyone can run to verify the
> improvements in their own environments ? This could help alleviate any
> concerns and gain confidence to introduce a perf. improvement in a
> patch release.
>
> I have not read the ticket in detail, so apologies if this was already
> discussed there or elsewhere.
>
> On Wed, Feb 12, 2025 at 3:01 PM Jon Haddad 
> wrote:
> >
> > Hey folks,
> >
> > Over the last 9 months Jordan and I have worked on CASSANDRA-15452
> [1].  The TL;DR is that we're internalizing a read ahead buffer to allow 
> us
> to do fewer requests to disk during compaction and range reads.  This
> results in far fewer system calls (roughly 16x reduction) and on systems
> with higher read latency, a significant improvement in compaction
> throughput.  We've tested several different EBS configurations and found 
> it
> delivers up to a 10x improvement when read ahead is optimized to minimize
> read latency.  I worked with AWS and the EBS team directly on this and the
> Best Practices for C* on EBS [2] I wrote for them.  I've performance 
> tested
> this patch extensively with hundreds of billions of operations across
> several clusters and thousands of compactions.  It has less of an impact 
> on
> local NVMe, since the p99 latency is already 10-30x less than what you see
> on EBS (100micros vs 1-3ms), and you can do hundreds of thousands of IOPS
> vs a max of 16K.
> >
> > Related to this, Branimir wrote CASSANDRA-20092 [3], which
> significantly improves compaction by avoiding reading the partition index.
> CASSANDRA-20092 has been merged to trunk already [4].
> >
> > I think we should merge both of these patches into 5.0, as the perf
> improvement should allow teams to increase density of EBS backed C*
> clusters by 2-5x, driving cost way down.  There's a lot of teams running 
> C*
> on EBS now.  I'm currently working with one that's bottlenecked on maxed
> out EBS GP3 storage.  I propose we merge both, because without
> CASSANDRA-20092, we won't get the performance improvements in