Re: Merging compaction improvements to 5.0

Josh McKenzie Fri, 14 Feb 2025 05:17:44 -0800

> If folks want to point to the docs for each cloud provider for the maximum 
> block size per IO request, we can certainly document that somewhere.
Meh, that will probably change on their side over time right? At most I'd say 
we link to their docs, but even then those external links will go stale and 
let's be honest: we're not going to keep up with that.


I think a bread crumb of "hey, check the docs for your infra to see what the 
optimal max block size is" would be just right.

Also: +1 to including in 5.0. Great work on this!

On Thu, Feb 13, 2025, at 8:25 PM, Paulo Motta wrote:
> > My personal take is that we’ve tested this on a variety of hardware (from 
> > laptops to large instance sizes) already, as well as a few different disk 
> > configs (it’s also been run internally, in test, at a few places) and that 
> > it has been reviewed by four committers and another contributor.
> 
> Thanks for the additional context Jordan.
> 
> I will not be able to test this soon, but it looks like it has broad support 
> and no apparent objections so this seems like a welcome change. +1 to include 
> extraordinarily perf improvement in 5.0 if no objections.
> 
> I will add additional feedback once I have the chance to test this.
> 
> On Wed, 12 Feb 2025 at 21:48 Jordan West <[email protected]> wrote:
>> Regarding the buffer size, it is configurable. My personal take is that 
>> we’ve tested this on a variety of hardware (from laptops to large instance 
>> sizes) already, as well as a few different disk configs (it’s also been run 
>> internally, in test, at a few places) and that it has been reviewed by four 
>> committers and another contributor. Always love to see more numbers. if 
>> folks want to take it for a spin on Alibaba cloud, azure, etc and determine 
>> the best buffer size that’s awesome. We could document which is suggested 
>> for the community. I don’t think it’s necessary to block on that however. 
>> 
>> Also I am of course +1 to including this in 5.0. 
>> 
>> Jordan 
>> 
>> On Wed, Feb 12, 2025 at 19:50 guo Maxwell <[email protected]> wrote:
>>> What I understand is that there will be some differences in block storage 
>>> among various cloud platforms. More intuitively, the default read-ahead 
>>> size will be the same. For example, AWS ebs seems to be 256K, and Alibaba 
>>> Cloud seems to be 512K（If I remember correctly).
>>> 
>>> Just like 19488, give the test method, see who can assist in the test , and 
>>> provide the results.  
>>> 
>>> Jon Haddad <[email protected]> 于2025年2月13日周四 08:30写道：
>>>> Can you elaborate why?  This would be several hundred hours of work and 
>>>> would cost me thousands of $$ to perform.
>>>> 
>>>> Filesystems and block devices are well understood.  Could you give me an 
>>>> example of what you think might be different here?  This is already one of 
>>>> the most well tested and documented performance patches ever contributed 
>>>> to the project.
>>>> 
>>>> On Wed, Feb 12, 2025 at 4:26 PM guo Maxwell <[email protected]> wrote:
>>>>>  I think it should be tested on most cloud platforms（at least 
>>>>> aws、azure、gcp） before merged into 5.0 . Just like  CASSANDRA-19488.
>>>>> 
>>>>> Paulo Motta <[email protected]>于2025年2月13日 周四上午6:10写道：
>>>>>> I'm looking forward to these improvements, compaction needs tlc. :-)
>>>>>> A couple of questions:
>>>>>> 
>>>>>> Has this been tested only on EBS, or also EC2/bare-metal/Azure/etc? My
>>>>>> only concern is if this is an optimization for EBS that can be a
>>>>>> deoptimization for other environments.
>>>>>> 
>>>>>> Are there reproducible scripts that anyone can run to verify the
>>>>>> improvements in their own environments ? This could help alleviate any
>>>>>> concerns and gain confidence to introduce a perf. improvement in a
>>>>>> patch release.
>>>>>> 
>>>>>> I have not read the ticket in detail, so apologies if this was already
>>>>>> discussed there or elsewhere.
>>>>>> 
>>>>>> On Wed, Feb 12, 2025 at 3:01 PM Jon Haddad <[email protected]> 
>>>>>> wrote:
>>>>>> >
>>>>>> > Hey folks,
>>>>>> >
>>>>>> > Over the last 9 months Jordan and I have worked on CASSANDRA-15452 
>>>>>> > [1].  The TL;DR is that we're internalizing a read ahead buffer to 
>>>>>> > allow us to do fewer requests to disk during compaction and range 
>>>>>> > reads.  This results in far fewer system calls (roughly 16x reduction) 
>>>>>> > and on systems with higher read latency, a significant improvement in 
>>>>>> > compaction throughput.  We've tested several different EBS 
>>>>>> > configurations and found it delivers up to a 10x improvement when read 
>>>>>> > ahead is optimized to minimize read latency.  I worked with AWS and 
>>>>>> > the EBS team directly on this and the Best Practices for C* on EBS [2] 
>>>>>> > I wrote for them.  I've performance tested this patch extensively with 
>>>>>> > hundreds of billions of operations across several clusters and 
>>>>>> > thousands of compactions.  It has less of an impact on local NVMe, 
>>>>>> > since the p99 latency is already 10-30x less than what you see on EBS 
>>>>>> > (100micros vs 1-3ms), and you can do hundreds of thousands of IOPS vs 
>>>>>> > a max of 16K.
>>>>>> >
>>>>>> > Related to this, Branimir wrote CASSANDRA-20092 [3], which 
>>>>>> > significantly improves compaction by avoiding reading the partition 
>>>>>> > index.  CASSANDRA-20092 has been merged to trunk already [4].
>>>>>> >
>>>>>> > I think we should merge both of these patches into 5.0, as the perf 
>>>>>> > improvement should allow teams to increase density of EBS backed C* 
>>>>>> > clusters by 2-5x, driving cost way down.  There's a lot of teams 
>>>>>> > running C* on EBS now.  I'm currently working with one that's 
>>>>>> > bottlenecked on maxed out EBS GP3 storage.  I propose we merge both, 
>>>>>> > because without CASSANDRA-20092, we won't get the performance 
>>>>>> > improvements in CASSANDRA-15452 with BTI, only BIG format.  I've 
>>>>>> > tested BTI in other situations and found it to be far more performant 
>>>>>> > than BIG.
>>>>>> >
>>>>>> > If we were looking at a small win, I wouldn't care much, but since 
>>>>>> > these patches, combined with UCS, allows more teams to run C* on EBS 
>>>>>> > at > 10TB / node, I think it's worth doing now.
>>>>>> >
>>>>>> > Thanks in advance,
>>>>>> > Jon
>>>>>> >
>>>>>> > [1] https://issues.apache.org/jira/browse/CASSANDRA-15452
>>>>>> > [2] 
>>>>>> > https://aws.amazon.com/blogs/database/best-practices-for-running-apache-cassandra-with-amazon-ebs/
>>>>>> > [3] https://issues.apache.org/jira/browse/CASSANDRA-20092
>>>>>> > [4] 
>>>>>> > https://github.com/apache/cassandra/commit/3078aea1cfc70092a185bab8ac5dc8a35627330f
>>>>>> >

Re: Merging compaction improvements to 5.0

Reply via email to