Re: [DISCUSS] Replacement of SSTable's partition cardinality implementation from stream-lib to Apache Datasketches

2025-03-12 Thread Štefan Miklošovič
Benedict,

I have reached Datasketches community (1) and asked what they think about
Clearspring and if it is convertible to Datasketches as you earlier
suggested that we might try to convert one to the other.

Based on what they wrote, I do not think that is possible to do (2) and
they say that Clearspring has "serious error problems" and it does not
implement Google's HLL++ paper correctly etc.

As I see it, in case we have SSTables with both old and new format, we
might compute keys like here (3). This code would be exercised only as long
as there are mixed formats. If we upgrade SSTables to a new format or if
old SSTables are compacated away to SSTables of new format, we would not do
it like in (3) anymore.

If we are OK with this, then I would try to spend more time on finishing
the PR and do some perf tests etc. so we might compare before / after.

How does that sound?

Regards

(1) https://lists.apache.org/thread/4rhbqzqyh1cn0pmbst8som4kvvko8gqp
(2) https://lists.apache.org/thread/l00yv67wwtztgl5lopdtbw3z9s7fng5b
(3) https://github.com/apache/cassandra/pull/3767/files#r1989136062

On Fri, Jan 3, 2025 at 1:47 PM Benedict  wrote:

> I’ve had a quick skim of the data sketches library, and it does seem to
> have made some more efficient decisions in its design than clearspring,
> appears to maybe support off-heap representations, and has reasonably good
> documentation about the theoretical properties of the sketches. The chair
> of the project is a published author on the topic, and the library has
> newer algorithms for cardinality estimation than HLL.
>
> So, honestly, it might not be a bad idea to (carefully) consider a
> migration, even if the current library isn’t broken for our needs.
>
> It would not be high up my priority list for the project, but I would
> support it if it scratches someone’s itch.
>
> On 3 Jan 2025, at 12:16, Štefan Miklošovič  wrote:
>
> 
> Okay ... first problems.
>
> These 2000 bytes I have mentioned in my response to Chris were indeed
> correct, but that was with Datasketches and the main parameter for Hall
> Sketch (DEFAULT_LG_K) was 12. When I changed that to 13 to match what we
> currently have in Cassandra with Clearspring, that doubled the size to
> ~4000 bytes.
>
> When we do not use Datasketches, what Clearspring generates is about ~5000
> bytes for the array itself but that array is wrapped into an ICardinality
> object of Clearspring and we need that object in order to merge another
> ICardinality into that. So, we would need to cache this ICardinality object
> instead of just an array itself. If we don't cache whole ICardinality, we
> would then need to do basically what
> CompactionMetadata.CompactionMetadataSerializer.deserialize is doing which
> would allocate a lot / often (ICardinality cardinality =
> HyperLogLogPlus.Builder.build(that_cached_array)).
>
> To avoid the allocations every time we compute, we would just cache that
> whole ICardinality of Clearspring, but that whole object measures like
> 11/12 KB. So even 10k tables would occupy like 100MB. 50k tables 500MB.
> That is becoming quite a problem.
>
> On the other hand, HllSketch of Datasketches, array included, adds minimal
> overhead. Like an array has 5000 bytes and the whole object like 5500. You
> got the idea ...
>
> If we are still OK with these sizes, sure ... I am just being transparent
> about the consequences here.
>
> A user would just opt-in into this (by default it would be turned off).
>
> On the other hand, if we have 10k SSTables, reading that 10+KB from disk
> takes around 2-3ms so we would read the disk 20/30 seconds every time we
> would hit that metric (and we haven't even started to merge the logs).
>
> If this is still not something which would sell Datasketches as a viable
> alternative then I guess we need to stick to these numbers and cache it all
> with Clearspring, occupying way more memory.
>
> On Thu, Jan 2, 2025 at 10:15 PM Benedict  wrote:
>
>> I would like to see somebody who has some experience writing data
>> structures, preferably someone we trust as a community to be competent at
>> this (ie having some experience within the project contributing at this
>> level), look at the code like they were at least lightly reviewing the
>> feature as a contribution to this project.
>>
>> This should be the bar for any new library really, but triply so for
>> replacing a library that works fine.
>>
>> On 2 Jan 2025, at 21:02, Štefan Miklošovič 
>> wrote:
>>
>> 
>> Point 2) is pretty hard to fulfil, I can not imagine what would be
>> "enough" for you to be persuaded. What should concretely happen? Because
>> whoever comes and says "yeah this is a good lib, it works" is probably not
>> going to be enough given the vague requirements you put under 2) You would
>> like to see exactly what?
>>
>> The way it looks to me is to just shut it down because of perceived churn
>> caused by that and there will always be some argument against that.
>>
>> Based on (1) I don't think what 

Re: [DISCUSS] Replacement of SSTable's partition cardinality implementation from stream-lib to Apache Datasketches

2025-03-12 Thread Štefan Miklošovič
Interesting. So to repeat that if I got it right:

current format - serialized Clearspring log
next format - serialized Clearspring log PLUS serialized log from
Datasketches

in case all SSTables are on legacy format - merge all Clearspring logs
in case some SSTables are on legacy format and the rest on new format -
still merge all Clearspring logs
in case all SSTables are on new format - merge Datasketches

I haven't looked at it this way. I'll play with it.

On Wed, Mar 12, 2025 at 12:55 PM Benedict Elliott Smith 
wrote:

> Hi Stefan,
>
> My reading of this mailing list thread is that they think clearspring is
> junk (probably fair) and so you shouldn’t use it or convert it. I am not
> sure this actually means it cannot be done.
>
> That said, a simpler option might be to produce both sketches until we can
> “upgrade” all of the legacy sstables to the new sketches. This would be
> fine in my book, and probably much simpler.
>
> On 12 Mar 2025, at 11:37, Štefan Miklošovič 
> wrote:
>
> Benedict,
>
> I have reached Datasketches community (1) and asked what they think about
> Clearspring and if it is convertible to Datasketches as you earlier
> suggested that we might try to convert one to the other.
>
> Based on what they wrote, I do not think that is possible to do (2) and
> they say that Clearspring has "serious error problems" and it does not
> implement Google's HLL++ paper correctly etc.
>
> As I see it, in case we have SSTables with both old and new format, we
> might compute keys like here (3). This code would be exercised only as long
> as there are mixed formats. If we upgrade SSTables to a new format or if
> old SSTables are compacated away to SSTables of new format, we would not do
> it like in (3) anymore.
>
> If we are OK with this, then I would try to spend more time on finishing
> the PR and do some perf tests etc. so we might compare before / after.
>
> How does that sound?
>
> Regards
>
> (1) https://lists.apache.org/thread/4rhbqzqyh1cn0pmbst8som4kvvko8gqp
> (2) https://lists.apache.org/thread/l00yv67wwtztgl5lopdtbw3z9s7fng5b
> (3) https://github.com/apache/cassandra/pull/3767/files#r1989136062
>
> On Fri, Jan 3, 2025 at 1:47 PM Benedict  wrote:
>
>> I’ve had a quick skim of the data sketches library, and it does seem to
>> have made some more efficient decisions in its design than clearspring,
>> appears to maybe support off-heap representations, and has reasonably good
>> documentation about the theoretical properties of the sketches. The chair
>> of the project is a published author on the topic, and the library has
>> newer algorithms for cardinality estimation than HLL.
>>
>> So, honestly, it might not be a bad idea to (carefully) consider a
>> migration, even if the current library isn’t broken for our needs.
>>
>> It would not be high up my priority list for the project, but I would
>> support it if it scratches someone’s itch.
>>
>> On 3 Jan 2025, at 12:16, Štefan Miklošovič 
>> wrote:
>>
>> 
>> Okay ... first problems.
>>
>> These 2000 bytes I have mentioned in my response to Chris were indeed
>> correct, but that was with Datasketches and the main parameter for Hall
>> Sketch (DEFAULT_LG_K) was 12. When I changed that to 13 to match what we
>> currently have in Cassandra with Clearspring, that doubled the size to
>> ~4000 bytes.
>>
>> When we do not use Datasketches, what Clearspring generates is about
>> ~5000 bytes for the array itself but that array is wrapped into an
>> ICardinality object of Clearspring and we need that object in order to
>> merge another ICardinality into that. So, we would need to cache this
>> ICardinality object instead of just an array itself. If we don't cache
>> whole ICardinality, we would then need to do basically what
>> CompactionMetadata.CompactionMetadataSerializer.deserialize is doing which
>> would allocate a lot / often (ICardinality cardinality =
>> HyperLogLogPlus.Builder.build(that_cached_array)).
>>
>> To avoid the allocations every time we compute, we would just cache that
>> whole ICardinality of Clearspring, but that whole object measures like
>> 11/12 KB. So even 10k tables would occupy like 100MB. 50k tables 500MB.
>> That is becoming quite a problem.
>>
>> On the other hand, HllSketch of Datasketches, array included, adds
>> minimal overhead. Like an array has 5000 bytes and the whole object like
>> 5500. You got the idea ...
>>
>> If we are still OK with these sizes, sure ... I am just being transparent
>> about the consequences here.
>>
>> A user would just opt-in into this (by default it would be turned off).
>>
>> On the other hand, if we have 10k SSTables, reading that 10+KB from disk
>> takes around 2-3ms so we would read the disk 20/30 seconds every time we
>> would hit that metric (and we haven't even started to merge the logs).
>>
>> If this is still not something which would sell Datasketches as a viable
>> alternative then I guess we need to stick to these numbers and cache it all
>> with Clearspring, occupying way more 

[RESULT][VOTE][IP CLEARANCE] Cassandra Cluster Manager (CCM)

2025-03-12 Thread Mick Semb Wever
Vote passes with twelve +1s (ten binding).




> On Sun, Mar 9, 2025 at 5:18 AM Mick Semb Wever  wrote:
>>>
 Please vote on the acceptance of the Cassandra Cluster Manager (CCM)
 and its IP Clearance:
 https://incubator.apache.org/ip-clearance/cassandra-ccm.html

 All consent from original authors of the donation, and tracking of
 collected CLAs, is found in:
  - https://github.com/riptano/ccm/issues/773
  -
 https://docs.google.com/spreadsheets/d/1lXDK3c7_-TZh845knVZ8zvJf65x2o03ACqY3pfdXZR8

 These do not require acknowledgement before the vote.

 The code is prepared for donation at https://github.com/riptano/ccm
 (Only `master` and `cassandra-test` refs will be brought over.)

 Once this vote passes we will request ASF Infra to move the
 riptano/ccm as-is to apache/cassandra-ccm  . The master branch and the
 cassandra-test tag, with all its history, will be kept.  Because
 consent and CLAs were not received from all original authors the
 NOTICE file keeps additional reference to these earlier copyright
 authors.

 PMC members, please check carefully the IP Clearance requirements
 before voting.

 The vote will be open for 72 hours (or longer). Votes by PMC members
 are considered binding. A vote passes if there are at least three
 binding +1s and no -1's.

>>>


Re: Dropwizard/Codahale metrics deprecation in Cassandra server

2025-03-12 Thread Benedict
It sounds like for the original query we have a broad consensus:1) Deprecate Codahale, but for the next major version publish compatible metrics2) After the next release, move to a codahale-like registry that allows us to be efficient without abusing unsafe, and continue publishing metrics that implement Codahale interfaces for easy consumption3) separately, investigate otel publishing of metrics (and perhaps also logging and tracing)Does that sound like a reasonable summary of where things are at?On 11 Mar 2025, at 23:17, Jon Haddad  wrote:Absolutely, happy to share.  All tests were done using easy-cass-stress v9 and easy-cass-lab, with the latest released 5.0 (not including 15452 or 20092).  Instructions at the end.> Regarding allocation rate vs throughput, unfortunately allocation rate vs throughput are not connected linearly,Yes, agreed, they're not linearly related.  However, allocation rate does correlate linearly to GC pause frequency, and does increase GC pause time.  When you increase your write throughput, you put more pressure on compaction.  In order to keep up, you need to increase compaction throughput.  This leads to excess allocation, and the longer pauses.  For teams with a low SLO (say10ms p99), compaction allocation becomes one of the factors that prevent them from increasing node density due to it's effect on GC pause times.  Reducing the allocation rate will allow for much faster compaction with less impact on GC.  > So, while I agree that the mentioned compaction logic (cells 
deserializing) is a subject to improve from an allocation point of view I
 am not sure if we get dramatic improvements in throughput just because 
of reducing it..I am _quite_ confident that reducing the total allocation in Cassandra by almost 50% we will see a _significant_ performance improvement, but obviously we need hard numbers, not just my gut feelings and unbridled confidence.I'll have to dig up the profile, I'm switching between a bunch of tests and sadly I didn't label all of them, I collected quite a few.  The % number I referenced was from a different load test that I looked up several days ago earlier in the thread, and I have several hundred of them hanging around.Here's the process of setting up the cluster with easy-cass-lab (I have ecl aliased to easy-cass-lab on my laptop):mkdir testcd testecl init -i r5d.2xlarge -c 3 -s 1 testecl upecl use 5.0cat <<'EOF' >> cassandra.patch.yamlmemtable:  configurations:    skiplist:      class_name: SkipListMemtable    trie:      class_name: TrieMemtable    default:      inherits: triememtable_offheap_space: 8GiBmemtable_allocation_type: offheap_objectsEOFThen apply these JVM settings to the jvm.options file in the local dir:### G1 Settings## Use the Hotspot garbage-first collector.-XX:+UseG1GC-XX:+ParallelRefProcEnabled-XX:MaxTenuringThreshold=2-XX:G1HeapRegionSize=16m-XX:+UnlockExperimentalVMOptions-XX:G1NewSizePercent=50-Xms30G-Xmx30G### Have the JVM do less remembered set work during STW, instead## preferring concurrent GC. Reduces p99.9 latency.-XX:G1RSetUpdatingPauseTimePercent=5### Main G1GC tunable: lowering the pause target will lower throughput and vise versa.## 200ms is the JVM default and lowest viable setting## 1000ms increases throughput. Keep it smaller than the timeouts in cassandra.yaml.-XX:MaxGCPauseMillis=200Then have it update the configs and start the cluster:ecl ucecl startsource env.shYou can disable compaction on a one node:c0 nodetool disableautocompactionConnect to the stress instance using the shortcut defined in env.sh:s0 Running the stress workload is best done with shenandoah and java 17 to avoid long pauses:sudo update-java-alternatives -s java-1.17.0-openjdk-amd64export EASY_CASS_STRESS_OPTS="-XX:+UseShenandoahGC"Here's a workload that's writes only, very small values:easy-cass-stress run KeyValue -d 1h --field.keyvalue.value='random(4,8)' --maxwlat 50 --rate 200k -r 0Let that ramp up for a bit.  Then back in your local dir, (make sure you source env.sh first)cflame cassandra0It'll take a profile and run for a minute. You can also get an allocation profile by doing this:cflame cassandra0 -e allocFeel free to ping me directly with questions.JonOn Tue, Mar 11, 2025 at 3:20 PM Dmitry Konstantinov  wrote:Jon, thank you for testing!, can you share your CPU profile and test load details? Have you tested it with CASSANDRA-20092 changes included?>> Allocations related to codahale were < 1%.Just to clarify: in the initial mail by memory footprint I mean the static amount of memory used to store metric objects, not a dynamic allocation during requests processing (it should be almost zero and not a target to optimize).>> Once compaction is enabled, it's in the 2-3% realmWhat percent of CPU profile do you have spent for compaction in your load? (to dilute 7-8% to 2-3% it should be around 50%.., because compaction does not change the ratio between between total efforts spent for request processing vs metrics part of it) 

Re: [DISCUSS] Replacement of SSTable's partition cardinality implementation from stream-lib to Apache Datasketches

2025-03-12 Thread Benedict Elliott Smith
Hi Stefan,

My reading of this mailing list thread is that they think clearspring is junk 
(probably fair) and so you shouldn’t use it or convert it. I am not sure this 
actually means it cannot be done.

That said, a simpler option might be to produce both sketches until we can 
“upgrade” all of the legacy sstables to the new sketches. This would be fine in 
my book, and probably much simpler.

> On 12 Mar 2025, at 11:37, Štefan Miklošovič  wrote:
> 
> Benedict,
> 
> I have reached Datasketches community (1) and asked what they think about 
> Clearspring and if it is convertible to Datasketches as you earlier suggested 
> that we might try to convert one to the other.
> 
> Based on what they wrote, I do not think that is possible to do (2) and they 
> say that Clearspring has "serious error problems" and it does not implement 
> Google's HLL++ paper correctly etc. 
> 
> As I see it, in case we have SSTables with both old and new format, we might 
> compute keys like here (3). This code would be exercised only as long as 
> there are mixed formats. If we upgrade SSTables to a new format or if old 
> SSTables are compacated away to SSTables of new format, we would not do it 
> like in (3) anymore.
> 
> If we are OK with this, then I would try to spend more time on finishing the 
> PR and do some perf tests etc. so we might compare before / after.
> 
> How does that sound?
> 
> Regards
> 
> (1) https://lists.apache.org/thread/4rhbqzqyh1cn0pmbst8som4kvvko8gqp
> (2) https://lists.apache.org/thread/l00yv67wwtztgl5lopdtbw3z9s7fng5b
> (3) https://github.com/apache/cassandra/pull/3767/files#r1989136062
> 
> On Fri, Jan 3, 2025 at 1:47 PM Benedict  > wrote:
>> I’ve had a quick skim of the data sketches library, and it does seem to have 
>> made some more efficient decisions in its design than clearspring, appears 
>> to maybe support off-heap representations, and has reasonably good 
>> documentation about the theoretical properties of the sketches. The chair of 
>> the project is a published author on the topic, and the library has newer 
>> algorithms for cardinality estimation than HLL.
>> 
>> So, honestly, it might not be a bad idea to (carefully) consider a 
>> migration, even if the current library isn’t broken for our needs.
>> 
>> It would not be high up my priority list for the project, but I would 
>> support it if it scratches someone’s itch.
>> 
>>> On 3 Jan 2025, at 12:16, Štefan Miklošovič >> > wrote:
>>> 
>>> 
>>> Okay ... first problems.
>>> 
>>> These 2000 bytes I have mentioned in my response to Chris were indeed 
>>> correct, but that was with Datasketches and the main parameter for Hall 
>>> Sketch (DEFAULT_LG_K) was 12. When I changed that to 13 to match what we 
>>> currently have in Cassandra with Clearspring, that doubled the size to 
>>> ~4000 bytes.
>>> 
>>> When we do not use Datasketches, what Clearspring generates is about ~5000 
>>> bytes for the array itself but that array is wrapped into an ICardinality 
>>> object of Clearspring and we need that object in order to merge another 
>>> ICardinality into that. So, we would need to cache this ICardinality object 
>>> instead of just an array itself. If we don't cache whole ICardinality, we 
>>> would then need to do basically what 
>>> CompactionMetadata.CompactionMetadataSerializer.deserialize is doing which 
>>> would allocate a lot / often (ICardinality cardinality = 
>>> HyperLogLogPlus.Builder.build(that_cached_array)).
>>> 
>>> To avoid the allocations every time we compute, we would just cache that 
>>> whole ICardinality of Clearspring, but that whole object measures like 
>>> 11/12 KB. So even 10k tables would occupy like 100MB. 50k tables 500MB. 
>>> That is becoming quite a problem. 
>>> 
>>> On the other hand, HllSketch of Datasketches, array included, adds minimal 
>>> overhead. Like an array has 5000 bytes and the whole object like 5500. You 
>>> got the idea ... 
>>> 
>>> If we are still OK with these sizes, sure ... I am just being transparent 
>>> about the consequences here.
>>> 
>>> A user would just opt-in into this (by default it would be turned off).
>>> 
>>> On the other hand, if we have 10k SSTables, reading that 10+KB from disk 
>>> takes around 2-3ms so we would read the disk 20/30 seconds every time we 
>>> would hit that metric (and we haven't even started to merge the logs).
>>> 
>>> If this is still not something which would sell Datasketches as a viable 
>>> alternative then I guess we need to stick to these numbers and cache it all 
>>> with Clearspring, occupying way more memory.
>>> 
>>> On Thu, Jan 2, 2025 at 10:15 PM Benedict >> > wrote:
 I would like to see somebody who has some experience writing data 
 structures, preferably someone we trust as a community to be competent at 
 this (ie having some experience within the project contributing at this 
 level), look at the code like they were at least

Re: [DISCUSS] Replacement of SSTable's partition cardinality implementation from stream-lib to Apache Datasketches

2025-03-12 Thread Benedict
Basically but ideally we would stop writing the old sketches once we’ve updated all the existing sstables to the new ones. I think ideally this would be orthogonal to sstable version, so that we can drop new sketches in place to existing sstables as we are able to produce them, and we can stop writing the old sketches once we’ve finished this process. But a secondary possibility would be to have the new format only produce both versions until no legacy sstables exist on the replica (or perhaps on the cluster, depending how we want to handle whole sstable streaming)On 12 Mar 2025, at 12:06, Štefan Miklošovič  wrote:Interesting. So to repeat that if I got it right:current format - serialized Clearspring lognext format - serialized Clearspring log PLUS serialized log from Datasketchesin case all SSTables are on legacy format - merge all Clearspring logsin case some SSTables are on legacy format and the rest on new format - still merge all Clearspring logsin case all SSTables are on new format - merge DatasketchesI haven't looked at it this way. I'll play with it.On Wed, Mar 12, 2025 at 12:55 PM Benedict Elliott Smith  wrote:Hi Stefan,My reading of this mailing list thread is that they think clearspring is junk (probably fair) and so you shouldn’t use it or convert it. I am not sure this actually means it cannot be done.That said, a simpler option might be to produce both sketches until we can “upgrade” all of the legacy sstables to the new sketches. This would be fine in my book, and probably much simpler.On 12 Mar 2025, at 11:37, Štefan Miklošovič  wrote:Benedict,I have reached Datasketches community (1) and asked what they think about Clearspring and if it is convertible to Datasketches as you earlier suggested that we might try to convert one to the other.Based on what they wrote, I do not think that is possible to do (2) and they say that Clearspring has "serious error problems" and it does not implement Google's HLL++ paper correctly etc. As I see it, in case we have SSTables with both old and new format, we might compute keys like here (3). This code would be exercised only as long as there are mixed formats. If we upgrade SSTables to a new format or if old SSTables are compacated away to SSTables of new format, we would not do it like in (3) anymore.If we are OK with this, then I would try to spend more time on finishing the PR and do some perf tests etc. so we might compare before / after.How does that sound?Regards(1) https://lists.apache.org/thread/4rhbqzqyh1cn0pmbst8som4kvvko8gqp(2) https://lists.apache.org/thread/l00yv67wwtztgl5lopdtbw3z9s7fng5b(3) https://github.com/apache/cassandra/pull/3767/files#r1989136062On Fri, Jan 3, 2025 at 1:47 PM Benedict  wrote:I’ve had a quick skim of the data sketches library, and it does seem to have made some more efficient decisions in its design than clearspring, appears to maybe support off-heap representations, and has reasonably good documentation about the theoretical properties of the sketches. The chair of the project is a published author on the topic, and the library has newer algorithms for cardinality estimation than HLL.So, honestly, it might not be a bad idea to (carefully) consider a migration, even if the current library isn’t broken for our needs.It would not be high up my priority list for the project, but I would support it if it scratches someone’s itch.On 3 Jan 2025, at 12:16, Štefan Miklošovič  wrote:Okay ... first problems.These 2000 bytes I have mentioned in my response to Chris were indeed correct, but that was with Datasketches and the main parameter for Hall Sketch (DEFAULT_LG_K) was 12. When I changed that to 13 to match what we currently have in Cassandra with Clearspring, that doubled the size to ~4000 bytes.When we do not use Datasketches, what Clearspring generates is about ~5000 bytes for the array itself but that array is wrapped into an ICardinality object of Clearspring and we need that object in order to merge another ICardinality into that. So, we would need to cache this ICardinality object instead of just an array itself. If we don't cache whole ICardinality, we would then need to do basically what CompactionMetadata.CompactionMetadataSerializer.deserialize is doing which would allocate a lot / often (ICardinality cardinality = HyperLogLogPlus.Builder.build(that_cached_array)).To avoid the allocations every time we compute, we would just cache that whole ICardinality of Clearspring, but that whole object measures like 11/12 KB. So even 10k tables would occupy like 100MB. 50k tables 500MB. That is becoming quite a problem. On the other hand, HllSketch of Datasketches, array included, adds minimal overhead. Like an array has 5000 bytes and the whole object like 5500. You got the idea ... If we are still OK with these sizes, sure ... I am just being transparent about the consequences here.A user would just opt-in into this (by default it