Re: [DISCUSS] Replacement of SSTable's partition cardinality implementation from stream-lib to Apache Datasketches

2025-01-02 Thread Benedict
I would like to see somebody who has some experience writing data structures, preferably someone we trust as a community to be competent at this (ie having some experience within the project contributing at this level), look at the code like they were at least lightly reviewing the feature as a con

Re: [DISCUSS] Replacement of SSTable's partition cardinality implementation from stream-lib to Apache Datasketches

2025-01-02 Thread Štefan Miklošovič
Point 2) is pretty hard to fulfil, I can not imagine what would be "enough" for you to be persuaded. What should concretely happen? Because whoever comes and says "yeah this is a good lib, it works" is probably not going to be enough given the vague requirements you put under 2) You would like to s

Re: [DISCUSS] Replacement of SSTable's partition cardinality implementation from stream-lib to Apache Datasketches

2025-01-02 Thread Benedict
Your message seemed to be all about the caching proposal, which I have proposed we separate, hence my confusion.To restate my answer to your question, I think that unless the new library actually offers us concrete benefits we can point to that we actually care about then yes it’s a bad idea to inc

Re: [DISCUSS] Replacement of SSTable's partition cardinality implementation from stream-lib to Apache Datasketches

2025-01-02 Thread Dmitry Konstantinov
I think it makes sense to make the options more clear, I would suggest a Google sheet or a table within a JIRA ticket with options and comparison (it looks like majority of confusion in this topic is caused by different ways to interpret the suggestion :-) ) I see a table like this: +--

Re: [DISCUSS] Replacement of SSTable's partition cardinality implementation from stream-lib to Apache Datasketches

2025-01-02 Thread Jeff Jirsa
I’m going to type a lot of extra words mostly for people just barely familiar with this part of the codebase, because it may or may not be useful to passive observers (it wasn’t to me, so I’m mostly echo’ing the things I just went and learned this morning): The HLL cardinality is used for bas

Re: [DISCUSS] Replacement of SSTable's partition cardinality implementation from stream-lib to Apache Datasketches

2025-01-02 Thread Štefan Miklošovič
Hi Benedict, you wrote: I am strongly opposed to updating libraries simply for the sake of it. Something like HLL does not need much ongoing maintenance if it works. We’re simply asking for extra work and bugs by switching, and some risk without understanding the quality control for the new libra

[Reviewer required] CASSANDRA-20132 - Add metric and tracing event for scanned purgeable tombstones

2025-01-02 Thread Dmitry Konstantinov
Hi, Can somebody help with reviewing of https://issues.apache.org/jira/browse/CASSANDRA-20132. When tombstones are expired they become almost invisible from a monitoring point view: you do not see them in metrics and tracing except a latency impact. I have observed such cases in production when co

Re: [DISCUSS] Replacement of SSTable's partition cardinality implementation from stream-lib to Apache Datasketches

2025-01-02 Thread Štefan Miklošovič
-> about 800 live SSTables Well, that would occupy 1.5MB of hyperloglogs each having 2000 bytes. That's peanuts. Instead of going 800 times to the disk every minute. On Thu, Jan 2, 2025 at 8:18 PM Chris Lohfink wrote: > > Regarding allocation details. The DB host had the following stats at > th

Re: [DISCUSS] Replacement of SSTable's partition cardinality implementation from stream-lib to Apache Datasketches

2025-01-02 Thread Benedict
I’m confused Stefan, in what way do you protest? How is your proposal to cache these collections tied to the topic you started here? This should be a separate proposal, discussed on its own merits independently, should it not?I am not opposed to it happening, only to conflating the two concerns.On

Re: [DISCUSS] Replacement of SSTable's partition cardinality implementation from stream-lib to Apache Datasketches

2025-01-02 Thread Chris Lohfink
> Regarding allocation details. The DB host had the following stats at that time: 5K/sec local reads, 3K/sec local writes, about 800 live SSTables, the profile was collected with duration = 5 minutes. I do not have an allocation rate info for that time period. What was the allocation rate on heap

Re: [DISCUSS] Replacement of SSTable's partition cardinality implementation from stream-lib to Apache Datasketches

2025-01-02 Thread Dmitry Konstantinov
Let me clarify my comment regarding allocation. I am not saying that switching to another implementation will make it better and we need to do it right now :-), any such switch is a subject for pros/cons analysis (and memory allocation I think should be one of criteria). What I wanted to say: this

Re: [DISCUSS] Replacement of SSTable's partition cardinality implementation from stream-lib to Apache Datasketches

2025-01-02 Thread Benedict
I am strongly opposed to updating libraries simply for the sake of it. Something like HLL does not need much ongoing maintenance if it works. We’re simply asking for extra work and bugs by switching, and some risk without understanding the quality control for the new library project’s releases.That

Re: [DISCUSS] Replacement of SSTable's partition cardinality implementation from stream-lib to Apache Datasketches

2025-01-02 Thread Štefan Miklošovič
Indeed, I plan to measure it and compare, maybe some bench test would be cool to add .. I strongly suspect that the primary reason for the slowness (if it is verified to be true) is us going to the disk every time and reading stats for every SSTable all over again. While datasketches say that it

Re: [DISCUSS] Replacement of SSTable's partition cardinality implementation from stream-lib to Apache Datasketches

2025-01-02 Thread Jon Haddad
Sounds interesting. I took a look at the issue but I'm not seeing any data to back up "expensive". Can this be quantified a bit more? Anytime we have a performance related issue, there should be some data to back it up, even if it seems obvious. Jon On Thu, Jan 2, 2025 at 8:20 AM Štefan Mikloš

[DISCUSS] Replacement of SSTable's partition cardinality implementation from stream-lib to Apache Datasketches

2025-01-02 Thread Štefan Miklošovič
Hello, I just stumbled upon this library we are using for getting estimations of the number of partitions in a SSTable which are used e.g. in EstimatedPartitionCount metric. (1) A user reported in (1) that it is an expensive operation. When one looks into what it is doing, it calls SSTableReader.