Some quick thoughts of my own…

=== Performance ===
- I have seen heap dumps with > 1GiB dedicated to metric counters. This patch 
should improve this, while opening up room to cut it further, steeply.
- The performance improvement in relative terms for the metrics being replaced 
is rather dramatic - about 80%.. We can also improve this further.
- Cheaper metrics (in terms of both cpu and memory) means we can readily have 
more of them, exposing finer-grained details. This is hard to understate the 
value of.

=== Reporting ===
- We’re already non-standard for our most important metrics, because we had to 
replace the Codahale histogram years ago
- We can continue implementing the Codahale interfaces, so that exporting 
libraries have minimal work to support us
- We can probably push patches upstream to a couple of selected libraries we 
consider important
- I would anyway also support picking a new reporting framework to support, but 
I would like us to do this with great care to avoid repeating our mistakes. I 
won’t have cycles to actually implement this, so it would be down to others to 
decide if they are willing to undertake this work

I think the fallback option for now, however, is to abuse unsafe to allow us to 
override the implementation details of Codahale metrics. So we can decouple the 
performance discussion for now from the deprecation discussion, but I think we 
should have a target of deprecating Codahale/DropWizard for the reasons Dmitry 
outlines, however we decide to do it.

> On 4 Mar 2025, at 21:17, Jon Haddad <j...@rustyrazorblade.com> wrote:
> 
> I've got a few thoughts...
> 
> On the performance side, I took a look at a few CPU profiles from past 
> benchmarks and I'm seeing DropWizard taking ~ 3% of CPU time.  Is there a 
> specific workload you're running where you're seeing it take up a significant 
> % of CPU time?  Could you share some metrics, profile data, or a workload so 
> I can try to reproduce your findings?  In my testing I've found the majority 
> of the overhead from metrics to come from JMX, not DropWizard.
> 
> On the operator side, inventing our own metrics lib means risks making it 
> harder to instrument Cassandra.  There are libraries out there that allow you 
> to tap into DropWizard metrics directly.  For example, Sarma Pydipally did a 
> presentation on this last year [1] based on some code I threw together.
> 
> If you're planning on making it easier to instrument C* by supporting sending 
> metrics to the OTel collector [2], then I could see the change being a net 
> win as long as the perf is no worse than the status quo.
> 
> It's hard to know the full extent of what you're planning and the impact, so 
> I'll save any opinions till I know more about the plan.
> 
> Thanks for bringing this up!
> Jon
> 
> [1] 
> https://planetcassandra.org/leaf/apache-cassandra-lunch-62-grafana-dashboard-for-apache-cassandra-business-platform-team/
> [2] https://opentelemetry.io/docs/collector/
> 
> On Tue, Mar 4, 2025 at 12:40 PM Dmitry Konstantinov <netud...@gmail.com 
> <mailto:netud...@gmail.com>> wrote:
>> Hi all,
>> 
>> After a long conversation with Benedict and Maxim in CASSANDRA-20250 
>> <https://issues.apache.org/jira/browse/CASSANDRA-20250> I would like to 
>> raise and discuss a proposal to deprecate Dropwizard/Codahale metrics usage 
>> in the next major release of Cassandra server and drop it in the following 
>> major release.
>> Instead of it our own Java API and implementation should be introduced. For 
>> the next major release Dropwizard/Codahale API is still planned to support 
>> by extending Codahale implementations, to give potential users of this API 
>> enough time for transition.
>> The proposal does not affect JMX API for metrics, it is only about local 
>> Java API changes within Cassandra server classpath, so it is about the cases 
>> when somebody outside of Cassandra server code relies on Codahale API in 
>> some kind of extensions or agents.
>> 
>> Reasons:
>> 1) Codahale metrics implementation is not very efficient from CPU and memory 
>> usage point of view. In the past we already replaced default Codahale 
>> implementations for Reservoir with our custom one and now in CASSANDRA-20250 
>> <https://issues.apache.org/jira/browse/CASSANDRA-20250> we (Benedict and I) 
>> want to add a more efficient implementation for Counter and Meter logic. So, 
>> in total we do not have so much logic left from the original library (mostly 
>> a MetricRegistry as container for metrics) and the majority of logic is 
>> implemented by ourselves.
>> We use metrics a lot along the read and write paths and they contribute a 
>> visible overhead (for example for plain write load it is about 9-11% 
>> according to async profiler CPU profile), so we want them to be highly 
>> optimized.
>> From memory perspective Counter and Meter are built based on LongAdder and 
>> they are quite heavy for the amounts which we create and use.
>> 
>> 2) Codahale metrics does not provide any way to replace Counter and Meter 
>> implementations. There are no full functional interfaces for these entities 
>> + MetricRegistry has casts/checks to implementations and cannot work with 
>> anything else.
>> I looked through the already reported issues and found the following similar 
>> and unsuccessful attempt to introduce interfaces for metrics: 
>> https://github.com/dropwizard/metrics/issues/2186
>> as well as other older attempts:
>> https://github.com/dropwizard/metrics/issues/252 
>> https://github.com/dropwizard/metrics/issues/264 
>> https://github.com/dropwizard/metrics/issues/703 
>> https://github.com/dropwizard/metrics/pull/487
>> https://github.com/dropwizard/metrics/issues/479
>> https://github.com/dropwizard/metrics/issues/253 
>> 
>> So, the option to request an extensibility from Codahale metrics does not 
>> look real..
>> 
>> 3) It looks like the library is in maintenance mode now, 5.x version is on 
>> hold and many integrations are also not so alive.
>> The main benefit to use Codahale metrics should be a huge amount of 
>> reporters/integrations but if we check carefully the list of reporters 
>> mentioned here: 
>> https://metrics.dropwizard.io/4.2.0/manual/third-party.html#reporters
>> we can see that almost all of them are dead/archived.
>> 
>> 4) In general, exposing other 3rd party libraries as our own public API 
>> frequently creates too many limitations and issues (Guava is another typical 
>> example which I saw previously, it is easy to start but later you struggle 
>> more and more).
>> 
>> Does anyone have any questions or concerns regarding this suggestion?
>> --
>> Dmitry Konstantinov

Reply via email to