I am in favor of the project adopting as a library.

My automation is very outdated, so what I am saying maybe a legacy thing… so 
w/e is the “new” way is what we should promote…. I rely a lot on the collapsed 
format and wish to migrate to the JFR format so I can collect CPU / Memory at 
the same time; it would be great for us to expose this as a promoted ability 
(curl cassandra/profile -o result.jfr). One issue I see with exposing the raw 
“execute” method is that it tied our API with the tools API, so any breaking 
changes there break our API; I am not against this, but it is something to 
consider.

As Scott has pointed out, there have been stability issues, so we should be 
able to dynamically flag the feature off.

> On Jun 16, 2025, at 9:26 AM, Jaydeep Chovatia <chovatia.jayd...@gmail.com> 
> wrote:
> 
> >Previous experiences (good or bad)
> I have been using an async-profiler in my project for quite some time to 
> profile the CPU. Additionally, I have wrapped it with an HTTP interface, 
> allowing one to open a browser and view the CPU flame graph in real-time, 
> which further simplifies the process.
> It is integrated as a library, and my preference is to include it as a 
> library, rather than forking processes.
> 
> Jaydeep
> 
> On Sat, Jun 14, 2025 at 8:14 AM Josh McKenzie <jmcken...@apache.org 
> <mailto:jmcken...@apache.org>> wrote:
>>> I have seen cases where specific async-profiler/JVM/Cassandra version 
>>> combos (JDK11/4.1-derived source tree) will immediately crash the JVM on 
>>> profile - especially successive profile invocations on the same process
>> This would be a great candidate for testing to ensure that, at least for 
>> provided profiles, this doesn't happen.
>> 
>> On Fri, Jun 13, 2025, at 10:41 PM, C. Scott Andreas wrote:
>>> Supportive of inclusion as well. General preference for invoking as a 
>>> library rather than forking processes.
>>> 
>>> Jon, thanks for the tips on off-CPU profiling - added to my personal cheat 
>>> sheet.
>>> 
>>> I have seen cases where specific async-profiler/JVM/Cassandra version 
>>> combos (JDK11/4.1-derived source tree) will immediately crash the JVM on 
>>> profile - especially successive profile invocations on the same process - 
>>> but have not observed this on JDK21 or trunk-derived source trees. If we 
>>> have user reports of that happening, we’ll need to figure out how to 
>>> reproduce and get to the bottom of it.
>>> 
>>> – Scott
>>> 
>>> > On Jun 13, 2025, at 5:24 PM, Francisco Guerrero <fran...@apache.org 
>>> > <mailto:fran...@apache.org>> wrote:
>>> > 
>>> > Thanks for bringing this discussion Doug. I didn't realize that 
>>> > async-profiler allows you to
>>> > bring it as a dependency. It looks pretty neat from what I could tell. I 
>>> > also think bringing
>>> > this to Cassandra as a dependency is a reasonable approach. We need to 
>>> > come up with
>>> > a solid way to expose this via JMX / vtable.
>>> > 
>>> > Best,
>>> > - Francisco
>>> > 
>>> >> On 2025/06/13 21:08:28 Doug Rohrer wrote:
>>> >> The nice thing from what I can tell about using the Java API per [6] 
>>> >> below is that you can literally just get an instance of the profiler and 
>>> >> pass it some commands in the `execute` method… just need to be careful 
>>> >> how much of that surface area we expose. Jon (and others obviously) I’d 
>>> >> love to get your take on how we could make a useful interface to the 
>>> >> async-profiler, maybe exposed via JMX, that doesn’t require someone to 
>>> >> read the entirety of the async-profiler docs and provides some useful 
>>> >> profiles without the rough edges (things like managing temp files so 
>>> >> users don’t have to know the layout of the filesystem C* is running on, 
>>> >> for example, since at least in the Sidecar we’d be executing this on 
>>> >> behalf of a remote user, with all of the constraints that implies).
>>> >> 
>>> >> We can always be more protective in the Sidecar than we are server-side 
>>> >> as well, but it seems like helping operators not do bad things is a good 
>>> >> thing.
>>> >> 
>>> >> Obviously we’d want the ability Cassandra-side to disable this 
>>> >> functionality all together however we implement it.
>>> >> 
>>> >> Doug
>>> >> 
>>> >>>> On Jun 13, 2025, at 2:38 PM, Jon Haddad <j...@rustyrazorblade.com 
>>> >>>> <mailto:j...@rustyrazorblade.com>> wrote:
>>> >>> 
>>> >>> I'd be very happy to see async-profiler included with C*  I've made 
>>> >>> extensive use of it in my performance evaluations [1][2], and even 
>>> >>> posted a video about it [3] for general Java perf analysis (among 
>>> >>> others).  It's part of easy-cass-lab and is easily the most informative 
>>> >>> tool I've found for the getting to the bottom of anything performance 
>>> >>> related.
>>> >>> 
>>> >>> There's probably a good case to be made for including it with the C* 
>>> >>> artifact as well as having it be something you can drop in. I lean 
>>> >>> towards including it all the time, but I haven't run it this way myself 
>>> >>> yet, so there might be some downside I'm unaware of.
>>> >>> 
>>> >>> When you call the asprof executable, it attaches the async-profiler to 
>>> >>> the running jvm using jattach [4].  We could do this as well, if we 
>>> >>> wanted to avoid including it with the release, but I don't know how 
>>> >>> much we really benefit from that.  I've run into issues with it when 
>>> >>> it's unable to detatch correctly, then you're unable to reattach it 
>>> >>> until after the server is restarted.  On the flip side, I don't know if 
>>> >>> you're able to set up all the same options for arbitrary profiling when 
>>> >>> it's loaded as an agent and turned on/off dynamically.  I think we can, 
>>> >>> based on the integration page [6], but I haven't tried it yet.  It 
>>> >>> would be a bummer if we only had a single mode of profiling available.  
>>> >>> 
>>> >>> The default mode, CPU profiling, is fantastic, but I've also made 
>>> >>> extensive use of allocation profiling [5] to identify perf issues as 
>>> >>> well so having that available is a must, imo. Wall clock / off cpu 
>>> >>> profiling is great for identifying when IO is the root cause, which 
>>> >>> isn't clearly revealed by on-cpu profiling due to the way threads are 
>>> >>> scheduled.  When I look at a system I typically do CPU / Wall / Alloc / 
>>> >>> Off-CPU to be thorough, and the last thing you want to do is have to 
>>> >>> restart between each one.  You can also specify specific Java methods, 
>>> >>> include or exclude frames matching specific regex, and a whole slew of 
>>> >>> other options.  The latest version even supports continuous profiling 
>>> >>> with heatmaps although I haven't tried it yet.  
>>> >>> 
>>> >>> So hopefully the option we go with allows all of that, otherwise the 
>>> >>> limits would impose more of a headache to me as I'd need to remove it 
>>> >>> and continue to bring my own.
>>> >>> 
>>> >>> Under the hood, the async-profiler uses Linux perf events + <> 
>>> >>> asynchronous polling of the java stack to match them up and generate 
>>> >>> it's reports.  As a result, it requires certain permissions to run and 
>>> >>> get all the details I like.  Specifically these kernel parameters:
>>> >>> 
>>> >>> sudo sysctl kernel.perf_event_paranoid=1
>>> >>> sudo sysctl kernel.kptr_restrict=0
>>> >>> 
>>> >>> You also need to enable some capabilities for off-cpu profiliing:
>>> >>> 
>>> >>> sudo find /usr/lib/jvm/ -type f -name 'java' -exec setcap 
>>> >>> "cap_perfmon,cap_sys_ptrace,cap_syslog=ep" {} \;
>>> >>> 
>>> >>> Then you can do off-cpu with this wild cryptic version (shout out to 
>>> >>> Andrei Pangin for helping me with this [7]):
>>> >>> 
>>> >>> asprof -e kprobe:schedule -i 2 --cstack dwarf -X '*Unsafe.park*' 
>>> >>> "${@:2}" $PID
>>> >>> 
>>> >>> There's also some subtle issues when it's run in a container, since by 
>>> >>> default you don't have access to the perf_event_open syscall.  Just 
>>> >>> something to keep in mind.  This is one of my main grievances with 
>>> >>> container deployments.
>>> >>> 
>>> >>> Indeed Patrick, I am very happy to see this discussion!  Thanks Doug 
>>> >>> for starting the thread.
>>> >>> 
>>> >>> Jon
>>> >>> 
>>> >>> [1] https://issues.apache.org/jira/browse/CASSANDRA-15452
>>> >>> [2] https://issues.apache.org/jira/browse/CASSANDRA-19477
>>> >>> [3] 
>>> >>> https://www.youtube.com/watch?v=yNZtnzjyJRI&t=212s&pp=ygUOYXN5bmMgcHJvZmlsZXI%3D
>>> >>> [4] 
>>> >>> https://github.com/async-profiler/async-profiler/blob/2b556680dc8f5d02c3f26ac119d835dc2381e604/src/jattach/jattach_hotspot.c#L38
>>> >>> [5] https://issues.apache.org/jira/browse/CASSANDRA-20428
>>> >>> [6] 
>>> >>> https://github.com/async-profiler/async-profiler/blob/master/docs/IntegratingAsyncProfiler.md
>>> >>> [7] https://github.com/async-profiler/async-profiler/issues/907
>>> >>> 
>>> >>> 
>>> >>> On Fri, Jun 13, 2025 at 10:18 AM Patrick McFadin <pmcfa...@gmail.com 
>>> >>> <mailto:pmcfa...@gmail.com> <mailto:pmcfa...@gmail.com 
>>> >>> <mailto:pmcfa...@gmail.com>>> wrote:
>>> >>>> The fact o3 used "Bus-factor" as a dimension is just amazing.
>>> >>>> 
>>> >>>> After reading more about the project, the possibilities are pretty 
>>> >>>> interesting. I suspect we'll see this in a Haddad talk soon.
>>> >>>> 
>>> >>>> On Fri, Jun 13, 2025 at 1:57 AM Josh McKenzie <jmcken...@apache.org 
>>> >>>> <mailto:jmcken...@apache.org> <mailto:jmcken...@apache.org 
>>> >>>> <mailto:jmcken...@apache.org>>> wrote:
>>> >>>>> I was curious if o3 (model from OpenAI) would be able to do a deep 
>>> >>>>> dive health check on a repo to assist in considering taking it as a 
>>> >>>>> dependency. The results can be found here: 
>>> >>>>> https://chatgpt.com/share/684be703-1d4c-8002-b831-f997f829f4b4
>>> >>>>> 
>>> >>>>> Apparently it can, and can do it quite well. This was a useful time 
>>> >>>>> saver (and honestly did a better job than I usually can in > 10x the 
>>> >>>>> time)
>>> >>>>> 
>>> >>>>> I'm +1 to taking this as a dependency on the lib in core C*. The rest 
>>> >>>>> of the ecosystem can consume it (more easily if we move to a 
>>> >>>>> cassandra-shared regime shared library build as well), and it opens 
>>> >>>>> up some interesting opportunities for us in both how we test core C* 
>>> >>>>> proper and what we expose in tooling.
>>> >>>>> 
>>> >>>>> On Thu, Jun 12, 2025, at 7:36 PM, Paulo Motta wrote:
>>> >>>>>> I'd prefer to avoid calling an external process and use the library 
>>> >>>>>> if possible. Not sure about including it in the project by default, 
>>> >>>>>> but also not against.
>>> >>>>>> 
>>> >>>>>> If there's contention about including it, I wonder if it would make 
>>> >>>>>> sense to explore  java's optional module extension[1] to make this 
>>> >>>>>> available optionally ? I can see this being useful for other 
>>> >>>>>> extensions if we haven't explored that option.
>>> >>>>>> 
>>> >>>>>> Then we could have another project cassandra-sidecar-extensions (or 
>>> >>>>>> similar) that would be linked by sidecar/advanced operators to 
>>> >>>>>> enable extended featureset in the main process.
>>> >>>>>> 
>>> >>>>>> 
>>> >>>>>> [1] -
>>> >>>>>> https://openjdk.org/projects/jigsaw/doc/topics/optional.html
>>> >>>>>> 
>>> >>>>>> On Thu, 12 Jun 2025 at 17:57 Doug Rohrer <droh...@apple.com 
>>> >>>>>> <mailto:droh...@apple.com> <mailto:droh...@apple.com 
>>> >>>>>> <mailto:droh...@apple.com>>> wrote:
>>> >>>>>> Hey folks!
>>> >>>>>> 
>>> >>>>>> We're looking into enabling the sidecar to collect async profiles 
>>> >>>>>> from Cassandra and, digging through the async-profiler code and 
>>> >>>>>> usage, it seems like there may be a few different ways to do it. I’m 
>>> >>>>>> curious if other folks have already done this beyond just “run 
>>> >>>>>> asprof with the pid of the Cassandra process”, as I’m a bit hesitant 
>>> >>>>>> to depend on executing an external process from the Sidecar to 
>>> >>>>>> gather the actual profile if we can avoid it.
>>> >>>>>> 
>>> >>>>>> There seem to be some opportunities to integrate the profiler into 
>>> >>>>>> another project (see 
>>> >>>>>> https://github.com/async-profiler/async-profiler/blob/master/docs/IntegratingAsyncProfiler.md#using-java-api)
>>> >>>>>>  but it seems this would end up having to be part of Cassandra, and 
>>> >>>>>> somehow callable via the sidecar (JMX? Some virtual table interface 
>>> >>>>>> where you insert a row to start a profile with the profiler options, 
>>> >>>>>> and it kicks off the profile, dumping the results into the table 
>>> >>>>>> when it’s done?).
>>> >>>>>> 
>>> >>>>>> The benefit in putting this functionality into Cassandra would be 
>>> >>>>>> that other consumers (in-jvm dtests, python dtests, other monitoring 
>>> >>>>>> systems where Sidecar isn’t available, easy-cass-lab) would be able 
>>> >>>>>> to leverage the same interface rather than having to re-invent the 
>>> >>>>>> wheel each time.
>>> >>>>>> 
>>> >>>>>> Drawback is it’s another library, and one with native library 
>>> >>>>>> dependencies, added to the class path and loaded at runtime.
>>> >>>>>> 
>>> >>>>>> Thoughts? Previous experiences (good or bad)?
>>> >>>>>> 
>>> >>>>>> Thanks,
>>> >>>>>> 
>>> >>>>>> Doug
>>> >>>>> 
>>> >> 
>>> >> 
>>> 
>> 

Reply via email to