>Previous experiences (good or bad) I have been using an async-profiler in my project for quite some time to profile the CPU. Additionally, I have wrapped it with an HTTP interface, allowing one to open a browser and view the CPU flame graph in real-time, which further simplifies the process. It is integrated as a library, and my preference is to include it as a library, rather than forking processes.
Jaydeep On Sat, Jun 14, 2025 at 8:14 AM Josh McKenzie <jmcken...@apache.org> wrote: > I have seen cases where specific async-profiler/JVM/Cassandra version > combos (JDK11/4.1-derived source tree) will immediately crash the JVM on > profile - especially successive profile invocations on the same process > > This would be a great candidate for testing to ensure that, at least for > provided profiles, this doesn't happen. > > On Fri, Jun 13, 2025, at 10:41 PM, C. Scott Andreas wrote: > > Supportive of inclusion as well. General preference for invoking as a > library rather than forking processes. > > Jon, thanks for the tips on off-CPU profiling - added to my personal cheat > sheet. > > I have seen cases where specific async-profiler/JVM/Cassandra version > combos (JDK11/4.1-derived source tree) will immediately crash the JVM on > profile - especially successive profile invocations on the same process - > but have not observed this on JDK21 or trunk-derived source trees. If we > have user reports of that happening, we’ll need to figure out how to > reproduce and get to the bottom of it. > > – Scott > > > On Jun 13, 2025, at 5:24 PM, Francisco Guerrero <fran...@apache.org> > wrote: > > > > Thanks for bringing this discussion Doug. I didn't realize that > async-profiler allows you to > > bring it as a dependency. It looks pretty neat from what I could tell. I > also think bringing > > this to Cassandra as a dependency is a reasonable approach. We need to > come up with > > a solid way to expose this via JMX / vtable. > > > > Best, > > - Francisco > > > >> On 2025/06/13 21:08:28 Doug Rohrer wrote: > >> The nice thing from what I can tell about using the Java API per [6] > below is that you can literally just get an instance of the profiler and > pass it some commands in the `execute` method… just need to be careful how > much of that surface area we expose. Jon (and others obviously) I’d love to > get your take on how we could make a useful interface to the > async-profiler, maybe exposed via JMX, that doesn’t require someone to read > the entirety of the async-profiler docs and provides some useful profiles > without the rough edges (things like managing temp files so users don’t > have to know the layout of the filesystem C* is running on, for example, > since at least in the Sidecar we’d be executing this on behalf of a remote > user, with all of the constraints that implies). > >> > >> We can always be more protective in the Sidecar than we are server-side > as well, but it seems like helping operators not do bad things is a good > thing. > >> > >> Obviously we’d want the ability Cassandra-side to disable this > functionality all together however we implement it. > >> > >> Doug > >> > >>>> On Jun 13, 2025, at 2:38 PM, Jon Haddad <j...@rustyrazorblade.com> > wrote: > >>> > >>> I'd be very happy to see async-profiler included with C* I've made > extensive use of it in my performance evaluations [1][2], and even posted a > video about it [3] for general Java perf analysis (among others). It's > part of easy-cass-lab and is easily the most informative tool I've found > for the getting to the bottom of anything performance related. > >>> > >>> There's probably a good case to be made for including it with the C* > artifact as well as having it be something you can drop in. I lean towards > including it all the time, but I haven't run it this way myself yet, so > there might be some downside I'm unaware of. > >>> > >>> When you call the asprof executable, it attaches the async-profiler to > the running jvm using jattach [4]. We could do this as well, if we wanted > to avoid including it with the release, but I don't know how much we really > benefit from that. I've run into issues with it when it's unable to > detatch correctly, then you're unable to reattach it until after the server > is restarted. On the flip side, I don't know if you're able to set up all > the same options for arbitrary profiling when it's loaded as an agent and > turned on/off dynamically. I think we can, based on the integration page > [6], but I haven't tried it yet. It would be a bummer if we only had a > single mode of profiling available. > >>> > >>> The default mode, CPU profiling, is fantastic, but I've also made > extensive use of allocation profiling [5] to identify perf issues as well > so having that available is a must, imo. Wall clock / off cpu profiling is > great for identifying when IO is the root cause, which isn't clearly > revealed by on-cpu profiling due to the way threads are scheduled. When I > look at a system I typically do CPU / Wall / Alloc / Off-CPU to be > thorough, and the last thing you want to do is have to restart between each > one. You can also specify specific Java methods, include or exclude frames > matching specific regex, and a whole slew of other options. The latest > version even supports continuous profiling with heatmaps although I haven't > tried it yet. > >>> > >>> So hopefully the option we go with allows all of that, otherwise the > limits would impose more of a headache to me as I'd need to remove it and > continue to bring my own. > >>> > >>> Under the hood, the async-profiler uses Linux perf events + <> > asynchronous polling of the java stack to match them up and generate it's > reports. As a result, it requires certain permissions to run and get all > the details I like. Specifically these kernel parameters: > >>> > >>> sudo sysctl kernel.perf_event_paranoid=1 > >>> sudo sysctl kernel.kptr_restrict=0 > >>> > >>> You also need to enable some capabilities for off-cpu profiliing: > >>> > >>> sudo find /usr/lib/jvm/ -type f -name 'java' -exec setcap > "cap_perfmon,cap_sys_ptrace,cap_syslog=ep" {} \; > >>> > >>> Then you can do off-cpu with this wild cryptic version (shout out to > Andrei Pangin for helping me with this [7]): > >>> > >>> asprof -e kprobe:schedule -i 2 --cstack dwarf -X '*Unsafe.park*' > "${@:2}" $PID > >>> > >>> There's also some subtle issues when it's run in a container, since by > default you don't have access to the perf_event_open syscall. Just > something to keep in mind. This is one of my main grievances with > container deployments. > >>> > >>> Indeed Patrick, I am very happy to see this discussion! Thanks Doug > for starting the thread. > >>> > >>> Jon > >>> > >>> [1] https://issues.apache.org/jira/browse/CASSANDRA-15452 > >>> [2] https://issues.apache.org/jira/browse/CASSANDRA-19477 > >>> [3] > https://www.youtube.com/watch?v=yNZtnzjyJRI&t=212s&pp=ygUOYXN5bmMgcHJvZmlsZXI%3D > >>> [4] > https://github.com/async-profiler/async-profiler/blob/2b556680dc8f5d02c3f26ac119d835dc2381e604/src/jattach/jattach_hotspot.c#L38 > >>> [5] https://issues.apache.org/jira/browse/CASSANDRA-20428 > >>> [6] > https://github.com/async-profiler/async-profiler/blob/master/docs/IntegratingAsyncProfiler.md > >>> [7] https://github.com/async-profiler/async-profiler/issues/907 > >>> > >>> > >>> On Fri, Jun 13, 2025 at 10:18 AM Patrick McFadin <pmcfa...@gmail.com > <mailto:pmcfa...@gmail.com>> wrote: > >>>> The fact o3 used "Bus-factor" as a dimension is just amazing. > >>>> > >>>> After reading more about the project, the possibilities are pretty > interesting. I suspect we'll see this in a Haddad talk soon. > >>>> > >>>> On Fri, Jun 13, 2025 at 1:57 AM Josh McKenzie <jmcken...@apache.org > <mailto:jmcken...@apache.org>> wrote: > >>>>> I was curious if o3 (model from OpenAI) would be able to do a deep > dive health check on a repo to assist in considering taking it as a > dependency. The results can be found here: > https://chatgpt.com/share/684be703-1d4c-8002-b831-f997f829f4b4 > >>>>> > >>>>> Apparently it can, and can do it quite well. This was a useful time > saver (and honestly did a better job than I usually can in > 10x the time) > >>>>> > >>>>> I'm +1 to taking this as a dependency on the lib in core C*. The > rest of the ecosystem can consume it (more easily if we move to a > cassandra-shared regime shared library build as well), and it opens up some > interesting opportunities for us in both how we test core C* proper and > what we expose in tooling. > >>>>> > >>>>> On Thu, Jun 12, 2025, at 7:36 PM, Paulo Motta wrote: > >>>>>> I'd prefer to avoid calling an external process and use the library > if possible. Not sure about including it in the project by default, but > also not against. > >>>>>> > >>>>>> If there's contention about including it, I wonder if it would make > sense to explore java's optional module extension[1] to make this > available optionally ? I can see this being useful for other extensions if > we haven't explored that option. > >>>>>> > >>>>>> Then we could have another project cassandra-sidecar-extensions (or > similar) that would be linked by sidecar/advanced operators to enable > extended featureset in the main process. > >>>>>> > >>>>>> > >>>>>> [1] - > >>>>>> https://openjdk.org/projects/jigsaw/doc/topics/optional.html > >>>>>> > >>>>>> On Thu, 12 Jun 2025 at 17:57 Doug Rohrer <droh...@apple.com > <mailto:droh...@apple.com>> wrote: > >>>>>> Hey folks! > >>>>>> > >>>>>> We're looking into enabling the sidecar to collect async profiles > from Cassandra and, digging through the async-profiler code and usage, it > seems like there may be a few different ways to do it. I’m curious if other > folks have already done this beyond just “run asprof with the pid of the > Cassandra process”, as I’m a bit hesitant to depend on executing an > external process from the Sidecar to gather the actual profile if we can > avoid it. > >>>>>> > >>>>>> There seem to be some opportunities to integrate the profiler into > another project (see > https://github.com/async-profiler/async-profiler/blob/master/docs/IntegratingAsyncProfiler.md#using-java-api) > but it seems this would end up having to be part of Cassandra, and somehow > callable via the sidecar (JMX? Some virtual table interface where you > insert a row to start a profile with the profiler options, and it kicks off > the profile, dumping the results into the table when it’s done?). > >>>>>> > >>>>>> The benefit in putting this functionality into Cassandra would be > that other consumers (in-jvm dtests, python dtests, other monitoring > systems where Sidecar isn’t available, easy-cass-lab) would be able to > leverage the same interface rather than having to re-invent the wheel each > time. > >>>>>> > >>>>>> Drawback is it’s another library, and one with native library > dependencies, added to the class path and loaded at runtime. > >>>>>> > >>>>>> Thoughts? Previous experiences (good or bad)? > >>>>>> > >>>>>> Thanks, > >>>>>> > >>>>>> Doug > >>>>> > >> > >> > > >