Hi SOLR Community, I'm investigating a node on solr 8.3.1 running in cloud mode which appears to have deadlocked, and I'm trying to figure out if this is a known issue or not, and looking for some guidance in understanding both (a) whether this is a resolved issue in future releases or needs a bug, and (b) how to lower the risk of recurrence until it is fixed.
Here is what I've observed: - strace shows the main process waiting. A spot check on child processes shows the same, though I did not deep dive all of the threads yet (there are over 100). - the server was not doing anything or busy, except for jvm sitting at constant memory usage. No resource of memory, swap, cpu, etc... was limited or showing active usage. - jcmd Thread.Print shows some interesting info which suggests a deadlock or another type of locking issue - For example, I found this log suggests something unusual because it looks like it's trying to lock a null object - "Finalizer" #3 daemon prio=8 os_prio=0 cpu=11.11ms elapsed=111111.11s tid=0x0000111111110100 nid=0x1111 in Object.wait() [0x0000111111111000] java.lang.Thread.State: WAITING (on object monitor) at java.lang.Object.wait(java.base@11.0.7/Native Method) - waiting on <no object reference available> at java.lang.ref.ReferenceQueue.remove(java.base@11.0.7 /ReferenceQueue.java:155) - waiting to re-lock in wait() <0x0000000200222220> (a java.lang.ref.ReferenceQueue$Lock) at java.lang.ref.ReferenceQueue.remove(java.base@11.0.7 /ReferenceQueue.java:176) at java.lang.ref.Finalizer$FinalizerThread.run(java.base@11.0.7 /Finalizer.java:170) - I also see a lot of this. Some addressess occur multiple times, but one in particular occurs 31 times. Maybe related? - "h2sc-1-thread-11" #110 prio=5 os_prio=0 cpu=54.29ms elapsed=111111.11s tid=0x0000111110010100 nid=0x1111 waiting on condition [0x0000111110011000] java.lang.Thread.State: WAITING (parking) at jdk.internal.misc.Unsafe.park(java.base@11.0.7/Native Method) - parking to wait for <0x0000000300333333> Can anyone help answer whether this is known or what I could look at next? Thanks! Stephen