Hi SOLR Community,
I'm investigating a node on solr 8.3.1 running in cloud mode which appears
to have deadlocked, and I'm trying to figure out if this is a known issue
or not, and looking for some guidance in understanding both (a) whether
this is a resolved issue in future releases or needs a bug, and (b) how to
lower the risk of recurrence until it is fixed.
Here is what I've observed:
- strace shows the main process waiting. A spot check on child processes
shows the same, though I did not deep dive all of the threads yet (there
are over 100).
- the server was not doing anything or busy, except for jvm sitting at
constant memory usage. No resource of memory, swap, cpu, etc... was limited
or showing active usage.
- jcmd Thread.Print shows some interesting info which suggests a
deadlock or another type of locking issue
- For example, I found this log suggests something unusual because it
looks like it's trying to lock a null object
- "Finalizer" #3 daemon prio=8 os_prio=0 cpu=11.11ms
elapsed=111111.11s tid=0x0000111111110100 nid=0x1111 in Object.wait()
[0x0000111111111000]
java.lang.Thread.State: WAITING (on object monitor)
at java.lang.Object.wait([email protected]/Native Method)
- waiting on <no object reference available>
at java.lang.ref.ReferenceQueue.remove([email protected]
/ReferenceQueue.java:155)
- waiting to re-lock in wait() <0x0000000200222220> (a
java.lang.ref.ReferenceQueue$Lock)
at java.lang.ref.ReferenceQueue.remove([email protected]
/ReferenceQueue.java:176)
at
java.lang.ref.Finalizer$FinalizerThread.run([email protected]
/Finalizer.java:170)
- I also see a lot of this. Some addressess occur multiple times,
but one in particular occurs 31 times. Maybe related?
- "h2sc-1-thread-11" #110 prio=5 os_prio=0 cpu=54.29ms
elapsed=111111.11s tid=0x0000111110010100 nid=0x1111 waiting
on condition
[0x0000111110011000]
java.lang.Thread.State: WAITING (parking)
at jdk.internal.misc.Unsafe.park([email protected]/Native
Method)
- parking to wait for <0x0000000300333333>
Can anyone help answer whether this is known or what I could look at next?
Thanks!
Stephen