Hi SOLR Community,

I'm investigating a node on solr 8.3.1 running in cloud mode which appears
to have deadlocked, and I'm trying to figure out if this is a known issue
or not, and looking for some guidance in understanding both (a) whether
this is a resolved issue in future releases or needs a bug, and (b) how to
lower the risk of recurrence until it is fixed.

Here is what I've observed:

   - strace shows the main process waiting. A spot check on child processes
   shows the same, though I did not deep dive all of the threads yet (there
   are over 100).
   - the server was not doing anything or busy, except for jvm sitting at
   constant memory usage. No resource of memory, swap, cpu, etc... was limited
   or showing active usage.
   - jcmd Thread.Print shows some interesting info which suggests a
   deadlock or another type of locking issue
      - For example, I found this log suggests something unusual because it
      looks like it's trying to lock a null object
         - "Finalizer" #3 daemon prio=8 os_prio=0 cpu=11.11ms
         elapsed=111111.11s tid=0x0000111111110100 nid=0x1111 in Object.wait()
          [0x0000111111111000]
            java.lang.Thread.State: WAITING (on object monitor)
                 at java.lang.Object.wait(java.base@11.0.7/Native Method)
                 - waiting on <no object reference available>
                 at java.lang.ref.ReferenceQueue.remove(java.base@11.0.7
         /ReferenceQueue.java:155)
                 - waiting to re-lock in wait() <0x0000000200222220> (a
         java.lang.ref.ReferenceQueue$Lock)
                 at java.lang.ref.ReferenceQueue.remove(java.base@11.0.7
         /ReferenceQueue.java:176)
                 at
         java.lang.ref.Finalizer$FinalizerThread.run(java.base@11.0.7
         /Finalizer.java:170)
         - I also see a lot of this. Some addressess occur multiple times,
      but one in particular occurs 31 times. Maybe related?
         - "h2sc-1-thread-11" #110 prio=5 os_prio=0 cpu=54.29ms
         elapsed=111111.11s tid=0x0000111110010100 nid=0x1111 waiting
on condition
          [0x0000111110011000]
            java.lang.Thread.State: WAITING (parking)
                 at jdk.internal.misc.Unsafe.park(java.base@11.0.7/Native
         Method)
                 - parking to wait for  <0x0000000300333333>

Can anyone help answer whether this is known or what I could look at next?

Thanks!
Stephen

Reply via email to