Mathieu Marie created SOLR-15106:
------------------------------------

             Summary: Thread in OverseerTaskProcessor should not "return"
                 Key: SOLR-15106
                 URL: https://issues.apache.org/jira/browse/SOLR-15106
             Project: Solr
          Issue Type: Bug
      Security Level: Public (Default Security Level. Issues are Public)
          Components: SolrCloud
    Affects Versions: 8.6, master (9.0)
            Reporter: Mathieu Marie


I have encountered a scenario were ZK was not accessible for a long time (due 
to _jute.maxbuffer_ issue, but not related to the rest of this issue).
During that time, the ClusterStateUpdater and OC queues from the Overseer got 
filled with 1200+ messages.

Once we restored ZK availability, the ClusterStateUpdater queue got emptied, 
but not the OC one.

The Overseer stopped to dequeue from the OC queue.

After some digging in the code it seems that a *return* from the overseer 
thread starting the runners could be the issue.

Code in OverseerTaskProcessor.java 
(https://github.com/apache/lucene-solr/blob/master/solr/core/src/java/org/apache/solr/cloud/OverseerTaskProcessor.java#L357)
The lines of codes that immediately follow should also be reviewed carefully as 
they also return or interrupt the thread that is responsible to execute the 
runners.

Anyhow, if anybody hit that same issue, the quick workaround is to bump the 
overseer instance to elect a new overseer on another node.







--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to