[jira] [Created] (LUCENE-10638) PrimaryNode close waits for replicas to close, but there is no guarantee they ever will

Steven Schlansker (Jira) Fri, 01 Jul 2022 13:08:08 -0700

Steven Schlansker created LUCENE-10638:
------------------------------------------


             Summary: PrimaryNode close waits for replicas to close, but there 
is no guarantee they ever will
                 Key: LUCENE-10638
                 URL: https://issues.apache.org/jira/browse/LUCENE-10638
             Project: Lucene - Core
          Issue Type: Improvement
          Components: modules/replicator
    Affects Versions: 9.2
            Reporter: Steven Schlansker


We run Lucene Replicator to replicate a single primary to many replicas. In 
production, we have experienced downtime due to PrimaryNode.close never 
finishing.

 

For some unknown reason - incorrect exception handling? Replica hung forever? 
Reference counting bug? - the primary's CopyState ref count never reaches 0, 
and so close hangs forever. While obviously we should fix the underlying bug 
that prevents CopyState from being released correctly, in the meantime it is 
quite harmful to have PrimaryNode hang for a condition that may never happen. 
There are also operational possibilities that could cause this even without 
bugs, for example a replica that hangs forever.

 

PrimaryNode.close should have the option to avoid this situation. One 
possibility is to add a timeout - give replicas a configurable timeout to close 
cleanly, otherwise go forward with closing anyway.

 

In our case, all replicas must already handle errors on the primary (e.g. 
crash) so in fact closing immediately is not more harmful than any of these 
other situations we must handle anyway. One could argue that generally replicas 
must expect a primary could disappear at any time for any reason, and in that 
case, maybe waiting for replicas to close is unnecessary in the first place.

 

If we can build consensus around the right approach for a fix here, and 
committers don't have time to do so themselves, I am happy to assemble a PR.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Created] (LUCENE-10638) PrimaryNode close waits for replicas to close, but there is no guarantee they ever will

Reply via email to