Steven Schlansker created LUCENE-10638:
------------------------------------------
Summary: PrimaryNode close waits for replicas to close, but there
is no guarantee they ever will
Key: LUCENE-10638
URL: https://issues.apache.org/jira/browse/LUCENE-10638
Project: Lucene - Core
Issue Type: Improvement
Components: modules/replicator
Affects Versions: 9.2
Reporter: Steven Schlansker
We run Lucene Replicator to replicate a single primary to many replicas. In
production, we have experienced downtime due to PrimaryNode.close never
finishing.
For some unknown reason - incorrect exception handling? Replica hung forever?
Reference counting bug? - the primary's CopyState ref count never reaches 0,
and so close hangs forever. While obviously we should fix the underlying bug
that prevents CopyState from being released correctly, in the meantime it is
quite harmful to have PrimaryNode hang for a condition that may never happen.
There are also operational possibilities that could cause this even without
bugs, for example a replica that hangs forever.
PrimaryNode.close should have the option to avoid this situation. One
possibility is to add a timeout - give replicas a configurable timeout to close
cleanly, otherwise go forward with closing anyway.
In our case, all replicas must already handle errors on the primary (e.g.
crash) so in fact closing immediately is not more harmful than any of these
other situations we must handle anyway. One could argue that generally replicas
must expect a primary could disappear at any time for any reason, and in that
case, maybe waiting for replicas to close is unnecessary in the first place.
If we can build consensus around the right approach for a fix here, and
committers don't have time to do so themselves, I am happy to assemble a PR.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]