[ 
https://issues.apache.org/jira/browse/SOLR-13975?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrzej Bialecki updated SOLR-13975:
------------------------------------
    Description: 
When a Solr process, which hosts replicas of a collection, is suspended - that 
is, the OS process is suspended using eg. {{kill -STOP <pid>}} - a long stall 
may occur in CUSC until a socket timeout is reached.

During this stall updates from the leader are not forwarded to any replica, 
even though other replicas are still active and can receive updates.  If the 
sender uses CUSC (eg. via {{CloudSolrClient}}) then it becomes stalled because 
the leader stops processing updates, too.

This situation is caused by several issues:
* when a process is suspended its sockets remain open - so there is no 
immediate disconnect as if the process died, but the process becomes 
unresponsive. Eventually, a socket timeout will be reached 
(distribUpdateSoTimeout) - but in the default version of {{solr.xml}} this is 
set to 10 min. During this time all indexing to that shard will be stuck.
* there are several infinite {{for}} loops in CUSC (eg. in 
{{blockUntilFinished}}, {{waitForEmptyQueue}} and even in {{request}}), which 
rely either on the relatively quick success of the call or an exception to be 
thrown. However, in this situation neither happens quickly - the call is stuck 
waiting for the remote end until soTimeout expires.

This issue proposes to add a stall prevention logic, which would break these 
infinite loops long before the socket timeout occurs based on the progress of 
the queue processing.

This is a follow-up to SOLR-13896.

  was:
When a Solr process, which hosts replicas of a collection, is suspended - that 
is, the OS process is suspended using eg. {{kill -STOP <pid>}} - a long stall 
may occur in CUSC until a socket timeout is reached.

During this stall updates from the leader are not forwarded to any replica, 
even though other replicas are still active and can receive updates.  If the 
sender uses CUSC (eg. via {{CloudSolrClient}}) then it becomes stalled because 
the leader stops processing updates, too.

This situation is caused by several issues:
* when a process is suspended its sockets remain open - so there is no 
immediate disconnect as if the process died, but the process becomes 
unresponsive. Eventually, a socket timeout will be reached 
(distribUpdateSoTimeout) - but in the default version of {{solr.xml}} this is 
set to 10 min. During this time all indexing to that shard will be stuck.
* there are several infinite {{for}} loops in CUSC (eg. in 
{{blockUntilFinished}}, {{waitForEmptyQueue}} and even in {{request}}), which 
rely either on the relatively quick success of the call or an exception to be 
thrown. However, in this situation neither happens quickly - the call is stuck 
waiting for the remote end until soTimeout expires.

This issue proposes to add a stall prevention logic, which would break these 
infinite loops long before the socket timeout occurs based on the progress of 
the queue processing.


> ConcurrentUpdateSolrClient connection stall prevention
> ------------------------------------------------------
>
>                 Key: SOLR-13975
>                 URL: https://issues.apache.org/jira/browse/SOLR-13975
>             Project: Solr
>          Issue Type: Bug
>      Security Level: Public(Default Security Level. Issues are Public) 
>            Reporter: Andrzej Bialecki
>            Assignee: Andrzej Bialecki
>            Priority: Major
>
> When a Solr process, which hosts replicas of a collection, is suspended - 
> that is, the OS process is suspended using eg. {{kill -STOP <pid>}} - a long 
> stall may occur in CUSC until a socket timeout is reached.
> During this stall updates from the leader are not forwarded to any replica, 
> even though other replicas are still active and can receive updates.  If the 
> sender uses CUSC (eg. via {{CloudSolrClient}}) then it becomes stalled 
> because the leader stops processing updates, too.
> This situation is caused by several issues:
> * when a process is suspended its sockets remain open - so there is no 
> immediate disconnect as if the process died, but the process becomes 
> unresponsive. Eventually, a socket timeout will be reached 
> (distribUpdateSoTimeout) - but in the default version of {{solr.xml}} this is 
> set to 10 min. During this time all indexing to that shard will be stuck.
> * there are several infinite {{for}} loops in CUSC (eg. in 
> {{blockUntilFinished}}, {{waitForEmptyQueue}} and even in {{request}}), which 
> rely either on the relatively quick success of the call or an exception to be 
> thrown. However, in this situation neither happens quickly - the call is 
> stuck waiting for the remote end until soTimeout expires.
> This issue proposes to add a stall prevention logic, which would break these 
> infinite loops long before the socket timeout occurs based on the progress of 
> the queue processing.
> This is a follow-up to SOLR-13896.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to