Sorry I missed this detail. We are running Solr 8.2. Thanks Le mar. 12 janv. 2021 à 16:46, Phill Campbell <sirgilli...@yahoo.com.invalid> a écrit :
> Which version of Apache Solr? > > > On Jan 12, 2021, at 8:36 AM, Pierre Salagnac <pierre.salag...@gmail.com> > wrote: > > > > Hello, > > We had a stuck leader election for a shard. > > > > We have collections with 2 shards, each shard has 5 replicas. We have > many > > collections but the issue happened for a single shard. Once all host > > restarts completed, this shard was stuck with one replica is "recovery" > > state and all other is "down" state. > > > > Here is the state of the shard returned by CLUSTERSTATUS command. > > "replicas":{ > > "core_node3":{ > > "core":"...._shard1_replica_n1", > > "base_url":"https://host1:8983/solr", > > "node_name":"host1:8983_solr", > > "state":"recovering", > > "type":"NRT", > > "force_set_state":"false"}, > > "core_node9":{ > > "core":"...._shard1_replica_n6", > > "base_url":"https://host2:8983/solr", > > "node_name":"host2:8983_solr", > > "state":"down", > > "type":"NRT", > > "force_set_state":"false"}, > > "core_node26":{ > > "core":"...._shard1_replica_n25", > > "base_url":"https://host3:8983/solr", > > "node_name":"host3:8983_solr", > > "state":"down", > > "type":"NRT", > > "force_set_state":"false"}, > > "core_node28":{ > > "core":"...._shard1_replica_n27", > > "base_url":"https://host4:8983/solr", > > "node_name":"host4:8983_solr", > > "state":"down", > > "type":"NRT", > > "force_set_state":"false"}, > > "core_node34":{ > > "core":"...._shard1_replica_n33", > > "base_url":"https://host5:8983/solr", > > "node_name":"host5:8983_solr", > > "state":"down", > > "type":"NRT", > > "force_set_state":"false"}}} > > > > The workarounds to shutdown server host1 with the replica stuck in > recovery > > state. This unblocked leader election, the 4 other replicas went active. > > > > Here is the first error I found in logs related to this shard. It > happened > > while shutting a server host3 that was the leader at that time/ > > (updateExecutor-5-thread-33908-processing-x:..._shard1_replica_n25 > > r:core_node26 null n:... s:shard1) [c:... s:shard1 r:core_node26 > > x:..._shard1_replica_n25] o.a.s.c.s.i.ConcurrentUpdateHttp2SolrClient > Error > > consuming and closing http response stream. => > > java.nio.channels.AsynchronousCloseException > > at > > > org.eclipse.jetty.client.util.InputStreamResponseListener$Input.read(InputStreamResponseListener.java:316) > > java.nio.channels.AsynchronousCloseException: null > > at > > > org.eclipse.jetty.client.util.InputStreamResponseListener$Input.read(InputStreamResponseListener.java:316) > > at java.io.InputStream.read(InputStream.java:205) ~[?:?] > > at > > > org.eclipse.jetty.client.util.InputStreamResponseListener$Input.read(InputStreamResponseListener.java:287) > > at > > > org.apache.solr.client.solrj.impl.ConcurrentUpdateHttp2SolrClient$Runner.sendUpdateStream(ConcurrentUpdateHttp2SolrClient.java:283) > > at > > > org.apache.solr.client.solrj.impl.ConcurrentUpdateHttp2SolrClient$Runner.run(ConcurrentUpdateHttp2SolrClient.java:176) > > at > > > com.codahale.metrics.InstrumentedExecutorService$InstrumentedRunnable.run(InstrumentedExecutorService.java:181) > > at > > > org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:209) > > at > > > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) > > ~[?:?] > > at > > > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) > > ~[?:?] > > at java.lang.Thread.run(Thread.java:834) [?:?] > > > > My understanding is following this error, each server restart ended in > the > > replica on this server being in "down" state, but I'm not sure how to > > confirm that. > > We then entered in a loop where term is increased because of failed > > replication. > > > > Is this a know issue? I found no similar ticket in Jira. > > Could you please having a better understanding of the issue? > > Thanks > >