Re: leader election stuck after hosts restarts

Pierre Salagnac Tue, 12 Jan 2021 07:48:32 -0800

Sorry I missed this detail.
We are running Solr 8.2.
Thanks

Le mar. 12 janv. 2021 à 16:46, Phill Campbell <sirgilli...@yahoo.com.invalid>
a écrit :


> Which version of Apache Solr?
>
> > On Jan 12, 2021, at 8:36 AM, Pierre Salagnac <pierre.salag...@gmail.com>
> wrote:
> >
> > Hello,
> > We had a stuck leader election for a shard.
> >
> > We have collections with 2 shards, each shard has 5 replicas. We have
> many
> > collections but the issue happened for a single shard. Once all host
> > restarts completed, this shard was stuck with one replica is "recovery"
> > state and all other is "down" state.
> >
> > Here is the state of the shard returned by CLUSTERSTATUS command.
> >      "replicas":{
> >        "core_node3":{
> >          "core":"...._shard1_replica_n1",
> >          "base_url":"https://host1:8983/solr";,
> >          "node_name":"host1:8983_solr",
> >          "state":"recovering",
> >          "type":"NRT",
> >          "force_set_state":"false"},
> >        "core_node9":{
> >          "core":"...._shard1_replica_n6",
> >          "base_url":"https://host2:8983/solr";,
> >          "node_name":"host2:8983_solr",
> >          "state":"down",
> >          "type":"NRT",
> >          "force_set_state":"false"},
> >        "core_node26":{
> >          "core":"...._shard1_replica_n25",
> >          "base_url":"https://host3:8983/solr";,
> >          "node_name":"host3:8983_solr",
> >          "state":"down",
> >          "type":"NRT",
> >          "force_set_state":"false"},
> >        "core_node28":{
> >          "core":"...._shard1_replica_n27",
> >          "base_url":"https://host4:8983/solr";,
> >          "node_name":"host4:8983_solr",
> >          "state":"down",
> >          "type":"NRT",
> >          "force_set_state":"false"},
> >        "core_node34":{
> >          "core":"...._shard1_replica_n33",
> >          "base_url":"https://host5:8983/solr";,
> >          "node_name":"host5:8983_solr",
> >          "state":"down",
> >          "type":"NRT",
> >          "force_set_state":"false"}}}
> >
> > The workarounds to shutdown server host1 with the replica stuck in
> recovery
> > state. This unblocked leader election, the 4 other replicas went active.
> >
> > Here is the first error I found in logs related to this shard. It
> happened
> > while shutting a server host3 that was the leader at that time/
> > (updateExecutor-5-thread-33908-processing-x:..._shard1_replica_n25
> > r:core_node26 null n:... s:shard1) [c:... s:shard1 r:core_node26
> > x:..._shard1_replica_n25] o.a.s.c.s.i.ConcurrentUpdateHttp2SolrClient
> Error
> > consuming and closing http response stream. =>
> > java.nio.channels.AsynchronousCloseException
> > at
> >
> org.eclipse.jetty.client.util.InputStreamResponseListener$Input.read(InputStreamResponseListener.java:316)
> > java.nio.channels.AsynchronousCloseException: null
> > at
> >
> org.eclipse.jetty.client.util.InputStreamResponseListener$Input.read(InputStreamResponseListener.java:316)
> > at java.io.InputStream.read(InputStream.java:205) ~[?:?]
> > at
> >
> org.eclipse.jetty.client.util.InputStreamResponseListener$Input.read(InputStreamResponseListener.java:287)
> > at
> >
> org.apache.solr.client.solrj.impl.ConcurrentUpdateHttp2SolrClient$Runner.sendUpdateStream(ConcurrentUpdateHttp2SolrClient.java:283)
> > at
> >
> org.apache.solr.client.solrj.impl.ConcurrentUpdateHttp2SolrClient$Runner.run(ConcurrentUpdateHttp2SolrClient.java:176)
> > at
> >
> com.codahale.metrics.InstrumentedExecutorService$InstrumentedRunnable.run(InstrumentedExecutorService.java:181)
> > at
> >
> org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:209)
> > at
> >
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
> > ~[?:?]
> > at
> >
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
> > ~[?:?]
> > at java.lang.Thread.run(Thread.java:834) [?:?]
> >
> > My understanding is following this error, each server restart ended in
> the
> > replica on this server being in "down" state, but I'm not sure how to
> > confirm that.
> > We then entered in a loop where term is increased because of failed
> > replication.
> >
> > Is this a know issue? I found no similar ticket in Jira.
> > Could you please having a better understanding of the issue?
> > Thanks
>
>

Re: leader election stuck after hosts restarts

Reply via email to