Solr version is 7.6.0
autoAddReplicas is set to true
/api/cluster/autoscaling returns this:
{
"responseHeader":{
"status":0,
"QTime":1},
"cluster-preferences":[{
"minimize":"cores",
"precision":1}],
"cluster-policy":[{
"replica":"<2",
"shard":"#EACH",
"node":"#ANY"}],
"triggers":{
".auto_add_replicas":{
"name":".auto_add_replicas",
"event":"nodeLost",
"waitFor":1800,
"enabled":true,
"actions":[{
"name":"auto_add_replicas_plan",
"class":"solr.AutoAddReplicasPlanAction"},
{
"name":"execute_plan",
"class":"solr.ExecutePlanAction"}]},
".scheduled_maintenance":{
"name":".scheduled_maintenance",
"event":"scheduled",
"startTime":"NOW",
"every":"+1DAY",
"enabled":true,
"actions":[{
"name":"inactive_shard_plan",
"class":"solr.InactiveShardPlanAction"},
{
"name":"execute_plan",
"class":"solr.ExecutePlanAction"}]}},
"listeners":{
".auto_add_replicas.system":{
"beforeAction":[],
"afterAction":[],
"stage":["STARTED",
"ABORTED",
"SUCCEEDED",
"FAILED",
"BEFORE_ACTION",
"AFTER_ACTION",
"IGNORED"],
"trigger":".auto_add_replicas",
"class":"org.apache.solr.cloud.autoscaling.SystemLogListener"},
".scheduled_maintenance.system":{
"beforeAction":[],
"afterAction":[],
"stage":["STARTED",
"ABORTED",
"SUCCEEDED",
"FAILED",
"BEFORE_ACTION",
"AFTER_ACTION",
"IGNORED"],
"trigger":".scheduled_maintenance",
"class":"org.apache.solr.cloud.autoscaling.SystemLogListener"}},
"properties":{},
"WARNING":"This response format is experimental. It is likely to change in the
future."}
I have two solr clouds that are setup in the same way. When restarting
the nodes only one of them showed this behavior.
Ideally I want replicas to be moved when a node is down for a longer
time but not when I just restart it. I would also like all nodes to end
up with the same number of cores.
On 10.02.2019 05:30, Erick Erickson wrote:
What version of Solr? Do you have any of the autoscaling stuff turned
on? What about autoAddReplicas (which does not need Solr 7x)?
On Sat, Feb 9, 2019 at 4:35 PM Hendrik Haddorp <hendrik.hadd...@gmx.net> wrote:
Hi,
I have two Solr clouds using Version 7.6.0 with 4 nodes each and about
500 collections with one shard and a replication factor of 2 per Solr
cloud. The data is stored in the HDFS. I restarted the nodes one by one
and always waited for the replicas to fully recover before I restarted
the next. Once the last node was restarted I noticed that Solr was
starting to move replicas to other nodes. Actually it started to move
all replicas from one node, which is now left empty. Is there any way to
figure out why Solr decided to move all replicas to other nodes?
The only problem that I see is that during the recovery the Solr
instance logged a problem with the HDFS, claiming that the filesystem is
closed. The recovery seems to have continued after that just fine though
and the logs are clean for the time after wards.
I restarted the node now and invoked the UTILIZENODE action that moved a
few replicas back to the node but then failed with this exception:
{
"responseHeader":{
"status":500,
"QTime":40220},
"Operation utilizenode caused
exception:":"java.lang.IllegalArgumentException:java.lang.IllegalArgumentException:
Comparison method violates its general contract!",
"exception":{
"msg":"Comparison method violates its general contract!",
"rspCode":-1},
"error":{
"metadata":[
"error-class","org.apache.solr.common.SolrException",
"root-error-class","org.apache.solr.common.SolrException"],
"msg":"Comparison method violates its general contract!",
"trace":"org.apache.solr.common.SolrException: Comparison method
violates its general contract!\n\tat
org.apache.solr.client.solrj.SolrResponse.getException(SolrResponse.java:53)\n\tat
org.apache.solr.handler.admin.CollectionsHandler.invokeAction(CollectionsHandler.java:274)\n\tat
org.apache.solr.handler.admin.CollectionsHandler.handleRequestBody(CollectionsHandler.java:246)\n\tat
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:199)\n\tat
org.apache.solr.servlet.HttpSolrCall.handleAdmin(HttpSolrCall.java:734)\n\tat
org.apache.solr.servlet.HttpSolrCall.handleAdminRequest(HttpSolrCall.java:715)\n\tat
org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:496)\n\tat
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:377)\n\tat
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:323)\n\tat
org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1634)\n\tat
org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:533)\n\tat
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:146)\n\tat
org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:548)\n\tat
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)\n\tat
org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:257)\n\tat
org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:1595)\n\tat
org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:255)\n\tat
org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1317)\n\tat
org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:203)\n\tat
org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:473)\n\tat
org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:1564)\n\tat
org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:201)\n\tat
org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1219)\n\tat
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:144)\n\tat
org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:219)\n\tat
org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:126)\n\tat
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)\n\tat
org.eclipse.jetty.rewrite.handler.RewriteHandler.handle(RewriteHandler.java:335)\n\tat
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)\n\tat
org.eclipse.jetty.server.Server.handle(Server.java:531)\n\tat
org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:352)\n\tat
org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:260)\n\tat
org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:281)\n\tat
org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:102)\n\tat
org.eclipse.jetty.io.ChannelEndPoint$2.run(ChannelEndPoint.java:118)\n\tat
org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.runTask(EatWhatYouKill.java:333)\n\tat
org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.doProduce(EatWhatYouKill.java:310)\n\tat
org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.tryProduce(EatWhatYouKill.java:168)\n\tat
org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.run(EatWhatYouKill.java:126)\n\tat
org.eclipse.jetty.util.thread.ReservedThreadExecutor$ReservedThread.run(ReservedThreadExecutor.java:366)\n\tat
org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:762)\n\tat
org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:680)\n\tat
java.lang.Thread.run(Thread.java:748)\n",
"code":500}}
When I invoke it again it moved a few more replicas but then failed in
the same way again. The log has this additional exception:
2019-02-10 00:09:00.539 ERROR
(OverseerThreadFactory-1268-thread-38-processing-n:agent2:9151_solr) [
] o.a.s.c.a.c.OverseerCollectionMessageHandler Operation utilizenode
failed:java.lang.IllegalArgumentException: Comparison method violates
its general contract!
at java.util.TimSort.mergeLo(TimSort.java:777)
at java.util.TimSort.mergeAt(TimSort.java:514)
at java.util.TimSort.mergeCollapse(TimSort.java:439)
at java.util.TimSort.sort(TimSort.java:245)
at java.util.Arrays.sort(Arrays.java:1512)
at java.util.ArrayList.sort(ArrayList.java:1462)
at
org.apache.solr.client.solrj.cloud.autoscaling.MoveReplicaSuggester.tryEachNode(MoveReplicaSuggester.java:50)
at
org.apache.solr.client.solrj.cloud.autoscaling.MoveReplicaSuggester.init(MoveReplicaSuggester.java:38)
at
org.apache.solr.client.solrj.cloud.autoscaling.Suggester.getSuggestion(Suggester.java:187)
at
org.apache.solr.cloud.api.collections.UtilizeNodeCmd.call(UtilizeNodeCmd.java:100)
at
org.apache.solr.cloud.api.collections.OverseerCollectionMessageHandler.processMessage(OverseerCollectionMessageHandler.java:259)
at
org.apache.solr.cloud.OverseerTaskProcessor$Runner.run(OverseerTaskProcessor.java:478)
at
org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:209)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Not quite sure what it compares but the comparator should be this one:
https://github.com/apache/lucene-solr/blob/master/solr/solrj/src/java/org/apache/solr/client/solrj/cloud/autoscaling/MoveReplicaSuggester.java#L98
Not sure if it's possible but if both replicas are leaders the result
looks wrong to me.
Anyhow, my main issue is that I don't see why Solr suddenly decided to
move all replicas of my node.
regards,
Hendrik