What version of Solr? Do you have any of the autoscaling stuff turned on? What about autoAddReplicas (which does not need Solr 7x)?
On Sat, Feb 9, 2019 at 4:35 PM Hendrik Haddorp <hendrik.hadd...@gmx.net> wrote: > > Hi, > > I have two Solr clouds using Version 7.6.0 with 4 nodes each and about > 500 collections with one shard and a replication factor of 2 per Solr > cloud. The data is stored in the HDFS. I restarted the nodes one by one > and always waited for the replicas to fully recover before I restarted > the next. Once the last node was restarted I noticed that Solr was > starting to move replicas to other nodes. Actually it started to move > all replicas from one node, which is now left empty. Is there any way to > figure out why Solr decided to move all replicas to other nodes? > The only problem that I see is that during the recovery the Solr > instance logged a problem with the HDFS, claiming that the filesystem is > closed. The recovery seems to have continued after that just fine though > and the logs are clean for the time after wards. > I restarted the node now and invoked the UTILIZENODE action that moved a > few replicas back to the node but then failed with this exception: > > { > "responseHeader":{ > "status":500, > "QTime":40220}, > "Operation utilizenode caused > exception:":"java.lang.IllegalArgumentException:java.lang.IllegalArgumentException: > Comparison method violates its general contract!", > "exception":{ > "msg":"Comparison method violates its general contract!", > "rspCode":-1}, > "error":{ > "metadata":[ > "error-class","org.apache.solr.common.SolrException", > "root-error-class","org.apache.solr.common.SolrException"], > "msg":"Comparison method violates its general contract!", > "trace":"org.apache.solr.common.SolrException: Comparison method > violates its general contract!\n\tat > org.apache.solr.client.solrj.SolrResponse.getException(SolrResponse.java:53)\n\tat > org.apache.solr.handler.admin.CollectionsHandler.invokeAction(CollectionsHandler.java:274)\n\tat > org.apache.solr.handler.admin.CollectionsHandler.handleRequestBody(CollectionsHandler.java:246)\n\tat > org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:199)\n\tat > org.apache.solr.servlet.HttpSolrCall.handleAdmin(HttpSolrCall.java:734)\n\tat > org.apache.solr.servlet.HttpSolrCall.handleAdminRequest(HttpSolrCall.java:715)\n\tat > org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:496)\n\tat > org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:377)\n\tat > org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:323)\n\tat > org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1634)\n\tat > org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:533)\n\tat > org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:146)\n\tat > org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:548)\n\tat > org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)\n\tat > org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:257)\n\tat > org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:1595)\n\tat > org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:255)\n\tat > org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1317)\n\tat > org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:203)\n\tat > org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:473)\n\tat > org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:1564)\n\tat > org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:201)\n\tat > org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1219)\n\tat > org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:144)\n\tat > org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:219)\n\tat > org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:126)\n\tat > org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)\n\tat > org.eclipse.jetty.rewrite.handler.RewriteHandler.handle(RewriteHandler.java:335)\n\tat > org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)\n\tat > org.eclipse.jetty.server.Server.handle(Server.java:531)\n\tat > org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:352)\n\tat > org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:260)\n\tat > org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:281)\n\tat > org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:102)\n\tat > org.eclipse.jetty.io.ChannelEndPoint$2.run(ChannelEndPoint.java:118)\n\tat > org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.runTask(EatWhatYouKill.java:333)\n\tat > org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.doProduce(EatWhatYouKill.java:310)\n\tat > org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.tryProduce(EatWhatYouKill.java:168)\n\tat > org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.run(EatWhatYouKill.java:126)\n\tat > org.eclipse.jetty.util.thread.ReservedThreadExecutor$ReservedThread.run(ReservedThreadExecutor.java:366)\n\tat > org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:762)\n\tat > org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:680)\n\tat > java.lang.Thread.run(Thread.java:748)\n", > "code":500}} > > When I invoke it again it moved a few more replicas but then failed in > the same way again. The log has this additional exception: > 2019-02-10 00:09:00.539 ERROR > (OverseerThreadFactory-1268-thread-38-processing-n:agent2:9151_solr) [ > ] o.a.s.c.a.c.OverseerCollectionMessageHandler Operation utilizenode > failed:java.lang.IllegalArgumentException: Comparison method violates > its general contract! > at java.util.TimSort.mergeLo(TimSort.java:777) > at java.util.TimSort.mergeAt(TimSort.java:514) > at java.util.TimSort.mergeCollapse(TimSort.java:439) > at java.util.TimSort.sort(TimSort.java:245) > at java.util.Arrays.sort(Arrays.java:1512) > at java.util.ArrayList.sort(ArrayList.java:1462) > at > org.apache.solr.client.solrj.cloud.autoscaling.MoveReplicaSuggester.tryEachNode(MoveReplicaSuggester.java:50) > at > org.apache.solr.client.solrj.cloud.autoscaling.MoveReplicaSuggester.init(MoveReplicaSuggester.java:38) > at > org.apache.solr.client.solrj.cloud.autoscaling.Suggester.getSuggestion(Suggester.java:187) > at > org.apache.solr.cloud.api.collections.UtilizeNodeCmd.call(UtilizeNodeCmd.java:100) > at > org.apache.solr.cloud.api.collections.OverseerCollectionMessageHandler.processMessage(OverseerCollectionMessageHandler.java:259) > at > org.apache.solr.cloud.OverseerTaskProcessor$Runner.run(OverseerTaskProcessor.java:478) > at > org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:209) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > > Not quite sure what it compares but the comparator should be this one: > https://github.com/apache/lucene-solr/blob/master/solr/solrj/src/java/org/apache/solr/client/solrj/cloud/autoscaling/MoveReplicaSuggester.java#L98 > Not sure if it's possible but if both replicas are leaders the result > looks wrong to me. > > Anyhow, my main issue is that I don't see why Solr suddenly decided to > move all replicas of my node. > > regards, > Hendrik