I opened https://issues.apache.org/jira/browse/SOLR-13240 for the exception.

On 10.02.2019 01:35, Hendrik Haddorp wrote:
Hi,

I have two Solr clouds using Version 7.6.0 with 4 nodes each and about 500 collections with one shard and a replication factor of 2 per Solr cloud. The data is stored in the HDFS. I restarted the nodes one by one and always waited for the replicas to fully recover before I restarted the next. Once the last node was restarted I noticed that Solr was starting to move replicas to other nodes. Actually it started to move all replicas from one node, which is now left empty. Is there any way to figure out why Solr decided to move all replicas to other nodes? The only problem that I see is that during the recovery the Solr instance logged a problem with the HDFS, claiming that the filesystem is closed. The recovery seems to have continued after that just fine though and the logs are clean for the time after wards. I restarted the node now and invoked the UTILIZENODE action that moved a few replicas back to the node but then failed with this exception:

{
  "responseHeader":{
    "status":500,
    "QTime":40220},
  "Operation utilizenode caused exception:":"java.lang.IllegalArgumentException:java.lang.IllegalArgumentException: Comparison method violates its general contract!",
  "exception":{
    "msg":"Comparison method violates its general contract!",
    "rspCode":-1},
  "error":{
    "metadata":[
      "error-class","org.apache.solr.common.SolrException",
      "root-error-class","org.apache.solr.common.SolrException"],
    "msg":"Comparison method violates its general contract!",
    "trace":"org.apache.solr.common.SolrException: Comparison method violates its general contract!\n\tat org.apache.solr.client.solrj.SolrResponse.getException(SolrResponse.java:53)\n\tat org.apache.solr.handler.admin.CollectionsHandler.invokeAction(CollectionsHandler.java:274)\n\tat org.apache.solr.handler.admin.CollectionsHandler.handleRequestBody(CollectionsHandler.java:246)\n\tat org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:199)\n\tat org.apache.solr.servlet.HttpSolrCall.handleAdmin(HttpSolrCall.java:734)\n\tat org.apache.solr.servlet.HttpSolrCall.handleAdminRequest(HttpSolrCall.java:715)\n\tat org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:496)\n\tat org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:377)\n\tat org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:323)\n\tat org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1634)\n\tat org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:533)\n\tat org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:146)\n\tat org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:548)\n\tat org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)\n\tat org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:257)\n\tat org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:1595)\n\tat org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:255)\n\tat org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1317)\n\tat org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:203)\n\tat org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:473)\n\tat org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:1564)\n\tat org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:201)\n\tat org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1219)\n\tat org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:144)\n\tat org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:219)\n\tat org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:126)\n\tat org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)\n\tat org.eclipse.jetty.rewrite.handler.RewriteHandler.handle(RewriteHandler.java:335)\n\tat org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)\n\tat org.eclipse.jetty.server.Server.handle(Server.java:531)\n\tat org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:352)\n\tat org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:260)\n\tat org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:281)\n\tat org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:102)\n\tat org.eclipse.jetty.io.ChannelEndPoint$2.run(ChannelEndPoint.java:118)\n\tat org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.runTask(EatWhatYouKill.java:333)\n\tat org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.doProduce(EatWhatYouKill.java:310)\n\tat org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.tryProduce(EatWhatYouKill.java:168)\n\tat org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.run(EatWhatYouKill.java:126)\n\tat org.eclipse.jetty.util.thread.ReservedThreadExecutor$ReservedThread.run(ReservedThreadExecutor.java:366)\n\tat org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:762)\n\tat org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:680)\n\tat java.lang.Thread.run(Thread.java:748)\n",
    "code":500}}

When I invoke it again it moved a few more replicas but then failed in the same way again. The log has this additional exception: 2019-02-10 00:09:00.539 ERROR (OverseerThreadFactory-1268-thread-38-processing-n:agent2:9151_solr) [   ] o.a.s.c.a.c.OverseerCollectionMessageHandler Operation utilizenode failed:java.lang.IllegalArgumentException: Comparison method violates its general contract!
    at java.util.TimSort.mergeLo(TimSort.java:777)
    at java.util.TimSort.mergeAt(TimSort.java:514)
    at java.util.TimSort.mergeCollapse(TimSort.java:439)
    at java.util.TimSort.sort(TimSort.java:245)
    at java.util.Arrays.sort(Arrays.java:1512)
    at java.util.ArrayList.sort(ArrayList.java:1462)
    at org.apache.solr.client.solrj.cloud.autoscaling.MoveReplicaSuggester.tryEachNode(MoveReplicaSuggester.java:50)     at org.apache.solr.client.solrj.cloud.autoscaling.MoveReplicaSuggester.init(MoveReplicaSuggester.java:38)     at org.apache.solr.client.solrj.cloud.autoscaling.Suggester.getSuggestion(Suggester.java:187)     at org.apache.solr.cloud.api.collections.UtilizeNodeCmd.call(UtilizeNodeCmd.java:100)     at org.apache.solr.cloud.api.collections.OverseerCollectionMessageHandler.processMessage(OverseerCollectionMessageHandler.java:259)     at org.apache.solr.cloud.OverseerTaskProcessor$Runner.run(OverseerTaskProcessor.java:478)     at org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:209)     at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)     at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)

Not quite sure what it compares but the comparator should be this one: https://github.com/apache/lucene-solr/blob/master/solr/solrj/src/java/org/apache/solr/client/solrj/cloud/autoscaling/MoveReplicaSuggester.java#L98 Not sure if it's possible but if both replicas are leaders the result looks wrong to me.

Anyhow, my main issue is that I don't see why Solr suddenly decided to move all replicas of my node.

regards,
Hendrik

Reply via email to