To be more specific: I currently have 19 collections, where each node has exactly one replica per collection. A new node will automatically create new replicas on itself, one for each existing collection (see cluster-policy above). So when removing a node, all 19 collection replicas of it need to be removed. This can't be done in one go due to thread count (parallel synchronous execution) being only 10 and is not scaling up when necessary.
On Fri, 29 Mar 2019 at 14:20, Roger Lehmann <roger.lehm...@offerista.com> wrote: > Situation > > I'm currently trying to set up SolrCloud in an AWS Autoscaling Group, so > that it can scale dynamically. > > I've also added the following triggers to Solr, so that each node will > have 1 (and only one) replication of each collection: > > { > "set-cluster-policy": [ > {"replica": "<2", "shard": "#EACH", "node": "#EACH"} > ], > "set-trigger": [{ > "name": "node_added_trigger", > "event": "nodeAdded", > "waitFor": "5s", > "preferredOperation": "ADDREPLICA" > },{ > "name": "node_lost_trigger", > "event": "nodeLost", > "waitFor": "120s", > "preferredOperation": "DELETENODE" > }] > } > > This works pretty well. But my problem is that when the a node gets > removed, it doesn't remove all 19 replicas from this node and I have > problems when accessing the "nodes" page: > > [image: enter image description here] > <https://i.stack.imgur.com/QyJrY.png> > > In the logs, this exception occurs: > > Operation deletenode failed:java.util.concurrent.RejectedExecutionException: > Task > org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor$$Lambda$45/1104948431@467049e2 > rejected from > org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor@773563df[Running, > pool size = 10, active threads = 10, queued tasks = 0, completed tasks = 1] > at > java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2063) > at > java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:830) > at > java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1379) > at > org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.execute(ExecutorUtil.java:194) > at > java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:134) > at > org.apache.solr.cloud.api.collections.DeleteReplicaCmd.deleteCore(DeleteReplicaCmd.java:276) > at > org.apache.solr.cloud.api.collections.DeleteReplicaCmd.deleteReplica(DeleteReplicaCmd.java:95) > at > org.apache.solr.cloud.api.collections.DeleteNodeCmd.cleanupReplicas(DeleteNodeCmd.java:109) > at > org.apache.solr.cloud.api.collections.DeleteNodeCmd.call(DeleteNodeCmd.java:62) > at > org.apache.solr.cloud.api.collections.OverseerCollectionMessageHandler.processMessage(OverseerCollectionMessageHandler.java:292) > at > org.apache.solr.cloud.OverseerTaskProcessor$Runner.run(OverseerTaskProcessor.java:496) > at > org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:209) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > > Problem description > > So, the problem is that it only has a pool size of 10, of which 10 are > busy and nothing gets queued (synchronous execution). In fact, it really > only removed 10 replicas and the other 9 replicas stayed there. When > manually sending the API command to delete this node it works fine, since > Solr only needs to remove the remaining 9 replicas and everything is good > again. > Question > > How can I either increase this (small) thread pool size and/or activate > queueing the remaining deletion tasks? Another solution might be to retry > the failed task until it succeeds. > > Using Solr 7.7.1 on Ubuntu Server installed with the installation script > from Solr (so I guess it's using Jetty?). > > Thanks for your help! >