Hi all, Here's my situation...
In AWS with zookeeper / solr. When trying to spin up additional Solr boxes from an "auto scaling group" I get this failure. The code used is exactly the same code that successfully spun up the first 3 or 4 solr boxes in each "auto scaling group" Below is a copy of my email to some of my compatriots within the company who also use solr/zookeeper.... I'm looking for any advice on what _might_ be the cause of this failure... Overload on Zookeeper in some way is our best guess. I know this isn't a zookeeper forum - - just hoping someone out there has some experience troubleshooting similar issues. Many thanks in advance... ===== We have 6 zookeepers. (3 of them are observers). They are not under a load balancer How do I check if zookeeper nodes are under heavy load? The problem arises when we try to scale up with more solr nodes. Current setup we have 160 nodes connected to zookeeper. Each node with 40 cores, so around 6400 cores. When we scale up, 40 to 80 solr nodes will spin up at one time. And we are getting errors like these that stops the index distribution process: 2017-06-05 20:06:34.357 ERROR [pool-3-thread-2] o.a.s.c.CoreContainer - Error creating core [p44_b1_s37]: Could not get shard id for core: p44_b1_s37 org.apache.solr.common.SolrException: Could not get shard id for core: p44_b1_s37 at org.apache.solr.cloud.ZkController.waitForShardId(ZkController.java:1496) at org.apache.solr.cloud.ZkController.doGetShardIdAndNodeNameProcess(ZkController.java:1438) at org.apache.solr.cloud.ZkController.preRegister(ZkController.java:1548) at org.apache.solr.core.CoreContainer.create(CoreContainer.java:815) at org.apache.solr.core.CoreContainer.create(CoreContainer.java:757) at com.ancestry.solr.servlet.AcomServlet.indexTransfer(AcomServlet.java:319) at com.ancestry.solr.servlet.AcomServlet.lambda$indexTransferStart$1(AcomServlet.java:303) at com.ancestry.solr.service.IndexTransferWorker.run(IndexTransferWorker.java:78) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Which we predict has to do with zookeeper not responding fast enough.