John:

First place I'd look is the ZooKeeper Overseer queue. Prior to 6.6
there were some inefficiencies in how those messages were processed
and that queue would get very, very large when lots of replicas came
up all at once, and that would gum up the works. See: SOLR-10524.

The quick check would be to bring up your nodes a few at a time and
monitor the Overseer work queue(s) in ZK. Bring up, say, 5 nodes, wait
for the Overseer queue to settle down, bring up 5 more. Rinse, repeat.
If you can bring everything up and index and the like, that's probably
the issue.

Purely keying off of your statements "The code used is exactly the
same code that successfully spun up the first 3 or 4 solr boxes...."
and "When we scale up, 40 to 80 solr nodes will spin up at one time",
so may be way off base.

If I'm guessing correctly, then Solr 6.6 or the patch above (and
perhaps associated) or bringing up boxes more slowly are indicated. I
do know of installations with over 100K replicas, so Solr works at
your scale.

Best,
Erick

On Fri, Jun 9, 2017 at 11:03 AM, John Bickerstaff
<j...@johnbickerstaff.com> wrote:
> Hi all,
>
> Here's my situation...
>
> In AWS with zookeeper / solr.
>
> When trying to spin up additional Solr boxes from an "auto scaling group" I
> get this failure.
>
> The code used is exactly the same code that successfully spun up the first
> 3 or 4 solr boxes in each "auto scaling group"
>
> Below is a copy of my email to some of my compatriots within the company
> who also use solr/zookeeper....
>
> I'm looking for any advice on what _might_ be the cause of this failure...
> Overload on Zookeeper in some way is our best guess.
>
> I know this isn't a zookeeper forum - - just hoping someone out there has
> some experience troubleshooting similar issues.
>
> Many thanks in advance...
>
> =====
>
> We have 6 zookeepers. (3 of them are observers).
>
> They are not under a load balancer
>
> How do I check if zookeeper nodes are under heavy load?
>
>
> The problem arises when we try to scale up with more solr nodes. Current
> setup we have 160 nodes connected to zookeeper. Each node with 40 cores, so
> around 6400 cores. When we scale up, 40 to 80 solr nodes will spin up at
> one time.
>
> And we are getting errors like these that stops the index distribution
> process:
>
> 2017-06-05 20:06:34.357 ERROR [pool-3-thread-2] o.a.s.c.CoreContainer -
> Error creating core [p44_b1_s37]: Could not get shard id for core:
> p44_b1_s37
>
>
> org.apache.solr.common.SolrException: Could not get shard id for core:
> p44_b1_s37
>
> at org.apache.solr.cloud.ZkController.waitForShardId(ZkController.java:1496)
>
> at
> org.apache.solr.cloud.ZkController.doGetShardIdAndNodeNameProcess(ZkController.java:1438)
>
> at org.apache.solr.cloud.ZkController.preRegister(ZkController.java:1548)
>
> at org.apache.solr.core.CoreContainer.create(CoreContainer.java:815)
>
> at org.apache.solr.core.CoreContainer.create(CoreContainer.java:757)
>
> at com.ancestry.solr.servlet.AcomServlet.indexTransfer(AcomServlet.java:319)
>
> at
> com.ancestry.solr.servlet.AcomServlet.lambda$indexTransferStart$1(AcomServlet.java:303)
>
> at
> com.ancestry.solr.service.IndexTransferWorker.run(IndexTransferWorker.java:78)
>
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>
> at java.lang.Thread.run(Thread.java:745)
>
>
> Which we predict has to do with zookeeper not responding fast enough.

Reply via email to