Thanks Eric!

It's very likely that the auto scale groups spinning up new Solrs hit
zookeeper harder than our initial deploy...  Just due to the way things get
staggered during the deploy.

Unfortunately, I don't think there's a way to stagger the auto scale
group's work of bringing up Solr boxes (although I need to check)

I appreciate the hint to check the Overseer queue - I'll be doing that for
sure...



On Fri, Jun 9, 2017 at 12:19 PM, Erick Erickson <erickerick...@gmail.com>
wrote:

> John:
>
> First place I'd look is the ZooKeeper Overseer queue. Prior to 6.6
> there were some inefficiencies in how those messages were processed
> and that queue would get very, very large when lots of replicas came
> up all at once, and that would gum up the works. See: SOLR-10524.
>
> The quick check would be to bring up your nodes a few at a time and
> monitor the Overseer work queue(s) in ZK. Bring up, say, 5 nodes, wait
> for the Overseer queue to settle down, bring up 5 more. Rinse, repeat.
> If you can bring everything up and index and the like, that's probably
> the issue.
>
> Purely keying off of your statements "The code used is exactly the
> same code that successfully spun up the first 3 or 4 solr boxes...."
> and "When we scale up, 40 to 80 solr nodes will spin up at one time",
> so may be way off base.
>
> If I'm guessing correctly, then Solr 6.6 or the patch above (and
> perhaps associated) or bringing up boxes more slowly are indicated. I
> do know of installations with over 100K replicas, so Solr works at
> your scale.
>
> Best,
> Erick
>
> On Fri, Jun 9, 2017 at 11:03 AM, John Bickerstaff
> <j...@johnbickerstaff.com> wrote:
> > Hi all,
> >
> > Here's my situation...
> >
> > In AWS with zookeeper / solr.
> >
> > When trying to spin up additional Solr boxes from an "auto scaling
> group" I
> > get this failure.
> >
> > The code used is exactly the same code that successfully spun up the
> first
> > 3 or 4 solr boxes in each "auto scaling group"
> >
> > Below is a copy of my email to some of my compatriots within the company
> > who also use solr/zookeeper....
> >
> > I'm looking for any advice on what _might_ be the cause of this
> failure...
> > Overload on Zookeeper in some way is our best guess.
> >
> > I know this isn't a zookeeper forum - - just hoping someone out there has
> > some experience troubleshooting similar issues.
> >
> > Many thanks in advance...
> >
> > =====
> >
> > We have 6 zookeepers. (3 of them are observers).
> >
> > They are not under a load balancer
> >
> > How do I check if zookeeper nodes are under heavy load?
> >
> >
> > The problem arises when we try to scale up with more solr nodes. Current
> > setup we have 160 nodes connected to zookeeper. Each node with 40 cores,
> so
> > around 6400 cores. When we scale up, 40 to 80 solr nodes will spin up at
> > one time.
> >
> > And we are getting errors like these that stops the index distribution
> > process:
> >
> > 2017-06-05 20:06:34.357 ERROR [pool-3-thread-2] o.a.s.c.CoreContainer -
> > Error creating core [p44_b1_s37]: Could not get shard id for core:
> > p44_b1_s37
> >
> >
> > org.apache.solr.common.SolrException: Could not get shard id for core:
> > p44_b1_s37
> >
> > at org.apache.solr.cloud.ZkController.waitForShardId(
> ZkController.java:1496)
> >
> > at
> > org.apache.solr.cloud.ZkController.doGetShardIdAndNodeNameProcess
> (ZkController.java:1438)
> >
> > at org.apache.solr.cloud.ZkController.preRegister(
> ZkController.java:1548)
> >
> > at org.apache.solr.core.CoreContainer.create(CoreContainer.java:815)
> >
> > at org.apache.solr.core.CoreContainer.create(CoreContainer.java:757)
> >
> > at com.ancestry.solr.servlet.AcomServlet.indexTransfer(
> AcomServlet.java:319)
> >
> > at
> > com.ancestry.solr.servlet.AcomServlet.lambda$indexTransferStart$1(
> AcomServlet.java:303)
> >
> > at
> > com.ancestry.solr.service.IndexTransferWorker.run(
> IndexTransferWorker.java:78)
> >
> > at
> > java.util.concurrent.ThreadPoolExecutor.runWorker(
> ThreadPoolExecutor.java:1142)
> >
> > at
> > java.util.concurrent.ThreadPoolExecutor$Worker.run(
> ThreadPoolExecutor.java:617)
> >
> > at java.lang.Thread.run(Thread.java:745)
> >
> >
> > Which we predict has to do with zookeeper not responding fast enough.
>

Reply via email to