Dave, there is something similar like MAX_CONNECTIONS and MAX_CONNECTIONS_PER_HOST which control the number of connections.
Are you leaving open the connection to zookeeper after you establish it? Are you using the singleton pattern? 2016-12-28 14:14 GMT-03:00 Dave Seltzer <dselt...@tveyes.com>: > Hi Erick, > > I'll dig in on these timeout settings and see how changes affect behavior. > > One interesting aspect is that we're not indexing any content at the > moment. The rate of ingress is something like 10 to 20 documents per day. > > So my guess is that ZK simply is deciding that these servers are dead based > on the fact that responses are so very sluggish. > > You've mentioned lots of timeouts, but are there any settings which control > the number of available threads? Or is this something which is largely > handled automagically? > > Many thanks! > > -Dave > > On Wed, Dec 28, 2016 at 11:56 AM, Erick Erickson <erickerick...@gmail.com> > wrote: > > > Dave: > > > > There are at least 4 timeouts (not even including ZK) that can > > be relevant, defined in solr.xml: > > socketTimeout > > connTimeout > > distribUpdateConnTimeout > > distribUpdateSoTimeout > > > > Plus the ZK timeout > > zkClientTimeout > > > > Plus the ZK configurations. > > > > So it would help narrow down what's going on if we knew why the nodes > > dropped out. There are indeed a lot of messages dumped, but somewhere > > in the logs there should be a root cause. > > > > You might see Leader Initiated Recovery (LIR) which can indicate that > > an update operation from the leader took too long, the timeouts above > > can be adjusted in this case. > > > > You might see evidence that ZK couldn't get a response from Solr in > > "too long" and decided it was gone. > > > > You might see... > > > > One thing I'd look at very closely is GC processing. One of the > > culprits for this behavior I've seen is a very long GC stop-the-world > > pause leading to ZK thinking the node is dead and tripping this chain. > > Depending on the timeouts, "very long" might be a few seconds. > > > > Not entirely helpful, but until you pinpoint why the node goes into > > recovery it's throwing darts at the wall. GC and log messages might > > give some insight into the root cause. > > > > Best, > > Erick > > > > On Wed, Dec 28, 2016 at 8:26 AM, Dave Seltzer <dselt...@tveyes.com> > wrote: > > > Hello Everyone, > > > > > > I'm working on a Solr Cloud cluster which is used in a hash matching > > > application. > > > > > > For performance reasons we've opted to batch-execute hash matching > > queries. > > > This means that a single query will contain many nested queries. As you > > > might expect, these queries take a while to execute. (On the order of 5 > > to > > > 10 seconds.) > > > > > > I've noticed that Solr will act erratically when we send too many > > > long-running queries. Specifically, heavily-loaded servers will > > repeatedly > > > fall out of the cluster and then recover. My theory is that there's > some > > > limit on the number of concurrent connections and that client queries > are > > > preventing zookeeper related queries... but I'm not sure. I've > increased > > > ZKClientTimeout to combat this. > > > > > > My question is: What configuration settings should I be looking at in > > order > > > to make sure I'm maximizing the ability of Solr to handle concurrent > > > requests. > > > > > > Many thanks! > > > > > > -Dave > > >