Dave: There are at least 4 timeouts (not even including ZK) that can be relevant, defined in solr.xml: socketTimeout connTimeout distribUpdateConnTimeout distribUpdateSoTimeout
Plus the ZK timeout zkClientTimeout Plus the ZK configurations. So it would help narrow down what's going on if we knew why the nodes dropped out. There are indeed a lot of messages dumped, but somewhere in the logs there should be a root cause. You might see Leader Initiated Recovery (LIR) which can indicate that an update operation from the leader took too long, the timeouts above can be adjusted in this case. You might see evidence that ZK couldn't get a response from Solr in "too long" and decided it was gone. You might see... One thing I'd look at very closely is GC processing. One of the culprits for this behavior I've seen is a very long GC stop-the-world pause leading to ZK thinking the node is dead and tripping this chain. Depending on the timeouts, "very long" might be a few seconds. Not entirely helpful, but until you pinpoint why the node goes into recovery it's throwing darts at the wall. GC and log messages might give some insight into the root cause. Best, Erick On Wed, Dec 28, 2016 at 8:26 AM, Dave Seltzer <dselt...@tveyes.com> wrote: > Hello Everyone, > > I'm working on a Solr Cloud cluster which is used in a hash matching > application. > > For performance reasons we've opted to batch-execute hash matching queries. > This means that a single query will contain many nested queries. As you > might expect, these queries take a while to execute. (On the order of 5 to > 10 seconds.) > > I've noticed that Solr will act erratically when we send too many > long-running queries. Specifically, heavily-loaded servers will repeatedly > fall out of the cluster and then recover. My theory is that there's some > limit on the number of concurrent connections and that client queries are > preventing zookeeper related queries... but I'm not sure. I've increased > ZKClientTimeout to combat this. > > My question is: What configuration settings should I be looking at in order > to make sure I'm maximizing the ability of Solr to handle concurrent > requests. > > Many thanks! > > -Dave