On Feb 21, 2014, at 12:23 PM, Jeff Wartes <jwar...@whitepages.com> wrote:
> > I’ve been experimenting with SolrCloud configurations in AWS. One issue I’ve > been plagued with is that during indexing, occasionally a node decides it > can’t talk to ZK, and this disables updates in the pool. The node usually > recovers within a second or two. It’s possible this happens when I’m not > indexing too, but I’m much less likely to notice. > > I’ve seen this with multiple sharding configurations and multiple cluster > sizes. I’ve searched around, and I think I’ve addressed the usual resolutions > when someone complains about ZK and Solr. I’m using: > > * 60-sec ZK connection timeout (although this seems like a pretty terrible > requirement) Be aware that it maxes out at like 40 or 45 seconds with the default tickTime of 2000. > * Independent 3-node ZK cluster, also in AWS. > * Solr 4.6.1 > * Optimized GC settings (and I’ve confirmed no GC pauses are occurring) > * 5-min auto-hard-commit with openSearcher=false > > I’m indexing some 10K docs/sec using CloudSolrServer, but the CPU usage on > the nodes doesn’t exceed 20%, typically it’s around 5%. > > Here is the relevant section of logs from one of the nodes when this happened: > http://pastebin.com/K0ZdKmL4 > > It looks like it had a connection timeout, and tried to re-establish the same > session on a connection to a new ZK node, except the session had also > expired. It then closes *that* connection, changes to read-only mode, and > eventually creates a new connection and new session which allows writes again. > > Can anyone familiar with the ZK connection/session stuff comment on whether > this is a bug? I really know nothing about proper ZK client behaviour. > > Thanks. > You have to figure out why Solr is not able to talk to ZooKeeper for 40-60 seconds. Perhaps it’s the network, perhaps it’s the…I’m not sure. But for some reason a very simple heart beat cannot occur for a long time - and for Solr to receive updates, it has to maintain a connection with ZooKeeper. You can either raise the timeout, or dig into why the connection heartbeat cannot be maintained (its very lightweight). - Mark http://about.me/markrmiller