On Feb 21, 2014, at 12:23 PM, Jeff Wartes <jwar...@whitepages.com> wrote:

> 
> I’ve been experimenting with SolrCloud configurations in AWS. One issue I’ve 
> been plagued with is that during indexing, occasionally a node decides it 
> can’t talk to ZK, and this disables updates in the pool. The node usually 
> recovers within a second or two. It’s possible this happens when I’m not 
> indexing too, but I’m much less likely to notice.
> 
> I’ve seen this with multiple sharding configurations and multiple cluster 
> sizes. I’ve searched around, and I think I’ve addressed the usual resolutions 
> when someone complains about ZK and Solr. I’m using:
> 
>  *   60-sec ZK connection timeout (although this seems like a pretty terrible 
> requirement)

Be aware that it maxes out at like 40 or 45 seconds with the default tickTime 
of 2000.

>  *   Independent 3-node ZK cluster, also in AWS.
>  *   Solr 4.6.1
>  *   Optimized GC settings (and I’ve confirmed no GC pauses are occurring)
>  *   5-min auto-hard-commit with openSearcher=false
> 
> I’m indexing some 10K docs/sec using CloudSolrServer, but the CPU usage on 
> the nodes doesn’t exceed 20%, typically it’s around 5%.
> 
> Here is the relevant section of logs from one of the nodes when this happened:
> http://pastebin.com/K0ZdKmL4
> 
> It looks like it had a connection timeout, and tried to re-establish the same 
> session on a connection to a new ZK node, except the session had also 
> expired. It then closes *that* connection, changes to read-only mode, and 
> eventually creates a new connection and new session which allows writes again.
> 
> Can anyone familiar with the ZK connection/session stuff comment on whether 
> this is a bug? I really know nothing about proper ZK client behaviour.
> 
> Thanks.
> 

You have to figure out why Solr is not able to talk to ZooKeeper for 40-60 
seconds. Perhaps it’s the network, perhaps it’s the…I’m not sure. But for some 
reason a very simple heart beat cannot occur for a long time - and for Solr to 
receive updates, it has to maintain a connection with ZooKeeper. You can either 
raise the timeout, or dig into why the connection heartbeat cannot be 
maintained (its very lightweight). 

- Mark

http://about.me/markrmiller

Reply via email to