The clusterstate on Zookeeper shouldn't be changing very often, only when nodes come and go.
bq: At that time I am also running queries (that return millions of docs). As in rows=milions? This is an anti-pattern, if that's true then you're probably network saturated and the like. If you mean your numFound is millions, then this is unlikely to be a problem. you say "clusterstate.json", which indicates you're on 4x? This has been changed to make a state.json for each collection, so either you upgraded sometime and didn't transform you ZK (there's a command to do that) or can you upgrade? What I'm guessing is that you have too much going on somehow and you're overloading your system and getting a timeout. So increasing the timeout is definitely a possibility, or reducing the ingestion load as a test. Best, Erick On Fri, Nov 18, 2016 at 4:51 PM, Chetas Joshi <chetas.jo...@gmail.com> wrote: > Hi, > > I have a SolrCloud (on HDFS) of 50 nodes and a ZK quorum of 5 nodes. The > SolrCloud is having difficulties talking to ZK when I am ingesting data > into the collections. At that time I am also running queries (that return > millions of docs). The ingest job is crying with the the following exception > > org.apache.solr.client.solrj.impl.CloudSolrClient$RouteException: Error > from server at http://xxx/solr/collection1_shard15_replica1: Cannot talk to > ZooKeeper - Updates are disabled. > > I think this is happening when the ingest job is trying to update the > clusterstate.json file but the query is reading from that file and thus has > some kind of a lock on that file. Are there any factors that will cause the > "READ" to acquire lock for a long time? Is my understanding correct? I am > using the cursor approach using SolrJ to get back results from Solr. > > How often is the ZK updated with the latest cluster state and what > parameter governs that? Should I just increase the ZK client timeout so > that it retries connecting to the ZK for a longer period of time (right now > it is 15 seconds)? > > Thanks!