Thanks Erick and Shawn. I have reduced number of rows per page from 500K to 100K. I also increased the ZKclientTimeOut to 30 seconds so that I don't run into ZK time out issues. The ZK cluster has been deployed on the hosts other than the SolrCloud hosts.
However, I was trying to increase the number of rows per page due to the following reason: Running ingestion at the same time as running queries has increased the amount of time it takes to read results from Solr using the Cursor approach by 5 times. I am able to read 1M sorted documents in 1 hour (88 bytes of data per document). What could be the reason behind the low speed of query execution? I am running solr servers with heap=16g and off-heap=16g. Off-heap is being used as the block cache. Do ingestion and query execution both use a lot of block cache? Should I increase the block cache size in oder to improve the query performance? Should I increase slab.count or maxDirectMemorySize? Thanks! On Sat, Nov 19, 2016 at 8:13 AM, Erick Erickson <erickerick...@gmail.com> wrote: > Returning 500K rows is, as Shawn says, not Solr's sweet spot. > > My guess: All the work you're doing trying to return that many > rows, particularly in SolrCloud mode is simply overloading > your system to the point that the ZK connection times out. Don't > do that. If you need that many rows, either Shawn's cursorMark > option or use export/streaming aggregation are much better > choices..... > > Consider what happens on a sharded request: > - the initial node sends a sub-request to a replica for each shard. > - each replica returns it's candidate topN (doc ID and sort criteria) > - the initial node sorts these lists (1M from each replica in your > example) to get the true top N > - the initial node requests the docs from each replica that made it > into the true top N > - each replica goes to disk, decompresses the doc and pulls out the fields > - each replica sends its portion of the top N to the initial node > - an enormous packet containing all 1M final docs is assembled and > returned to the client. > - this sucks up bandwidth and resources > - that's bad enough, but especially if your ZK nodes are on the same > box as your Solr nodes they're even more like to have a timeout issue. > > > Best, > Erick > > On Fri, Nov 18, 2016 at 8:45 PM, Shawn Heisey <apa...@elyograg.org> wrote: > > On 11/18/2016 6:50 PM, Chetas Joshi wrote: > >> The numFound is millions but I was also trying with rows= 1 Million. I > will reduce it to 500K. > >> > >> I am sorry. It is state.json. I am using Solr 5.5.0 > >> > >> One of the things I am not able to understand is why my ingestion job is > >> complaining about "Cannot talk to ZooKeeper - Updates are disabled." > >> > >> I have a spark streaming job that continuously ingests into Solr. My > shards are always up and running. The moment I start a query on SolrCloud > it starts running into this exception. However as you said ZK will only > update the state of the cluster when the shards go down. Then why my job is > trying to contact ZK when the cluster is up and why is the exception about > updating ZK? > > > > SolrCloud and SolrJ (CloudSolrClient) both maintain constant connections > > to all the zookeeper servers they are configured to use. If zookeeper > > quorum is lost, SolrCloud will go read-only -- no updating is possible. > > That is what is meant by "updates are disabled." > > > > Solr and Lucene are optimized for very low rowcounts, typically two or > > three digits. Asking for hundreds of thousands of rows is problematic. > > The cursorMark feature is designed for efficient queries when paging > > deeply into results, but it assumes your rows value is relatively small, > > and that you will be making many queries to get a large number of > > results, each of which will be fast and won't overload the server. > > > > Since it appears you are having a performance issue, here's a few things > > I have written on the topic: > > > > https://wiki.apache.org/solr/SolrPerformanceProblems > > > > Thanks, > > Shawn > > >