On 12/14/2016 11:58 AM, Chetas Joshi wrote: > I am running Solr 5.5.0 on HDFS. It is a solrCloud of 50 nodes and I have > the following config. > maxShardsperNode: 1 > replicationFactor: 1 > > I have been ingesting data into Solr for the last 3 months. With increase > in data, I am observing increase in the query time. Currently the size of > my indices is 70 GB per shard (i.e. per node).
Query times will increase as the index size increases, but significant jumps in the query time may be an indication of a performance problem. Performance problems are usually caused by insufficient resources, memory in particular. With HDFS, I am honestly not sure *where* the cache memory is needed. I would assume that it's needed on the HDFS hosts, that a lot of spare memory on the Solr (HDFS client) hosts probably won't make much difference. I could be wrong -- I have no idea what kind of caching HDFS does. If the HDFS client can cache data, then you probably would want extra memory on the Solr machines. > I am using cursor approach (/export handler) using SolrJ client to get back > results from Solr. All the fields I am querying on and all the fields that > I get back from Solr are indexed and have docValues enabled as well. What > could be the reason behind increase in query time? If actual disk access is required to satisfy a query, Solr is going to be slow. Caching is absolutely required for good performance. If your query times are really long but used to be short, chances are that your index size has exceeded your system's ability to cache it effectively. One thing to keep in mind: Gigabit Ethernet is comparable in speed to the sustained transfer rate of a single modern SATA magnetic disk, so if the data has to traverse a gigabit network, it probably will be nearly as slow as it would be if it were coming from a single disk. Having a 10gig network for your storage is probably a good idea ... but current fast memory chips can leave 10gig in the dust, so if the data can come from cache and the chips are new enough, then it can be faster than network storage. Because the network can be a potential bottleneck, I strongly recommend putting index data on local disks. If you have enough memory, the disk doesn't even need to be super-fast. > Has this got something to do with the OS disk cache that is used for > loading the Solr indices? When a query is fired, will Solr wait for all > (70GB) of disk cache being available so that it can load the index file? Caching the files on the disk is not handled by Solr, so Solr won't wait for the entire index to be cached unless the underlying storage waits for some reason. The caching is usually handled by the OS. For HDFS, it might be handled by a combination of the OS and Hadoop, but I don't know enough about HDFS to comment. Solr makes a request for the parts of the index files that it needs to satisfy the request. If the underlying system is capable of caching the data, if that feature is enabled, and if there's memory available for that purpose, then it gets cached. Thanks, Shawn