On 9/22/2016 1:01 PM, vsolakhian wrote: > Our index is in HDFS, but we did not change any configuration after we > deleted 35% of records and optimized. > > The relatively slow commit (soft commit and warming up took 1.5 minutes) is > OK for our use case (adding hundreds of thousands and even millions of > records and then committing). > > The question is why it takes much longer after optimization, when disk > caches, network and other configuration remained the same and the index is > smaller?
When you optimize an index down to one segment, you are reading one entire copy of the index and creating a second copy of the index. This is going to greatly affect the data that is in the disk cache. Presumably you do not have enough caching memory to hold anywhere near the entire 300GB index. Memory sizes that large are possible, but not common. With HDFS, I think the amount of memory used for caching is configurable. I do not know if both HDFS clients and servers can do caching, or if that's just a server-side option. With a 300GB index, 150 to 250GB of memory should be available for caching if you want to have stellar performance. If you can get the entire 300GB to fit, then you'd nearly be guaranteed good performance. Assuming I'm right about the amount of caching memory available relative to the index size, when the optimize is finished, chances are very good that the particular data sitting in the disk cache is completely useless for queries, so the first few warming and user queries will need to actually read the *disk*, and put different data in the cache. When enough queries have been processed, eventually the disk cache will be populated with enough relevant data that subsequent queries will be fast. If there are other programs or Solr indexes competing for the same caching memory, then the problem might be even worse. You might want to refrain from optimizing indexes this large, at least on a frequent basis, and just rely on normal index merging to handle your deletes. Optimizing is a special case when it comes to cache memory, and for that, you need even more than in the general case. There's a special note about optimizes here: https://wiki.apache.org/solr/SolrPerformanceProblems#OS_Disk_Cache Thanks, Shawn