Here is a quick update based on your question, and few additional information that will help
The additional info is that when we execute the test for longer (20mins) we are seeing better response time, however for a short test (5mins) and rerun the test after an hour or so we are seeing slow response times again. Note that we don't update the collection during the test or in between the test. Does this help to identify the issue? 1. The schema is exactly the same between prod and QA 2. The solr Config are exactly the same between prod and QA 3. We have designed our test to mimick reality where filter cache is not hit at all. From solr, we are seeing that there is ZERO Filter cache hit. There is about 4% query and document cache hit in prod and we are seeing no filter cache hit in both QA and PROD 4. The GC CPU usage is about 0.2% in prod and about 0.02% in QA. Note sure if that matters 5. We measure it using newrelic, and it has statistical information about the solr transaction times. Give that, could this be some warming up related issue to keep the Solr / Lucene memory-mapped file in RAM? Is there any way to measure which collection is using memory? we do have 350GB RAM, but we see it full with buffer cache, not really sure what is really using this memory. On Sun, May 10, 2020 at 10:37 AM Erick Erickson <erickerick...@gmail.com> wrote: > Do not, repeat NOT expungeDelete after each deleteByQuery, it is > a very expensive operation. Perhaps after the nightly batch, but > I doubt that’ll help much anyway. > > 30% deleted docs is quite normal, and should definitely not > change the response time by a factor of 100! So there’s > some other issue in your environment. > > So the things I’d check: > 1> the schema is exactly the same. It’s vaguely possible that > the schema is just a tiny bit different. If that’s the case, you > need to delete your entire collection’s data and re-index from > scratch. You can index to a new collection and use > collection aliasing to do this seamlessly > > 2> Your solrconfig is exactly the same, especially the filterCache > cache settings. I call out filterCache because you specifically > mention filter queries, but check your other caches too. > > 3> Check your filterCache usage statistics. If you see drastically > different hit ratios in the two environments, you need to pursue that. > > 4> Once and always, check your GC performance on the two > environments. It’s a low-probability item, but you may be > just enough different in prod that GC is an issue. > > 5> Take a look at the QTimes recorded in your solr logs to insure > that the difference isn’t outside of Solr. > > While I can’t say what the exact problem is, I’m 99% sure that the number > of deleted docs isn’t the culprit. > > Best, > Erick > > > On May 9, 2020, at 6:22 PM, Ganesh Sethuraman <ganeshmail...@gmail.com> > wrote: > > > > Hi Solr Users, > > > > We use SolrCloud 7.2.1 with 2 Solr nodes in AWS. The shard size for these > > collections does not exceed more than 5G. They have approximately 16 > shards > > with 2 replicas. We do deletes (ByQuery) as well large updates in some > of > > these Solr collections. We are seeing slower filter queries (95% > > 10secs) > > on these collections in production, same collections, and same queries in > > our lower environment with similar setup and configuration we seeing much > > better performance (<100ms). These are NRT indexes, with daily batch > > updates only. > > > > We see a difference however in the lower environment; that we don't see > > updates or deletes, we see in Segment Info for each of the Solr code > there > > are ZERO delete percentages. Could this be the reason for the faster > query > > response time in our lower environment? in our production environment, we > > are seeing about 30-32% of deletes in each core shard/replica pair. > > > > Does this segment delete % has any correlation with query response time? > We > > do delete by Query in a loop. Also updates. > > If it is so, do you suggest to try to do Optimize or expungeDelete at the > > end every day? > > Do we need to expunge delete after each delete ByQuery or do it once at > the > > end? > > > > Regards, > > Ganesh > >