Thank a lot for your inputs and suggestions, even I was thinking similar like creating another collection of the same ( hot and cold), and moving documents which are older than certain days like 180 days from original collection (hot) to new collection(cold).
Thanks, Madhava Sent from my iPhone > On 4 Jul 2020, at 14:37, Erick Erickson <erickerick...@gmail.com> wrote: > > You need more shards. And, I’m pretty certain, more hardware. > > You say you have 13 billion documents and 6 shards. Solr/Lucene has an > absolute upper limit of 2B (2^31) docs per shard. I don’t quite know how > you’re running at all unless that 13B is a round number. If you keep adding > documents, your installation will shortly, at best, stop accepting new > documents for indexing. At worst you’ll start seeing weird errors and > possibly corrupt indexes and have to re-index everything from scratch. > > You’ve backed yourself in to a pretty tight corner here. You either have to > re-index to a properly-sized cluster or use SPLITSHARD. This latter will > double the index-on-disk size (it creates two child indexes per replica and > keeps the old one for safety’s sake that you have to clean up later). I > strongly recommend you stop ingesting more data while you do this. > > You say you have 6 VMs with 2 nodes running on each. If those VMs are > co-located with anything else, the physical hardware is going to be stressed. > VMs themselves aren’t bad, but somewhere there’s physical hardware that runs > it… > > In fact, I urge you to stop ingesting data immediately and address this > issue. You have a cluster that’s mis-configured, and you must address that > before Bad Things Happen. > > Best, > Erick > >> On Jul 4, 2020, at 5:09 AM, Mad have <madhava.a.re...@gmail.com> wrote: >> >> Hi Eric, >> >> There are total 6 VM’s in Solr clusters and 2 nodes are running on each VM. >> Total number of shards are 6 with 3 replicas. I can see the index size is >> more than 220GB on each node for the collection where we are facing the >> performance issue. >> >> The more documents we add to the collection the indexing become slow and I >> also have same impression that the size of the collection is creating this >> issue. Appreciate if you can suggests any solution on this. >> >> >> Regards, >> Madhava >> Sent from my iPhone >> >>>> On 3 Jul 2020, at 23:30, Erick Erickson <erickerick...@gmail.com> wrote: >>> >>> Oops, I transposed that. If your index is a terabyte and your RAM is 128M, >>> _that’s_ a red flag. >>> >>>> On Jul 3, 2020, at 5:53 PM, Erick Erickson <erickerick...@gmail.com> wrote: >>>> >>>> You haven’t said how many _shards_ are present. Nor how many replicas of >>>> the collection you’re hosting per physical machine. Nor how large the >>>> indexes are on disk. Those are the numbers that count. The latter is >>>> somewhat fuzzy, but if your aggregate index size on a machine with, say, >>>> 128G of memory is a terabyte, that’s a red flag. >>>> >>>> Short form, though is yes. Subject to the questions above, this is what >>>> I’d be looking at first. >>>> >>>> And, as I said, if you’ve been steadily increasing the total number of >>>> documents, you’ll reach a tipping point sometime. >>>> >>>> Best, >>>> Erick >>>> >>>>>> On Jul 3, 2020, at 5:32 PM, Mad have <madhava.a.re...@gmail.com> wrote: >>>>> >>>>> Hi Eric, >>>>> >>>>> The collection has almost 13billion documents with each document around >>>>> 5kb size, all the columns around 150 are the indexed. Do you think that >>>>> number of documents in the collection causing this issue. Appreciate your >>>>> response. >>>>> >>>>> Regards, >>>>> Madhava >>>>> >>>>> Sent from my iPhone >>>>> >>>>>> On 3 Jul 2020, at 12:42, Erick Erickson <erickerick...@gmail.com> wrote: >>>>>> >>>>>> If you’re seeing low CPU utilization at the same time, you probably >>>>>> just have too much data on too little hardware. Check your >>>>>> swapping, how much of your I/O is just because Lucene can’t >>>>>> hold all the parts of the index it needs in memory at once? Lucene >>>>>> uses MMapDirectory to hold the index and you may well be >>>>>> swapping, see: >>>>>> >>>>>> https://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html >>>>>> >>>>>> But my guess is that you’ve just reached a tipping point. You say: >>>>>> >>>>>> "From last 2-3 weeks we have been noticing either slow indexing or >>>>>> timeout errors while indexing” >>>>>> >>>>>> So have you been continually adding more documents to your >>>>>> collections for more than the 2-3 weeks? If so you may have just >>>>>> put so much data on the same boxes that you’ve gone over >>>>>> the capacity of your hardware. As Toke says, adding physical >>>>>> memory for the OS to use to hold relevant parts of the index may >>>>>> alleviate the problem (again, refer to Uwe’s article for why). >>>>>> >>>>>> All that said, if you’re going to keep adding document you need to >>>>>> seriously think about adding new machines and moving some of >>>>>> your replicas to them. >>>>>> >>>>>> Best, >>>>>> Erick >>>>>> >>>>>>> On Jul 3, 2020, at 7:14 AM, Toke Eskildsen <t...@kb.dk> wrote: >>>>>>> >>>>>>>> On Thu, 2020-07-02 at 11:16 +0000, Kommu, Vinodh K. wrote: >>>>>>>> We are performing QA performance testing on couple of collections >>>>>>>> which holds 2 billion and 3.5 billion docs respectively. >>>>>>> >>>>>>> How many shards? >>>>>>> >>>>>>>> 1. Our performance team noticed that read operations are pretty >>>>>>>> more than write operations like 100:1 ratio, is this expected during >>>>>>>> indexing or solr nodes are doing any other operations like syncing? >>>>>>> >>>>>>> Are you saying that there are 100 times more read operations when you >>>>>>> are indexing? That does not sound too unrealistic as the disk cache >>>>>>> might be filled with the data that the writers are flushing. >>>>>>> >>>>>>> In that case, more RAM would help. Okay, more RAM nearly always helps, >>>>>>> but such massive difference in IO-utilization does indicate that you >>>>>>> are starved for cache. >>>>>>> >>>>>>> I noticed you have at least 18 replicas. That's a lot. Just to sanity >>>>>>> check: How many replicas are each physical box handling? If they are >>>>>>> sharing resources, fewer replicas would probably be better. >>>>>>> >>>>>>>> 3. Our client timeout is set to 2mins, can they increase further >>>>>>>> more? Would that help or create any other problems? >>>>>>> >>>>>>> It does not hurt the server to increase the client timeout as the >>>>>>> initiated query will keep running until it is finished, independent of >>>>>>> whether or not there is a client to receive the result. >>>>>>> >>>>>>> If you want a better max time for query processing, you should look at >>>>>>> >>>>>>> https://lucene.apache.org/solr/guide/7_7/common-query-parameters.html#timeallowed-parameter >>>>>>> but due to its inherent limitations it might not help in your >>>>>>> situation. >>>>>>> >>>>>>>> 4. When we created an empty collection and loaded same data file, >>>>>>>> it loaded fine without any issues so having more documents in a >>>>>>>> collection would create such problems? >>>>>>> >>>>>>> Solr 7 does have a problem with sparse DocValues and many documents, >>>>>>> leading to excessive IO-activity, which might be what you are seeing. I >>>>>>> can see from an earlier post that you were using streaming expressions >>>>>>> for another collection: This is one of the things that are affected by >>>>>>> the Solr 7 DocValues issue. >>>>>>> >>>>>>> More info about DocValues and streaming: >>>>>>> https://issues.apache.org/jira/browse/SOLR-13013 >>>>>>> >>>>>>> Fairly in-depth info on the problem with Solr 7 docValues: >>>>>>> https://issues.apache.org/jira/browse/LUCENE-8374 >>>>>>> >>>>>>> If this is your problem, upgrading to Solr 8 and indexing the >>>>>>> collection from scratch should fix it. >>>>>>> >>>>>>> Alternatively you can port the LUCENE-8374-patch from Solr 7.3 to 7.7 >>>>>>> or you can ensure that there are values defined for all DocValues- >>>>>>> fields in all your documents. >>>>>>> >>>>>>>> java.net.SocketTimeoutException: Read timed out >>>>>>>> at java.net.SocketInputStream.socketRead0(Native Method) >>>>>>> ... >>>>>>>> Remote error message: java.util.concurrent.TimeoutException: Idle >>>>>>>> timeout expired: 600000/600000 ms >>>>>>> >>>>>>> There is a default timeout of 10 minutes (distribUpdateSoTimeout?). You >>>>>>> should be able to change it in solr.xml. >>>>>>> https://lucene.apache.org/solr/guide/8_5/format-of-solr-xml.html >>>>>>> >>>>>>> BUT if an update takes > 10 minutes to be processed, it indicates that >>>>>>> the cluster is overloaded. Increasing the timeout is just a band-aid. >>>>>>> >>>>>>> - Toke Eskildsen, Royal Danish Library >>>>>>> >>>>>>> >>>>>> >>>> >>> >