Oops, I transposed that. If your index is a terabyte and your RAM is 128M, _that’s_ a red flag.
> On Jul 3, 2020, at 5:53 PM, Erick Erickson <erickerick...@gmail.com> wrote: > > You haven’t said how many _shards_ are present. Nor how many replicas of the > collection you’re hosting per physical machine. Nor how large the indexes are > on disk. Those are the numbers that count. The latter is somewhat fuzzy, but > if your aggregate index size on a machine with, say, 128G of memory is a > terabyte, that’s a red flag. > > Short form, though is yes. Subject to the questions above, this is what I’d > be looking at first. > > And, as I said, if you’ve been steadily increasing the total number of > documents, you’ll reach a tipping point sometime. > > Best, > Erick > >> On Jul 3, 2020, at 5:32 PM, Mad have <madhava.a.re...@gmail.com> wrote: >> >> Hi Eric, >> >> The collection has almost 13billion documents with each document around 5kb >> size, all the columns around 150 are the indexed. Do you think that number >> of documents in the collection causing this issue. Appreciate your response. >> >> Regards, >> Madhava >> >> Sent from my iPhone >> >>> On 3 Jul 2020, at 12:42, Erick Erickson <erickerick...@gmail.com> wrote: >>> >>> If you’re seeing low CPU utilization at the same time, you probably >>> just have too much data on too little hardware. Check your >>> swapping, how much of your I/O is just because Lucene can’t >>> hold all the parts of the index it needs in memory at once? Lucene >>> uses MMapDirectory to hold the index and you may well be >>> swapping, see: >>> >>> https://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html >>> >>> But my guess is that you’ve just reached a tipping point. You say: >>> >>> "From last 2-3 weeks we have been noticing either slow indexing or timeout >>> errors while indexing” >>> >>> So have you been continually adding more documents to your >>> collections for more than the 2-3 weeks? If so you may have just >>> put so much data on the same boxes that you’ve gone over >>> the capacity of your hardware. As Toke says, adding physical >>> memory for the OS to use to hold relevant parts of the index may >>> alleviate the problem (again, refer to Uwe’s article for why). >>> >>> All that said, if you’re going to keep adding document you need to >>> seriously think about adding new machines and moving some of >>> your replicas to them. >>> >>> Best, >>> Erick >>> >>>> On Jul 3, 2020, at 7:14 AM, Toke Eskildsen <t...@kb.dk> wrote: >>>> >>>>> On Thu, 2020-07-02 at 11:16 +0000, Kommu, Vinodh K. wrote: >>>>> We are performing QA performance testing on couple of collections >>>>> which holds 2 billion and 3.5 billion docs respectively. >>>> >>>> How many shards? >>>> >>>>> 1. Our performance team noticed that read operations are pretty >>>>> more than write operations like 100:1 ratio, is this expected during >>>>> indexing or solr nodes are doing any other operations like syncing? >>>> >>>> Are you saying that there are 100 times more read operations when you >>>> are indexing? That does not sound too unrealistic as the disk cache >>>> might be filled with the data that the writers are flushing. >>>> >>>> In that case, more RAM would help. Okay, more RAM nearly always helps, >>>> but such massive difference in IO-utilization does indicate that you >>>> are starved for cache. >>>> >>>> I noticed you have at least 18 replicas. That's a lot. Just to sanity >>>> check: How many replicas are each physical box handling? If they are >>>> sharing resources, fewer replicas would probably be better. >>>> >>>>> 3. Our client timeout is set to 2mins, can they increase further >>>>> more? Would that help or create any other problems? >>>> >>>> It does not hurt the server to increase the client timeout as the >>>> initiated query will keep running until it is finished, independent of >>>> whether or not there is a client to receive the result. >>>> >>>> If you want a better max time for query processing, you should look at >>>> >>>> https://lucene.apache.org/solr/guide/7_7/common-query-parameters.html#timeallowed-parameter >>>> but due to its inherent limitations it might not help in your >>>> situation. >>>> >>>>> 4. When we created an empty collection and loaded same data file, >>>>> it loaded fine without any issues so having more documents in a >>>>> collection would create such problems? >>>> >>>> Solr 7 does have a problem with sparse DocValues and many documents, >>>> leading to excessive IO-activity, which might be what you are seeing. I >>>> can see from an earlier post that you were using streaming expressions >>>> for another collection: This is one of the things that are affected by >>>> the Solr 7 DocValues issue. >>>> >>>> More info about DocValues and streaming: >>>> https://issues.apache.org/jira/browse/SOLR-13013 >>>> >>>> Fairly in-depth info on the problem with Solr 7 docValues: >>>> https://issues.apache.org/jira/browse/LUCENE-8374 >>>> >>>> If this is your problem, upgrading to Solr 8 and indexing the >>>> collection from scratch should fix it. >>>> >>>> Alternatively you can port the LUCENE-8374-patch from Solr 7.3 to 7.7 >>>> or you can ensure that there are values defined for all DocValues- >>>> fields in all your documents. >>>> >>>>> java.net.SocketTimeoutException: Read timed out >>>>> at java.net.SocketInputStream.socketRead0(Native Method) >>>> ... >>>>> Remote error message: java.util.concurrent.TimeoutException: Idle >>>>> timeout expired: 600000/600000 ms >>>> >>>> There is a default timeout of 10 minutes (distribUpdateSoTimeout?). You >>>> should be able to change it in solr.xml. >>>> https://lucene.apache.org/solr/guide/8_5/format-of-solr-xml.html >>>> >>>> BUT if an update takes > 10 minutes to be processed, it indicates that >>>> the cluster is overloaded. Increasing the timeout is just a band-aid. >>>> >>>> - Toke Eskildsen, Royal Danish Library >>>> >>>> >>> >