Oops, I transposed that. If your index is a terabyte and your RAM is 128M, 
_that’s_ a red flag.

> On Jul 3, 2020, at 5:53 PM, Erick Erickson <erickerick...@gmail.com> wrote:
> 
> You haven’t said how many _shards_ are present. Nor how many replicas of the 
> collection you’re hosting per physical machine. Nor how large the indexes are 
> on disk. Those are the numbers that count. The latter is somewhat fuzzy, but 
> if your aggregate index size on a machine with, say, 128G of memory is a 
> terabyte, that’s a red flag.
> 
> Short form, though is yes. Subject to the questions above, this is what I’d 
> be looking at first.
> 
> And, as I said, if you’ve been steadily increasing the total number of 
> documents, you’ll reach a tipping point sometime.
> 
> Best,
> Erick
> 
>> On Jul 3, 2020, at 5:32 PM, Mad have <madhava.a.re...@gmail.com> wrote:
>> 
>> Hi Eric,
>> 
>> The collection has almost 13billion documents with each document around 5kb 
>> size, all the columns around 150 are the indexed. Do you think that number 
>> of documents in the collection causing this issue. Appreciate your response.
>> 
>> Regards,
>> Madhava 
>> 
>> Sent from my iPhone
>> 
>>> On 3 Jul 2020, at 12:42, Erick Erickson <erickerick...@gmail.com> wrote:
>>> 
>>> If you’re seeing low CPU utilization at the same time, you probably
>>> just have too much data on too little hardware. Check your
>>> swapping, how much of your I/O is just because Lucene can’t
>>> hold all the parts of the index it needs in memory at once? Lucene
>>> uses MMapDirectory to hold the index and you may well be
>>> swapping, see:
>>> 
>>> https://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html
>>> 
>>> But my guess is that you’ve just reached a tipping point. You say:
>>> 
>>> "From last 2-3 weeks we have been noticing either slow indexing or timeout 
>>> errors while indexing”
>>> 
>>> So have you been continually adding more documents to your
>>> collections for more than the 2-3 weeks? If so you may have just
>>> put so much data on the same boxes that you’ve gone over
>>> the capacity of your hardware. As Toke says, adding physical
>>> memory for the OS to use to hold relevant parts of the index may
>>> alleviate the problem (again, refer to Uwe’s article for why).
>>> 
>>> All that said, if you’re going to keep adding document you need to
>>> seriously think about adding new machines and moving some of
>>> your replicas to them.
>>> 
>>> Best,
>>> Erick
>>> 
>>>> On Jul 3, 2020, at 7:14 AM, Toke Eskildsen <t...@kb.dk> wrote:
>>>> 
>>>>> On Thu, 2020-07-02 at 11:16 +0000, Kommu, Vinodh K. wrote:
>>>>> We are performing QA performance testing on couple of collections
>>>>> which holds 2 billion and 3.5 billion docs respectively.
>>>> 
>>>> How many shards?
>>>> 
>>>>> 1.  Our performance team noticed that read operations are pretty
>>>>> more than write operations like 100:1 ratio, is this expected during
>>>>> indexing or solr nodes are doing any other operations like syncing?
>>>> 
>>>> Are you saying that there are 100 times more read operations when you
>>>> are indexing? That does not sound too unrealistic as the disk cache
>>>> might be filled with the data that the writers are flushing.
>>>> 
>>>> In that case, more RAM would help. Okay, more RAM nearly always helps,
>>>> but such massive difference in IO-utilization does indicate that you
>>>> are starved for cache.
>>>> 
>>>> I noticed you have at least 18 replicas. That's a lot. Just to sanity
>>>> check: How many replicas are each physical box handling? If they are
>>>> sharing resources, fewer replicas would probably be better.
>>>> 
>>>>> 3.  Our client timeout is set to 2mins, can they increase further
>>>>> more? Would that help or create any other problems?
>>>> 
>>>> It does not hurt the server to increase the client timeout as the
>>>> initiated query will keep running until it is finished, independent of
>>>> whether or not there is a client to receive the result.
>>>> 
>>>> If you want a better max time for query processing, you should look at 
>>>> 
>>>> https://lucene.apache.org/solr/guide/7_7/common-query-parameters.html#timeallowed-parameter
>>>> but due to its inherent limitations it might not help in your
>>>> situation.
>>>> 
>>>>> 4.  When we created an empty collection and loaded same data file,
>>>>> it loaded fine without any issues so having more documents in a
>>>>> collection would create such problems?
>>>> 
>>>> Solr 7 does have a problem with sparse DocValues and many documents,
>>>> leading to excessive IO-activity, which might be what you are seeing. I
>>>> can see from an earlier post that you were using streaming expressions
>>>> for another collection: This is one of the things that are affected by
>>>> the Solr 7 DocValues issue.
>>>> 
>>>> More info about DocValues and streaming:
>>>> https://issues.apache.org/jira/browse/SOLR-13013
>>>> 
>>>> Fairly in-depth info on the problem with Solr 7 docValues:
>>>> https://issues.apache.org/jira/browse/LUCENE-8374
>>>> 
>>>> If this is your problem, upgrading to Solr 8 and indexing the
>>>> collection from scratch should fix it. 
>>>> 
>>>> Alternatively you can port the LUCENE-8374-patch from Solr 7.3 to 7.7
>>>> or you can ensure that there are values defined for all DocValues-
>>>> fields in all your documents.
>>>> 
>>>>> java.net.SocketTimeoutException: Read timed out
>>>>>     at java.net.SocketInputStream.socketRead0(Native Method) 
>>>> ...
>>>>> Remote error message: java.util.concurrent.TimeoutException: Idle
>>>>> timeout expired: 600000/600000 ms
>>>> 
>>>> There is a default timeout of 10 minutes (distribUpdateSoTimeout?). You
>>>> should be able to change it in solr.xml.
>>>> https://lucene.apache.org/solr/guide/8_5/format-of-solr-xml.html
>>>> 
>>>> BUT if an update takes > 10 minutes to be processed, it indicates that
>>>> the cluster is overloaded.  Increasing the timeout is just a band-aid.
>>>> 
>>>> - Toke Eskildsen, Royal Danish Library
>>>> 
>>>> 
>>> 
> 

Reply via email to