Thank a lot for your inputs and suggestions, even I was thinking similar like 
creating  another collection of the same ( hot and cold), and moving documents 
which are older than certain days like 180 days from original collection (hot) 
to new collection(cold). 

Thanks,
Madhava

Sent from my iPhone

> On 4 Jul 2020, at 14:37, Erick Erickson <erickerick...@gmail.com> wrote:
> 
> You need more shards. And, I’m pretty certain, more hardware.
> 
> You say you have 13 billion documents and 6 shards. Solr/Lucene has an 
> absolute upper limit of 2B (2^31) docs per shard. I don’t quite know how 
> you’re running at all unless that 13B is a round number. If you keep adding 
> documents, your installation will shortly, at best, stop accepting new 
> documents for indexing. At worst you’ll start seeing weird errors and 
> possibly corrupt indexes and have to re-index everything from scratch.
> 
> You’ve backed yourself in to a pretty tight corner here. You either have to 
> re-index to a properly-sized cluster or use SPLITSHARD. This latter will 
> double the index-on-disk size (it creates two child indexes per replica and 
> keeps the old one for safety’s sake that you have to clean up later). I 
> strongly recommend you stop ingesting more data while you do this.
> 
> You say you have 6 VMs with 2 nodes running on each. If those VMs are 
> co-located with anything else, the physical hardware is going to be stressed. 
> VMs themselves aren’t bad, but somewhere there’s physical hardware that runs 
> it…
> 
> In fact, I urge you to stop ingesting data immediately and address this 
> issue. You have a cluster that’s mis-configured, and you must address that 
> before Bad Things Happen.
> 
> Best,
> Erick
> 
>> On Jul 4, 2020, at 5:09 AM, Mad have <madhava.a.re...@gmail.com> wrote:
>> 
>> Hi Eric,
>> 
>> There are total 6 VM’s in Solr clusters and 2 nodes are running on each VM. 
>> Total number of shards are 6 with 3 replicas. I can see the index size is 
>> more than 220GB on each node for the collection where we are facing the 
>> performance issue.
>> 
>> The more documents we add to the collection the indexing become slow and I 
>> also have same impression that the size of the collection is creating this 
>> issue. Appreciate if you can suggests any solution on this.
>> 
>> 
>> Regards,
>> Madhava 
>> Sent from my iPhone
>> 
>>>> On 3 Jul 2020, at 23:30, Erick Erickson <erickerick...@gmail.com> wrote:
>>> 
>>> Oops, I transposed that. If your index is a terabyte and your RAM is 128M, 
>>> _that’s_ a red flag.
>>> 
>>>> On Jul 3, 2020, at 5:53 PM, Erick Erickson <erickerick...@gmail.com> wrote:
>>>> 
>>>> You haven’t said how many _shards_ are present. Nor how many replicas of 
>>>> the collection you’re hosting per physical machine. Nor how large the 
>>>> indexes are on disk. Those are the numbers that count. The latter is 
>>>> somewhat fuzzy, but if your aggregate index size on a machine with, say, 
>>>> 128G of memory is a terabyte, that’s a red flag.
>>>> 
>>>> Short form, though is yes. Subject to the questions above, this is what 
>>>> I’d be looking at first.
>>>> 
>>>> And, as I said, if you’ve been steadily increasing the total number of 
>>>> documents, you’ll reach a tipping point sometime.
>>>> 
>>>> Best,
>>>> Erick
>>>> 
>>>>>> On Jul 3, 2020, at 5:32 PM, Mad have <madhava.a.re...@gmail.com> wrote:
>>>>> 
>>>>> Hi Eric,
>>>>> 
>>>>> The collection has almost 13billion documents with each document around 
>>>>> 5kb size, all the columns around 150 are the indexed. Do you think that 
>>>>> number of documents in the collection causing this issue. Appreciate your 
>>>>> response.
>>>>> 
>>>>> Regards,
>>>>> Madhava 
>>>>> 
>>>>> Sent from my iPhone
>>>>> 
>>>>>> On 3 Jul 2020, at 12:42, Erick Erickson <erickerick...@gmail.com> wrote:
>>>>>> 
>>>>>> If you’re seeing low CPU utilization at the same time, you probably
>>>>>> just have too much data on too little hardware. Check your
>>>>>> swapping, how much of your I/O is just because Lucene can’t
>>>>>> hold all the parts of the index it needs in memory at once? Lucene
>>>>>> uses MMapDirectory to hold the index and you may well be
>>>>>> swapping, see:
>>>>>> 
>>>>>> https://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html
>>>>>> 
>>>>>> But my guess is that you’ve just reached a tipping point. You say:
>>>>>> 
>>>>>> "From last 2-3 weeks we have been noticing either slow indexing or 
>>>>>> timeout errors while indexing”
>>>>>> 
>>>>>> So have you been continually adding more documents to your
>>>>>> collections for more than the 2-3 weeks? If so you may have just
>>>>>> put so much data on the same boxes that you’ve gone over
>>>>>> the capacity of your hardware. As Toke says, adding physical
>>>>>> memory for the OS to use to hold relevant parts of the index may
>>>>>> alleviate the problem (again, refer to Uwe’s article for why).
>>>>>> 
>>>>>> All that said, if you’re going to keep adding document you need to
>>>>>> seriously think about adding new machines and moving some of
>>>>>> your replicas to them.
>>>>>> 
>>>>>> Best,
>>>>>> Erick
>>>>>> 
>>>>>>> On Jul 3, 2020, at 7:14 AM, Toke Eskildsen <t...@kb.dk> wrote:
>>>>>>> 
>>>>>>>> On Thu, 2020-07-02 at 11:16 +0000, Kommu, Vinodh K. wrote:
>>>>>>>> We are performing QA performance testing on couple of collections
>>>>>>>> which holds 2 billion and 3.5 billion docs respectively.
>>>>>>> 
>>>>>>> How many shards?
>>>>>>> 
>>>>>>>> 1.  Our performance team noticed that read operations are pretty
>>>>>>>> more than write operations like 100:1 ratio, is this expected during
>>>>>>>> indexing or solr nodes are doing any other operations like syncing?
>>>>>>> 
>>>>>>> Are you saying that there are 100 times more read operations when you
>>>>>>> are indexing? That does not sound too unrealistic as the disk cache
>>>>>>> might be filled with the data that the writers are flushing.
>>>>>>> 
>>>>>>> In that case, more RAM would help. Okay, more RAM nearly always helps,
>>>>>>> but such massive difference in IO-utilization does indicate that you
>>>>>>> are starved for cache.
>>>>>>> 
>>>>>>> I noticed you have at least 18 replicas. That's a lot. Just to sanity
>>>>>>> check: How many replicas are each physical box handling? If they are
>>>>>>> sharing resources, fewer replicas would probably be better.
>>>>>>> 
>>>>>>>> 3.  Our client timeout is set to 2mins, can they increase further
>>>>>>>> more? Would that help or create any other problems?
>>>>>>> 
>>>>>>> It does not hurt the server to increase the client timeout as the
>>>>>>> initiated query will keep running until it is finished, independent of
>>>>>>> whether or not there is a client to receive the result.
>>>>>>> 
>>>>>>> If you want a better max time for query processing, you should look at 
>>>>>>> 
>>>>>>> https://lucene.apache.org/solr/guide/7_7/common-query-parameters.html#timeallowed-parameter
>>>>>>> but due to its inherent limitations it might not help in your
>>>>>>> situation.
>>>>>>> 
>>>>>>>> 4.  When we created an empty collection and loaded same data file,
>>>>>>>> it loaded fine without any issues so having more documents in a
>>>>>>>> collection would create such problems?
>>>>>>> 
>>>>>>> Solr 7 does have a problem with sparse DocValues and many documents,
>>>>>>> leading to excessive IO-activity, which might be what you are seeing. I
>>>>>>> can see from an earlier post that you were using streaming expressions
>>>>>>> for another collection: This is one of the things that are affected by
>>>>>>> the Solr 7 DocValues issue.
>>>>>>> 
>>>>>>> More info about DocValues and streaming:
>>>>>>> https://issues.apache.org/jira/browse/SOLR-13013
>>>>>>> 
>>>>>>> Fairly in-depth info on the problem with Solr 7 docValues:
>>>>>>> https://issues.apache.org/jira/browse/LUCENE-8374
>>>>>>> 
>>>>>>> If this is your problem, upgrading to Solr 8 and indexing the
>>>>>>> collection from scratch should fix it. 
>>>>>>> 
>>>>>>> Alternatively you can port the LUCENE-8374-patch from Solr 7.3 to 7.7
>>>>>>> or you can ensure that there are values defined for all DocValues-
>>>>>>> fields in all your documents.
>>>>>>> 
>>>>>>>> java.net.SocketTimeoutException: Read timed out
>>>>>>>>  at java.net.SocketInputStream.socketRead0(Native Method) 
>>>>>>> ...
>>>>>>>> Remote error message: java.util.concurrent.TimeoutException: Idle
>>>>>>>> timeout expired: 600000/600000 ms
>>>>>>> 
>>>>>>> There is a default timeout of 10 minutes (distribUpdateSoTimeout?). You
>>>>>>> should be able to change it in solr.xml.
>>>>>>> https://lucene.apache.org/solr/guide/8_5/format-of-solr-xml.html
>>>>>>> 
>>>>>>> BUT if an update takes > 10 minutes to be processed, it indicates that
>>>>>>> the cluster is overloaded.  Increasing the timeout is just a band-aid.
>>>>>>> 
>>>>>>> - Toke Eskildsen, Royal Danish Library
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>> 
>>> 
> 

Reply via email to