Hi Eric, Thanks for your reply! I have one more question which I think you missed in my previous email. *"When our core size becomes ~100 G, indexing becomes really slow. Why is this happening? Do we need to put a limit on how large each core can grow?"*
This question is unrelated to segments. I think I missed setting the context properly in my previous email. We have a collection with 20 shards and rf 2. Basically we want to hold 500M documents in each shard. Depending on our avg doc size (~1KB), it will grow up to 400G. Is this shard size feasible or should we split it? On Sat, Jun 6, 2020 at 10:50 PM Erick Erickson <erickerick...@gmail.com> wrote: > New segments are created when > 1> the RAMBufferSizeMB is exceeded > or > 2> a commit happens. > > The maximum segment size defaults to 5G, but TieredMergePolicy can be > configured in solrconfig.xml to have larger max sizes by setting > maxMergedSegmentMB > > Depending on your indexing rate, requiring commits every 100K records may > be too frequent, I have no idea what your indexing rate is. In general I > prefer a time based autocommit policy. Say, for some reason, you stop > indexing after 50K records. They’ll never be searchable unless you have a > time-based commit. Besides, it’s much easier to explain to users “it may > take 60 seconds for your doc to be searchable” than “well, depending on the > indexing rate, it may be between 10 seconds and 6 hours for your docs to be > searchable”. Of course if you’re indexing at a very fast rate, that may not > matter. > > There’s no such thing as low disk read during segment merging”. If 5 > segments need to be read, they all must be read in their entirety and the > new segment must be completely written out. At best you can try to cut down > on the number of times segment merges happen, but from what you’re > describing that may not be feasible. > > Attachments are aggressively stripped by the mail server, your graph did > not come through. > > Once a segment grows to the max size (5g by default), it is not mreged > again unless and until it accumulates quite a number of deleted documents. > So one question is whether you update existing documents frequently. Is > that the case? If not, then the index size really shouldn’t matter and your > problem is something else. > > And I sincerely hope that part of your indexing does _NOT_ include > optimize/forcemerge or expungeDeletes. Those are very expensive operations, > and prior to Solr 7.5 would leave your index in an awkward state, see: > https://lucidworks.com/post/segment-merging-deleted-documents-optimize-may-bad/. > There’s a link for how this is different in Solr 7.5+ in that article. > > But something smells fishy about this situation. Segment merging is > typically not very noticeable. Perhaps you just have too much data on too > small hardware? You’ve got some evidence that segment merging is the root > cause, but I wonder if what’s happening is you’re just swapping instead? > Segment merging will certainly increase the I/O pressure, but by and large > that shouldn’t really affect search speed if the OS memory space is large > enough to hold the important portions of your index. If the OS isn’t large > enough, the additional I/O pressure from merging may be enough to start > your system swapping which is A Bad Thing. > > See: > https://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html > for how Lucene uses MMapDirectory... > > Best, > Erick > > > On Jun 6, 2020, at 11:29 AM, Anshuman Singh <singhanshuma...@gmail.com> > wrote: > > > > Hi Eric, > > > > We are looking into TLOG/PULL replicas. But I have some doubts regarding > segments. Can you explain what causes creation of a new segment and how > large it can grow? > > And this is my index config: > > maxMergeAtOnce - 20 > > segmentsPerTier - 20 > > ramBufferSizeMB - 512 MB > > > > Can I configure these settings optimally for low disk read during > segment merging? Like increasing segmentsPerTier may help but a large > number of segments may impact search. And as per the documentation, > ramBufferSizeMB can trigger segment merging so maybe that can be tweaked. > > > > One more question: > > This graph is representing index time wrt core size (0-100G). Commits > were happening automatically at every 100k records. > > > > > > > > As you can see the density of spikes is increasing as the core size is > increasing. When our core size becomes ~100 G, indexing becomes really > slow. Why is this happening? Do we need to put a limit on how large each > core can grow? > > > > > > On Fri, Jun 5, 2020 at 5:59 PM Erick Erickson <erickerick...@gmail.com> > wrote: > > Have you considered TLOG/PULL replicas rather than NRT replicas? > > That way, all the indexing happens on a single machine and you can > > use shards.preference to confine the searches happen on the PULL > replicas, > > see: https://lucene.apache.org/solr/guide/7_7/distributed-requests.html > > > > No, you can’t really limit the number of segments. While that seems like > a > > good idea, it quickly becomes counter-productive. Say you require that > you > > have 10 segments. Say each one becomes 10G. What happens when the 11th > > segment is created and it’s 100M? Do you rewrite one of the 10G segments > just > > to add 100M? Your problem gets worse, not better. > > > > > > Best, > > Erick > > > > > On Jun 5, 2020, at 1:41 AM, Anshuman Singh <singhanshuma...@gmail.com> > wrote: > > > > > > Hi Nicolas, > > > > > > Commit happens automatically at 100k documents. We don't commit > explicitly. > > > We didn't limit the number of segments. There are 35+ segments in each > core. > > > But unrelated to the question, I would like to know if we can limit the > > > number of segments in the core. I tried it in the past but the merge > > > policies don't allow that. > > > The TieredMergePolicy has two parameters, maxMergeAtOnce and > > > segmentsPerTier. It seems like we cannot control the total number of > > > segments but only the segments per tier.( > > > > http://blog.mikemccandless.com/2011/02/visualizing-lucenes-segment-merges.html > > > ) > > > > > > > > > On Thu, Jun 4, 2020 at 5:48 PM Nicolas Franck <nicolas.fra...@ugent.be > > > > > wrote: > > > > > >> The real questions are: > > >> > > >> * how much often do you commit (either explicitly or automatically)? > > >> * how much segments do you allow? If you only allow 1 segment, > > >> then that whole segment is recreated using the old documents and the > > >> updates. > > >> And yes, that requires reading the old segment. > > >> It is common to allow multiple segments when you update often, > > >> so updating does not interfere with reading the index too often. > > >> > > >> > > >>> On 4 Jun 2020, at 14:08, Anshuman Singh <singhanshuma...@gmail.com> > > >> wrote: > > >>> > > >>> I noticed that while indexing, when commit happens, there is high > disk > > >> read > > >>> by Solr. The problem is that it is impacting search performance when > the > > >>> index is loaded from the disk with respect to the query, as the disk > read > > >>> speed is not quite good and the whole index is not cached in RAM. > > >>> > > >>> When no searching is performed, I noticed that disk is usually read > > >> during > > >>> commit operations and sometimes even without commit at low rate. I > guess > > >> it > > >>> is read due to segment merge operations. Can it be something else? > > >>> If it is merging, can we limit disk IO during merging? > > >> > > >> > > > >