Hi Eric,

Thanks for your reply!
I have one more question which I think you missed in my previous email.
*"When our core size becomes ~100 G, indexing becomes really slow. Why is
this happening? Do we need to put a limit on how large each core can grow?"*

This question is unrelated to segments. I think I missed setting the
context properly in my previous email.

We have a collection with 20 shards and rf 2. Basically we want to hold
500M documents in each shard. Depending on our avg doc size (~1KB), it will
grow up to 400G. Is this shard size feasible or should we split it?

On Sat, Jun 6, 2020 at 10:50 PM Erick Erickson <erickerick...@gmail.com>
wrote:

> New segments are created when
> 1> the RAMBufferSizeMB is exceeded
> or
> 2> a commit happens.
>
> The maximum segment size defaults to 5G, but TieredMergePolicy can be
> configured in solrconfig.xml to have larger max sizes by setting
> maxMergedSegmentMB
>
> Depending on your indexing rate, requiring commits every 100K records may
> be too frequent, I have no idea what your indexing rate is. In general I
> prefer a time based autocommit policy. Say, for some reason, you stop
> indexing after 50K records. They’ll never be searchable unless you have a
> time-based commit. Besides, it’s much easier to explain to users “it may
> take 60 seconds for your doc to be searchable” than “well, depending on the
> indexing rate, it may be between 10 seconds and 6 hours for your docs to be
> searchable”. Of course if you’re indexing at a very fast rate, that may not
> matter.
>
> There’s no such thing as low disk read during segment merging”. If 5
> segments need to be read, they all must be read in their entirety and the
> new segment must be completely written out. At best you can try to cut down
> on the number of times segment merges happen, but from what you’re
> describing that may not be feasible.
>
> Attachments are aggressively stripped by the mail server, your graph did
> not come through.
>
> Once a segment grows to the max size (5g by default), it is not mreged
> again unless and until it accumulates quite a number of deleted documents.
> So one question is whether you update existing documents frequently. Is
> that the case? If not, then the index size really shouldn’t matter and your
> problem is something else.
>
> And I sincerely hope that part of your indexing does _NOT_ include
> optimize/forcemerge or expungeDeletes. Those are very expensive operations,
> and prior to Solr 7.5 would leave your index in an awkward state, see:
> https://lucidworks.com/post/segment-merging-deleted-documents-optimize-may-bad/.
> There’s a link for how this is different in Solr 7.5+ in that article.
>
> But something smells fishy about this situation. Segment merging is
> typically not very noticeable. Perhaps you just have too much data on too
> small hardware? You’ve got some evidence that segment merging is the root
> cause, but I wonder if what’s happening is you’re just swapping instead?
> Segment merging will certainly increase the I/O pressure, but by and large
> that shouldn’t really affect search speed if the OS memory space is large
> enough to hold the important portions of your index. If the OS isn’t large
> enough, the additional I/O pressure from merging may be enough to start
> your system swapping which is A Bad Thing.
>
> See:
> https://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html
> for how Lucene uses MMapDirectory...
>
> Best,
> Erick
>
> > On Jun 6, 2020, at 11:29 AM, Anshuman Singh <singhanshuma...@gmail.com>
> wrote:
> >
> > Hi Eric,
> >
> > We are looking into TLOG/PULL replicas. But I have some doubts regarding
> segments. Can you explain what causes creation of a new segment and how
> large it can grow?
> > And this is my index config:
> > maxMergeAtOnce - 20
> > segmentsPerTier - 20
> > ramBufferSizeMB - 512 MB
> >
> > Can I configure these settings optimally for low disk read during
> segment merging? Like increasing segmentsPerTier may help but a large
> number of segments may impact search. And as per the documentation,
> ramBufferSizeMB can trigger segment merging so maybe that can be tweaked.
> >
> > One more question:
> > This graph is representing index time wrt core size (0-100G). Commits
> were happening automatically at every 100k records.
> >
> >
> >
> > As you can see the density of spikes is increasing as the core size is
> increasing. When our core size becomes ~100 G, indexing becomes really
> slow. Why is this happening? Do we need to put a limit on how large each
> core can grow?
> >
> >
> > On Fri, Jun 5, 2020 at 5:59 PM Erick Erickson <erickerick...@gmail.com>
> wrote:
> > Have you considered TLOG/PULL replicas rather than NRT replicas?
> > That way, all the indexing happens on a single machine and you can
> > use shards.preference to confine the searches happen on the PULL
> replicas,
> > see:  https://lucene.apache.org/solr/guide/7_7/distributed-requests.html
> >
> > No, you can’t really limit the number of segments. While that seems like
> a
> > good idea, it quickly becomes counter-productive. Say you require that
> you
> > have 10 segments. Say each one becomes 10G. What happens when the 11th
> > segment is created and it’s 100M? Do you rewrite one of the 10G segments
> just
> > to add 100M? Your problem gets worse, not better.
> >
> >
> > Best,
> > Erick
> >
> > > On Jun 5, 2020, at 1:41 AM, Anshuman Singh <singhanshuma...@gmail.com>
> wrote:
> > >
> > > Hi Nicolas,
> > >
> > > Commit happens automatically at 100k documents. We don't commit
> explicitly.
> > > We didn't limit the number of segments. There are 35+ segments in each
> core.
> > > But unrelated to the question, I would like to know if we can limit the
> > > number of segments in the core. I tried it in the past but the merge
> > > policies don't allow that.
> > > The TieredMergePolicy has two parameters, maxMergeAtOnce and
> > > segmentsPerTier. It seems like we cannot control the total number of
> > > segments but only the segments per tier.(
> > >
> http://blog.mikemccandless.com/2011/02/visualizing-lucenes-segment-merges.html
> > > )
> > >
> > >
> > > On Thu, Jun 4, 2020 at 5:48 PM Nicolas Franck <nicolas.fra...@ugent.be
> >
> > > wrote:
> > >
> > >> The real questions are:
> > >>
> > >> * how much often do you commit (either explicitly or automatically)?
> > >> * how much segments do you allow? If you only allow 1 segment,
> > >>  then that whole segment is recreated using the old documents and the
> > >> updates.
> > >>  And yes, that requires reading the old segment.
> > >>  It is common to allow multiple segments when you update often,
> > >>  so updating does not interfere with reading the index too often.
> > >>
> > >>
> > >>> On 4 Jun 2020, at 14:08, Anshuman Singh <singhanshuma...@gmail.com>
> > >> wrote:
> > >>>
> > >>> I noticed that while indexing, when commit happens, there is high
> disk
> > >> read
> > >>> by Solr. The problem is that it is impacting search performance when
> the
> > >>> index is loaded from the disk with respect to the query, as the disk
> read
> > >>> speed is not quite good and the whole index is not cached in RAM.
> > >>>
> > >>> When no searching is performed, I noticed that disk is usually read
> > >> during
> > >>> commit operations and sometimes even without commit at low rate. I
> guess
> > >> it
> > >>> is read due to segment merge operations. Can it be something else?
> > >>> If it is merging, can we limit disk IO during merging?
> > >>
> > >>
> >
>
>

Reply via email to