Yes, i made sure the large test segment had just over 10 % deleted documents. 
But all that expungeDeletes did was merging that segment with itself, making it 
just 10 % smaller. It makes sense though. Optimizing with maxSegments is also 
not a possibility, it will just merge the cheapest segments to fullfil the 
maxSegments requirement.

But, thinking of it, the production segment is over 75 % deleted. Using 
expungeDeletes on production should reduce the segment to about 5 GB, making it 
eligible for regular merging again right?

Thanks,
Markus

-----Original message-----
> From:Erick Erickson <erickerick...@gmail.com>
> Sent: Wednesday 10th January 2018 22:41
> To: solr-user <solr-user@lucene.apache.org>
> Subject: Re: Very high number of deleted docs, part 2
> 
> There's some background here:
> https://lucidworks.com/2017/10/13/segment-merging-deleted-documents-optimize-may-bad/
> 
> the 2.5 "live" document limit is really "50% of the max segment size",
> hard-coded in TieredMergePolicy.
> 
> bq: Well, maxSegments with optimize or commit with expungeDeletes did not
> do the job in testing
> 
> Surprising. What actually happened? Do note that expungeDeletes does not
> promise to remove all deleted docs, it merges segments with < (some
> percentage) deleted documents.
> 
> Best,
> Erick
> 
> On Wed, Jan 10, 2018 at 9:45 AM, Markus Jelsma <markus.jel...@openindex.io>
> wrote:
> 
> > Well, maxSegments with optimize or commit with expungeDeletes did not do
> > the job in testing. But tell me more about 2.5G live documents limit, no
> > idea what it is.
> >
> > Thanks,
> > Markus
> >
> > -----Original message-----
> > > From:Erick Erickson <erickerick...@gmail.com>
> > > Sent: Friday 5th January 2018 17:56
> > > To: solr-user <solr-user@lucene.apache.org>
> > > Subject: Re: Very high number of deleted docs, part 2
> > >
> > > I'm not 100% sure that playing with maxSegments will work.
> > >
> > > what will work is to re-index everything. You can re-index into the
> > > existing collection, no need to start with a new collection. Eventually
> > > you'll replace enough docs in the over-sized segments that they'll fall
> > > under the 2.5G live documents limit and be merged away. Not elegant, but
> > > it'd work.
> > >
> > > Best,
> > > Erick
> > >
> > > On Fri, Jan 5, 2018 at 6:46 AM, Markus Jelsma <
> > markus.jel...@openindex.io>
> > > wrote:
> > >
> > > > It could be that when this index was first reconstructed, it was
> > optimized
> > > > to one segment before packed and shipped.
> > > >
> > > > How about optimizing it again, with maxSegments set to ten, it should
> > > > recover right?
> > > >
> > > > -----Original message-----
> > > > > From:Shawn Heisey <apa...@elyograg.org>
> > > > > Sent: Friday 5th January 2018 14:34
> > > > > To: solr-user@lucene.apache.org
> > > > > Subject: Re: Very high number of deleted docs, part 2
> > > > >
> > > > > On 1/5/2018 5:33 AM, Markus Jelsma wrote:
> > > > > > Another collection, now on 7.1, also shows this problem and has
> > > > default TMP settings. This time size is different, each shard of this
> > > > collection is over 40 GB, and each shard has about 50 % deleted
> > documents.
> > > > Each shard's largest segment is just under 20 GB with about 75 %
> > deleted
> > > > documents. After that are a few five/six GB segments with just under
> > 50 %
> > > > deleted documents.
> > > > > >
> > > > > > What do i need to change to make Lucene believe that at least that
> > > > twenty GB and three month old segment should be merged away. And how
> > what
> > > > would the predicted indexing performance penalty be?
> > > > >
> > > > > Quick answer: Erick's statements in the previous thread can be
> > > > > summarized as this:  On large indexes that do a lot of deletes or
> > > > > updates, once you do an optimize, you have to continue to do
> > optimizes
> > > > > regularly, or you're going to have this problem.
> > > > >
> > > > > TL;DR:
> > > > >
> > > > > I think Erick covered most of this (possibly all of it) in the
> > previous
> > > > > thread.
> > > > >
> > > > > If you've got a 20GB segment and TMP's settings are default, then
> > that
> > > > > means at some point in the past, you've done an optimize.  The
> > default
> > > > > TMP settings have a maximum segment size of 5GB, so if you never
> > > > > optimize, then there will never be a segment larger than 5GB, and the
> > > > > deleted document percentage would be less likely to get out of
> > control.
> > > > > The optimize operation ignores the maximum segment size and reduces
> > the
> > > > > index to a single large segment with zero deleted docs.
> > > > >
> > > > > TMP's behavior with really big segments is apparently completely as
> > the
> > > > > author intended, but this specific problem wasn't ever addressed.
> > > > >
> > > > > If you do an optimize once and then don't ever do it again, any very
> > > > > large segments are going to be vulnerable to this problem, and the
> > only
> > > > > way (currently) to fix it is to do another optimize.
> > > > >
> > > > > See this issue for a more in-depth discussion and an attempt to
> > figure
> > > > > out how to avoid it:
> > > > >
> > > > > https://issues.apache.org/jira/browse/LUCENE-7976
> > > > >
> > > > > Thanks,
> > > > > Shawn
> > > > >
> > > > >
> > > >
> > >
> >
> 

Reply via email to