Yes, i made sure the large test segment had just over 10 % deleted documents. But all that expungeDeletes did was merging that segment with itself, making it just 10 % smaller. It makes sense though. Optimizing with maxSegments is also not a possibility, it will just merge the cheapest segments to fullfil the maxSegments requirement.
But, thinking of it, the production segment is over 75 % deleted. Using expungeDeletes on production should reduce the segment to about 5 GB, making it eligible for regular merging again right? Thanks, Markus -----Original message----- > From:Erick Erickson <erickerick...@gmail.com> > Sent: Wednesday 10th January 2018 22:41 > To: solr-user <solr-user@lucene.apache.org> > Subject: Re: Very high number of deleted docs, part 2 > > There's some background here: > https://lucidworks.com/2017/10/13/segment-merging-deleted-documents-optimize-may-bad/ > > the 2.5 "live" document limit is really "50% of the max segment size", > hard-coded in TieredMergePolicy. > > bq: Well, maxSegments with optimize or commit with expungeDeletes did not > do the job in testing > > Surprising. What actually happened? Do note that expungeDeletes does not > promise to remove all deleted docs, it merges segments with < (some > percentage) deleted documents. > > Best, > Erick > > On Wed, Jan 10, 2018 at 9:45 AM, Markus Jelsma <markus.jel...@openindex.io> > wrote: > > > Well, maxSegments with optimize or commit with expungeDeletes did not do > > the job in testing. But tell me more about 2.5G live documents limit, no > > idea what it is. > > > > Thanks, > > Markus > > > > -----Original message----- > > > From:Erick Erickson <erickerick...@gmail.com> > > > Sent: Friday 5th January 2018 17:56 > > > To: solr-user <solr-user@lucene.apache.org> > > > Subject: Re: Very high number of deleted docs, part 2 > > > > > > I'm not 100% sure that playing with maxSegments will work. > > > > > > what will work is to re-index everything. You can re-index into the > > > existing collection, no need to start with a new collection. Eventually > > > you'll replace enough docs in the over-sized segments that they'll fall > > > under the 2.5G live documents limit and be merged away. Not elegant, but > > > it'd work. > > > > > > Best, > > > Erick > > > > > > On Fri, Jan 5, 2018 at 6:46 AM, Markus Jelsma < > > markus.jel...@openindex.io> > > > wrote: > > > > > > > It could be that when this index was first reconstructed, it was > > optimized > > > > to one segment before packed and shipped. > > > > > > > > How about optimizing it again, with maxSegments set to ten, it should > > > > recover right? > > > > > > > > -----Original message----- > > > > > From:Shawn Heisey <apa...@elyograg.org> > > > > > Sent: Friday 5th January 2018 14:34 > > > > > To: solr-user@lucene.apache.org > > > > > Subject: Re: Very high number of deleted docs, part 2 > > > > > > > > > > On 1/5/2018 5:33 AM, Markus Jelsma wrote: > > > > > > Another collection, now on 7.1, also shows this problem and has > > > > default TMP settings. This time size is different, each shard of this > > > > collection is over 40 GB, and each shard has about 50 % deleted > > documents. > > > > Each shard's largest segment is just under 20 GB with about 75 % > > deleted > > > > documents. After that are a few five/six GB segments with just under > > 50 % > > > > deleted documents. > > > > > > > > > > > > What do i need to change to make Lucene believe that at least that > > > > twenty GB and three month old segment should be merged away. And how > > what > > > > would the predicted indexing performance penalty be? > > > > > > > > > > Quick answer: Erick's statements in the previous thread can be > > > > > summarized as this: On large indexes that do a lot of deletes or > > > > > updates, once you do an optimize, you have to continue to do > > optimizes > > > > > regularly, or you're going to have this problem. > > > > > > > > > > TL;DR: > > > > > > > > > > I think Erick covered most of this (possibly all of it) in the > > previous > > > > > thread. > > > > > > > > > > If you've got a 20GB segment and TMP's settings are default, then > > that > > > > > means at some point in the past, you've done an optimize. The > > default > > > > > TMP settings have a maximum segment size of 5GB, so if you never > > > > > optimize, then there will never be a segment larger than 5GB, and the > > > > > deleted document percentage would be less likely to get out of > > control. > > > > > The optimize operation ignores the maximum segment size and reduces > > the > > > > > index to a single large segment with zero deleted docs. > > > > > > > > > > TMP's behavior with really big segments is apparently completely as > > the > > > > > author intended, but this specific problem wasn't ever addressed. > > > > > > > > > > If you do an optimize once and then don't ever do it again, any very > > > > > large segments are going to be vulnerable to this problem, and the > > only > > > > > way (currently) to fix it is to do another optimize. > > > > > > > > > > See this issue for a more in-depth discussion and an attempt to > > figure > > > > > out how to avoid it: > > > > > > > > > > https://issues.apache.org/jira/browse/LUCENE-7976 > > > > > > > > > > Thanks, > > > > > Shawn > > > > > > > > > > > > > > > > > > > >