Well, maxSegments with optimize or commit with expungeDeletes did not do the job in testing. But tell me more about 2.5G live documents limit, no idea what it is.
Thanks, Markus -----Original message----- > From:Erick Erickson <erickerick...@gmail.com> > Sent: Friday 5th January 2018 17:56 > To: solr-user <solr-user@lucene.apache.org> > Subject: Re: Very high number of deleted docs, part 2 > > I'm not 100% sure that playing with maxSegments will work. > > what will work is to re-index everything. You can re-index into the > existing collection, no need to start with a new collection. Eventually > you'll replace enough docs in the over-sized segments that they'll fall > under the 2.5G live documents limit and be merged away. Not elegant, but > it'd work. > > Best, > Erick > > On Fri, Jan 5, 2018 at 6:46 AM, Markus Jelsma <markus.jel...@openindex.io> > wrote: > > > It could be that when this index was first reconstructed, it was optimized > > to one segment before packed and shipped. > > > > How about optimizing it again, with maxSegments set to ten, it should > > recover right? > > > > -----Original message----- > > > From:Shawn Heisey <apa...@elyograg.org> > > > Sent: Friday 5th January 2018 14:34 > > > To: solr-user@lucene.apache.org > > > Subject: Re: Very high number of deleted docs, part 2 > > > > > > On 1/5/2018 5:33 AM, Markus Jelsma wrote: > > > > Another collection, now on 7.1, also shows this problem and has > > default TMP settings. This time size is different, each shard of this > > collection is over 40 GB, and each shard has about 50 % deleted documents. > > Each shard's largest segment is just under 20 GB with about 75 % deleted > > documents. After that are a few five/six GB segments with just under 50 % > > deleted documents. > > > > > > > > What do i need to change to make Lucene believe that at least that > > twenty GB and three month old segment should be merged away. And how what > > would the predicted indexing performance penalty be? > > > > > > Quick answer: Erick's statements in the previous thread can be > > > summarized as this: On large indexes that do a lot of deletes or > > > updates, once you do an optimize, you have to continue to do optimizes > > > regularly, or you're going to have this problem. > > > > > > TL;DR: > > > > > > I think Erick covered most of this (possibly all of it) in the previous > > > thread. > > > > > > If you've got a 20GB segment and TMP's settings are default, then that > > > means at some point in the past, you've done an optimize. The default > > > TMP settings have a maximum segment size of 5GB, so if you never > > > optimize, then there will never be a segment larger than 5GB, and the > > > deleted document percentage would be less likely to get out of control. > > > The optimize operation ignores the maximum segment size and reduces the > > > index to a single large segment with zero deleted docs. > > > > > > TMP's behavior with really big segments is apparently completely as the > > > author intended, but this specific problem wasn't ever addressed. > > > > > > If you do an optimize once and then don't ever do it again, any very > > > large segments are going to be vulnerable to this problem, and the only > > > way (currently) to fix it is to do another optimize. > > > > > > See this issue for a more in-depth discussion and an attempt to figure > > > out how to avoid it: > > > > > > https://issues.apache.org/jira/browse/LUCENE-7976 > > > > > > Thanks, > > > Shawn > > > > > > > > >