I'm not 100% sure that playing with maxSegments will work. what will work is to re-index everything. You can re-index into the existing collection, no need to start with a new collection. Eventually you'll replace enough docs in the over-sized segments that they'll fall under the 2.5G live documents limit and be merged away. Not elegant, but it'd work.
Best, Erick On Fri, Jan 5, 2018 at 6:46 AM, Markus Jelsma <markus.jel...@openindex.io> wrote: > It could be that when this index was first reconstructed, it was optimized > to one segment before packed and shipped. > > How about optimizing it again, with maxSegments set to ten, it should > recover right? > > -----Original message----- > > From:Shawn Heisey <apa...@elyograg.org> > > Sent: Friday 5th January 2018 14:34 > > To: solr-user@lucene.apache.org > > Subject: Re: Very high number of deleted docs, part 2 > > > > On 1/5/2018 5:33 AM, Markus Jelsma wrote: > > > Another collection, now on 7.1, also shows this problem and has > default TMP settings. This time size is different, each shard of this > collection is over 40 GB, and each shard has about 50 % deleted documents. > Each shard's largest segment is just under 20 GB with about 75 % deleted > documents. After that are a few five/six GB segments with just under 50 % > deleted documents. > > > > > > What do i need to change to make Lucene believe that at least that > twenty GB and three month old segment should be merged away. And how what > would the predicted indexing performance penalty be? > > > > Quick answer: Erick's statements in the previous thread can be > > summarized as this: On large indexes that do a lot of deletes or > > updates, once you do an optimize, you have to continue to do optimizes > > regularly, or you're going to have this problem. > > > > TL;DR: > > > > I think Erick covered most of this (possibly all of it) in the previous > > thread. > > > > If you've got a 20GB segment and TMP's settings are default, then that > > means at some point in the past, you've done an optimize. The default > > TMP settings have a maximum segment size of 5GB, so if you never > > optimize, then there will never be a segment larger than 5GB, and the > > deleted document percentage would be less likely to get out of control. > > The optimize operation ignores the maximum segment size and reduces the > > index to a single large segment with zero deleted docs. > > > > TMP's behavior with really big segments is apparently completely as the > > author intended, but this specific problem wasn't ever addressed. > > > > If you do an optimize once and then don't ever do it again, any very > > large segments are going to be vulnerable to this problem, and the only > > way (currently) to fix it is to do another optimize. > > > > See this issue for a more in-depth discussion and an attempt to figure > > out how to avoid it: > > > > https://issues.apache.org/jira/browse/LUCENE-7976 > > > > Thanks, > > Shawn > > > > >