Re: Very high number of deleted docs, part 2

Shawn Heisey Fri, 05 Jan 2018 05:54:46 -0800

On 1/5/2018 5:33 AM, Markus Jelsma wrote:

Another collection, now on 7.1, also shows this problem and has default TMP 
settings. This time size is different, each shard of this collection is over 40 
GB, and each shard has about 50 % deleted documents. Each shard's largest 
segment is just under 20 GB with about 75 % deleted documents. After that are a 
few five/six GB segments with just under 50 % deleted documents.


What do i need to change to make Lucene believe that at least that twenty GB 
and three month old segment should be merged away. And how what would the 
predicted indexing performance penalty be?

Quick answer: Erick's statements in the previous thread can besummarized as this: On large indexes that do a lot of deletes orupdates, once you do an optimize, you have to continue to do optimizesregularly, or you're going to have this problem.


TL;DR:

I think Erick covered most of this (possibly all of it) in the previousthread.

If you've got a 20GB segment and TMP's settings are default, then thatmeans at some point in the past, you've done an optimize. The defaultTMP settings have a maximum segment size of 5GB, so if you neveroptimize, then there will never be a segment larger than 5GB, and thedeleted document percentage would be less likely to get out of control. The optimize operation ignores the maximum segment size and reduces theindex to a single large segment with zero deleted docs.

TMP's behavior with really big segments is apparently completely as theauthor intended, but this specific problem wasn't ever addressed.

If you do an optimize once and then don't ever do it again, any verylarge segments are going to be vulnerable to this problem, and the onlyway (currently) to fix it is to do another optimize.

See this issue for a more in-depth discussion and an attempt to figureout how to avoid it:


https://issues.apache.org/jira/browse/LUCENE-7976

Thanks,
Shawn

Re: Very high number of deleted docs, part 2

Reply via email to