Re: [I] Add a timeout for forceMergeDeletes in IndexWriter [lucene]

via GitHub Mon, 21 Apr 2025 12:24:17 -0700


mikemccand commented on issue #14431:
URL: https://github.com/apache/lucene/issues/14431#issuecomment-2819320548


   If we do add this timeout, I don't think the still-running merges kicked off 
during `forceMergeDeletes` should abort -- they should ideally run to 
completion, just in the background.  If the index is not using a 
`ConcurrentMergeScheduler` then the timeout won't do anything (maybe we would 
throw an `IllegalArgumentException` or so in that case).
   
   > E.g. if you target for deletes is less than 10% but you can live with 15% 
deletes, you could run a first forceMergeDeletes call with this 15% target and 
doWait=true and then another call with a 10% target and doWait=false?
   
   That's a neat solution!  It'd let the user roughly approximate the timeout 
if they can pick the incremental thresholds (15%, 10%) properly ...
   
   For now we (Amazon product search team) are just calling `forceMergeDeletes` 
in our own background thread, and main thread waits on that with the timeout.  
It seems to work fine, just adds complexity to the user-space code... probably 
one could factor this out into a `FutureWithTimeout` that wraps any `Runnable` 
or so ...
   
   > But latency is important for you, so it would likely be better to return 
several small merges that can run in parallel than one large merge that has 
less write amplification but would take longer before the first deletes get 
reclaimed. Maybe you're already doing this?
   
   This is really important observation about merging ... say you have three 
segments with deletes.  If your merge policy (MP) returns one merge (3 -> 1) 
that's basically single threaded (except KNN merging which is now concurrent 
within a single merge, yay!).  If instead the MP returns three separate merges 
(1 -> 1, 1 -> 1, 1 -> 1), then that's using three merge threads, 3X more 
concurrency (if your CPUs are not saturated during indexing), but yes you still 
have three segments in the end (just with no/fewer deletes), so higher write 
amplification.
   
   I don't know if we are already doing this -- is this `TieredMergePolicy`'s 
default behavior (1 -> 1) for `forceMergeDeletes`?  I don't think so?
   
   > Alternatively, maybe it shouldn't be using forceMergeDeletes but natural 
background merges and TieredMergePolicy#setDeletesPctAllowed to reclaim deletes 
since background merges do have this latency constraint and would try to run 
the smallest merges that reclaim the most deletes first.
   
   Yeah we actually set natural merging to tolerate lower deletes after 
`forceMergeDeletes` finishes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [I] Add a timeout for forceMergeDeletes in IndexWriter [lucene]

Reply via email to