mikemccand opened a new issue, #14219:
URL: https://github.com/apache/lucene/issues/14219

   ### Description
   
   At Amazon (product search) we use Lucene's [awesome near-real-time segment 
replication](https://blog.mikemccandless.com/2017/09/lucenes-near-real-time-segment-index.html)
 to efficiently distribute index changes (through S3) to searchers.
   
   This "index once search many" approach works very well for applications 
needing high QPS scale.  It enables physical isolation of indexing and 
searching (so e.g. heavy merges cannot affect searching).  It enables rapid 
failover or proactive cluster load-balancing if an indexer crashes or its node 
is too hot: just promote one of the replicas.  And it means you should 
frontload lots of indexing cost if it make searching even a bit faster.
   
   But recently we've been struggling with too-large checkpoints.  We ask the 
indexer to write a new checkpoint (an `IndexWriter` commit call) every 60 
seconds, and searchers copy down the checkpoint and light the new segments.  
During periods of heavy index updates ("update storms"), combined with our very 
aggressive `TieredMergePolicy` configuration to reclaim deletes, we see a big 
write amplification (bytes copied to replicas vs bytes written for newly 
indexed documents), sometimes sending many 10s of GB new segments in a single 
checkpoint.
   
   When replicas copy these large checkpoints, it can induce heavy page faults 
on the hot query path for in-flight queries (we suspect the [`MADV_RANDOM` hint 
for KNN files to also exacerbate 
things](https://github.com/apache/lucene/pull/13267) for us -- this is good for 
cold indices, but maybe not mostly hot ones?) since the hot searching pages 
evicted by copying in the large checkpoint before any of the new segments are 
lit puts RAM pressure on the OS.  We could maybe tune the OS to more 
aggressively move dirty pages to disk?  Or maybe try `O_DIRECT` when copying 
the new checkpoint files.  But still when we then light the new segments, we'll 
hit page faults then too.
   
   We had an a-ha moment on how to fix this, using APIs Lucene already exposes! 
 We just need to decouple committing from checkpointing/replicating.  Instead 
of committing/replicating every 60 seconds, ask Lucene to commit much more 
frequently (say once per second, like OpenSearch/Elasticsearch default I think, 
 or maybe "whenever > N GB segments turnover", though this is harder).  
Configure a time-based  `IndexDeletionPolicy` so these commit points all live 
for a long time (say an hour).  Then, every 60 seconds (or whatever your 
replication interval is), replicate all new commit points (and any segment 
files referenced by these commit points) out to searchers.
   
   The searchers can then carefully pick and choose which commit points they 
want to switch too, in a bite sized / stepping stone manner, ensuring that each 
commit point they light has < N GB turnover in the segments, meaning the OS 
will only ever need "hot-pages plus N" GB of working RAM.  This leans nicely on 
[Lucene's strongly transactional 
APIs](https://blog.mikemccandless.com/2012/03/transactional-lucene.html), and I 
think it's largely sugar / utility classes in NRT replicator that we'd need to 
add to demonstrate this approach, maybe.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to