[I] NRT replication should make it possible/easy to use bite-sized commits [lucene]

via GitHub Tue, 11 Feb 2025 04:15:33 -0800


mikemccand opened a new issue, #14219:
URL: https://github.com/apache/lucene/issues/14219

### Description

At Amazon (product search) we use Lucene's [awesome near-real-time segment
replication](https://blog.mikemccandless.com/2017/09/lucenes-near-real-time-segment-index.html)
to efficiently distribute index changes (through S3) to searchers.

This "index once search many" approach works very well for applications
needing high QPS scale. It enables physical isolation of indexing and
searching (so e.g. heavy merges cannot affect searching). It enables rapid
failover or proactive cluster load-balancing if an indexer crashes or its node
is too hot: just promote one of the replicas. And it means you should
frontload lots of indexing cost if it make searching even a bit faster.

But recently we've been struggling with too-large checkpoints. We ask the
indexer to write a new checkpoint (an `IndexWriter` commit call) every 60
seconds, and searchers copy down the checkpoint and light the new segments.
During periods of heavy index updates ("update storms"), combined with our very
aggressive `TieredMergePolicy` configuration to reclaim deletes, we see a big
write amplification (bytes copied to replicas vs bytes written for newly
indexed documents), sometimes sending many 10s of GB new segments in a single
checkpoint.

When replicas copy these large checkpoints, it can induce heavy page faults
on the hot query path for in-flight queries (we suspect the [`MADV_RANDOM` hint
for KNN files to also exacerbate
things](https://github.com/apache/lucene/pull/13267) for us -- this is good for
cold indices, but maybe not mostly hot ones?) since the hot searching pages
evicted by copying in the large checkpoint before any of the new segments are
lit puts RAM pressure on the OS. We could maybe tune the OS to more
aggressively move dirty pages to disk? Or maybe try `O_DIRECT` when copying
the new checkpoint files. But still when we then light the new segments, we'll
hit page faults then too.

We had an a-ha moment on how to fix this, using APIs Lucene already exposes!
We just need to decouple committing from checkpointing/replicating. Instead
of committing/replicating every 60 seconds, ask Lucene to commit much more
frequently (say once per second, like OpenSearch/Elasticsearch default I think,
or maybe "whenever > N GB segments turnover", though this is harder).
Configure a time-based `IndexDeletionPolicy` so these commit points all live
for a long time (say an hour). Then, every 60 seconds (or whatever your
replication interval is), replicate all new commit points (and any segment
files referenced by these commit points) out to searchers.

The searchers can then carefully pick and choose which commit points they
want to switch too, in a bite sized / stepping stone manner, ensuring that each
commit point they light has < N GB turnover in the segments, meaning the OS
will only ever need "hot-pages plus N" GB of working RAM. This leans nicely on
[Lucene's strongly transactional
APIs](https://blog.mikemccandless.com/2012/03/transactional-lucene.html), and I
think it's largely sugar / utility classes in NRT replicator that we'd need to
add to demonstrate this approach, maybe.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[I] NRT replication should make it possible/easy to use bite-sized commits [lucene]

Reply via email to