[I] asynchronous I/O + saturating NVMe bandwidth [lucene]

via GitHub Mon, 12 May 2025 23:38:51 -0700


shbakram opened a new issue, #14655:
URL: https://github.com/apache/lucene/issues/14655


   ### Description
   
   The intersection of fast NVMe SSDs that offer high I/O concurrency and 
io_uring is leading to a momentum toward asynchronous request processing. For 
example, PostgreSQL is the latest example to consider support for io_uring.
   
   Our in-house analysis shows that io_uring provides better throughput for 
NVMe SSDs than FileChannel#read/write and FileChannel#mmap. Others have reached 
the same conclusion.
   
   [JUring](https://www.phoronix.com/news/JUring-IO_uring-Java)
   
   io_uring is a Linux kernel asynchronous I/O interface that allows 
applications to queue multiple I/O operations into a ring buffer and submit 
them all with a single syscall, enabling improved batched I/O over AIO and 
user-level thread-based approaches. What does this development mean for Lucene? 
   
   Talking to Mike recently, it does seem there is room for improving query 
throughput in Lucene when the index cannot fit in available DRAM capacity. It 
is well-known that DRAM is a limited resource, and datasets grow faster than 
improved DRAM scaling. In real-time cases, especially, limited DRAM is 
pressured from new index updates (memtable), query cache, page cache, and Java 
heap (including garbage). Hence, SSD serves as an extension and optimizing SSD 
throughput is critical for important use cases out there. 
   
   Taking the full advantage of NVMe/io_uring requires submitting large I/O 
batches. These batches can come from a single request but this limits I/O 
concurrency to queries with popular terms only. Or they can come from multiple 
requests. In the latter case, for example, gather read requests for index pages 
from multiple queries and submit them as a batch. Then, start processing 
queries when I/Os complete in any order. (Avoiding details of handling corner 
cases and ordering constraints for now).
   
   In any case, quoting Mike, “Lucene's cross-segment concurrency ("thread per 
slice" within a single query) helps some with I/O concurrency, but that's still 
at the whim of the index geometry, which is horribly leaky abstraction e.g., an 
"optimized" index (single segment) loses all concurrency!”
   
   I believe taking full advantage of asynchronous I/O requires a shift to a 
new posting format.
   
   So, I suggest a radical shift in the underlying posting format. Give up 
log-structured format (LSM)! Give up segments. LSM aims for sequential I/O, 
which is no longer relevant for NVMe SSDs. Have one monolithic index. Each term 
initially gets a page for its posting and related data. If the term is popular, 
progressively assign it chunks of multiple pages. Pages/chunks are chained 
using forward pointers. These pointers will be stored in an on-heap (or 
in-memory) data structure similar to skip offsets. (Embrace random SSD I/Os.) A 
single request can be broken down into many random pages on an SSD, and all 
these pages are part of the batch for io_submit(). On the write path, 
similarly, gather all updates in a submit buffer and issue hundreds of random 
I/Os.  
   
   Terms are no longer sorted, so range queries may suffer. Still, on-heap FST 
provides a sorted view of terms. It's just they are not sorted on storage. On 
the flip side, there is no CPU to waste on sorting and merging. A reasonable 
tradeoff for the rapidly shifting underlying hardware technology. Saved CPU can 
serve more queries in the case of real-time search (an important use case for 
social media platforms).
   
   There are many challenges indeed. What about (crash) consistency? Segments 
make it easier to deal with failures somewhat gracefully. With some effort, one 
can maintain logs that contain sufficient information to recover an index in 
the event of failure. There are fewer index management operations like merging 
so that helps offset the overhead of maintaining additional logs.
   
   What about real-time search? Currently, the near real-time API requires 
substantial effort to reopen the index using DirectoryReader.openIfChanged(). 
My analysis shows its overhead is prohibitive for reasonably sized indexes. The 
new posting format would require less effort to expose the most recent 
in-memory changes to index searchers (omitting details).
   
   Is there a better way to utilize io_uring and fully leverage NVMe potential 
without requiring radical changes to the index format? For example, has someone 
observed using the new prefetch API coupled with a lot of searcher threads as a 
way to saturate NVMe? Even then, I believe it does not increase I/O concurrency 
on the write path. Large I/Os on the write path help, but I believe large I/Os 
alone do not fully exploit the internal parallelism of modern SSDs. 
   
   I just want to start a discussion on this subject after some initial 
exchanges with Mike. Thoughts?
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[I] asynchronous I/O + saturating NVMe bandwidth [lucene]

Reply via email to