shbakram opened a new issue, #14655: URL: https://github.com/apache/lucene/issues/14655
### Description The intersection of fast NVMe SSDs that offer high I/O concurrency and io_uring is leading to a momentum toward asynchronous request processing. For example, PostgreSQL is the latest example to consider support for io_uring. Our in-house analysis shows that io_uring provides better throughput for NVMe SSDs than FileChannel#read/write and FileChannel#mmap. Others have reached the same conclusion. [JUring](https://www.phoronix.com/news/JUring-IO_uring-Java) io_uring is a Linux kernel asynchronous I/O interface that allows applications to queue multiple I/O operations into a ring buffer and submit them all with a single syscall, enabling improved batched I/O over AIO and user-level thread-based approaches. What does this development mean for Lucene? Talking to Mike recently, it does seem there is room for improving query throughput in Lucene when the index cannot fit in available DRAM capacity. It is well-known that DRAM is a limited resource, and datasets grow faster than improved DRAM scaling. In real-time cases, especially, limited DRAM is pressured from new index updates (memtable), query cache, page cache, and Java heap (including garbage). Hence, SSD serves as an extension and optimizing SSD throughput is critical for important use cases out there. Taking the full advantage of NVMe/io_uring requires submitting large I/O batches. These batches can come from a single request but this limits I/O concurrency to queries with popular terms only. Or they can come from multiple requests. In the latter case, for example, gather read requests for index pages from multiple queries and submit them as a batch. Then, start processing queries when I/Os complete in any order. (Avoiding details of handling corner cases and ordering constraints for now). In any case, quoting Mike, “Lucene's cross-segment concurrency ("thread per slice" within a single query) helps some with I/O concurrency, but that's still at the whim of the index geometry, which is horribly leaky abstraction e.g., an "optimized" index (single segment) loses all concurrency!” I believe taking full advantage of asynchronous I/O requires a shift to a new posting format. So, I suggest a radical shift in the underlying posting format. Give up log-structured format (LSM)! Give up segments. LSM aims for sequential I/O, which is no longer relevant for NVMe SSDs. Have one monolithic index. Each term initially gets a page for its posting and related data. If the term is popular, progressively assign it chunks of multiple pages. Pages/chunks are chained using forward pointers. These pointers will be stored in an on-heap (or in-memory) data structure similar to skip offsets. (Embrace random SSD I/Os.) A single request can be broken down into many random pages on an SSD, and all these pages are part of the batch for io_submit(). On the write path, similarly, gather all updates in a submit buffer and issue hundreds of random I/Os. Terms are no longer sorted, so range queries may suffer. Still, on-heap FST provides a sorted view of terms. It's just they are not sorted on storage. On the flip side, there is no CPU to waste on sorting and merging. A reasonable tradeoff for the rapidly shifting underlying hardware technology. Saved CPU can serve more queries in the case of real-time search (an important use case for social media platforms). There are many challenges indeed. What about (crash) consistency? Segments make it easier to deal with failures somewhat gracefully. With some effort, one can maintain logs that contain sufficient information to recover an index in the event of failure. There are fewer index management operations like merging so that helps offset the overhead of maintaining additional logs. What about real-time search? Currently, the near real-time API requires substantial effort to reopen the index using DirectoryReader.openIfChanged(). My analysis shows its overhead is prohibitive for reasonably sized indexes. The new posting format would require less effort to expose the most recent in-memory changes to index searchers (omitting details). Is there a better way to utilize io_uring and fully leverage NVMe potential without requiring radical changes to the index format? For example, has someone observed using the new prefetch API coupled with a lot of searcher threads as a way to saturate NVMe? Even then, I believe it does not increase I/O concurrency on the write path. Large I/Os on the write path help, but I believe large I/Os alone do not fully exploit the internal parallelism of modern SSDs. I just want to start a discussion on this subject after some initial exchanges with Mike. Thoughts? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org