Dandandan opened a new pull request, #21793:
URL: https://github.com/apache/datafusion/pull/21793

   ## Which issue does this PR close?
   
   - Closes #.
   
   ## Rationale for this change
   
   When profiling DataFusion's local parquet reads under ClickBench, 
`object_store::LocalFileSystem::get_ranges` serializes all range reads inside a 
single `spawn_blocking` task:
   
   ```rust
   async fn get_ranges(&self, location: &Path, ranges: &[Range<u64>]) -> 
Result<Vec<Bytes>> {
       let path = self.path_to_filesystem(location)?;
       let ranges = ranges.to_vec();
       maybe_spawn_blocking(move || {
           // Vectored IO might be faster
           let mut file = File::open(&path).map_err(|e| map_open_error(e, 
&path))?;
           ranges.into_iter().map(|r| read_range(&mut file, &path, r)).collect()
       }).await
   }
   ```
   
   One blocking thread, N sequential `seek + read_exact` pairs. On NVMe devices 
with meaningful queue-depth capability, and on cold-cache reads, this leaves a 
lot of parallelism unused — the kernel block layer would happily service many 
concurrent reads if we asked it to.
   
   This PR adds a benchmark-only alternative `ObjectStore` (no changes to 
`datafusion/` or `object_store`) that routes the range reads through an 
`io_uring` submission queue, so N preads become N concurrent kernel-side 
operations. It's intended as a tool for A/B measurement rather than a 
production-quality replacement.
   
   ## What changes are included in this PR?
   
   - `benchmarks/src/util/uring_local_fs.rs` (new, ~480 lines): a 
`UringLocalFileSystem` implementing `object_store::ObjectStore`. It owns an 
`inner: LocalFileSystem` for non-read ops and a dedicated `io-uring-driver` OS 
thread that owns the `IoUring` instance and the submission/completion loop.
   - `benchmarks/src/util/mod.rs`: registers the module under `#[cfg(target_os 
= \"linux\")]`.
   - `benchmarks/src/util/options.rs`: `CommonOpt::build_runtime` registers 
`UringLocalFileSystem` for `file:///` by default on Linux, with 
`DATAFUSION_IO_URING=0` as the opt-out. Layers with `--simulate-latency` as 
expected (`LatencyObjectStore` wraps the uring store).
   - `benchmarks/Cargo.toml`: `io-uring = \"0.7\"` added under 
`[target.'cfg(target_os = \"linux\")'.dependencies]`, so non-Linux targets 
don't pull it in.
   
   Driver shape:
   1. Any tokio task calls `submit_read(Arc<File>, offset, len)` — a **sync** 
fn — which sends a `Cmd::Read` over an mpsc and returns a 
`oneshot::Receiver<io::Result<Bytes>>`. This is sync on purpose: `get_ranges` 
enqueues all N ranges before awaiting any of them, so the driver sees the whole 
batch in one `try_recv` drain.
   2. The driver fills the SQ up to free slots, `submit_and_wait(1)` to flush 
and block for at least one completion when work is outstanding, then drains the 
CQ and fires the oneshots. Idles with `blocking_recv()` when empty.
   3. Buffers (`Box<[u8]>`) and the keep-alive `Arc<File>` live in the driver's 
`in_flight` map until the corresponding CQ arrives — the kernel never writes 
into freed memory or a closed fd.
   
   Known rough edges (documented in the module header):
   
   - No fd cache — one `open(2)` per `get_ranges` call (same as today).
   - No registered buffers / `IORING_OP_READV` — one SQE per range, heap 
allocation per op.
   - No `IORING_OP_ASYNC_CANCEL` on dropped-future cancellation; the submission 
runs to completion and its result is discarded.
   - Metrics / tracing not yet plumbed in.
   
   Not included in this PR: any change to `object_store` or `datafusion` core, 
or any production path. All non-Linux users get the stock `LocalFileSystem` via 
the existing cfg-gated code.
   
   ## Are these changes tested?
   
   - `cargo check -p datafusion-benchmarks` and `cargo clippy -p 
datafusion-benchmarks --all-targets -- -D warnings` pass on macOS (Linux module 
is cfg-d out).
   - The Linux build path has not yet been exercised on a real Linux toolchain 
in this change — please let CI / benchmark runners exercise it before merging. 
Running `./target/release-nonlto/dfbench clickbench --iterations 3 --path 
<hits_partitioned> --queries-path benchmarks/queries/clickbench/queries` with 
and without `DATAFUSION_IO_URING=0` is the expected first validation.
   
   ## Are there any user-facing changes?
   
   Only within `dfbench` on Linux:
   
   - Startup prints `Using io_uring-backed LocalFileSystem` so it's visible 
which backend is in effect.
   - `DATAFUSION_IO_URING=0` in the environment restores the stock 
`LocalFileSystem`.
   
   No API changes. No changes to any crate that downstream users depend on.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to