Dandandan opened a new pull request, #21793:
URL: https://github.com/apache/datafusion/pull/21793
## Which issue does this PR close?
- Closes #.
## Rationale for this change
When profiling DataFusion's local parquet reads under ClickBench,
`object_store::LocalFileSystem::get_ranges` serializes all range reads inside a
single `spawn_blocking` task:
```rust
async fn get_ranges(&self, location: &Path, ranges: &[Range<u64>]) ->
Result<Vec<Bytes>> {
let path = self.path_to_filesystem(location)?;
let ranges = ranges.to_vec();
maybe_spawn_blocking(move || {
// Vectored IO might be faster
let mut file = File::open(&path).map_err(|e| map_open_error(e,
&path))?;
ranges.into_iter().map(|r| read_range(&mut file, &path, r)).collect()
}).await
}
```
One blocking thread, N sequential `seek + read_exact` pairs. On NVMe devices
with meaningful queue-depth capability, and on cold-cache reads, this leaves a
lot of parallelism unused — the kernel block layer would happily service many
concurrent reads if we asked it to.
This PR adds a benchmark-only alternative `ObjectStore` (no changes to
`datafusion/` or `object_store`) that routes the range reads through an
`io_uring` submission queue, so N preads become N concurrent kernel-side
operations. It's intended as a tool for A/B measurement rather than a
production-quality replacement.
## What changes are included in this PR?
- `benchmarks/src/util/uring_local_fs.rs` (new, ~480 lines): a
`UringLocalFileSystem` implementing `object_store::ObjectStore`. It owns an
`inner: LocalFileSystem` for non-read ops and a dedicated `io-uring-driver` OS
thread that owns the `IoUring` instance and the submission/completion loop.
- `benchmarks/src/util/mod.rs`: registers the module under `#[cfg(target_os
= \"linux\")]`.
- `benchmarks/src/util/options.rs`: `CommonOpt::build_runtime` registers
`UringLocalFileSystem` for `file:///` by default on Linux, with
`DATAFUSION_IO_URING=0` as the opt-out. Layers with `--simulate-latency` as
expected (`LatencyObjectStore` wraps the uring store).
- `benchmarks/Cargo.toml`: `io-uring = \"0.7\"` added under
`[target.'cfg(target_os = \"linux\")'.dependencies]`, so non-Linux targets
don't pull it in.
Driver shape:
1. Any tokio task calls `submit_read(Arc<File>, offset, len)` — a **sync**
fn — which sends a `Cmd::Read` over an mpsc and returns a
`oneshot::Receiver<io::Result<Bytes>>`. This is sync on purpose: `get_ranges`
enqueues all N ranges before awaiting any of them, so the driver sees the whole
batch in one `try_recv` drain.
2. The driver fills the SQ up to free slots, `submit_and_wait(1)` to flush
and block for at least one completion when work is outstanding, then drains the
CQ and fires the oneshots. Idles with `blocking_recv()` when empty.
3. Buffers (`Box<[u8]>`) and the keep-alive `Arc<File>` live in the driver's
`in_flight` map until the corresponding CQ arrives — the kernel never writes
into freed memory or a closed fd.
Known rough edges (documented in the module header):
- No fd cache — one `open(2)` per `get_ranges` call (same as today).
- No registered buffers / `IORING_OP_READV` — one SQE per range, heap
allocation per op.
- No `IORING_OP_ASYNC_CANCEL` on dropped-future cancellation; the submission
runs to completion and its result is discarded.
- Metrics / tracing not yet plumbed in.
Not included in this PR: any change to `object_store` or `datafusion` core,
or any production path. All non-Linux users get the stock `LocalFileSystem` via
the existing cfg-gated code.
## Are these changes tested?
- `cargo check -p datafusion-benchmarks` and `cargo clippy -p
datafusion-benchmarks --all-targets -- -D warnings` pass on macOS (Linux module
is cfg-d out).
- The Linux build path has not yet been exercised on a real Linux toolchain
in this change — please let CI / benchmark runners exercise it before merging.
Running `./target/release-nonlto/dfbench clickbench --iterations 3 --path
<hits_partitioned> --queries-path benchmarks/queries/clickbench/queries` with
and without `DATAFUSION_IO_URING=0` is the expected first validation.
## Are there any user-facing changes?
Only within `dfbench` on Linux:
- Startup prints `Using io_uring-backed LocalFileSystem` so it's visible
which backend is in effect.
- `DATAFUSION_IO_URING=0` in the environment restores the stock
`LocalFileSystem`.
No API changes. No changes to any crate that downstream users depend on.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]