Dandandan opened a new pull request, #21682:
URL: https://github.com/apache/datafusion/pull/21682

   ## Which issue does this PR close?
   
   Stacks on top of #21351.
   
   ## Rationale for this change
   
   PR #21351 enables dynamic work scheduling in FileStream but keeps the same 
single-outstanding-I/O-per-partition property as main. This PR implements the 
follow-on item @alamb listed:
   
   > 2. Trying to issue multiple IOs by the same partition (aka to interleave 
IO and CPU work more)
   
   It lets each partition prefetch upcoming files while the active reader 
decodes the current file, so planner I/O is no longer serialized within a 
partition.
   
   ## What changes are included in this PR?
   
   1. New `FileStreamState::Prefetch` variant and `PrefetchState` that drives 
multiple `PendingMorselPlanner` I/Os concurrently and issues planner I/O for 
upcoming files while the active reader is blocked.
   2. Prefetching is bounded at `MAX_PREFETCH_MORSELS = 20` in-flight 
morsel-producing work items (pending I/O + ready planners + ready morsels + 
active reader) to cap buffering.
   3. Enabled by default via `FileStreamBuilder`; the legacy single-I/O 
`ScanState` path is preserved and opt-in-able via 
`FileStreamBuilder::with_prefetch(false)`.
   4. Two new snapshot tests:
      - `morsel_prefetch_overlaps_io_across_files` — verifies file2's planner 
I/O is issued while file1's I/O is still pending.
      - `morsel_no_prefetch_keeps_files_sequential` — verifies 
`with_prefetch(false)` preserves the legacy single-I/O behavior.
   
   The reader takes priority over prefetching (step order: poll pending I/O → 
poll reader → plan → promote morsel → morselize next file), so user-visible 
latency is not delayed by opening new files, and all existing snapshot tests 
pass unchanged.
   
   ## Are these changes tested?
   
   Yes — 27 file_stream tests pass, including the two new prefetch-specific 
tests. Full `datafusion-datasource` and `datafusion` crate test suites pass 
locally. Clippy is clean on the affected crates.
   
   ## Are there any user-facing changes?
   
   Yes — prefetching is on by default, so multi-file scans may now have 
multiple planner I/Os in flight per partition. Users can opt out via 
`FileStreamBuilder::with_prefetch(false)`.
   
   🤖 Generated with [Claude Code](https://claude.com/claude-code)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to