pantShrey opened a new pull request, #21882: URL: https://github.com/apache/datafusion/pull/21882
## Which issue does this PR close? - Closes #21215 ## Rationale for this change DataFusion’s spill infrastructure is tightly coupled to OS-level files, with no extension points for alternative storage backends. `DiskManager` cannot be customized for file creation, and `IPCStreamWriter` depends on OS file paths. This prevents integration in environments where temporary storage must be managed by the host system. For example, Postgres extensions (e.g., ParadeDB) require spill files to go through `BufFile` APIs to respect `temp_tablespaces`, enforce `temp_file_limit`, and integrate with transaction-scoped cleanup. Since `BufFile` has no OS-visible path, it cannot work with the current design. A secondary motivation raised by @alamb is supporting object storage backends (S3, GCS) for spilling, which require async IO and cannot use `std::io::Write` or `std::io::Read`. ## What changes are included in this PR? - Introduced `SpillFile`, `SpillWriter`, and `TempFileFactory` traits to abstract spill file handling - Added `DiskManagerMode::Custom` to allow pluggable backends - Updated `DiskManager` to return `Arc<dyn SpillFile>` instead of OS-bound types - Refactored write path using `SpillWriteAdapter` to bridge sync Arrow writers with backend-agnostic writers - Refactored read path to use async streaming (`Stream<Item = Result<Bytes>>`) instead of blocking state machines - Updated spill-related components to operate on `Arc<dyn SpillFile>` - Provided a sync read escape hatch (`open_sync_reader`) for operators not yet migrated to async (SortMergeJoin) ## Are these changes tested? Yes. Existing spill tests cover the full read/write flow. - Fixed `test_disk_usage_decreases_as_files_consumed` by correcting a pre-existing off-by-one assumption in file rotation ## Are there any user-facing changes? Yes this introduces API changes: - Spill-related APIs now use `Arc<dyn SpillFile>` instead of `RefCountedTempFile` - New public traits: `SpillFile`, `SpillWriter`, `TempFileFactory` - Added `DiskManagerMode::Custom` for custom backends Custom spill backends can now be implemented and plugged in via `DiskManager`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
