alamb commented on issue #21215:
URL: https://github.com/apache/datafusion/issues/21215#issuecomment-4199997422
>
> Does this approach seem reasonable, or did you have a different
abstraction in mind? To make this concrete, here is the architecture I'm
thinking of before I open a draft PR:
>
> pub trait SpillFile: Send + Sync {
> fn size(&self) -> u64;
> fn path(&self) -> Option<&Path> { None }//kept for the debugging and
observability part for osfiles
> fn open_write(&self) -> Result<Box<dyn std::io::Write + Send>>;
> fn open_read(&self) -> Result<Box<dyn std::io::Read + Send>>;
> }
>
> pub trait TempFileFactory: Send + Sync {
> fn create_temp_file(&self, request_description: &str) ->
Result<std::sync::Arc<dyn SpillFile>>;
> }
One thing to consider is not using `Write` or `Read` as they
1. pretty much require copies of the data
2. Are not `async` and thus couldn't be used for spilling to remote IO
I wonder if we could implement our own abstraction or something
> 2. Planned changes:
>
> * RefCountedTempFile stays as-is but implements SpillFile.
> * Add a `Custom(Arc <dyn TempFileFactory>) `variant to DiskManagerMode.
> * Change IPCStreamWriter to accept Box<dyn Write + Send> instead of File.
> * Update scattered callers currently doing File::open(spill_file.path())
to use spill_file.open_read().
This ceratinly seems better than the current state of affairs
> I also had a question regarding disk quota tracking: Currently,
`update_disk_usage()` updates DataFusion's global tracker to enforce
`max_temp_directory_size`. If we use a custom factory (like Postgres or S3),
how should we handle this?
> * Option A: Add update_disk_usage() to the SpillFile trait so custom
backends are forced into DataFusion's tracker.
> * Option B: Bypass DataFusion's tracking for Custom backends. We assume
the custom backend enforces its own limits (e.g., Postgres temp_file_limit),
and DataFusion only tracks OS-level files.
How about add the tracking into the `Arc <dyn TempFileFactory>` -- so the
default in DataFusion implementation does the global tracker but custom
implementations would have to implement their own tracking as needed
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]