dojiong opened a new pull request, #1941:
URL: https://github.com/apache/iceberg-rust/pull/1941
## Which issue does this PR close?
- Closes #.
## What changes are included in this PR?
Currently, ArrowReader instantiates a new CachingDeleteFileLoader (and
consequently a new DeleteFilter) for each FileScanTask when calling
load_deletes. This
results in the DeleteFilter state being isolated per task. If multiple
tasks reference the same delete file (common in positional deletes), that
delete file is
re-read and re-parsed for every task, leading to significant performance
overhead and redundant I/O.
Changes
* Shared State: Moved the DeleteFilter instance into the
CachingDeleteFileLoader struct. Since ArrowReader holds a single
CachingDeleteFileLoader instance across
its lifetime, the DeleteFilter state is now effectively shared across
all file scan tasks processed by that reader.
* Positional Delete Caching: Implemented a state machine for loading
positional delete files (PosDelState) in DeleteFilter.
* Added try_start_pos_del_load: Coordinates concurrent access to the
same positional delete file.
* Added finish_pos_del_load: Signals completion of loading.
* Synchronization: Introduced a WaitFor state. Unlike equality
deletes (which are accessed asynchronously), positional deletes are accessed
synchronously by
ArrowReader. Therefore, if a task encounters a file that is
currently being loaded by another task, it must asynchronously wait
(notify.notified().await)
during the loading phase to ensure the data is fully populated
before ArrowReader proceeds.
* Refactoring: Updated load_file_for_task and related types in
CachingDeleteFileLoader to support the new caching logic and carry file paths
through the loading
context.
## Are these changes tested?
Added test_caching_delete_file_loader_caches_results to verify that repeated
loads of the same delete file return shared memory objects
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]