sdd opened a new issue, #630:
URL: https://github.com/apache/iceberg-rust/issues/630

   I'm looking to start work on proper handling of delete files in table scans 
and so I'd like to open an issue to discuss some of the design decisions.
   
   A core tenet of our approach so far has been to ensure that the tasks 
produced by the file plan are small, independent and self-contained, so that 
they can be easily distributed in architectures where the service that 
generates the file plan could be on a different machine to the service(s) that 
perform the file reads.
   
   The`FileScanTask` struct represents these individual units of work at 
present. Currently though, it's shape is focussed on Data files and it does not 
cater for including information on Delete files that are produced by the scan. 
Here's how it looks now, for reference:
   
   
https://github.com/apache/iceberg-rust/blob/cde35ab0eefffae88c521d4e897ba86ee754861c/crates/iceberg/src/scan.rs#L859-L886
   
   In order to properly process delete files as part of executing a scan task, 
executors will now need to load in any applicable delete files along with the 
data file that they are processing. I'll outline what happens now, and follow 
that by my proposed approach.
   
   
   ## Current TableScan Synopsis
   
   The current structure pushes all manifest file entries from the manifest 
list into a stream which we then process concurrently in order to retrieve 
their associated manifests. Once retrieved, each manifest then has each of it's 
manifest entries extracted and pushed onto a channel so that they can be 
processed in parallel. Each is embedded inside a context object that contains 
the relevant information that is needed for processing of the manifest entry. 
Tokio tasks listening to the channel then execute 
`TableScan::process_manifest_entry` on these objects, where we filter out any 
entries that do not match the scan filter predicate.
   At this point, a `FileScanTask` is created for each of those entries that 
match the scan predicate. The `FileScanTask`s are then pushed into a channel 
that produces the stream of `FileScanTask`s that is returned to the original 
caller of `plan_files`.
   
   ## Changes to `TableScan`
   
   ### `FileScanTask`
   
   Each `FileScanTask` represents a scan to be performed on a single data file. 
However, multiple delete files may need to be applied to any one data file. 
Additionally, the scope of applicability of delete files is any data file 
within the same partition of the delete file -  i.e. the same delete file can 
need to be applied to multiple data files. Thus an executor needs to know not 
just the data file that it is processing, but all of the delete files that are 
applicable to that data file.
   
   The first part of the set of changes that I'm proposing is refactor 
`FileScanTask` so that it represents a single data file and zero or more delete 
files.
   
   * The `data_file_content` property would be removed - each task is 
implicitly about a file of type `Data`.
   * A new struct, `DeleteFileEntry`, would be added. It would look something 
like this:
     ```rust
     struct DeleteFileEntry {
         path: String,
         format: DataFileFormat
     }
     ```
   * A `delete_files` property of typ `Vec<DeleteFileEntry>` would be added to 
`FileScanTask` to represent the delete files that are applicable to it's data 
file.
   
   ### `TableScan::plan_files` and associated methods
   
   We need to update this logic in order to ensure that we can properly 
populate this new `delete_files` property. Each `ManifestEntryContext` will 
need the list delete files so that if the manifest entry that it encapsulates 
passes the filtering steps, it can populate the new `delete_files` property 
when it constructs `FileScanTask`.
   
   A naive approach may be to simply build a list of all of the delete files 
referred to by the top-level manifest list and give references to this list to 
all `ManifestEntryContext`s so that, if any delete files are present then all 
of them are included in every `FileScanTask`. This would be a good first step - 
code that works inefficiently is better than code that does not work at all! It 
would also permit work to proceed on the execution side. 
   
   Improvements could then be made to refine this approach to filter out 
inapplicable delete files that goes into each `FileScanTask`'s `delete_files` 
property.
   
   How does this sound so far, @liurenjie1024, @Xuanwo, @ZENOTME, @Fokko?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Reply via email to