Re: [PR] Core: Switch usage to DataFileSet / DeleteFileSet [iceberg]

via GitHub Tue, 08 Oct 2024 19:37:36 -0700


aokolnychyi commented on code in PR #11158:
URL: https://github.com/apache/iceberg/pull/11158#discussion_r1792639080



##########
core/src/main/java/org/apache/iceberg/FastAppend.java:
##########
@@ -215,7 +213,7 @@ private List<ManifestFile> writeNewManifests() throws 
IOException {
     }
 
     if (newManifests == null && !newFiles.isEmpty()) {
-      this.newManifests = writeDataManifests(newFiles, spec);
+      this.newManifests = writeDataManifests(Lists.newArrayList(newFiles), 
spec);

Review Comment:
   What about modifying `writeDataManifests` to accept `Collection` and moving 
the list creation to `divide`?



##########
core/src/main/java/org/apache/iceberg/ManifestFilterManager.java:
##########
@@ -533,4 +531,51 @@ private Pair<InclusiveMetricsEvaluator, 
StrictMetricsEvaluator> metricsEvaluator
       return metricsEvaluators.get(partition);
     }
   }
+
+  private class FilesToDeleteHolder {

Review Comment:
   Is there any way we can do this differently? In theory, we can add another 
abstract method, similar to how we handle manifest writers.
   
   ```
   protected abstract Set<F> newFileSet();
   
   protected abstract ManifestWriter<F> newManifestWriter(PartitionSpec spec);
   
   protected abstract ManifestReader<F> newManifestReader(ManifestFile 
manifest);
   ```
   
   One caveat is calling this method to initialize an instance field. It is 
considered a bad practice but implementations will be stateless, so it will 
work. We could pass `Supplier<Set<F>>` but not sure it is better. In either 
case, we need to find a way not to have both sets of files here. It will also 
reduce the number of changes.



##########
core/src/main/java/org/apache/iceberg/ManifestFilterManager.java:
##########
@@ -372,8 +367,14 @@ private boolean manifestHasDeletedFiles(
 
     for (ManifestEntry<F> entry : reader.liveEntries()) {
       F file = entry.file();
+
+      // add path-based delete to set of files to be deleted
+      if (deletePaths.contains(CharSequenceWrapper.wrap(file.path()))) {

Review Comment:
   Why do we wrap? It is `CharSequenceSet`.



##########
core/src/main/java/org/apache/iceberg/MergingSnapshotProducer.java:
##########
@@ -81,8 +83,8 @@ abstract class MergingSnapshotProducer<ThisT> extends 
SnapshotProducer<ThisT> {
 
   // update data
   private final Map<PartitionSpec, List<DataFile>> newDataFilesBySpec = 
Maps.newHashMap();
-  private final CharSequenceSet newDataFilePaths = CharSequenceSet.empty();
-  private final CharSequenceSet newDeleteFilePaths = CharSequenceSet.empty();
+  private final DataFileSet newDataFiles = DataFileSet.create();
+  private final DeleteFileSet newDeleteFiles = DeleteFileSet.create();

Review Comment:
   Do we need these extra collections? Can't we use sets in 
`newDataFilesBySpec` and `newDeleteFilesBySpec`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Core: Switch usage to DataFileSet / DeleteFileSet [iceberg]

Reply via email to