danielcweeks commented on code in PR #14264:
URL: https://github.com/apache/iceberg/pull/14264#discussion_r2535808483
##########
core/src/main/java/org/apache/iceberg/BaseIncrementalChangelogScan.java:
##########
@@ -179,5 +735,56 @@ public CloseableIterable<ChangelogScanTask> apply(
}
});
}
+
+ /**
+ * Gets delete files that apply to an ADDED data file. Only includes
deletes added in the same
+ * snapshot as the file.
+ */
+ private DeleteFile[] getDeletesForAddedFile(
+ ManifestEntry<DataFile> entry, long commitSnapshotId) {
+ DeleteFileIndex addedDeleteIndex =
addedDeletesBySnapshot.get(commitSnapshotId);
+ return addedDeleteIndex == null || addedDeleteIndex.isEmpty()
+ ? NO_DELETES
+ : addedDeleteIndex.forEntry(entry);
+ }
+
+ /**
+ * Gets all delete files that were applied to a DELETED data file up to
the point it was
+ * deleted. This includes existing deletes and all deletes added in the
scan range up to (but
+ * not including) the deletion snapshot.
+ */
+ private DeleteFile[] getDeletesForDeletedFile(
+ ManifestEntry<DataFile> entry, long deletionSnapshotId) {
+
+ List<DeleteFile> allDeletes = Lists.newArrayList();
+
+ // Build existing delete index lazily when first DELETED entry is
encountered
+ DeleteFileIndex existingDeleteIndex = existingDeleteIndexSupplier.get();
+ DeleteFile[] existingDeletes =
+ existingDeleteIndex.isEmpty() ? NO_DELETES :
existingDeleteIndex.forEntry(entry);
+ for (DeleteFile df : existingDeletes) {
+ allDeletes.add(df);
+ }
+
+ // Add all deletes from snapshots in the scan range BEFORE the deletion
+ List<DeleteFile> cumulativeDeletes =
cumulativeDeletesMap.get(deletionSnapshotId);
+ if (cumulativeDeletes != null) {
+ // For each cumulative delete, check if it applies to this entry
+ for (DeleteFile df : cumulativeDeletes) {
Review Comment:
This feels super expensive. Don't we know that it applies? If there's an
existing POS/DV against a removed file, it would have to apply. If there's an
equality delete, you would just include it regardless and let the task apply
it? Overall, it feels like any associated delete would apply or could/would be
deconflicted by the task.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]