rdblue commented on code in PR #15006:
URL: https://github.com/apache/iceberg/pull/15006#discussion_r2825087710
##########
core/src/main/java/org/apache/iceberg/MergingSnapshotProducer.java:
##########
@@ -1065,9 +1062,34 @@ private List<ManifestFile> newDeleteFilesAsManifests() {
}
if (cachedNewDeleteManifests.isEmpty()) {
+ for (Map.Entry<String, List<DeleteFile>> entry :
dvsByReferencedFile.entrySet()) {
+ if (entry.getValue().size() > 1) {
+ LOG.warn(
+ "Attempted to commit {} duplicate DVs for data file {} in table
{}. "
+ + "Merging duplicates, and original DVs will be orphaned.",
+ entry.getValue().size(),
+ entry.getKey(),
+ tableName);
+ }
+ }
+
+ List<DeleteFile> finalDVs =
+ DVUtil.mergeAndWriteDvsIfRequired(
+ dvsByReferencedFile,
+ ThreadPools.getDeleteWorkerPool(),
+ ops().locationProvider(),
+ ops().encryption(),
+ ops().io(),
+ ops().current().specsById());
+ // Prevent commiting duplicate V2 deletes by deduping them
+ Map<Integer, List<DeleteFile>> newDeleteFilesBySpec =
+ Streams.stream(Iterables.concat(finalDVs,
DeleteFileSet.of(positionAndEqualityDeletes)))
+ .map(file -> Delegates.pendingDeleteFile(file,
file.dataSequenceNumber()))
+ .collect(Collectors.groupingBy(ContentFile::specId));
newDeleteFilesBySpec.forEach(
(specId, deleteFiles) -> {
PartitionSpec spec = ops().current().spec(specId);
+ deleteFiles.forEach(file -> addedFilesSummary.addedFile(spec,
file));
Review Comment:
If the cache is invalidated, this will double-count all delete files, which
I think we have to fix. That's a bit ugly. The cleanest way I can think of is
to keep a summary builder for data files and then combine the current set of
deletes with data files here with a new builder, or possibly just clear the
current builder and add everything.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]