wypoon commented on code in PR #10935: URL: https://github.com/apache/iceberg/pull/10935#discussion_r1727409290
########## core/src/main/java/org/apache/iceberg/BaseIncrementalChangelogScan.java: ########## @@ -63,33 +60,43 @@ protected CloseableIterable<ChangelogScanTask> doPlanFiles( return CloseableIterable.empty(); } - Set<Long> changelogSnapshotIds = toSnapshotIds(changelogSnapshots); + Map<Long, Integer> snapshotOrdinals = computeSnapshotOrdinals(changelogSnapshots); - Set<ManifestFile> newDataManifests = - FluentIterable.from(changelogSnapshots) - .transformAndConcat(snapshot -> snapshot.dataManifests(table().io())) - .filter(manifest -> changelogSnapshotIds.contains(manifest.snapshotId())) - .toSet(); - - ManifestGroup manifestGroup = - new ManifestGroup(table().io(), newDataManifests, ImmutableList.of()) - .specsById(table().specs()) - .caseSensitive(isCaseSensitive()) - .select(scanColumns()) - .filterData(filter()) - .filterManifestEntries(entry -> changelogSnapshotIds.contains(entry.snapshotId())) - .ignoreExisting() - .columnsToKeepStats(columnsToKeepStats()); - - if (shouldIgnoreResiduals()) { - manifestGroup = manifestGroup.ignoreResiduals(); - } - - if (newDataManifests.size() > 1 && shouldPlanWithExecutor()) { - manifestGroup = manifestGroup.planWith(planExecutor()); - } + // map of delete file to the snapshot where the delete file is added + // the delete file is keyed by its path, and the snapshot is represented by the snapshot ordinal + Map<String, Integer> deleteFileToSnapshotOrdinal = + computeDeleteFileToSnapshotOrdinal(changelogSnapshots, snapshotOrdinals); - return manifestGroup.plan(new CreateDataFileChangeTasks(changelogSnapshots)); + Iterable<CloseableIterable<ChangelogScanTask>> plans = + FluentIterable.from(changelogSnapshots) + .transform( + snapshot -> { + List<ManifestFile> dataManifests = snapshot.dataManifests(table().io()); + List<ManifestFile> deleteManifests = snapshot.deleteManifests(table().io()); + + ManifestGroup manifestGroup = + new ManifestGroup(table().io(), dataManifests, deleteManifests) + .specsById(table().specs()) + .caseSensitive(isCaseSensitive()) + .select(scanColumns()) + .filterData(filter()) + .columnsToKeepStats(columnsToKeepStats()); + + if (shouldIgnoreResiduals()) { + manifestGroup = manifestGroup.ignoreResiduals(); + } + + if (dataManifests.size() > 1 && shouldPlanWithExecutor()) { + manifestGroup = manifestGroup.planWith(planExecutor()); + } + + long snapshotId = snapshot.snapshotId(); + return manifestGroup.plan( + new CreateDataFileChangeTasks( + snapshotId, snapshotOrdinals, deleteFileToSnapshotOrdinal)); Review Comment: Ah, in this case, (b) is the correct behavior. The changelog scan is an incremental scan over multiple snapshots in a range, and should emit the changes for each snapshot. This is the current behavior for the supported case, which is copy-on-write. What you are seeking are the net changes, which is functionality that is also supported by Spark, and built on top of the changelog scan. This uses `ChangelogIterator.removeNetCarryovers`. This functionality is exposed in the Spark procedure, `create_changelog_view`. (Of course, one can also use it programmatically.) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org