aokolnychyi commented on code in PR #10935: URL: https://github.com/apache/iceberg/pull/10935#discussion_r1803692669
########## core/src/main/java/org/apache/iceberg/BaseIncrementalChangelogScan.java: ########## @@ -63,33 +61,39 @@ protected CloseableIterable<ChangelogScanTask> doPlanFiles( return CloseableIterable.empty(); } - Set<Long> changelogSnapshotIds = toSnapshotIds(changelogSnapshots); + Map<Long, Integer> snapshotOrdinals = computeSnapshotOrdinals(changelogSnapshots); - Set<ManifestFile> newDataManifests = + Iterable<CloseableIterable<ChangelogScanTask>> plans = FluentIterable.from(changelogSnapshots) - .transformAndConcat(snapshot -> snapshot.dataManifests(table().io())) - .filter(manifest -> changelogSnapshotIds.contains(manifest.snapshotId())) - .toSet(); - - ManifestGroup manifestGroup = - new ManifestGroup(table().io(), newDataManifests, ImmutableList.of()) - .specsById(table().specs()) - .caseSensitive(isCaseSensitive()) - .select(scanColumns()) - .filterData(filter()) - .filterManifestEntries(entry -> changelogSnapshotIds.contains(entry.snapshotId())) - .ignoreExisting() - .columnsToKeepStats(columnsToKeepStats()); - - if (shouldIgnoreResiduals()) { - manifestGroup = manifestGroup.ignoreResiduals(); - } - - if (newDataManifests.size() > 1 && shouldPlanWithExecutor()) { - manifestGroup = manifestGroup.planWith(planExecutor()); - } - - return manifestGroup.plan(new CreateDataFileChangeTasks(changelogSnapshots)); + .transform( + snapshot -> { + List<ManifestFile> dataManifests = snapshot.dataManifests(table().io()); + List<ManifestFile> deleteManifests = snapshot.deleteManifests(table().io()); + + ManifestGroup manifestGroup = + new ManifestGroup(table().io(), dataManifests, deleteManifests) Review Comment: This will cause a substantial performance hit as we will scan all data and delete manifests that match the filter for each changelog snapshot, instead of opening only newly added manifests before. We may have to do that for deletes anyway, but I wonder about data manifests. Let me think a bit. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org