talatuyarer commented on code in PR #14264:
URL: https://github.com/apache/iceberg/pull/14264#discussion_r2418312507


##########
core/src/main/java/org/apache/iceberg/BaseIncrementalChangelogScan.java:
##########
@@ -71,6 +80,12 @@ protected CloseableIterable<ChangelogScanTask> doPlanFiles(
             .filter(manifest -> 
changelogSnapshotIds.contains(manifest.snapshotId()))
             .toSet();
 
+    // Build delete file index for existing deletes (before the start snapshot)
+    DeleteFileIndex existingDeleteIndex = 
buildExistingDeleteIndex(fromSnapshotIdExclusive);

Review Comment:
   I thought about `existingDeleteIndex`. Currently All delete manifests are 
loaded and parsed upfront, regardless of whether they're relevant to the scan 
filter. For tables with thousands of delete manifests, this can be extremely 
slow. Even we have caching, the first scan still parses all manifests. Memory 
footprint can be also big for huge tables.
   
   We can optimize planning performance by adding manifest with partition 
filters. However there could be still big memory need if there is so many 
delete fields for specific partitions.
   
   I can also implement lazy loading for deletefiles that loads delete files 
only when needed during planning.
   
   Do you think we should reduce scope for this PR and put basic functionality 
for CDC and I can create more PR for performance improvements ?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to