eshishki commented on issue #12280: URL: https://github.com/apache/iceberg/issues/12280#issuecomment-2661592485
in our scenario each commit adds 1 eq delete file, every 5 minutes, 12 times an hour we run compaction say every hour, and the number of eq delete files stays within reason i think we can tradeoff a large number of delete files to more granular bounds, so that we reduce the number of data files and rows we need to recheck for deletes so the theoretical eq_delete_rewrite procedure should 1. try to keep the number of eq delete files constant 2. rewrite so that to minimize the number of data file rows overlapped with eq delete file bounds this would help spark too, since it will reduce the number of file references now our situation is: Record Counts: Total Data Records: 124,557,336 Total Data Files: 217 Records with no deletes: 0 Records with only eq deletes: 0 Records with only pos deletes: 1,164 Records with both deletes: 124,556,172 Delete Statistics: Records with eq deletes total: 124,556,172 Unique eq delete files: 11 Eq delete files referenced: 2,204 Eq delete records: 8,714 Pos Delete Statistics: Unique pos delete files: 10 Pos delete files referenced: 2,059 Pos delete records: 170 frankly i would love to see any improvement that reduce "Records with eq deletes total" -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org