eshishki commented on issue #12280:
URL: https://github.com/apache/iceberg/issues/12280#issuecomment-2661592485

   in our scenario each commit adds 1 eq delete file, every 5 minutes, 12 times 
an hour
   we run compaction say every hour, and the number of eq delete files stays 
within reason
   
   i think we can tradeoff a large number of delete files to more granular 
bounds, so that we reduce the number of data files and rows we need to recheck 
for deletes
   
   so the theoretical eq_delete_rewrite procedure should
   1. try to keep the number of eq delete files constant
   2. rewrite so that to minimize the number of data file rows overlapped with 
eq delete file bounds 
   
   this would help spark too, since it will reduce the number of file references
   
   now our situation is:
   Record Counts:
   Total Data Records:              124,557,336
   Total Data Files:               217
   Records with no deletes:        0
   Records with only eq deletes:   0
   Records with only pos deletes:  1,164
   Records with both deletes:      124,556,172
   
   Delete Statistics:
   Records with eq deletes total:  124,556,172
   Unique eq delete files:         11
   Eq delete files referenced:     2,204
   Eq delete records:             8,714
   
   Pos Delete Statistics:
   Unique pos delete files:        10
   Pos delete files referenced:    2,059
   Pos delete records:            170
   
   frankly i would love to see any improvement that reduce "Records with eq 
deletes total"


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Reply via email to