eshishki opened a new issue, #12280: URL: https://github.com/apache/iceberg/issues/12280
### Feature Request / Improvement We do ingestion from debezium to iceberg via https://github.com/databricks/iceberg-kafka-connect/ Basically it uses flink delta writer. Each batch of data writes small number of eq deletes for updates of prev commit data. Most of db pk keys are uuid and so we even a handful of eq delete rows cover a large portion of data files (via lower/upper bounds check), forcing costly check at query time. We do run periodic compaction process, but it is inefficient, since it forces us to rewrite practically whole table, which would be "dirty" within 5 minutes of commit interval. We thought about having multiple eq delete files, to make bounds more granular and to emulate poor man bloom filter. But it again add many ranges and only postpone the issue, the table would be dirty say in 30 minutes, not 5. If however new writer could read previous handful of eq deletes, maybe it could have combined them with new ones, so that the number of range buckets would stay ~ constant. ### Query engine None ### Willingness to contribute - [ ] I can contribute this improvement/feature independently - [ ] I would be willing to contribute this improvement/feature with guidance from the Iceberg community - [x] I cannot contribute this improvement/feature at this time -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org