eshishki opened a new issue, #12280:
URL: https://github.com/apache/iceberg/issues/12280

   ### Feature Request / Improvement
   
   We do ingestion from debezium to iceberg via 
https://github.com/databricks/iceberg-kafka-connect/
   Basically it uses flink delta writer.
   
   Each batch of data writes small number of eq deletes for updates of prev 
commit data.
   Most of db pk keys are uuid and so we even a handful of eq delete rows cover 
a large portion of data files (via lower/upper bounds check),
   forcing costly check at query time.
   
   We do run periodic compaction process, but it is inefficient, since it 
forces us to rewrite practically whole table, which would be "dirty" within 5 
minutes of commit interval.
   
   We thought about having multiple eq delete files, to make bounds more 
granular and to emulate poor man bloom filter.
   But it again add many ranges and only postpone the issue, the table would be 
dirty say in 30 minutes, not 5.
   
   If however new writer could read previous handful of eq deletes, maybe it 
could have combined them with new ones, so that the number of range buckets 
would stay ~ constant.
   
   ### Query engine
   
   None
   
   ### Willingness to contribute
   
   - [ ] I can contribute this improvement/feature independently
   - [ ] I would be willing to contribute this improvement/feature with 
guidance from the Iceberg community
   - [x] I cannot contribute this improvement/feature at this time


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Reply via email to