zhongyujiang opened a new issue, #9149: URL: https://github.com/apache/iceberg/issues/9149
### Feature Request / Improvement When MoR mode is truned on, Spark-iceberg connector will write the position deletion of multiple data files into one pos delete file. This will reduce the performance of reading because the reader have to iterator all pos delete records to find the deletions that should apply to the current data file. This is because the current default pos-delete file size is 64 MB, and the row-group size is not explicitly set, which will default to 128 MB. That is to say, by default, a pos delete file always has only one row-group, which makes row-group filtering impossible. Since Iceberg stores pos deletion sorted by `file_path`. I think if we can set a reasonable row-group size and separate pos deletions belonging to different data files into different row-groups as much as possible, then row-group filter can help us to filter out and delete records that are not relevant to the current data file. Regarding the reasonable row-group size, I did a simple test, when only `file_path` and `pos` are stored, even 10 million position deletion records only use 14 MB of storage space. (Thanks to Parquet's dictionary encoding and good compression) I assume that typically, the number of delete records applied to a single data file will not exceed 1 million, in which case 1 MB ~ 2 MB seems to be a reasonable row-group size. My test and assumption is kind of simple, but real scenarios can be complicated. So would like to hear some suggestions from the community. ### Query engine Spark -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org