zhongyujiang opened a new issue, #9149:
URL: https://github.com/apache/iceberg/issues/9149

   ### Feature Request / Improvement
   
   When MoR mode is truned on,  Spark-iceberg connector will write the position 
deletion of multiple data files into one pos delete file. This will reduce the 
performance of reading because the reader have to iterator all pos delete 
records to find the deletions that should apply to the current data file. This 
is because the current default pos-delete file size is 64 MB, and the row-group 
size is not explicitly set, which will default to 128 MB.  That is to say, by 
default, a pos delete file always has only one row-group, which makes row-group 
filtering impossible.
   
   Since Iceberg stores pos deletion sorted by `file_path`. I think if we can 
set a reasonable row-group size and separate pos deletions belonging to 
different data files into different row-groups as much as possible, then 
row-group filter can help us to  filter out and delete records that are not 
relevant to the current data file.
   
   Regarding the reasonable row-group size, I did a simple test, when only 
`file_path` and `pos` are stored, even 10 million position deletion records 
only use 14 MB of storage space.  (Thanks to Parquet's dictionary encoding and 
good compression) 
   I assume that typically, the number of delete records applied to a single 
data file will not exceed 1 million, in which case 1 MB ~ 2 MB seems to be a 
reasonable row-group size. My test and assumption is kind of simple, but real 
scenarios can be complicated. So would like to hear some suggestions from the 
community.
   
   ### Query engine
   
   Spark


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Reply via email to