zhongyujiang opened a new pull request, #9177:
URL: https://github.com/apache/iceberg/pull/9177

   This add a table property 
`write.delete.path-pos-delete.parquet.row-group-size-bytes` to control the 
Parquet row-group size of position delete files.
   
   When the position delete file has only `file_path` and `pos` columns, even 
one million position delete records will only occupy 1~2MB of storage space 
(assuming they have the same `file_path`):
   ```text 
   data-file-row-count: 100000000, pos-delete-row-count : 10000, 
delete-file-size-bytes: 24617
   data-file-row-count: 100000000, pos-delete-row-count : 100000, 
delete-file-size-bytes: 209195
   data-file-row-count: 100000000, pos-delete-row-count : 500000, 
delete-file-size-bytes: 922813
   data-file-row-count: 100000000, pos-delete-row-count : 1000000, 
delete-file-size-bytes: 1625049
   ```
   However, currently the default value of Parquet row-group size of position 
delete is 128MB, which is too big. The reader must read all pos delete to find 
the relevant position deletes because all delete records are written to one 
row-group.
   
   So this adds a property to control the row-group size of position delete 
file, it will be used when position deletes has only `file_path` and `pos` 
colunmns. The default is set to 2MB to separate position deletes that should be 
applied to different data files into different row-groups as much as possible, 
so that we can take advantage of the row-group filter to filter out unreleated 
position delete records when reading. 
   
   closes #9149 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Reply via email to