zhongyujiang opened a new pull request, #9177: URL: https://github.com/apache/iceberg/pull/9177
This add a table property `write.delete.path-pos-delete.parquet.row-group-size-bytes` to control the Parquet row-group size of position delete files. When the position delete file has only `file_path` and `pos` columns, even one million position delete records will only occupy 1~2MB of storage space (assuming they have the same `file_path`): ```text data-file-row-count: 100000000, pos-delete-row-count : 10000, delete-file-size-bytes: 24617 data-file-row-count: 100000000, pos-delete-row-count : 100000, delete-file-size-bytes: 209195 data-file-row-count: 100000000, pos-delete-row-count : 500000, delete-file-size-bytes: 922813 data-file-row-count: 100000000, pos-delete-row-count : 1000000, delete-file-size-bytes: 1625049 ``` However, currently the default value of Parquet row-group size of position delete is 128MB, which is too big. The reader must read all pos delete to find the relevant position deletes because all delete records are written to one row-group. So this adds a property to control the row-group size of position delete file, it will be used when position deletes has only `file_path` and `pos` colunmns. The default is set to 2MB to separate position deletes that should be applied to different data files into different row-groups as much as possible, so that we can take advantage of the row-group filter to filter out unreleated position delete records when reading. closes #9149 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org