pvary opened a new pull request, #8803: URL: https://github.com/apache/iceberg/pull/8803
Based on our discussion on the dev list, I have created the PR which makes possible to narrow down the retained column statistics in the `ScanTask` returned from planning. For reference the discussion: https://lists.apache.org/thread/pcfpztld5gfpdvm1dy4l84xfl6odxhw8 The PR makes it possible to set the `includeColumnStats` for a `Scan`. The resulting `ScanTask`s will contain column statistics for the specific columnIds only, omitting statistics which might be present in the metadata files, but not specifically requested by the user. The PR consists of 3 main parts: 1. Interface changes: - `Scan.includeColumnStats` to set the required columnIds - `ContentFile.copyWithSpecificStats` to provide an interface for the stat removal when copying the file objects 2. Core changes: - Implementation of the `BaseFile` constructor which takes care of the statistics filtering, and making sure that the other implementations are using this method as well. - Propagating the `columnStatsToInclude` filed through the different scan implementations, and putting it into the `TableScanContext`. - Adding a new property to the `ManifestGroup` builder to store the `columnStatsToKeep`. This class is responsible for the final copy of the `DataFiles` where we remove the statistics which are not needed. - Added tests to check that the statistics removal is working as expected. 3. Flink changes: - Adding a new `FlinkReadOption` to set which column stats we should keep: `column-stats-to-keep` - Minimal Flink `ScanContext` and Planner changes to propagate the values - Updated the documentation for the Flink Source - Added tests to check that the statistics removal is working as expected. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org