pvary opened a new pull request, #8803:
URL: https://github.com/apache/iceberg/pull/8803

   Based on our discussion on the dev list, I have created the PR which makes 
possible to narrow down the retained column statistics in the `ScanTask` 
returned from planning.
   
   For reference the discussion: 
https://lists.apache.org/thread/pcfpztld5gfpdvm1dy4l84xfl6odxhw8
   
   The PR makes it possible to set the `includeColumnStats` for a `Scan`. The 
resulting `ScanTask`s will contain column statistics for the specific columnIds 
only, omitting statistics which might be present in the metadata files, but not 
specifically requested by the user.
   
   The PR consists of 3 main parts:
   1. Interface changes:
      - `Scan.includeColumnStats` to set the required columnIds
      - `ContentFile.copyWithSpecificStats` to provide an interface for the 
stat removal when copying the file objects
   2. Core changes:
      - Implementation of the `BaseFile` constructor which takes care of the 
statistics filtering, and making sure that the other implementations are using 
this method as well.
      - Propagating the `columnStatsToInclude` filed through the different scan 
implementations, and putting it into the `TableScanContext`.
      - Adding a new property to the `ManifestGroup` builder to store the 
`columnStatsToKeep`. This class is responsible for the final copy of the 
`DataFiles` where we remove the statistics which are not needed.
      - Added tests to check that the statistics removal is working as expected.
   3. Flink changes:
      - Adding a new `FlinkReadOption` to set which column stats we should 
keep: `column-stats-to-keep`
      - Minimal Flink `ScanContext` and Planner changes to propagate the values
      - Updated the documentation for the Flink Source
      - Added tests to check that the statistics removal is working as expected.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Reply via email to