Re: [PR] Core: Enable column statistics filtering after planning [iceberg]

via GitHub Wed, 18 Oct 2023 09:40:23 -0700


stevenzwu commented on code in PR #8803:
URL: https://github.com/apache/iceberg/pull/8803#discussion_r1364171669



##########
core/src/main/java/org/apache/iceberg/BaseFile.java:
##########
@@ -174,8 +176,9 @@ public PartitionData copy() {
    *
    * @param toCopy a generic data file to copy.
    * @param fullCopy whether to copy all fields or to drop column-level stats
+   * @param statsToKeep the collection of the column ids for the columns which 
stats are kept
    */
-  BaseFile(BaseFile<F> toCopy, boolean fullCopy) {
+  BaseFile(BaseFile<F> toCopy, boolean fullCopy, Collection<Integer> 
statsToKeep) {

Review Comment:
   nit: arg can be `columnsToKeepStats`



##########
api/src/main/java/org/apache/iceberg/Scan.java:
##########
@@ -77,6 +77,20 @@ public interface Scan<ThisT, T extends ScanTask, G extends 
ScanTaskGroup<T>> {
    */
   ThisT includeColumnStats();
 
+  /**
+   * Create a new scan from this that loads the column stats for the specific 
columns with each data
+   * file.
+   *
+   * <p>Column stats include: value count, null value count, lower bounds, and 
upper bounds.
+   *
+   * @param columns column ids from the table's schema
+   * @return a new scan based on this that loads column stats for specific 
columns.
+   */
+  default ThisT includeColumnStats(Collection<Integer> columns) {

Review Comment:
   Set is probably the better type



##########
core/src/main/java/org/apache/iceberg/BaseFile.java:
##########
@@ -186,12 +189,23 @@ public PartitionData copy() {
     this.recordCount = toCopy.recordCount;
     this.fileSizeInBytes = toCopy.fileSizeInBytes;
     if (fullCopy) {
-      this.columnSizes = SerializableMap.copyOf(toCopy.columnSizes);
-      this.valueCounts = SerializableMap.copyOf(toCopy.valueCounts);
-      this.nullValueCounts = SerializableMap.copyOf(toCopy.nullValueCounts);
-      this.nanValueCounts = SerializableMap.copyOf(toCopy.nanValueCounts);
-      this.lowerBounds = 
SerializableByteBufferMap.wrap(SerializableMap.copyOf(toCopy.lowerBounds));
-      this.upperBounds = 
SerializableByteBufferMap.wrap(SerializableMap.copyOf(toCopy.upperBounds));
+      if (statsToKeep == null || statsToKeep.isEmpty()) {

Review Comment:
   to simplify the code, we can probably move the if-else inside the 
`filterColumnStats` method below. so we can always do 
`...copyOf(filterColumnsStats(..., columnsToKeepStats)`



##########
core/src/main/java/org/apache/iceberg/TableScanContext.java:
##########
@@ -59,6 +60,11 @@ public boolean returnColumnStats() {
     return false;
   }
 
+  @Value.Default
+  public Collection<Integer> returnColumnStatsToInclude() {

Review Comment:
   I would call this `columnsToIncludeStats`. return here is not intuitive 
here, while it makes sense for the method above.



##########
core/src/main/java/org/apache/iceberg/BaseScan.java:
##########
@@ -121,6 +121,10 @@ protected boolean shouldReturnColumnStats() {
     return context().returnColumnStats();
   }
 
+  protected Collection<Integer> columnStatsToInclude() {

Review Comment:
   nit: columnsToIncludeStats?



##########
api/src/main/java/org/apache/iceberg/ContentFile.java:
##########
@@ -177,4 +191,26 @@ default Long fileSequenceNumber() {
   default F copy(boolean withStats) {
     return withStats ? copy() : copyWithoutStats();
   }
+
+  /**
+   * Copies this file (potentially with or without specific column stats). 
Manifest readers can
+   * reuse file instances; use this method to copy data when collecting files 
from tasks.
+   *
+   * @param withStats Will copy this file without file stats if set to 
<code>false</code>.
+   * @param statsToKeep Will keep stats only for these columns. Not used if 
<code>withStats</code>
+   *     is set to <code>false</code>.
+   * @return a copy of this data file. If "withStats" is set to 
<code>false</code> the file will not
+   *     contain lower bounds, upper bounds, value counts, null value counts, 
or nan value counts.
+   *     If "withStats" is set to <code>true</code> and the "statsToKeep" is 
not empty then only
+   *     specific column stats will be kept.
+   */
+  default F copy(boolean withStats, Collection<Integer> statsToKeep) {

Review Comment:
   yeah. I think it is more important to keep the interface lean. I am 
wondering if we can move them into a util class like `ContentFileUtil` that is 
similar to `ManifestFileUtil`.



##########
api/src/main/java/org/apache/iceberg/ContentFile.java:
##########
@@ -165,6 +166,19 @@ default Long fileSequenceNumber() {
    */
   F copyWithoutStats();
 
+  /**
+   * Copies this file with only specific column stats. Manifest readers can 
reuse file instances;
+   * use this method to copy data and only copy specific stats when collecting 
files.
+   *
+   * @param statsToKeep the collection of the column ids for the columns which 
stats are kept
+   * @return a copy of this data file, with stats lower bounds, upper bounds, 
value counts, null
+   *     value counts, and nan value counts for only specific columns.
+   */
+  default F copyWithSpecificStats(Collection<Integer> statsToKeep) {

Review Comment:
   nit: `columnsToKeepStats` arg name is probably more accurate.  the type 
probably should be a `Set` to be accurate.
   
   Regarding method name, `Specific` is not conveying much in this context. I 
am wondering if we can simply name it `copyWithStats(Set<Integer> 
columnsToKeepStats)`.



##########
core/src/main/java/org/apache/iceberg/BaseFile.java:
##########
@@ -504,6 +518,40 @@ private static Map<Integer, ByteBuffer> 
toReadableByteBufferMap(Map<Integer, Byt
     }
   }
 
+  private static Map<Integer, Long> filteredLongMap(

Review Comment:
   we probably can combine these two util method into one with generic, like
   ```
   private static <T> Map<Integer, T> filterColumnsStats(Map<Integer, T> stats, 
Set<Integer> columnsToKeepStats)
   ```



##########
core/src/main/java/org/apache/iceberg/TableScanContext.java:
##########
@@ -125,6 +131,16 @@ TableScanContext shouldReturnColumnStats(boolean 
returnColumnStats) {
         .build();
   }
 
+  TableScanContext shouldReturnSpecificColumnStats(Collection<Integer> 
columStatsToInclude) {

Review Comment:
   `should` is typically used for boolean arg. how about just 
`columnsToIncludeStats(Set<Integer> columns)`? similar to the `selectColumns` 
method below.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Core: Enable column statistics filtering after planning [iceberg]

Reply via email to