Re: [PR] Core: Enable column statistics filtering after planning [iceberg]

via GitHub Tue, 24 Oct 2023 09:13:45 -0700


stevenzwu commented on code in PR #8803:
URL: https://github.com/apache/iceberg/pull/8803#discussion_r1370442734



##########
api/src/main/java/org/apache/iceberg/ContentFile.java:
##########
@@ -165,6 +166,20 @@ default Long fileSequenceNumber() {
    */
   F copyWithoutStats();
 
+  /**
+   * Copies this file with only specific column stats. Manifest readers can 
reuse file instances;
+   * use this method to copy data and only copy specific stats when collecting 
files. If the
+   * columnsToKeepStats set is empty or <code>null</code>, then all column 
stats will be kept.
+   *
+   * @param columnsToKeepStats the set of the column ids for the columns which 
stats are kept.
+   * @return a copy of this data file, with stats lower bounds, upper bounds, 
value counts, null
+   *     value counts, and nan value counts for only specific columns.
+   */
+  default F copyWithStats(Set<Integer> columnsToKeepStats) {
+    throw new UnsupportedOperationException(
+        this.getClass().getName() + " doesn't implement 
copyWithSpecificStats");

Review Comment:
   copyWithSpecificStats needs to be updated to `copyWithStats(Set<Integer>)`



##########
core/src/main/java/org/apache/iceberg/BaseFile.java:
##########
@@ -174,8 +177,10 @@ public PartitionData copy() {
    *
    * @param toCopy a generic data file to copy.
    * @param fullCopy whether to copy all fields or to drop column-level stats
+   * @param columnsToKeepStats a set of column ids to keep stats. If empty or 
<code>null</code> then
+   *     every column stat is kept.
    */
-  BaseFile(BaseFile<F> toCopy, boolean fullCopy) {
+  BaseFile(BaseFile<F> toCopy, boolean fullCopy, Set<Integer> 
columnsToKeepStats) {

Review Comment:
   the arg name `fullCopy` is not accurate anymore. maybe rename it to 
`copyStats`?



##########
core/src/main/java/org/apache/iceberg/BaseFile.java:
##########
@@ -504,6 +508,27 @@ private static Map<Integer, ByteBuffer> 
toReadableByteBufferMap(Map<Integer, Byt
     }
   }
 
+  private static <TypeT> Map<Integer, TypeT> filterColumnsStats(
+      Map<Integer, TypeT> map, Set<Integer> columnIds) {
+    if (columnIds == null || columnIds.isEmpty()) {
+      return SerializableMap.copyOf(map);
+    }
+
+    if (map == null) {
+      return null;
+    }
+
+    Map<Integer, TypeT> filtered = 
Maps.newHashMapWithExpectedSize(columnIds.size());

Review Comment:
   I meant `columnIds.stream().map(columnId -> map.get(columnId))...`



##########
core/src/main/java/org/apache/iceberg/BaseFile.java:
##########
@@ -504,6 +508,27 @@ private static Map<Integer, ByteBuffer> 
toReadableByteBufferMap(Map<Integer, Byt
     }
   }
 
+  private static <TypeT> Map<Integer, TypeT> filterColumnsStats(

Review Comment:
   got it. this is good then



##########
core/src/main/java/org/apache/iceberg/BaseFile.java:
##########
@@ -504,6 +508,27 @@ private static Map<Integer, ByteBuffer> 
toReadableByteBufferMap(Map<Integer, Byt
     }
   }
 
+  private static <TypeT> Map<Integer, TypeT> filterColumnsStats(
+      Map<Integer, TypeT> map, Set<Integer> columnIds) {
+    if (columnIds == null || columnIds.isEmpty()) {
+      return SerializableMap.copyOf(map);
+    }
+
+    if (map == null) {

Review Comment:
   ok. I am fine with it.
   
   It is either check the map or the columnIds first. not sure about extra null 
check.



##########
api/src/main/java/org/apache/iceberg/ContentFile.java:
##########
@@ -165,6 +166,20 @@ default Long fileSequenceNumber() {
    */
   F copyWithoutStats();
 
+  /**
+   * Copies this file with only specific column stats. Manifest readers can 
reuse file instances;
+   * use this method to copy data and only copy specific stats when collecting 
files. If the

Review Comment:
   should `If the columnsToKeepStats set is empty ...` be put in the `@param` 
below?



##########
core/src/main/java/org/apache/iceberg/BaseFile.java:
##########
@@ -504,6 +508,27 @@ private static Map<Integer, ByteBuffer> 
toReadableByteBufferMap(Map<Integer, Byt
     }
   }
 
+  private static <TypeT> Map<Integer, TypeT> filterColumnsStats(
+      Map<Integer, TypeT> map, Set<Integer> columnIds) {
+    if (columnIds == null || columnIds.isEmpty()) {
+      return SerializableMap.copyOf(map);
+    }
+
+    if (map == null) {
+      return null;
+    }
+
+    Map<Integer, TypeT> filtered = 
Maps.newHashMapWithExpectedSize(columnIds.size());
+    for (Integer columnId : columnIds) {
+      TypeT value = map.get(columnId);
+      if (value != null) {

Review Comment:
   If there are null values in lower_bound or upper_bound, why do we want to 
filter them out? shouldn't we keep the same behavior when copying stats for 
selected columns.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Core: Enable column statistics filtering after planning [iceberg]

Reply via email to