Re: [PR] Core: Enable column statistics filtering after planning [iceberg]

via GitHub Tue, 07 Nov 2023 18:15:26 -0800


aokolnychyi commented on code in PR #8803:
URL: https://github.com/apache/iceberg/pull/8803#discussion_r1385865215



##########
api/src/main/java/org/apache/iceberg/Scan.java:
##########
@@ -77,6 +77,21 @@ public interface Scan<ThisT, T extends ScanTask, G extends 
ScanTaskGroup<T>> {
    */
   ThisT includeColumnStats();
 
+  /**
+   * Create a new scan from this that loads the column stats for the specific 
columns with each data
+   * file.
+   *
+   * <p>Column stats include: value count, null value count, lower bounds, and 
upper bounds.
+   *
+   * @param requestedColumns column names for which to keep the stats. If 
<code>null</code> then all

Review Comment:
   I am not sure how I feel about supporting `null` here. Let me think.
   Also, won't the current implementation throw an exception if we pass `null`?



##########
core/src/main/java/org/apache/iceberg/BaseFile.java:
##########
@@ -185,13 +190,25 @@ public PartitionData copy() {
     this.partitionType = toCopy.partitionType;
     this.recordCount = toCopy.recordCount;
     this.fileSizeInBytes = toCopy.fileSizeInBytes;
-    if (fullCopy) {
-      this.columnSizes = SerializableMap.copyOf(toCopy.columnSizes);
-      this.valueCounts = SerializableMap.copyOf(toCopy.valueCounts);
-      this.nullValueCounts = SerializableMap.copyOf(toCopy.nullValueCounts);
-      this.nanValueCounts = SerializableMap.copyOf(toCopy.nanValueCounts);
-      this.lowerBounds = 
SerializableByteBufferMap.wrap(SerializableMap.copyOf(toCopy.lowerBounds));
-      this.upperBounds = 
SerializableByteBufferMap.wrap(SerializableMap.copyOf(toCopy.upperBounds));
+    if (copyStats) {

Review Comment:
   I wonder about the performance and generated extra objects given that it is 
such a critical place that's invoked in a tight loop. We can have millions of 
files. Will need to check this with fresh eyes tomorrow.
   
   We should run our JMH benchmarks before merging. If the difference is 
visible, we can offer `copyOf` methods that would copy only selected keys in 
our custom maps.



##########
api/src/main/java/org/apache/iceberg/ContentFile.java:
##########
@@ -165,6 +166,20 @@ default Long fileSequenceNumber() {
    */
   F copyWithoutStats();
 
+  /**
+   * Copies this file with only specific column stats. Manifest readers can 
reuse file instances;
+   * use this method to copy data and only copy specific stats when collecting 
files.
+   *
+   * @param requestedColumnIds column ids for which to keep stats. If 
<code>null</code> then every
+   *     column stat is kept.
+   * @return a copy of this data file, with stats lower bounds, upper bounds, 
value counts, null
+   *     value counts, and nan value counts for only specific columns.
+   */
+  default F copyWithStats(Set<Integer> requestedColumnIds) {

Review Comment:
   Also, if we want to support `null` as a valid value, do we still need 
`ContentFileUtil`?
   
   ```
   copyWithStats(null) -> copy with all stats
   copyWithStats(empty) -> copy without stats
   copyWithStats(someIds) -> copy with some stats
   ```
   
   It seems we need to either support `null` here or have `ContenFileUtil`, is 
there a reason to have both? I am not sure what I prefer more, though.



##########
api/src/main/java/org/apache/iceberg/ContentFile.java:
##########
@@ -165,6 +166,20 @@ default Long fileSequenceNumber() {
    */
   F copyWithoutStats();
 
+  /**
+   * Copies this file with only specific column stats. Manifest readers can 
reuse file instances;
+   * use this method to copy data and only copy specific stats when collecting 
files.
+   *
+   * @param requestedColumnIds column ids for which to keep stats. If 
<code>null</code> then every
+   *     column stat is kept.
+   * @return a copy of this data file, with stats lower bounds, upper bounds, 
value counts, null
+   *     value counts, and nan value counts for only specific columns.
+   */
+  default F copyWithStats(Set<Integer> requestedColumnIds) {

Review Comment:
   I think it is fine to use IDs at this level as `ContenFile` references IDs 
everywhere. I agree with converting column names to IDs in scans.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Core: Enable column statistics filtering after planning [iceberg]

Reply via email to