aokolnychyi commented on code in PR #8803: URL: https://github.com/apache/iceberg/pull/8803#discussion_r1385865215
########## api/src/main/java/org/apache/iceberg/Scan.java: ########## @@ -77,6 +77,21 @@ public interface Scan<ThisT, T extends ScanTask, G extends ScanTaskGroup<T>> { */ ThisT includeColumnStats(); + /** + * Create a new scan from this that loads the column stats for the specific columns with each data + * file. + * + * <p>Column stats include: value count, null value count, lower bounds, and upper bounds. + * + * @param requestedColumns column names for which to keep the stats. If <code>null</code> then all Review Comment: I am not sure how I feel about supporting `null` here. Let me think. Also, won't the current implementation throw an exception if we pass `null`? ########## core/src/main/java/org/apache/iceberg/BaseFile.java: ########## @@ -185,13 +190,25 @@ public PartitionData copy() { this.partitionType = toCopy.partitionType; this.recordCount = toCopy.recordCount; this.fileSizeInBytes = toCopy.fileSizeInBytes; - if (fullCopy) { - this.columnSizes = SerializableMap.copyOf(toCopy.columnSizes); - this.valueCounts = SerializableMap.copyOf(toCopy.valueCounts); - this.nullValueCounts = SerializableMap.copyOf(toCopy.nullValueCounts); - this.nanValueCounts = SerializableMap.copyOf(toCopy.nanValueCounts); - this.lowerBounds = SerializableByteBufferMap.wrap(SerializableMap.copyOf(toCopy.lowerBounds)); - this.upperBounds = SerializableByteBufferMap.wrap(SerializableMap.copyOf(toCopy.upperBounds)); + if (copyStats) { Review Comment: I wonder about the performance and generated extra objects given that it is such a critical place that's invoked in a tight loop. We can have millions of files. Will need to check this with fresh eyes tomorrow. We should run our JMH benchmarks before merging. If the difference is visible, we can offer `copyOf` methods that would copy only selected keys in our custom maps. ########## api/src/main/java/org/apache/iceberg/ContentFile.java: ########## @@ -165,6 +166,20 @@ default Long fileSequenceNumber() { */ F copyWithoutStats(); + /** + * Copies this file with only specific column stats. Manifest readers can reuse file instances; + * use this method to copy data and only copy specific stats when collecting files. + * + * @param requestedColumnIds column ids for which to keep stats. If <code>null</code> then every + * column stat is kept. + * @return a copy of this data file, with stats lower bounds, upper bounds, value counts, null + * value counts, and nan value counts for only specific columns. + */ + default F copyWithStats(Set<Integer> requestedColumnIds) { Review Comment: Also, if we want to support `null` as a valid value, do we still need `ContentFileUtil`? ``` copyWithStats(null) -> copy with all stats copyWithStats(empty) -> copy without stats copyWithStats(someIds) -> copy with some stats ``` It seems we need to either support `null` here or have `ContenFileUtil`, is there a reason to have both? I am not sure what I prefer more, though. ########## api/src/main/java/org/apache/iceberg/ContentFile.java: ########## @@ -165,6 +166,20 @@ default Long fileSequenceNumber() { */ F copyWithoutStats(); + /** + * Copies this file with only specific column stats. Manifest readers can reuse file instances; + * use this method to copy data and only copy specific stats when collecting files. + * + * @param requestedColumnIds column ids for which to keep stats. If <code>null</code> then every + * column stat is kept. + * @return a copy of this data file, with stats lower bounds, upper bounds, value counts, null + * value counts, and nan value counts for only specific columns. + */ + default F copyWithStats(Set<Integer> requestedColumnIds) { Review Comment: I think it is fine to use IDs at this level as `ContenFile` references IDs everywhere. I agree with converting column names to IDs in scans. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org