aokolnychyi commented on code in PR #8803: URL: https://github.com/apache/iceberg/pull/8803#discussion_r1388756877
########## api/src/main/java/org/apache/iceberg/ContentFile.java: ########## @@ -165,6 +166,20 @@ default Long fileSequenceNumber() { */ F copyWithoutStats(); + /** + * Copies this file with only specific column stats. Manifest readers can reuse file instances; Review Comment: Optional: I think `with only specific column stats` may be interpreted as with some types of metrics, rather than with all types of metrics but for specific columns. I'd consider something like this but it is up to you. ``` Copies this file with column stats only for specific columns. Manifest readers can reuse file instances; use this method to copy data with stats only for specific columns when collecting files. ``` ########## api/src/main/java/org/apache/iceberg/ContentFile.java: ########## @@ -165,6 +166,20 @@ default Long fileSequenceNumber() { */ F copyWithoutStats(); + /** + * Copies this file with only specific column stats. Manifest readers can reuse file instances; + * use this method to copy data and only copy specific stats when collecting files. + * + * @param requestedColumnIds column ids for which to keep stats. If <code>null</code> then every Review Comment: Minor: `column ids` -> `column IDs` to follow the existing Javadoc in this class. ########## core/src/main/java/org/apache/iceberg/BaseScan.java: ########## @@ -165,6 +170,19 @@ public ThisT includeColumnStats() { return newRefinedScan(table, schema, context.shouldReturnColumnStats(true)); } + @Override + public ThisT includeColumnStats(Collection<String> requestedColumns) { + return newRefinedScan( + table, + schema, + context + .shouldReturnColumnStats(true) Review Comment: Supporting both `shouldReturnColumnStats()` and `columnsToKeepStats()` will only be needed if we decide to go with the utility for copying. If `null` is a valid value in `ContentFile$copyWithStats`, I think having just a set of field IDs to keep stats for would be sufficient. ########## core/src/main/java/org/apache/iceberg/BaseFile.java: ########## @@ -185,13 +190,25 @@ public PartitionData copy() { this.partitionType = toCopy.partitionType; this.recordCount = toCopy.recordCount; this.fileSizeInBytes = toCopy.fileSizeInBytes; - if (fullCopy) { - this.columnSizes = SerializableMap.copyOf(toCopy.columnSizes); - this.valueCounts = SerializableMap.copyOf(toCopy.valueCounts); - this.nullValueCounts = SerializableMap.copyOf(toCopy.nullValueCounts); - this.nanValueCounts = SerializableMap.copyOf(toCopy.nanValueCounts); - this.lowerBounds = SerializableByteBufferMap.wrap(SerializableMap.copyOf(toCopy.lowerBounds)); - this.upperBounds = SerializableByteBufferMap.wrap(SerializableMap.copyOf(toCopy.upperBounds)); + if (copyStats) { Review Comment: In my view, this piece should not use streams, closures and must do as few extra copies as possible. It is called in a tight loop and our previous profiling showed that all this cost adds up at scale. I would suggest extending `SerializableMap` with `filteredCopyOf`. ``` public static <K, V> SerializableMap<K, V> filteredCopyOf(Map<K, V> map, Set<K> keys) { return map == null ? null : new SerializableMap<>(map, keys); } private SerializableMap(Map<K, V> map, Set<K> keys) { Map<K, V> filteredMap = Maps.newHashMapWithExpectedSize(keys.size()); for (K key : keys) { V value = map.get(key); if (value != null) { filteredMap.put(key, value); } } this.copiedMap = filteredMap; } ``` Then we can define some extra helper methods in `BaseFile` to have reasonable formatting. ``` private static <K, V> Map<K, V> copyMap(Map<K, V> map, Set<K> keys) { if (keys == null) { return SerializableMap.copyOf(map); } else { return SerializableMap.filteredCopyOf(map, keys); } } private static Map<Integer, ByteBuffer> copyByteBufferMap( Map<Integer, ByteBuffer> map, Set<Integer> keys) { return SerializableByteBufferMap.wrap(copyMap(map, keys)); } ``` I understand we are trying to be generic here but I think the performance is critical. ########## api/src/main/java/org/apache/iceberg/ContentFile.java: ########## @@ -165,6 +166,20 @@ default Long fileSequenceNumber() { */ F copyWithoutStats(); + /** + * Copies this file with only specific column stats. Manifest readers can reuse file instances; + * use this method to copy data and only copy specific stats when collecting files. + * + * @param requestedColumnIds column ids for which to keep stats. If <code>null</code> then every + * column stat is kept. + * @return a copy of this data file, with stats lower bounds, upper bounds, value counts, null Review Comment: Minor: `a copy of this data file` -> `a copy of this file`, it is not necessary a data file anymore. We are gradually fixing the docs. ########## api/src/main/java/org/apache/iceberg/Scan.java: ########## @@ -77,6 +77,21 @@ public interface Scan<ThisT, T extends ScanTask, G extends ScanTaskGroup<T>> { */ ThisT includeColumnStats(); + /** + * Create a new scan from this that loads the column stats for the specific columns with each data + * file. + * + * <p>Column stats include: value count, null value count, lower bounds, and upper bounds. + * + * @param requestedColumns column names for which to keep the stats. If <code>null</code> then all Review Comment: I think we should not support null here as a valid value independently of what we will do in `ContentFile`. I would drop the doc. ########## api/src/main/java/org/apache/iceberg/ContentFile.java: ########## @@ -165,6 +166,20 @@ default Long fileSequenceNumber() { */ F copyWithoutStats(); + /** + * Copies this file with only specific column stats. Manifest readers can reuse file instances; + * use this method to copy data and only copy specific stats when collecting files. + * + * @param requestedColumnIds column ids for which to keep stats. If <code>null</code> then every + * column stat is kept. + * @return a copy of this data file, with stats lower bounds, upper bounds, value counts, null Review Comment: Question: Should `with stats lower bounds, ...` be `with lower bounds, ...`? ########## api/src/main/java/org/apache/iceberg/ContentFile.java: ########## @@ -165,6 +166,20 @@ default Long fileSequenceNumber() { */ F copyWithoutStats(); + /** + * Copies this file with only specific column stats. Manifest readers can reuse file instances; + * use this method to copy data and only copy specific stats when collecting files. + * + * @param requestedColumnIds column ids for which to keep stats. If <code>null</code> then every + * column stat is kept. + * @return a copy of this data file, with stats lower bounds, upper bounds, value counts, null + * value counts, and nan value counts for only specific columns. + */ + default F copyWithStats(Set<Integer> requestedColumnIds) { Review Comment: I spend a bit more time on this one. I do think we either drop support for `null` as a valid value here or drop the utility to copy. It is up to you which approach to pick. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org