stevenzwu commented on code in PR #8803: URL: https://github.com/apache/iceberg/pull/8803#discussion_r1364171669
########## core/src/main/java/org/apache/iceberg/BaseFile.java: ########## @@ -174,8 +176,9 @@ public PartitionData copy() { * * @param toCopy a generic data file to copy. * @param fullCopy whether to copy all fields or to drop column-level stats + * @param statsToKeep the collection of the column ids for the columns which stats are kept */ - BaseFile(BaseFile<F> toCopy, boolean fullCopy) { + BaseFile(BaseFile<F> toCopy, boolean fullCopy, Collection<Integer> statsToKeep) { Review Comment: nit: arg can be `columnsToKeepStats` ########## api/src/main/java/org/apache/iceberg/Scan.java: ########## @@ -77,6 +77,20 @@ public interface Scan<ThisT, T extends ScanTask, G extends ScanTaskGroup<T>> { */ ThisT includeColumnStats(); + /** + * Create a new scan from this that loads the column stats for the specific columns with each data + * file. + * + * <p>Column stats include: value count, null value count, lower bounds, and upper bounds. + * + * @param columns column ids from the table's schema + * @return a new scan based on this that loads column stats for specific columns. + */ + default ThisT includeColumnStats(Collection<Integer> columns) { Review Comment: Set is probably the better type ########## core/src/main/java/org/apache/iceberg/BaseFile.java: ########## @@ -186,12 +189,23 @@ public PartitionData copy() { this.recordCount = toCopy.recordCount; this.fileSizeInBytes = toCopy.fileSizeInBytes; if (fullCopy) { - this.columnSizes = SerializableMap.copyOf(toCopy.columnSizes); - this.valueCounts = SerializableMap.copyOf(toCopy.valueCounts); - this.nullValueCounts = SerializableMap.copyOf(toCopy.nullValueCounts); - this.nanValueCounts = SerializableMap.copyOf(toCopy.nanValueCounts); - this.lowerBounds = SerializableByteBufferMap.wrap(SerializableMap.copyOf(toCopy.lowerBounds)); - this.upperBounds = SerializableByteBufferMap.wrap(SerializableMap.copyOf(toCopy.upperBounds)); + if (statsToKeep == null || statsToKeep.isEmpty()) { Review Comment: to simplify the code, we can probably move the if-else inside the `filterColumnStats` method below. so we can always do `...copyOf(filterColumnsStats(..., columnsToKeepStats)` ########## core/src/main/java/org/apache/iceberg/TableScanContext.java: ########## @@ -59,6 +60,11 @@ public boolean returnColumnStats() { return false; } + @Value.Default + public Collection<Integer> returnColumnStatsToInclude() { Review Comment: I would call this `columnsToIncludeStats`. return here is not intuitive here, while it makes sense for the method above. ########## core/src/main/java/org/apache/iceberg/BaseScan.java: ########## @@ -121,6 +121,10 @@ protected boolean shouldReturnColumnStats() { return context().returnColumnStats(); } + protected Collection<Integer> columnStatsToInclude() { Review Comment: nit: columnsToIncludeStats? ########## api/src/main/java/org/apache/iceberg/ContentFile.java: ########## @@ -177,4 +191,26 @@ default Long fileSequenceNumber() { default F copy(boolean withStats) { return withStats ? copy() : copyWithoutStats(); } + + /** + * Copies this file (potentially with or without specific column stats). Manifest readers can + * reuse file instances; use this method to copy data when collecting files from tasks. + * + * @param withStats Will copy this file without file stats if set to <code>false</code>. + * @param statsToKeep Will keep stats only for these columns. Not used if <code>withStats</code> + * is set to <code>false</code>. + * @return a copy of this data file. If "withStats" is set to <code>false</code> the file will not + * contain lower bounds, upper bounds, value counts, null value counts, or nan value counts. + * If "withStats" is set to <code>true</code> and the "statsToKeep" is not empty then only + * specific column stats will be kept. + */ + default F copy(boolean withStats, Collection<Integer> statsToKeep) { Review Comment: yeah. I think it is more important to keep the interface lean. I am wondering if we can move them into a util class like `ContentFileUtil` that is similar to `ManifestFileUtil`. ########## api/src/main/java/org/apache/iceberg/ContentFile.java: ########## @@ -165,6 +166,19 @@ default Long fileSequenceNumber() { */ F copyWithoutStats(); + /** + * Copies this file with only specific column stats. Manifest readers can reuse file instances; + * use this method to copy data and only copy specific stats when collecting files. + * + * @param statsToKeep the collection of the column ids for the columns which stats are kept + * @return a copy of this data file, with stats lower bounds, upper bounds, value counts, null + * value counts, and nan value counts for only specific columns. + */ + default F copyWithSpecificStats(Collection<Integer> statsToKeep) { Review Comment: nit: `columnsToKeepStats` arg name is probably more accurate. the type probably should be a `Set` to be accurate. Regarding method name, `Specific` is not conveying much in this context. I am wondering if we can simply name it `copyWithStats(Set<Integer> columnsToKeepStats)`. ########## core/src/main/java/org/apache/iceberg/BaseFile.java: ########## @@ -504,6 +518,40 @@ private static Map<Integer, ByteBuffer> toReadableByteBufferMap(Map<Integer, Byt } } + private static Map<Integer, Long> filteredLongMap( Review Comment: we probably can combine these two util method into one with generic, like ``` private static <T> Map<Integer, T> filterColumnsStats(Map<Integer, T> stats, Set<Integer> columnsToKeepStats) ``` ########## core/src/main/java/org/apache/iceberg/TableScanContext.java: ########## @@ -125,6 +131,16 @@ TableScanContext shouldReturnColumnStats(boolean returnColumnStats) { .build(); } + TableScanContext shouldReturnSpecificColumnStats(Collection<Integer> columStatsToInclude) { Review Comment: `should` is typically used for boolean arg. how about just `columnsToIncludeStats(Set<Integer> columns)`? similar to the `selectColumns` method below. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org