stevenzwu commented on code in PR #8803: URL: https://github.com/apache/iceberg/pull/8803#discussion_r1370442734
########## api/src/main/java/org/apache/iceberg/ContentFile.java: ########## @@ -165,6 +166,20 @@ default Long fileSequenceNumber() { */ F copyWithoutStats(); + /** + * Copies this file with only specific column stats. Manifest readers can reuse file instances; + * use this method to copy data and only copy specific stats when collecting files. If the + * columnsToKeepStats set is empty or <code>null</code>, then all column stats will be kept. + * + * @param columnsToKeepStats the set of the column ids for the columns which stats are kept. + * @return a copy of this data file, with stats lower bounds, upper bounds, value counts, null + * value counts, and nan value counts for only specific columns. + */ + default F copyWithStats(Set<Integer> columnsToKeepStats) { + throw new UnsupportedOperationException( + this.getClass().getName() + " doesn't implement copyWithSpecificStats"); Review Comment: copyWithSpecificStats needs to be updated to `copyWithStats(Set<Integer>)` ########## core/src/main/java/org/apache/iceberg/BaseFile.java: ########## @@ -174,8 +177,10 @@ public PartitionData copy() { * * @param toCopy a generic data file to copy. * @param fullCopy whether to copy all fields or to drop column-level stats + * @param columnsToKeepStats a set of column ids to keep stats. If empty or <code>null</code> then + * every column stat is kept. */ - BaseFile(BaseFile<F> toCopy, boolean fullCopy) { + BaseFile(BaseFile<F> toCopy, boolean fullCopy, Set<Integer> columnsToKeepStats) { Review Comment: the arg name `fullCopy` is not accurate anymore. maybe rename it to `copyStats`? ########## core/src/main/java/org/apache/iceberg/BaseFile.java: ########## @@ -504,6 +508,27 @@ private static Map<Integer, ByteBuffer> toReadableByteBufferMap(Map<Integer, Byt } } + private static <TypeT> Map<Integer, TypeT> filterColumnsStats( + Map<Integer, TypeT> map, Set<Integer> columnIds) { + if (columnIds == null || columnIds.isEmpty()) { + return SerializableMap.copyOf(map); + } + + if (map == null) { + return null; + } + + Map<Integer, TypeT> filtered = Maps.newHashMapWithExpectedSize(columnIds.size()); Review Comment: I meant `columnIds.stream().map(columnId -> map.get(columnId))...` ########## core/src/main/java/org/apache/iceberg/BaseFile.java: ########## @@ -504,6 +508,27 @@ private static Map<Integer, ByteBuffer> toReadableByteBufferMap(Map<Integer, Byt } } + private static <TypeT> Map<Integer, TypeT> filterColumnsStats( Review Comment: got it. this is good then ########## core/src/main/java/org/apache/iceberg/BaseFile.java: ########## @@ -504,6 +508,27 @@ private static Map<Integer, ByteBuffer> toReadableByteBufferMap(Map<Integer, Byt } } + private static <TypeT> Map<Integer, TypeT> filterColumnsStats( + Map<Integer, TypeT> map, Set<Integer> columnIds) { + if (columnIds == null || columnIds.isEmpty()) { + return SerializableMap.copyOf(map); + } + + if (map == null) { Review Comment: ok. I am fine with it. It is either check the map or the columnIds first. not sure about extra null check. ########## api/src/main/java/org/apache/iceberg/ContentFile.java: ########## @@ -165,6 +166,20 @@ default Long fileSequenceNumber() { */ F copyWithoutStats(); + /** + * Copies this file with only specific column stats. Manifest readers can reuse file instances; + * use this method to copy data and only copy specific stats when collecting files. If the Review Comment: should `If the columnsToKeepStats set is empty ...` be put in the `@param` below? ########## core/src/main/java/org/apache/iceberg/BaseFile.java: ########## @@ -504,6 +508,27 @@ private static Map<Integer, ByteBuffer> toReadableByteBufferMap(Map<Integer, Byt } } + private static <TypeT> Map<Integer, TypeT> filterColumnsStats( + Map<Integer, TypeT> map, Set<Integer> columnIds) { + if (columnIds == null || columnIds.isEmpty()) { + return SerializableMap.copyOf(map); + } + + if (map == null) { + return null; + } + + Map<Integer, TypeT> filtered = Maps.newHashMapWithExpectedSize(columnIds.size()); + for (Integer columnId : columnIds) { + TypeT value = map.get(columnId); + if (value != null) { Review Comment: If there are null values in lower_bound or upper_bound, why do we want to filter them out? shouldn't we keep the same behavior when copying stats for selected columns. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org