aokolnychyi commented on code in PR #8502:
URL: https://github.com/apache/iceberg/pull/8502#discussion_r1427180249


##########
core/src/main/java/org/apache/iceberg/TableMetadata.java:
##########
@@ -946,15 +955,25 @@ private Builder(TableMetadata base) {
       this.previousFileLocation = base.metadataFileLocation;
       this.previousFiles = base.previousFiles;
       this.refs = Maps.newHashMap(base.refs);
-      this.statisticsFiles =

Review Comment:
   @ajantha-bhat, just to make sure I understand. We try to replace the 
partition stats for each snapshot but it is not required by the spec so it is 
technically possible to have multiple files for one snapshot?



##########
core/src/main/java/org/apache/iceberg/TableMetadata.java:
##########
@@ -1288,6 +1307,24 @@ public Builder suppressHistoricalSnapshots() {
       return this;
     }
 
+    public Builder setPartitionStatistics(PartitionStatisticsFile 
partitionStatisticsFile) {
+      Preconditions.checkNotNull(partitionStatisticsFile, "partition 
statistics file is null");
+      partitionStatisticsFiles.put(
+          partitionStatisticsFile.snapshotId(), 
ImmutableList.of(partitionStatisticsFile));
+      changes.add(new 
MetadataUpdate.SetPartitionStatistics(partitionStatisticsFile));
+      return this;
+    }
+
+    public Builder removePartitionStatistics(long snapshotId) {
+      Preconditions.checkNotNull(snapshotId, "snapshotId is null");

Review Comment:
   How can it be null if it is primitive?



##########
core/src/main/java/org/apache/iceberg/TableMetadataParser.java:
##########
@@ -561,4 +576,21 @@ private static List<StatisticsFile> 
statisticsFilesFromJson(JsonNode statisticsF
 
     return statisticsFilesBuilder.build();
   }
+
+  private static List<PartitionStatisticsFile> 
partitionStatisticsFilesFromJson(
+      JsonNode partitionStatisticsFilesList) {

Review Comment:
   I don't mind using longer variables but I wonder whether a shorter version 
and staying on one line in some places would be more readable in this method.



##########
core/src/main/java/org/apache/iceberg/SetPartitionStatistics.java:
##########
@@ -0,0 +1,68 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.iceberg;
+
+import java.util.List;
+import java.util.Set;
+import org.apache.iceberg.relocated.com.google.common.base.Preconditions;
+import org.apache.iceberg.relocated.com.google.common.collect.Sets;
+
+public class SetPartitionStatistics implements UpdatePartitionStatistics {
+  private final TableOperations ops;
+  private final Set<PartitionStatisticsFile> partitionStatisticsToSet = 
Sets.newHashSet();
+  private final Set<Long> partitionStatisticsToRemove = Sets.newHashSet();
+
+  public SetPartitionStatistics(TableOperations ops) {
+    this.ops = ops;
+  }
+
+  @Override
+  public UpdatePartitionStatistics setPartitionStatistics(
+      PartitionStatisticsFile partitionStatisticsFile) {

Review Comment:
   I feel the context in this class is pretty clear and we can call it `stats`, 
`partitionStats` or `file` to shorten the lines. Then we can also call other 
variables `partitionStatsToSet` or `statsToSet`.



##########
core/src/main/java/org/apache/iceberg/SetPartitionStatistics.java:
##########
@@ -0,0 +1,68 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.iceberg;
+
+import java.util.List;
+import java.util.Set;
+import org.apache.iceberg.relocated.com.google.common.base.Preconditions;
+import org.apache.iceberg.relocated.com.google.common.collect.Sets;
+
+public class SetPartitionStatistics implements UpdatePartitionStatistics {
+  private final TableOperations ops;
+  private final Set<PartitionStatisticsFile> partitionStatisticsToSet = 
Sets.newHashSet();

Review Comment:
   What about using `Map<Long, PartitionStatisticsFile>` to make sure the files 
are overridden if multiple files are passed for the same snapshot? We can then 
simply call `statsToSet.values().forEach(...)` below?



##########
api/src/main/java/org/apache/iceberg/PartitionStatisticsFile.java:
##########
@@ -0,0 +1,38 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.iceberg;
+
+import java.io.Serializable;
+
+/**
+ * Represents a partition statistics file that can be used to read table data 
more efficiently.
+ *
+ * <p>Statistics are informational. A reader can choose to ignore statistics 
information. Statistics
+ * support is not required to read the table correctly.
+ */
+public interface PartitionStatisticsFile extends Serializable {

Review Comment:
   Are we sure it is a good idea to make `PartitionStatisticsFile` 
serializable? I would probably not do that unless there is a good reason right 
now. None of our existing files are serializable by contract (they may be in 
practice but not by API).



##########
core/src/main/java/org/apache/iceberg/TableMetadata.java:
##########
@@ -946,15 +955,25 @@ private Builder(TableMetadata base) {
       this.previousFileLocation = base.metadataFileLocation;
       this.previousFiles = base.previousFiles;
       this.refs = Maps.newHashMap(base.refs);
-      this.statisticsFiles =
-          
base.statisticsFiles.stream().collect(Collectors.groupingBy(StatisticsFile::snapshotId));
-
+      this.statisticsFiles = statsFileBySnapshotID(base);
+      this.partitionStatisticsFiles = partitionStatsFileBySnapshotID(base);
       this.snapshotsById = Maps.newHashMap(base.snapshotsById);
       this.schemasById = Maps.newHashMap(base.schemasById);
       this.specsById = Maps.newHashMap(base.specsById);
       this.sortOrdersById = Maps.newHashMap(base.sortOrdersById);
     }
 
+    private static Map<Long, List<StatisticsFile>> 
statsFileBySnapshotID(TableMetadata base) {

Review Comment:
   Question: Is there a pattern in this class to have static methods at the 
end? If so, can we put these methods together with other static methods? If 
not, it is OK to keep them here.



##########
core/src/main/java/org/apache/iceberg/TableMetadata.java:
##########
@@ -946,15 +955,25 @@ private Builder(TableMetadata base) {
       this.previousFileLocation = base.metadataFileLocation;
       this.previousFiles = base.previousFiles;
       this.refs = Maps.newHashMap(base.refs);
-      this.statisticsFiles =
-          
base.statisticsFiles.stream().collect(Collectors.groupingBy(StatisticsFile::snapshotId));
-
+      this.statisticsFiles = statsFileBySnapshotID(base);
+      this.partitionStatisticsFiles = partitionStatsFileBySnapshotID(base);
       this.snapshotsById = Maps.newHashMap(base.snapshotsById);
       this.schemasById = Maps.newHashMap(base.schemasById);
       this.specsById = Maps.newHashMap(base.specsById);
       this.sortOrdersById = Maps.newHashMap(base.sortOrdersById);
     }
 
+    private static Map<Long, List<StatisticsFile>> 
statsFileBySnapshotID(TableMetadata base) {
+      return base.statisticsFiles.stream()
+          .collect(Collectors.groupingBy(StatisticsFile::snapshotId));
+    }
+
+    private static Map<Long, List<PartitionStatisticsFile>> 
partitionStatsFileBySnapshotID(

Review Comment:
   I think it is common for this class to call such methods `indexSmth` and 
pass actual elements. If so, what about something like below? We can use 
shorter variables to stay on line but that's totally optional. Up to you, 
method names are examples.
   
   ```
   private static Map<Long, List<StatisticsFile>> 
indexStatistics(List<StatisticsFile> files) {
     return 
files.stream().collect(Collectors.groupingBy(StatisticsFile::snapshotId));
   }
   
   private static Map<Long, List<PartitionStatisticsFile>> 
indexPartitionStatistics(
       List<PartitionStatisticsFile> files) {
     return 
files.stream().collect(Collectors.groupingBy(PartitionStatisticsFile::snapshotId));
   }
   ```



##########
core/src/main/java/org/apache/iceberg/TableMetadata.java:
##########
@@ -1288,6 +1307,24 @@ public Builder suppressHistoricalSnapshots() {
       return this;
     }
 
+    public Builder setPartitionStatistics(PartitionStatisticsFile 
partitionStatisticsFile) {

Review Comment:
   Question: I wonder whether using just `file` or `partitionStats` to shorten 
the lines would make sense.
   Completely up to you.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Reply via email to