aokolnychyi commented on code in PR #8502:
URL: https://github.com/apache/iceberg/pull/8502#discussion_r1412672873


##########
.palantir/revapi.yml:
##########
@@ -873,6 +873,10 @@ acceptedBreaks:
       new: "method void org.apache.iceberg.encryption.Ciphers::<init>()"
       justification: "Static utility class - should not have public 
constructor"
   "1.4.0":
+    org.apache.iceberg:iceberg-api:
+    - code: "java.method.addedToInterface"

Review Comment:
   I know we did a similar breaking change when adding table stats. I wonder 
whether that was the correct decision, however. Why not return an empty list by 
default?



##########
api/src/main/java/org/apache/iceberg/PartitionStatisticsFile.java:
##########
@@ -0,0 +1,41 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.iceberg;
+
+import java.io.Serializable;
+
+/**
+ * Represents a partition statistics file in the table default format, that 
can be used to read

Review Comment:
   I wonder whether `in the table default format` will always be accurate. What 
about just this?
   
   ```
   Represents a partition statistics file that can be used to read table data 
more efficiently.
   ```



##########
core/src/main/java/org/apache/iceberg/SetPartitionStatistics.java:
##########
@@ -0,0 +1,74 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.iceberg;
+
+import java.util.List;
+import java.util.Map;
+import java.util.Optional;
+import org.apache.iceberg.relocated.com.google.common.base.Preconditions;
+import org.apache.iceberg.relocated.com.google.common.collect.Maps;
+
+public class SetPartitionStatistics implements UpdatePartitionStatistics {
+  private final TableOperations ops;
+  private final Map<Long, Optional<PartitionStatisticsFile>> 
partitionStatisticsToSet =

Review Comment:
   I am not sure how I feel about using a map and giving `Optional` a special 
meaning. Would having a map of stats to set and a separate set of snapshot IDs 
to remove easier to read/understand? What do you think, @ajantha-bhat?
   
   ```
   private TableMetadata internalApply(TableMetadata base) {
     TableMetadata.Builder builder = TableMetadata.buildFrom(base);
     toSet.forEach(builder::setPartitionStatistics);
     toRemove.forEach(builder::removePartitionStatistics);
     return builder.build();
   }
   ```



##########
api/src/main/java/org/apache/iceberg/PartitionStatisticsFile.java:
##########
@@ -0,0 +1,41 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.iceberg;
+
+import java.io.Serializable;
+
+/**
+ * Represents a partition statistics file in the table default format, that 
can be used to read
+ * table data more efficiently.
+ *
+ * <p>Statistics are informational. A reader can choose to ignore statistics 
information. Statistics
+ * support is not required to read the table correctly.
+ */
+public interface PartitionStatisticsFile extends Serializable {
+  /** ID of the Iceberg table's snapshot the partition statistics file is 
associated with. */
+  long snapshotId();
+
+  /**
+   * Returns fully qualified path to the file, suitable for constructing a 
Hadoop Path. Never null.

Review Comment:
   I wouldn't encourage using Hadoop `Path` given that we have our own `FileIO` 
and `InputFile`.
   What about dropping `suitable for constructing a Hadoop Path`?



##########
core/src/main/java/org/apache/iceberg/TableMetadata.java:
##########
@@ -948,6 +957,9 @@ private Builder(TableMetadata base) {
       this.refs = Maps.newHashMap(base.refs);
       this.statisticsFiles =
           
base.statisticsFiles.stream().collect(Collectors.groupingBy(StatisticsFile::snapshotId));
+      this.partitionStatisticsFiles =
+          base.partitionStatisticsFiles.stream()

Review Comment:
   I wonder whether a helper method would make it easier to read.



##########
core/src/test/resources/TableMetadataPartitionStatisticsFiles.json:
##########
@@ -0,0 +1,61 @@
+{
+  "format-version": 2,
+  "table-uuid": "9c12d441-03fe-4693-9a96-a0705ddf69c1",
+  "location": "s3://bucket/test/location",
+  "last-sequence-number": 34,
+  "last-updated-ms": 1602638573590,
+  "last-column-id": 3,
+  "current-schema-id": 0,
+  "schemas": [
+    {
+      "type": "struct",
+      "schema-id": 0,
+      "fields": [
+        {
+          "id": 1,
+          "name": "x",
+          "required": true,
+          "type": "long"
+        }
+      ]
+    }
+  ],
+  "default-spec-id": 0,
+  "partition-specs": [
+    {
+      "spec-id": 0,
+      "fields": []
+    }
+  ],
+  "last-partition-id": 1000,
+  "default-sort-order-id": 0,
+  "sort-orders": [
+    {
+      "order-id": 0,
+      "fields": []
+    }
+  ],
+  "properties": {},
+  "current-snapshot-id": 3055729675574597004,
+  "snapshots": [
+    {
+      "snapshot-id": 3055729675574597004,
+      "timestamp-ms": 1555100955770,
+      "sequence-number": 1,
+      "summary": {
+        "operation": "append"
+      },
+      "manifest-list": "s3://a/b/2.avro",
+      "schema-id": 0
+    }
+  ],
+  "partition-statistics": [
+    {
+      "snapshot-id": 3055729675574597004,
+      "statistics-path": "s3://a/b/partition-stats.parquet",
+      "file-size-in-bytes": 43
+    }
+  ],
+  "snapshot-log": [],
+  "metadata-log": []
+}

Review Comment:
   Missing empty line?



##########
core/src/main/java/org/apache/iceberg/IncrementalFileCleanup.java:
##########
@@ -262,7 +262,8 @@ public void cleanFiles(TableMetadata beforeExpiration, 
TableMetadata afterExpira
     deleteFiles(manifestsToDelete, "manifest");
     deleteFiles(manifestListsToDelete, "manifest list");
 
-    if (!beforeExpiration.statisticsFiles().isEmpty()) {
+    if (!beforeExpiration.statisticsFiles().isEmpty()

Review Comment:
   What about a helper method like `hasStatsFiles` or something to simplify the 
condition and stay on 1 line?



##########
core/src/main/java/org/apache/iceberg/ReachableFileCleanup.java:
##########
@@ -82,7 +82,8 @@ public void cleanFiles(TableMetadata beforeExpiration, 
TableMetadata afterExpira
 
     deleteFiles(manifestListsToDelete, "manifest list");
 
-    if (!beforeExpiration.statisticsFiles().isEmpty()) {
+    if (!beforeExpiration.statisticsFiles().isEmpty()

Review Comment:
   A helper method here too?



##########
api/src/main/java/org/apache/iceberg/UpdatePartitionStatistics.java:
##########
@@ -0,0 +1,40 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.iceberg;
+
+import java.util.List;
+
+/** API for updating partition statistics files in a table. */
+public interface UpdatePartitionStatistics extends 
PendingUpdate<List<PartitionStatisticsFile>> {
+  /**
+   * Set the table's partition statistics file for given snapshot, replacing 
the previous partition
+   * statistics file for the snapshot if any exists.
+   *
+   * @return this for method chaining
+   */
+  UpdatePartitionStatistics setPartitionStatistics(
+      long snapshotId, PartitionStatisticsFile partitionStatisticsFile);

Review Comment:
   Hm, `PartitionStatisticsFile` already returns `snapshotId()`.
   Why also pass it explicitly and validate they are equal in the 
implementation?
   Seems to match what we do for `UpdateStatistics`, though.



##########
core/src/main/java/org/apache/iceberg/ReachableFileUtil.java:
##########
@@ -137,7 +137,9 @@ public static List<String> manifestListLocations(Table 
table, Set<Long> snapshot
    *
    * @param table table for which statistics files needs to be listed
    * @return the location of statistics files
+   * @deprecated use the {@code allStatisticsFilesLocations(table)} instead.
    */
+  @Deprecated

Review Comment:
   Given the purpose of this method and how it is used, what about simply 
extending it to return all statistics files? Then we won't need changes in 
consumers and won't need another method. We can clarify that this returns all 
stats files in Javadoc.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Reply via email to