[GitHub] [iceberg] szehon-ho commented on a diff in pull request #6779: Spark 3.2, 3.3: use table partition schema in add_files for getPartitions to avoid data corruption

via GitHub Fri, 17 Feb 2023 17:36:42 -0800


szehon-ho commented on code in PR #6779:
URL: https://github.com/apache/iceberg/pull/6779#discussion_r1110474171



##########
spark/v3.2/spark-extensions/src/test/java/org/apache/iceberg/spark/extensions/TestAddFilesProcedure.java:
##########
@@ -911,6 +935,14 @@ public void 
testPartitionedImportFromEmptyPartitionDoesNotThrow() {
     new StructField("ts", DataTypes.DateType, true, Metadata.empty())
   };
 
+  private static final StructField[] dateHourStruct = {
+    new StructField("id", DataTypes.IntegerType, true, Metadata.empty()),
+    new StructField("name", DataTypes.StringType, true, Metadata.empty()),
+    new StructField("dept", DataTypes.StringType, true, Metadata.empty()),
+    new StructField("ts", DataTypes.DateType, true, Metadata.empty()),
+    new StructField("hour", DataTypes.StringType, true, Metadata.empty())

Review Comment:
   OK in that case, I'd prefer not extra structs that's not strictly necessary, 
to keep the changes smaller.  I dont see string hour being of a value like 01 
being much more readable than a dept that has a name like 01 to justify a new 
struct.
   
   I think we can still make a separate DF if we need to.



##########
spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/Spark3Util.java:
##########
@@ -815,9 +816,30 @@ public static String quotedFullIdentifier(String 
catalogName, Identifier identif
    * @param format format of the file
    * @param partitionFilter partitionFilter of the file
    * @return all table's partitions
+   * @deprecated use {@link Spark3Util#getPartitions(SparkSession, Path, 
String, Map, Option)}
    */
+  @Deprecated
   public static List<SparkPartition> getPartitions(
       SparkSession spark, Path rootPath, String format, Map<String, String> 
partitionFilter) {
+    return getPartitions(spark, rootPath, format, partitionFilter, 
Optional.empty());
+  }
+
+  /**
+   * Use Spark to list all partitions in the table.
+   *
+   * @param spark a Spark session
+   * @param rootPath a table identifier
+   * @param format format of the file
+   * @param partitionFilter partitionFilter of the file
+   * @param partitionSpec partitionSpec of the table
+   * @return all table's partitions
+   */
+  public static List<SparkPartition> getPartitions(
+      SparkSession spark,
+      Path rootPath,
+      String format,
+      Map<String, String> partitionFilter,
+      Optional<PartitionSpec> partitionSpec) {

Review Comment:
   I was saying in original comment, Optional javadoc mentions it's usually for 
return value, I think in Java it's not so frequently used for arguments.  So I 
would say, let's just make a PartitionSpec that can be null.  We have the other 
version that takes in 4 arguments for users.



##########
spark/v3.2/spark-extensions/src/test/java/org/apache/iceberg/spark/extensions/TestAddFilesProcedure.java:
##########
@@ -926,6 +955,17 @@ private static java.sql.Date toDate(String value) {
               new StructType(dateStruct))
           .repartition(2);
 
+  private static final Dataset<Row> dateHourDF =

Review Comment:
   I think dateDF is specifically to address date partition test, and 
dateHourDf doesnt indicate that is testing a case where hour is modeled as a 
string.  Maybe something descriptive like testPartitionTypeDF



##########
spark/v3.2/spark-extensions/src/test/java/org/apache/iceberg/spark/extensions/TestAddFilesProcedure.java:
##########
@@ -418,6 +418,27 @@ public void addDataPartitionedByDateToPartitioned() {
         sql("SELECT id, name, dept, date FROM %s ORDER BY id", tableName));
   }
 
+  @Test
+  public void addDataPartitionedByDateHourToPartitioned() {

Review Comment:
   I think the test name is not capturing the problem its solving, should be 
something like 'testPartitionType' to capture the problem.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

[GitHub] [iceberg] szehon-ho commented on a diff in pull request #6779: Spark 3.2, 3.3: use table partition schema in add_files for getPartitions to avoid data corruption

Reply via email to