[GitHub] [iceberg] szehon-ho commented on a diff in pull request #6779: Spark 3.2, 3.3: use table partition schema in add_files for getPartitions to avoid data corruption

via GitHub Thu, 16 Feb 2023 18:06:54 -0800


szehon-ho commented on code in PR #6779:
URL: https://github.com/apache/iceberg/pull/6779#discussion_r1109199713



##########
spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/Spark3Util.java:
##########
@@ -836,9 +836,30 @@ public static String quotedFullIdentifier(String 
catalogName, Identifier identif
    * @param format format of the file
    * @param partitionFilter partitionFilter of the file
    * @return all table's partitions
+   * @deprecated use {@link Spark3Util#getPartitions(SparkSession, Path, 
String, Map, Option)}
    */
+  @Deprecated
   public static List<SparkPartition> getPartitions(
       SparkSession spark, Path rootPath, String format, Map<String, String> 
partitionFilter) {
+    return getPartitions(spark, rootPath, format, partitionFilter, 
Option.empty());
+  }
+
+  /**
+   * Use Spark to list all partitions in the table.
+   *
+   * @param spark a Spark session
+   * @param rootPath a table identifier
+   * @param format format of the file
+   * @param partitionFilter partitionFilter of the file
+   * @param partitionSpec partitionSpec of the table
+   * @return all table's partitions
+   */
+  public static List<SparkPartition> getPartitions(
+      SparkSession spark,
+      Path rootPath,
+      String format,
+      Map<String, String> partitionFilter,
+      Option<PartitionSpec> partitionSpec) {

Review Comment:
   We shouldnt take in scala Option as argument to java method.  I would vote 
to just take a regular PartitionSpec, and check null in code, the equivalent is 
java.util.Optional and javadoc suggests to use it primarily for return types. 



##########
spark/v3.2/spark-extensions/src/test/java/org/apache/iceberg/spark/extensions/TestAddFilesProcedure.java:
##########
@@ -911,6 +935,14 @@ public void 
testPartitionedImportFromEmptyPartitionDoesNotThrow() {
     new StructField("ts", DataTypes.DateType, true, Metadata.empty())
   };
 
+  private static final StructField[] dateHourStruct = {
+    new StructField("id", DataTypes.IntegerType, true, Metadata.empty()),
+    new StructField("name", DataTypes.StringType, true, Metadata.empty()),
+    new StructField("dept", DataTypes.StringType, true, Metadata.empty()),
+    new StructField("ts", DataTypes.DateType, true, Metadata.empty()),
+    new StructField("hour", DataTypes.StringType, true, Metadata.empty())

Review Comment:
   Question, (maybe I am not understanding original problem entirely), can we 
re-use existing test struct, ie, using "dept=01" as partition field?



##########
spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/SparkSchemaUtil.java:
##########
@@ -354,4 +354,15 @@ public static Map<Integer, String> 
indexQuotedNameById(Schema schema) {
     Function<String, String> quotingFunc = name -> String.format("`%s`", 
name.replace("`", "``"));
     return TypeUtil.indexQuotedNameById(schema.asStruct(), quotingFunc);
   }
+
+  /**
+   * Convert a {@link PartitionSpec} to a {@link DataType Spark type}.
+   *
+   * @param spec iceberg PartitionSpec
+   * @return {@link StructType}
+   * @throws IllegalArgumentException if the type cannot be converted
+   */
+  public static StructType convert(PartitionSpec spec) {
+    return convert(new 
Schema(spec.partitionType().asNestedType().asStructType().fields()));

Review Comment:
   I think the general project direction has been to avoid adding public method 
unless we need it in many places.  This is converting from a partition spec 
type to spark type, and should be fairly limited, so would suggest inline.
   
   I personally think , if anything, adding public method convert(StructType) 
to spark StructType would be more useful, instead of forcing constructing of 
Schema in existing convert method, and will cover this case here.  like to see 
what @aokolnychyi @RussellSpitzer think as well.
   
   Also, is it more convoluted than needed?  Could it be?
   ```
   convert(new Schema(partType.partitionType().fields())))
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

[GitHub] [iceberg] szehon-ho commented on a diff in pull request #6779: Spark 3.2, 3.3: use table partition schema in add_files for getPartitions to avoid data corruption

Reply via email to