Re: [PR] Spark 3.5: Parallelize file listing in add_files procedure [iceberg]

via GitHub Mon, 11 Dec 2023 15:35:52 -0800


amogh-jahagirdar commented on code in PR #9274:
URL: https://github.com/apache/iceberg/pull/9274#discussion_r1423238036



##########
spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/SparkTableUtil.java:
##########
@@ -530,14 +515,16 @@ private static void importUnpartitionedSparkTable(
    * @param spec a partition spec
    * @param stagingDir a staging directory to store temporary manifest files
    * @param checkDuplicateFiles if true, throw exception if import results in 
a duplicate data file
+   * @param listingParallelism the parallelism to use when listing files
    */
   public static void importSparkPartitions(
       SparkSession spark,
       List<SparkPartition> partitions,
       Table targetTable,
       PartitionSpec spec,
       String stagingDir,
-      boolean checkDuplicateFiles) {
+      boolean checkDuplicateFiles,
+      int listingParallelism) {

Review Comment:
   Important: This is a public utility, we can't just add a parameter without 
breaking users who may be using this (and this API has existed for a while) . 
It's better to duplicate and add a new a method (and then leverage the 
implementation of the parallel one in the original one)



##########
spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/SparkTableUtil.java:
##########
@@ -417,33 +427,6 @@ public static void importSparkTable(
     }
   }
 
-  /**
-   * Import files from an existing Spark table to an Iceberg table.
-   *
-   * <p>The import uses the Spark session to get table metadata. It assumes no 
operation is going on
-   * the original and target table and thus is not thread-safe.
-   *
-   * @param spark a Spark session
-   * @param sourceTableIdent an identifier of the source Spark table
-   * @param targetTable an Iceberg table where to import the data
-   * @param stagingDir a staging directory to store temporary manifest files
-   * @param checkDuplicateFiles if true, throw exception if import results in 
a duplicate data file
-   */
-  public static void importSparkTable(
-      SparkSession spark,

Review Comment:
   Important: this is a public utility, we want to maintain compatibility and 
not remove it. If there's a reason we want to deprecate it, it should go down a 
deprecation path. But I don't see why we would want to deprecate this



##########
data/src/main/java/org/apache/iceberg/data/TableMigrationUtil.java:
##########
@@ -215,11 +215,11 @@ private static DataFile buildDataFile(
         .build();
   }
 
-  private static ExecutorService migrationService(int concurrentDeletes) {
+  private static ExecutorService migrationService(int numThreads) {

Review Comment:
   Nit: This seems like an unnecessary name change to me, there are other 
places in the code where concurrentX refers to parallelism we want, and I don't 
see a need to change that convention



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Spark 3.5: Parallelize file listing in add_files procedure [iceberg]

Reply via email to