[GitHub] [iceberg] RussellSpitzer commented on a diff in pull request #6624: 🎨 Add "parallelism" parameter to "add_files" syscall and MigrateTable, SnapshotTable.

via GitHub Thu, 26 Jan 2023 06:24:52 -0800


RussellSpitzer commented on code in PR #6624:
URL: https://github.com/apache/iceberg/pull/6624#discussion_r1087917320



##########
spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/SparkTableUtil.java:
##########
@@ -442,14 +444,51 @@ public static void importSparkTable(
             "Cannot find any partitions in table %s",
             sourceTableIdent);
         importSparkPartitions(
-            spark, sourceTablePartitions, targetTable, spec, stagingDir, 
checkDuplicateFiles);
+            spark,
+            sourceTablePartitions,
+            targetTable,
+            spec,
+            stagingDir,
+            checkDuplicateFiles,
+            parallelism);
       }
     } catch (AnalysisException e) {
       throw SparkExceptionUtil.toUncheckedException(
           e, "Unable to get partition spec for table: %s", 
sourceTableIdentWithDB);
     }
   }
 
+  /**
+   * Import files from an existing Spark table to an Iceberg table.
+   *
+   * <p>The import uses the Spark session to get table metadata. It assumes no 
operation is going on
+   * the original and target table and thus is not thread-safe.
+   *
+   * @param spark a Spark session
+   * @param sourceTableIdent an identifier of the source Spark table
+   * @param targetTable an Iceberg table where to import the data
+   * @param stagingDir a staging directory to store temporary manifest files
+   * @param partitionFilter only import partitions whose values match those in 
the map, can be
+   *     partially defined
+   * @param checkDuplicateFiles if true, throw exception if import results in 
a duplicate data file
+   */
+  public static void importSparkTable(
+      SparkSession spark,
+      TableIdentifier sourceTableIdent,
+      Table targetTable,
+      String stagingDir,
+      Map<String, String> partitionFilter,
+      boolean checkDuplicateFiles) {
+    importSparkTable(
+        spark,
+        sourceTableIdent,
+        targetTable,
+        stagingDir,
+        Collections.emptyMap(),
+        checkDuplicateFiles,
+        1);

Review Comment:
   This is just a performance chance, unless we think this will break somehow 
we can change the implementation of apis, just not apis or their outputs.



##########
spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/SparkTableUtil.java:
##########
@@ -442,14 +444,51 @@ public static void importSparkTable(
             "Cannot find any partitions in table %s",
             sourceTableIdent);
         importSparkPartitions(
-            spark, sourceTablePartitions, targetTable, spec, stagingDir, 
checkDuplicateFiles);
+            spark,
+            sourceTablePartitions,
+            targetTable,
+            spec,
+            stagingDir,
+            checkDuplicateFiles,
+            parallelism);
       }
     } catch (AnalysisException e) {
       throw SparkExceptionUtil.toUncheckedException(
           e, "Unable to get partition spec for table: %s", 
sourceTableIdentWithDB);
     }
   }
 
+  /**
+   * Import files from an existing Spark table to an Iceberg table.
+   *
+   * <p>The import uses the Spark session to get table metadata. It assumes no 
operation is going on
+   * the original and target table and thus is not thread-safe.
+   *
+   * @param spark a Spark session
+   * @param sourceTableIdent an identifier of the source Spark table
+   * @param targetTable an Iceberg table where to import the data
+   * @param stagingDir a staging directory to store temporary manifest files
+   * @param partitionFilter only import partitions whose values match those in 
the map, can be
+   *     partially defined
+   * @param checkDuplicateFiles if true, throw exception if import results in 
a duplicate data file
+   */
+  public static void importSparkTable(
+      SparkSession spark,
+      TableIdentifier sourceTableIdent,
+      Table targetTable,
+      String stagingDir,
+      Map<String, String> partitionFilter,
+      boolean checkDuplicateFiles) {
+    importSparkTable(
+        spark,
+        sourceTableIdent,
+        targetTable,
+        stagingDir,
+        Collections.emptyMap(),
+        checkDuplicateFiles,
+        1);

Review Comment:
   This is just a performance change, unless we think this will break somehow 
we can change the implementation of apis, just not apis or their outputs.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

[GitHub] [iceberg] RussellSpitzer commented on a diff in pull request #6624: 🎨 Add "parallelism" parameter to "add_files" syscall and MigrateTable, SnapshotTable.

Reply via email to