[GitHub] [iceberg] szehon-ho commented on a diff in pull request #7499: Doc: Updates Writing to Partitioned Table Spark Docs

via GitHub Tue, 02 May 2023 15:18:57 -0700


szehon-ho commented on code in PR #7499:
URL: https://github.com/apache/iceberg/pull/7499#discussion_r1183093526



##########
docs/spark-writes.md:
##########
@@ -339,74 +331,55 @@ USING iceberg
 PARTITIONED BY (days(ts), category)
 ```
 
-To write data to the sample table, your data needs to be sorted by `days(ts), 
category`.
-
-If you're inserting data with SQL statement, you can use `ORDER BY` to achieve 
it, like below:
+To write data to the sample table, your data needs to be sorted by `days(ts), 
category` but this is taken care
+of automatically by the default `hash` distribution.
 
 ```sql
 INSERT INTO prod.db.sample
 SELECT id, data, category, ts FROM another_table
-ORDER BY ts, category
-```
-
-If you're inserting data with DataFrame, you can use either `orderBy`/`sort` 
to trigger global sort, or `sortWithinPartitions`
-to trigger local sort. Local sort for example:
-
-```scala
-data.sortWithinPartitions("ts", "category")
-    .writeTo("prod.db.sample")
-    .append()
 ```
 
-You can simply add the original column to the sort condition for the most 
partition transformations, except `bucket`.
-
-For `bucket` partition transformation, you need to register the Iceberg 
transform function in Spark to specify it during sort.
-
-Let's go through another sample table having bucket partition:
-
-```sql
-CREATE TABLE prod.db.sample (
-    id bigint,
-    data string,
-    category string,
-    ts timestamp)
-USING iceberg
-PARTITIONED BY (bucket(16, id))
-```
-
-You need to register the function to deal with bucket, like below:
-
-```scala
-import org.apache.iceberg.spark.IcebergSpark
-import org.apache.spark.sql.types.DataTypes
-
-IcebergSpark.registerBucketUDF(spark, "iceberg_bucket16", DataTypes.LongType, 
16)
-```
-
-{{< hint info >}}
-Explicit registration of the function is necessary because Spark doesn't allow 
Iceberg to provide functions.
-[SPARK-27658](https://issues.apache.org/jira/browse/SPARK-27658) is filed to 
enable Iceberg to provide functions
-which can be used in query.
-{{< /hint >}}
-
-Here we just registered the bucket function as `iceberg_bucket16`, which can 
be used in sort clause.
-
-If you're inserting data with SQL statement, you can use the function like 
below:
-
-```sql
-INSERT INTO prod.db.sample
-SELECT id, data, category, ts FROM another_table
-ORDER BY iceberg_bucket16(id)
-```
-
-If you're inserting data with DataFrame, you can use the function like below:
-
-```scala
-data.sortWithinPartitions(expr("iceberg_bucket16(id)"))
-    .writeTo("prod.db.sample")
-    .append()
-```
 
+There are 3 options for `write.distribution-mode`
+
+* `none` - This is the previous default for Iceberg.<p> This mode does not 
require any shuffles or sort to be performed

Review Comment:
   We say 'request' earlier.  Should we keep that here, or use 'require'?



##########
docs/spark-writes.md:
##########
@@ -312,20 +312,12 @@ data.writeTo("prod.db.table")
     .createOrReplace()
 ```
 
-## Writing to partitioned tables
+## Writing Distribution Modes
 
-Iceberg requires the data to be sorted according to the partition spec per 
task (Spark partition) in prior to write
-against partitioned table. This applies both Writing with SQL and Writing with 
DataFrames.
-
-{{< hint info >}}
-Explicit sort is necessary because Spark doesn't allow Iceberg to request a 
sort before writing as of Spark 3.0.
-[SPARK-23889](https://issues.apache.org/jira/browse/SPARK-23889) is filed to 
enable Iceberg to require specific
-distribution & sort order to Spark.
-{{< /hint >}}
-
-{{< hint info >}}
-Both global sort (`orderBy`/`sort`) and local sort (`sortWithinPartitions`) 
work for the requirement.
-{{< /hint >}}
+Iceberg's default Spark writers require that the data in each spark task is 
clustered by partition values. This 
+distribution is required to minimize the number of file handles that are held 
open while writing. By default, starting
+in Iceberg 1.2.0, Iceberg now also requests that Spark pre-sort data to be 
written to fit this distribution. The

Review Comment:
   Nit: now may be a bit redundant with 'starting in Iceberg 1.2.0'



##########
docs/spark-writes.md:
##########
@@ -339,74 +331,55 @@ USING iceberg
 PARTITIONED BY (days(ts), category)
 ```
 
-To write data to the sample table, your data needs to be sorted by `days(ts), 
category`.
-
-If you're inserting data with SQL statement, you can use `ORDER BY` to achieve 
it, like below:
+To write data to the sample table, your data needs to be sorted by `days(ts), 
category` but this is taken care
+of automatically by the default `hash` distribution.
 
 ```sql
 INSERT INTO prod.db.sample
 SELECT id, data, category, ts FROM another_table
-ORDER BY ts, category
-```
-
-If you're inserting data with DataFrame, you can use either `orderBy`/`sort` 
to trigger global sort, or `sortWithinPartitions`
-to trigger local sort. Local sort for example:
-
-```scala
-data.sortWithinPartitions("ts", "category")
-    .writeTo("prod.db.sample")
-    .append()
 ```
 
-You can simply add the original column to the sort condition for the most 
partition transformations, except `bucket`.
-
-For `bucket` partition transformation, you need to register the Iceberg 
transform function in Spark to specify it during sort.
-
-Let's go through another sample table having bucket partition:
-
-```sql
-CREATE TABLE prod.db.sample (
-    id bigint,
-    data string,
-    category string,
-    ts timestamp)
-USING iceberg
-PARTITIONED BY (bucket(16, id))
-```
-
-You need to register the function to deal with bucket, like below:
-
-```scala
-import org.apache.iceberg.spark.IcebergSpark
-import org.apache.spark.sql.types.DataTypes
-
-IcebergSpark.registerBucketUDF(spark, "iceberg_bucket16", DataTypes.LongType, 
16)
-```
-
-{{< hint info >}}
-Explicit registration of the function is necessary because Spark doesn't allow 
Iceberg to provide functions.
-[SPARK-27658](https://issues.apache.org/jira/browse/SPARK-27658) is filed to 
enable Iceberg to provide functions
-which can be used in query.
-{{< /hint >}}
-
-Here we just registered the bucket function as `iceberg_bucket16`, which can 
be used in sort clause.
-
-If you're inserting data with SQL statement, you can use the function like 
below:
-
-```sql
-INSERT INTO prod.db.sample
-SELECT id, data, category, ts FROM another_table
-ORDER BY iceberg_bucket16(id)
-```
-
-If you're inserting data with DataFrame, you can use the function like below:
-
-```scala
-data.sortWithinPartitions(expr("iceberg_bucket16(id)"))
-    .writeTo("prod.db.sample")
-    .append()
-```
 
+There are 3 options for `write.distribution-mode`
+
+* `none` - This is the previous default for Iceberg.<p> This mode does not 
require any shuffles or sort to be performed
+automatically by Spark. Because no work is done automatically by Spark, the 
data must be either locally or globally 
+sorted manually by partition value. To reduce the number of files produced 
during writing, using a global sort is recommended.<p> 
+A local sort can be avoided by using the Spark [write 
fanout](#write-properties) property but this will cause all file handles to 
+remain open until each write task has completed. 
+* `hash` - This mode is the new default and requests that Spark uses a 
hash-based exchange to shuffle the incoming
+write data before writing. Practically, this means that each row is hashed 
based on the row's partition value and then placed
+in a corresponding Spark task based upon that value. Further division and 
coalescing of tasks may take place based on 
+the [Spark's Adaptive Query planning](#controlling-file-sizes).
+* `range` - This mode requests that Spark perform a range based exchanged to 
shuffle the data before writing. This is
+a two stage procedure which is more expensive than the `hash` mode. The first 
stage samples the data to be written based
+on the partition and sort columns, this information is then used in the second 
stage to shuffle data into tasks. Each
+task gets an exclusive range of the input data which clusters the data by 
partition and also globally sorts it.
+While this is more expensive than the hash distribution, the global ordering 
can be beneficial for read performance if
+sorted columns are used during queries. Further division and coalescing of 
tasks may take place based on
+  the [Spark's Adaptive Query planning](#controlling-file-sizes).
+
+
+## Controlling File Sizes
+
+When writing data to Iceberg with Spark, it's important to note that Spark 
cannot write a file larger than a Spark 
+task. This means although Iceberg will always roll over a file when it grows 
to 
+[`write.target-file-size-bytes`](../configuration/#write-properties), a file
+will not be able to grow to that size if the task is not large enough. The
+on disk file size will also be much smaller than the Spark task size since the 
on disk data will be both compressed 
+and in columnar format as opposed to Spark's uncompressed row representation. 
This means a 100 megabyte task will 
+always corrospond to on an on disk file of much less than 100 megabytes even 
when writing to a single Iceberg partition.

Review Comment:
   Some extra words here.



##########
docs/spark-writes.md:
##########
@@ -339,74 +331,55 @@ USING iceberg
 PARTITIONED BY (days(ts), category)
 ```
 
-To write data to the sample table, your data needs to be sorted by `days(ts), 
category`.
-
-If you're inserting data with SQL statement, you can use `ORDER BY` to achieve 
it, like below:
+To write data to the sample table, your data needs to be sorted by `days(ts), 
category` but this is taken care
+of automatically by the default `hash` distribution.
 
 ```sql
 INSERT INTO prod.db.sample
 SELECT id, data, category, ts FROM another_table
-ORDER BY ts, category
-```
-
-If you're inserting data with DataFrame, you can use either `orderBy`/`sort` 
to trigger global sort, or `sortWithinPartitions`
-to trigger local sort. Local sort for example:
-
-```scala
-data.sortWithinPartitions("ts", "category")
-    .writeTo("prod.db.sample")
-    .append()
 ```
 
-You can simply add the original column to the sort condition for the most 
partition transformations, except `bucket`.
-
-For `bucket` partition transformation, you need to register the Iceberg 
transform function in Spark to specify it during sort.
-
-Let's go through another sample table having bucket partition:
-
-```sql
-CREATE TABLE prod.db.sample (
-    id bigint,
-    data string,
-    category string,
-    ts timestamp)
-USING iceberg
-PARTITIONED BY (bucket(16, id))
-```
-
-You need to register the function to deal with bucket, like below:
-
-```scala
-import org.apache.iceberg.spark.IcebergSpark
-import org.apache.spark.sql.types.DataTypes
-
-IcebergSpark.registerBucketUDF(spark, "iceberg_bucket16", DataTypes.LongType, 
16)
-```
-
-{{< hint info >}}
-Explicit registration of the function is necessary because Spark doesn't allow 
Iceberg to provide functions.
-[SPARK-27658](https://issues.apache.org/jira/browse/SPARK-27658) is filed to 
enable Iceberg to provide functions
-which can be used in query.
-{{< /hint >}}
-
-Here we just registered the bucket function as `iceberg_bucket16`, which can 
be used in sort clause.
-
-If you're inserting data with SQL statement, you can use the function like 
below:
-
-```sql
-INSERT INTO prod.db.sample
-SELECT id, data, category, ts FROM another_table
-ORDER BY iceberg_bucket16(id)
-```
-
-If you're inserting data with DataFrame, you can use the function like below:
-
-```scala
-data.sortWithinPartitions(expr("iceberg_bucket16(id)"))
-    .writeTo("prod.db.sample")
-    .append()
-```
 
+There are 3 options for `write.distribution-mode`
+
+* `none` - This is the previous default for Iceberg.<p> This mode does not 
require any shuffles or sort to be performed
+automatically by Spark. Because no work is done automatically by Spark, the 
data must be either locally or globally 
+sorted manually by partition value. To reduce the number of files produced 
during writing, using a global sort is recommended.<p> 
+A local sort can be avoided by using the Spark [write 
fanout](#write-properties) property but this will cause all file handles to 
+remain open until each write task has completed. 
+* `hash` - This mode is the new default and requests that Spark uses a 
hash-based exchange to shuffle the incoming
+write data before writing. Practically, this means that each row is hashed 
based on the row's partition value and then placed
+in a corresponding Spark task based upon that value. Further division and 
coalescing of tasks may take place based on 
+the [Spark's Adaptive Query planning](#controlling-file-sizes).
+* `range` - This mode requests that Spark perform a range based exchanged to 
shuffle the data before writing. This is
+a two stage procedure which is more expensive than the `hash` mode. The first 
stage samples the data to be written based
+on the partition and sort columns, this information is then used in the second 
stage to shuffle data into tasks. Each
+task gets an exclusive range of the input data which clusters the data by 
partition and also globally sorts it.
+While this is more expensive than the hash distribution, the global ordering 
can be beneficial for read performance if
+sorted columns are used during queries. Further division and coalescing of 
tasks may take place based on
+  the [Spark's Adaptive Query planning](#controlling-file-sizes).
+
+
+## Controlling File Sizes
+
+When writing data to Iceberg with Spark, it's important to note that Spark 
cannot write a file larger than a Spark 
+task. This means although Iceberg will always roll over a file when it grows 
to 

Review Comment:
   This is a great section.  While we are at it, would it also help new users 
to explicitly mention partitions, ie, 
   
   `it's important to note that Spark cannot write a file larger than a Spark 
task, and files cannot span across Iceberg partitions`



##########
docs/spark-writes.md:
##########
@@ -339,74 +331,55 @@ USING iceberg
 PARTITIONED BY (days(ts), category)
 ```
 
-To write data to the sample table, your data needs to be sorted by `days(ts), 
category`.
-
-If you're inserting data with SQL statement, you can use `ORDER BY` to achieve 
it, like below:
+To write data to the sample table, your data needs to be sorted by `days(ts), 
category` but this is taken care
+of automatically by the default `hash` distribution.
 
 ```sql
 INSERT INTO prod.db.sample
 SELECT id, data, category, ts FROM another_table
-ORDER BY ts, category
-```
-
-If you're inserting data with DataFrame, you can use either `orderBy`/`sort` 
to trigger global sort, or `sortWithinPartitions`
-to trigger local sort. Local sort for example:
-
-```scala
-data.sortWithinPartitions("ts", "category")
-    .writeTo("prod.db.sample")
-    .append()
 ```
 
-You can simply add the original column to the sort condition for the most 
partition transformations, except `bucket`.
-
-For `bucket` partition transformation, you need to register the Iceberg 
transform function in Spark to specify it during sort.
-
-Let's go through another sample table having bucket partition:
-
-```sql
-CREATE TABLE prod.db.sample (
-    id bigint,
-    data string,
-    category string,
-    ts timestamp)
-USING iceberg
-PARTITIONED BY (bucket(16, id))
-```
-
-You need to register the function to deal with bucket, like below:
-
-```scala
-import org.apache.iceberg.spark.IcebergSpark
-import org.apache.spark.sql.types.DataTypes
-
-IcebergSpark.registerBucketUDF(spark, "iceberg_bucket16", DataTypes.LongType, 
16)
-```
-
-{{< hint info >}}
-Explicit registration of the function is necessary because Spark doesn't allow 
Iceberg to provide functions.
-[SPARK-27658](https://issues.apache.org/jira/browse/SPARK-27658) is filed to 
enable Iceberg to provide functions
-which can be used in query.
-{{< /hint >}}
-
-Here we just registered the bucket function as `iceberg_bucket16`, which can 
be used in sort clause.
-
-If you're inserting data with SQL statement, you can use the function like 
below:
-
-```sql
-INSERT INTO prod.db.sample
-SELECT id, data, category, ts FROM another_table
-ORDER BY iceberg_bucket16(id)
-```
-
-If you're inserting data with DataFrame, you can use the function like below:
-
-```scala
-data.sortWithinPartitions(expr("iceberg_bucket16(id)"))
-    .writeTo("prod.db.sample")
-    .append()
-```
 
+There are 3 options for `write.distribution-mode`
+
+* `none` - This is the previous default for Iceberg.<p> This mode does not 
require any shuffles or sort to be performed
+automatically by Spark. Because no work is done automatically by Spark, the 
data must be either locally or globally 
+sorted manually by partition value. To reduce the number of files produced 
during writing, using a global sort is recommended.<p> 
+A local sort can be avoided by using the Spark [write 
fanout](#write-properties) property but this will cause all file handles to 
+remain open until each write task has completed. 
+* `hash` - This mode is the new default and requests that Spark uses a 
hash-based exchange to shuffle the incoming
+write data before writing. Practically, this means that each row is hashed 
based on the row's partition value and then placed
+in a corresponding Spark task based upon that value. Further division and 
coalescing of tasks may take place based on 
+the [Spark's Adaptive Query planning](#controlling-file-sizes).
+* `range` - This mode requests that Spark perform a range based exchanged to 
shuffle the data before writing. This is
+a two stage procedure which is more expensive than the `hash` mode. The first 
stage samples the data to be written based
+on the partition and sort columns, this information is then used in the second 
stage to shuffle data into tasks. Each
+task gets an exclusive range of the input data which clusters the data by 
partition and also globally sorts it.
+While this is more expensive than the hash distribution, the global ordering 
can be beneficial for read performance if
+sorted columns are used during queries. Further division and coalescing of 
tasks may take place based on
+  the [Spark's Adaptive Query planning](#controlling-file-sizes).
+
+
+## Controlling File Sizes
+
+When writing data to Iceberg with Spark, it's important to note that Spark 
cannot write a file larger than a Spark 
+task. This means although Iceberg will always roll over a file when it grows 
to 
+[`write.target-file-size-bytes`](../configuration/#write-properties), a file
+will not be able to grow to that size if the task is not large enough. The
+on disk file size will also be much smaller than the Spark task size since the 
on disk data will be both compressed 
+and in columnar format as opposed to Spark's uncompressed row representation. 
This means a 100 megabyte task will 
+always corrospond to on an on disk file of much less than 100 megabytes even 
when writing to a single Iceberg partition.
+
+To control what data ends up in each task the user must either use a [`write 
distribution mode`](#writing-distribution-modes) 
+or manually repartition the data. 
+

Review Comment:
   Do we need `<p>` here and above or not?  As we had some above



##########
docs/spark-writes.md:
##########
@@ -312,20 +312,12 @@ data.writeTo("prod.db.table")
     .createOrReplace()
 ```
 
-## Writing to partitioned tables
+## Writing Distribution Modes
 
-Iceberg requires the data to be sorted according to the partition spec per 
task (Spark partition) in prior to write
-against partitioned table. This applies both Writing with SQL and Writing with 
DataFrames.
-
-{{< hint info >}}
-Explicit sort is necessary because Spark doesn't allow Iceberg to request a 
sort before writing as of Spark 3.0.
-[SPARK-23889](https://issues.apache.org/jira/browse/SPARK-23889) is filed to 
enable Iceberg to require specific
-distribution & sort order to Spark.
-{{< /hint >}}
-
-{{< hint info >}}
-Both global sort (`orderBy`/`sort`) and local sort (`sortWithinPartitions`) 
work for the requirement.
-{{< /hint >}}
+Iceberg's default Spark writers require that the data in each spark task is 
clustered by partition values. This 
+distribution is required to minimize the number of file handles that are held 
open while writing. By default, starting
+in Iceberg 1.2.0, Iceberg now also requests that Spark pre-sort data to be 
written to fit this distribution. The
+request to spark is done through the parameter `write.distribution-mode` with 
the default value being `hash`.

Review Comment:
   Nit: capital Spark?



##########
docs/spark-writes.md:
##########
@@ -339,74 +331,55 @@ USING iceberg
 PARTITIONED BY (days(ts), category)
 ```
 
-To write data to the sample table, your data needs to be sorted by `days(ts), 
category`.
-
-If you're inserting data with SQL statement, you can use `ORDER BY` to achieve 
it, like below:
+To write data to the sample table, your data needs to be sorted by `days(ts), 
category` but this is taken care
+of automatically by the default `hash` distribution.
 
 ```sql
 INSERT INTO prod.db.sample
 SELECT id, data, category, ts FROM another_table
-ORDER BY ts, category
-```
-
-If you're inserting data with DataFrame, you can use either `orderBy`/`sort` 
to trigger global sort, or `sortWithinPartitions`
-to trigger local sort. Local sort for example:
-
-```scala
-data.sortWithinPartitions("ts", "category")
-    .writeTo("prod.db.sample")
-    .append()
 ```
 
-You can simply add the original column to the sort condition for the most 
partition transformations, except `bucket`.
-
-For `bucket` partition transformation, you need to register the Iceberg 
transform function in Spark to specify it during sort.
-
-Let's go through another sample table having bucket partition:
-
-```sql
-CREATE TABLE prod.db.sample (
-    id bigint,
-    data string,
-    category string,
-    ts timestamp)
-USING iceberg
-PARTITIONED BY (bucket(16, id))
-```
-
-You need to register the function to deal with bucket, like below:
-
-```scala
-import org.apache.iceberg.spark.IcebergSpark
-import org.apache.spark.sql.types.DataTypes
-
-IcebergSpark.registerBucketUDF(spark, "iceberg_bucket16", DataTypes.LongType, 
16)
-```
-
-{{< hint info >}}
-Explicit registration of the function is necessary because Spark doesn't allow 
Iceberg to provide functions.
-[SPARK-27658](https://issues.apache.org/jira/browse/SPARK-27658) is filed to 
enable Iceberg to provide functions
-which can be used in query.
-{{< /hint >}}
-
-Here we just registered the bucket function as `iceberg_bucket16`, which can 
be used in sort clause.
-
-If you're inserting data with SQL statement, you can use the function like 
below:
-
-```sql
-INSERT INTO prod.db.sample
-SELECT id, data, category, ts FROM another_table
-ORDER BY iceberg_bucket16(id)
-```
-
-If you're inserting data with DataFrame, you can use the function like below:
-
-```scala
-data.sortWithinPartitions(expr("iceberg_bucket16(id)"))
-    .writeTo("prod.db.sample")
-    .append()
-```
 
+There are 3 options for `write.distribution-mode`
+
+* `none` - This is the previous default for Iceberg.<p> This mode does not 
require any shuffles or sort to be performed

Review Comment:
   Also, is it possible to put the `<p>` on new line so it more accurately 
reflects the doc?



##########
docs/spark-writes.md:
##########
@@ -339,74 +331,55 @@ USING iceberg
 PARTITIONED BY (days(ts), category)
 ```
 
-To write data to the sample table, your data needs to be sorted by `days(ts), 
category`.
-
-If you're inserting data with SQL statement, you can use `ORDER BY` to achieve 
it, like below:
+To write data to the sample table, your data needs to be sorted by `days(ts), 
category` but this is taken care
+of automatically by the default `hash` distribution.
 
 ```sql
 INSERT INTO prod.db.sample
 SELECT id, data, category, ts FROM another_table
-ORDER BY ts, category
-```
-
-If you're inserting data with DataFrame, you can use either `orderBy`/`sort` 
to trigger global sort, or `sortWithinPartitions`
-to trigger local sort. Local sort for example:
-
-```scala
-data.sortWithinPartitions("ts", "category")
-    .writeTo("prod.db.sample")
-    .append()
 ```
 
-You can simply add the original column to the sort condition for the most 
partition transformations, except `bucket`.
-
-For `bucket` partition transformation, you need to register the Iceberg 
transform function in Spark to specify it during sort.
-
-Let's go through another sample table having bucket partition:
-
-```sql
-CREATE TABLE prod.db.sample (
-    id bigint,
-    data string,
-    category string,
-    ts timestamp)
-USING iceberg
-PARTITIONED BY (bucket(16, id))
-```
-
-You need to register the function to deal with bucket, like below:
-
-```scala
-import org.apache.iceberg.spark.IcebergSpark
-import org.apache.spark.sql.types.DataTypes
-
-IcebergSpark.registerBucketUDF(spark, "iceberg_bucket16", DataTypes.LongType, 
16)
-```
-
-{{< hint info >}}
-Explicit registration of the function is necessary because Spark doesn't allow 
Iceberg to provide functions.
-[SPARK-27658](https://issues.apache.org/jira/browse/SPARK-27658) is filed to 
enable Iceberg to provide functions
-which can be used in query.
-{{< /hint >}}
-
-Here we just registered the bucket function as `iceberg_bucket16`, which can 
be used in sort clause.
-
-If you're inserting data with SQL statement, you can use the function like 
below:
-
-```sql
-INSERT INTO prod.db.sample
-SELECT id, data, category, ts FROM another_table
-ORDER BY iceberg_bucket16(id)
-```
-
-If you're inserting data with DataFrame, you can use the function like below:
-
-```scala
-data.sortWithinPartitions(expr("iceberg_bucket16(id)"))
-    .writeTo("prod.db.sample")
-    .append()
-```
 
+There are 3 options for `write.distribution-mode`
+
+* `none` - This is the previous default for Iceberg.<p> This mode does not 
require any shuffles or sort to be performed
+automatically by Spark. Because no work is done automatically by Spark, the 
data must be either locally or globally 
+sorted manually by partition value. To reduce the number of files produced 
during writing, using a global sort is recommended.<p> 
+A local sort can be avoided by using the Spark [write 
fanout](#write-properties) property but this will cause all file handles to 
+remain open until each write task has completed. 
+* `hash` - This mode is the new default and requests that Spark uses a 
hash-based exchange to shuffle the incoming
+write data before writing. Practically, this means that each row is hashed 
based on the row's partition value and then placed
+in a corresponding Spark task based upon that value. Further division and 
coalescing of tasks may take place based on 
+the [Spark's Adaptive Query planning](#controlling-file-sizes).
+* `range` - This mode requests that Spark perform a range based exchanged to 
shuffle the data before writing. This is
+a two stage procedure which is more expensive than the `hash` mode. The first 
stage samples the data to be written based
+on the partition and sort columns, this information is then used in the second 
stage to shuffle data into tasks. Each

Review Comment:
   Nit: run-on, add 'and' before this?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] szehon-ho commented on a diff in pull request #7499: Doc: Updates Writing to Partitioned Table Spark Docs

Reply via email to