This is an automated email from the ASF dual-hosted git repository.
github-bot pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/datafusion-comet.git
The following commit(s) were added to refs/heads/asf-site by this push:
new 6abbe1899 Publish built docs triggered by
73a713a5c2a9ed6c9ff8148a3f7144050d7813d6
6abbe1899 is described below
commit 6abbe1899d0e5d1d7a62210b4cdcac1964282ddb
Author: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
AuthorDate: Fri Mar 13 20:15:09 2026 +0000
Publish built docs triggered by 73a713a5c2a9ed6c9ff8148a3f7144050d7813d6
---
_sources/user-guide/latest/iceberg.md.txt | 258 +++++++++++++++---------------
searchindex.js | 2 +-
user-guide/latest/iceberg.html | 249 ++++++++++++++--------------
3 files changed, 259 insertions(+), 250 deletions(-)
diff --git a/_sources/user-guide/latest/iceberg.md.txt
b/_sources/user-guide/latest/iceberg.md.txt
index f29de73d9..516fcaf18 100644
--- a/_sources/user-guide/latest/iceberg.md.txt
+++ b/_sources/user-guide/latest/iceberg.md.txt
@@ -20,11 +20,139 @@
# Accelerating Apache Iceberg Parquet Scans using Comet (Experimental)
**Note: Iceberg integration is a work-in-progress. Comet currently has two
distinct Iceberg
-code paths: 1) a hybrid reader (native Parquet decoding, JVM otherwise) that
requires
-building Iceberg from source rather than using available artifacts in Maven,
and 2) fully-native
-reader (based on [iceberg-rust](https://github.com/apache/iceberg-rust)).
Directions for both
+code paths: 1) fully-native
+reader (based on [iceberg-rust](https://github.com/apache/iceberg-rust)), and
2) a hybrid reader (native Parquet decoding, JVM otherwise) that requires
+building Iceberg from source rather than using available artifacts in Maven.
Directions for both
designs are provided below.**
+## Native Reader
+
+Comet's fully-native Iceberg integration does not require modifying Iceberg
source
+code. Instead, Comet relies on reflection to extract `FileScanTask`s from
Iceberg, which are
+then serialized to Comet's native execution engine (see
+[PR #2528](https://github.com/apache/datafusion-comet/pull/2528)).
+
+The example below uses Spark's package downloader to retrieve Comet 0.14.0 and
Iceberg
+1.8.1, but Comet has been tested with Iceberg 1.5, 1.7, 1.8, 1.9, and 1.10.
The key configuration
+to enable fully-native Iceberg is
`spark.comet.scan.icebergNative.enabled=true`. This
+configuration should **not** be used with the hybrid Iceberg configuration
+`spark.sql.iceberg.parquet.reader-type=COMET` from below.
+
+```shell
+$SPARK_HOME/bin/spark-shell \
+ --packages
org.apache.datafusion:comet-spark-spark3.5_2.12:0.14.0,org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.8.1,org.apache.iceberg:iceberg-core:1.8.1
\
+ --repositories https://repo1.maven.org/maven2/ \
+ --conf
spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions
\
+ --conf
spark.sql.catalog.spark_catalog=org.apache.iceberg.spark.SparkCatalog \
+ --conf spark.sql.catalog.spark_catalog.type=hadoop \
+ --conf spark.sql.catalog.spark_catalog.warehouse=/tmp/warehouse \
+ --conf spark.plugins=org.apache.spark.CometPlugin \
+ --conf
spark.shuffle.manager=org.apache.spark.sql.comet.execution.shuffle.CometShuffleManager
\
+ --conf spark.sql.extensions=org.apache.comet.CometSparkSessionExtensions \
+ --conf spark.comet.scan.icebergNative.enabled=true \
+ --conf spark.comet.explainFallback.enabled=true \
+ --conf spark.memory.offHeap.enabled=true \
+ --conf spark.memory.offHeap.size=2g
+```
+
+### Tuning
+
+Comet’s native Iceberg reader supports fetching multiple files in parallel to
hide I/O latency with the
+config `spark.comet.scan.icebergNative.dataFileConcurrencyLimit`. This value
defaults to 1 to
+maintain test behavior on Iceberg Java tests without `ORDER BY` clauses, but
we suggest increasing it to
+values between 2 and 8 based on your workload.
+
+### Supported features
+
+The native Iceberg reader supports the following features:
+
+**Table specifications:**
+
+- Iceberg table spec v1 and v2 (v3 will fall back to Spark)
+
+**Schema and data types:**
+
+- All primitive types including UUID
+- Complex types: arrays, maps, and structs
+- Schema evolution (adding and dropping columns)
+
+**Time travel and branching:**
+
+- `VERSION AS OF` queries to read historical snapshots
+- Branch reads for accessing named branches
+
+**Delete handling (Merge-On-Read tables):**
+
+- Positional deletes
+- Equality deletes
+- Mixed delete types
+
+**Filter pushdown:**
+
+- Equality and comparison predicates (`=`, `!=`, `>`, `>=`, `<`, `<=`)
+- Logical operators (`AND`, `OR`)
+- NULL checks (`IS NULL`, `IS NOT NULL`)
+- `IN` and `NOT IN` list operations
+- `BETWEEN` operations
+
+**Partitioning:**
+
+- Standard partitioning with partition pruning
+- Date partitioning with `days()` transform
+- Bucket partitioning
+- Truncate transform
+- Hour transform
+
+**Storage:**
+
+- Local filesystem
+- Hadoop Distributed File System (HDFS)
+- S3-compatible storage (AWS S3, MinIO)
+
+### REST Catalog
+
+Comet's native Iceberg reader also supports REST catalogs. The following
example shows how to
+configure Spark to use a REST catalog with Comet's native Iceberg scan:
+
+```shell
+$SPARK_HOME/bin/spark-shell \
+ --packages
org.apache.datafusion:comet-spark-spark3.5_2.12:0.14.0,org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.8.1,org.apache.iceberg:iceberg-core:1.8.1
\
+ --repositories https://repo1.maven.org/maven2/ \
+ --conf
spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions
\
+ --conf spark.sql.catalog.rest_cat=org.apache.iceberg.spark.SparkCatalog \
+ --conf
spark.sql.catalog.rest_cat.catalog-impl=org.apache.iceberg.rest.RESTCatalog \
+ --conf spark.sql.catalog.rest_cat.uri=http://localhost:8181 \
+ --conf spark.sql.catalog.rest_cat.warehouse=/tmp/warehouse \
+ --conf spark.plugins=org.apache.spark.CometPlugin \
+ --conf
spark.shuffle.manager=org.apache.spark.sql.comet.execution.shuffle.CometShuffleManager
\
+ --conf spark.sql.extensions=org.apache.comet.CometSparkSessionExtensions \
+ --conf spark.comet.scan.icebergNative.enabled=true \
+ --conf spark.comet.explainFallback.enabled=true \
+ --conf spark.memory.offHeap.enabled=true \
+ --conf spark.memory.offHeap.size=2g
+```
+
+Note that REST catalogs require explicit namespace creation before creating
tables:
+
+```scala
+scala> spark.sql("CREATE NAMESPACE rest_cat.db")
+scala> spark.sql("CREATE TABLE rest_cat.db.test_table (id INT, name STRING)
USING iceberg")
+scala> spark.sql("INSERT INTO rest_cat.db.test_table VALUES (1, 'Alice'), (2,
'Bob')")
+scala> spark.sql("SELECT * FROM rest_cat.db.test_table").show()
+```
+
+### Current limitations
+
+The following scenarios will fall back to Spark's native Iceberg reader:
+
+- Iceberg table spec v3 scans
+- Iceberg writes (reads are accelerated, writes use Spark)
+- Tables backed by Avro or ORC data files (only Parquet is accelerated)
+- Tables partitioned on `BINARY` or `DECIMAL` (with precision >28) columns
+- Scans with residual filters using `truncate`, `bucket`, `year`, `month`,
`day`, or `hour`
+ transform functions (partition pruning still works, but row-level filtering
of these
+ transforms falls back)
+
## Hybrid Reader
### Build Comet
@@ -149,127 +277,3 @@ scala> spark.sql(s"SELECT * from t1").explain()
- Spark Runtime Filtering isn't
[working](https://github.com/apache/datafusion-comet/issues/2116)
- You can bypass the issue by either setting
`spark.sql.adaptive.enabled=false` or
`spark.comet.exec.broadcastExchange.enabled=false`
-
-## Native Reader
-
-Comet's fully-native Iceberg integration does not require modifying Iceberg
source
-code. Instead, Comet relies on reflection to extract `FileScanTask`s from
Iceberg, which are
-then serialized to Comet's native execution engine (see
-[PR #2528](https://github.com/apache/datafusion-comet/pull/2528)).
-
-The example below uses Spark's package downloader to retrieve Comet 0.12.0 and
Iceberg
-1.8.1, but Comet has been tested with Iceberg 1.5, 1.7, 1.8, and 1.10. The key
configuration
-to enable fully-native Iceberg is
`spark.comet.scan.icebergNative.enabled=true`. This
-configuration should **not** be used with the hybrid Iceberg configuration
-`spark.sql.iceberg.parquet.reader-type=COMET` from above.
-
-```shell
-$SPARK_HOME/bin/spark-shell \
- --packages
org.apache.datafusion:comet-spark-spark3.5_2.12:0.12.0,org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.8.1,org.apache.iceberg:iceberg-core:1.8.1
\
- --repositories https://repo1.maven.org/maven2/ \
- --conf
spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions
\
- --conf
spark.sql.catalog.spark_catalog=org.apache.iceberg.spark.SparkCatalog \
- --conf spark.sql.catalog.spark_catalog.type=hadoop \
- --conf spark.sql.catalog.spark_catalog.warehouse=/tmp/warehouse \
- --conf spark.plugins=org.apache.spark.CometPlugin \
- --conf
spark.shuffle.manager=org.apache.spark.sql.comet.execution.shuffle.CometShuffleManager
\
- --conf spark.sql.extensions=org.apache.comet.CometSparkSessionExtensions \
- --conf spark.comet.scan.icebergNative.enabled=true \
- --conf spark.comet.explainFallback.enabled=true \
- --conf spark.memory.offHeap.enabled=true \
- --conf spark.memory.offHeap.size=2g
-```
-
-The same sample queries from above can be used to test Comet's fully-native
Iceberg integration,
-however the scan node to look for is `CometIcebergNativeScan`.
-
-### Supported features
-
-The native Iceberg reader supports the following features:
-
-**Table specifications:**
-
-- Iceberg table spec v1 and v2 (v3 will fall back to Spark)
-
-**Schema and data types:**
-
-- All primitive types including UUID
-- Complex types: arrays, maps, and structs
-- Schema evolution (adding and dropping columns)
-
-**Time travel and branching:**
-
-- `VERSION AS OF` queries to read historical snapshots
-- Branch reads for accessing named branches
-
-**Delete handling (Merge-On-Read tables):**
-
-- Positional deletes
-- Equality deletes
-- Mixed delete types
-
-**Filter pushdown:**
-
-- Equality and comparison predicates (`=`, `!=`, `>`, `>=`, `<`, `<=`)
-- Logical operators (`AND`, `OR`)
-- NULL checks (`IS NULL`, `IS NOT NULL`)
-- `IN` and `NOT IN` list operations
-- `BETWEEN` operations
-
-**Partitioning:**
-
-- Standard partitioning with partition pruning
-- Date partitioning with `days()` transform
-- Bucket partitioning
-- Truncate transform
-- Hour transform
-
-**Storage:**
-
-- Local filesystem
-- Hadoop Distributed File System (HDFS)
-- S3-compatible storage (AWS S3, MinIO)
-
-### REST Catalog
-
-Comet's native Iceberg reader also supports REST catalogs. The following
example shows how to
-configure Spark to use a REST catalog with Comet's native Iceberg scan:
-
-```shell
-$SPARK_HOME/bin/spark-shell \
- --packages
org.apache.datafusion:comet-spark-spark3.5_2.12:0.12.0,org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.8.1,org.apache.iceberg:iceberg-core:1.8.1
\
- --repositories https://repo1.maven.org/maven2/ \
- --conf
spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions
\
- --conf spark.sql.catalog.rest_cat=org.apache.iceberg.spark.SparkCatalog \
- --conf
spark.sql.catalog.rest_cat.catalog-impl=org.apache.iceberg.rest.RESTCatalog \
- --conf spark.sql.catalog.rest_cat.uri=http://localhost:8181 \
- --conf spark.sql.catalog.rest_cat.warehouse=/tmp/warehouse \
- --conf spark.plugins=org.apache.spark.CometPlugin \
- --conf
spark.shuffle.manager=org.apache.spark.sql.comet.execution.shuffle.CometShuffleManager
\
- --conf spark.sql.extensions=org.apache.comet.CometSparkSessionExtensions \
- --conf spark.comet.scan.icebergNative.enabled=true \
- --conf spark.comet.explainFallback.enabled=true \
- --conf spark.memory.offHeap.enabled=true \
- --conf spark.memory.offHeap.size=2g
-```
-
-Note that REST catalogs require explicit namespace creation before creating
tables:
-
-```scala
-scala> spark.sql("CREATE NAMESPACE rest_cat.db")
-scala> spark.sql("CREATE TABLE rest_cat.db.test_table (id INT, name STRING)
USING iceberg")
-scala> spark.sql("INSERT INTO rest_cat.db.test_table VALUES (1, 'Alice'), (2,
'Bob')")
-scala> spark.sql("SELECT * FROM rest_cat.db.test_table").show()
-```
-
-### Current limitations
-
-The following scenarios will fall back to Spark's native Iceberg reader:
-
-- Iceberg table spec v3 scans
-- Iceberg writes (reads are accelerated, writes use Spark)
-- Tables backed by Avro or ORC data files (only Parquet is accelerated)
-- Tables partitioned on `BINARY` or `DECIMAL` (with precision >28) columns
-- Scans with residual filters using `truncate`, `bucket`, `year`, `month`,
`day`, or `hour`
- transform functions (partition pruning still works, but row-level filtering
of these
- transforms falls back)
diff --git a/searchindex.js b/searchindex.js
index be2879c0f..62215ea49 100644
--- a/searchindex.js
+++ b/searchindex.js
@@ -1 +1 @@
-Search.setIndex({"alltitles": {"1. Format Your Code": [[12,
"format-your-code"]], "1. Install Comet": [[22, "install-comet"]], "1. Native
Operators (nativeExecs map)": [[4, "native-operators-nativeexecs-map"]], "2.
Build and Verify": [[12, "build-and-verify"]], "2. Clone Spark and Apply Diff":
[[22, "clone-spark-and-apply-diff"]], "2. Sink Operators (sinks map)": [[4,
"sink-operators-sinks-map"]], "3. Comet JVM Operators": [[4,
"comet-jvm-operators"]], "3. Run Clippy (Recommended)": [[12 [...]
\ No newline at end of file
+Search.setIndex({"alltitles": {"1. Format Your Code": [[12,
"format-your-code"]], "1. Install Comet": [[22, "install-comet"]], "1. Native
Operators (nativeExecs map)": [[4, "native-operators-nativeexecs-map"]], "2.
Build and Verify": [[12, "build-and-verify"]], "2. Clone Spark and Apply Diff":
[[22, "clone-spark-and-apply-diff"]], "2. Sink Operators (sinks map)": [[4,
"sink-operators-sinks-map"]], "3. Comet JVM Operators": [[4,
"comet-jvm-operators"]], "3. Run Clippy (Recommended)": [[12 [...]
\ No newline at end of file
diff --git a/user-guide/latest/iceberg.html b/user-guide/latest/iceberg.html
index c409200b3..058f5b573 100644
--- a/user-guide/latest/iceberg.html
+++ b/user-guide/latest/iceberg.html
@@ -461,135 +461,23 @@ under the License.
<section
id="accelerating-apache-iceberg-parquet-scans-using-comet-experimental">
<h1>Accelerating Apache Iceberg Parquet Scans using Comet (Experimental)<a
class="headerlink"
href="#accelerating-apache-iceberg-parquet-scans-using-comet-experimental"
title="Link to this heading">#</a></h1>
<p><strong>Note: Iceberg integration is a work-in-progress. Comet currently
has two distinct Iceberg
-code paths: 1) a hybrid reader (native Parquet decoding, JVM otherwise) that
requires
-building Iceberg from source rather than using available artifacts in Maven,
and 2) fully-native
-reader (based on <a class="reference external"
href="https://github.com/apache/iceberg-rust">iceberg-rust</a>). Directions for
both
+code paths: 1) fully-native
+reader (based on <a class="reference external"
href="https://github.com/apache/iceberg-rust">iceberg-rust</a>), and 2) a
hybrid reader (native Parquet decoding, JVM otherwise) that requires
+building Iceberg from source rather than using available artifacts in Maven.
Directions for both
designs are provided below.</strong></p>
-<section id="hybrid-reader">
-<h2>Hybrid Reader<a class="headerlink" href="#hybrid-reader" title="Link to
this heading">#</a></h2>
-<section id="build-comet">
-<h3>Build Comet<a class="headerlink" href="#build-comet" title="Link to this
heading">#</a></h3>
-<p>Run a Maven install so that we can compile Iceberg against latest Comet:</p>
-<div class="highlight-shell notranslate"><div
class="highlight"><pre><span></span>mvn<span class="w"> </span>install<span
class="w"> </span>-DskipTests
-</pre></div>
-</div>
-<p>Build the release JAR to be used from Spark:</p>
-<div class="highlight-shell notranslate"><div
class="highlight"><pre><span></span>make<span class="w"> </span>release
-</pre></div>
-</div>
-<p>Set <code class="docutils literal notranslate"><span
class="pre">COMET_JAR</span></code> env var:</p>
-<div class="highlight-shell notranslate"><div
class="highlight"><pre><span></span><span class="nb">export</span><span
class="w"> </span><span class="nv">COMET_JAR</span><span
class="o">=</span><span class="sb">`</span><span class="nb">pwd</span><span
class="sb">`</span>/spark/target/comet-spark-spark3.5_2.12-0.14.0-SNAPSHOT.jar
-</pre></div>
-</div>
-</section>
-<section id="build-iceberg">
-<h3>Build Iceberg<a class="headerlink" href="#build-iceberg" title="Link to
this heading">#</a></h3>
-<p>Clone the Iceberg repository and apply code changes needed by Comet</p>
-<div class="highlight-shell notranslate"><div
class="highlight"><pre><span></span>git<span class="w"> </span>clone<span
class="w"> </span>[email protected]:apache/iceberg.git
-<span class="nb">cd</span><span class="w"> </span>iceberg
-git<span class="w"> </span>checkout<span class="w"> </span>apache-iceberg-1.8.1
-git<span class="w"> </span>apply<span class="w">
</span>../datafusion-comet/dev/diffs/iceberg/1.8.1.diff
-</pre></div>
-</div>
-<p>Perform a clean build</p>
-<div class="highlight-shell notranslate"><div
class="highlight"><pre><span></span>./gradlew<span class="w"> </span>clean<span
class="w"> </span>build<span class="w"> </span>-x<span class="w"> </span><span
class="nb">test</span><span class="w"> </span>-x<span class="w">
</span>integrationTest
-</pre></div>
-</div>
-</section>
-<section id="test">
-<h3>Test<a class="headerlink" href="#test" title="Link to this
heading">#</a></h3>
-<p>Set <code class="docutils literal notranslate"><span
class="pre">ICEBERG_JAR</span></code> environment variable.</p>
-<div class="highlight-shell notranslate"><div
class="highlight"><pre><span></span><span class="nb">export</span><span
class="w"> </span><span class="nv">ICEBERG_JAR</span><span
class="o">=</span><span class="sb">`</span><span class="nb">pwd</span><span
class="sb">`</span>/spark/v3.5/spark-runtime/build/libs/iceberg-spark-runtime-3.5_2.12-1.9.0-SNAPSHOT.jar
-</pre></div>
-</div>
-<p>Launch Spark Shell:</p>
-<div class="highlight-shell notranslate"><div
class="highlight"><pre><span></span><span
class="nv">$SPARK_HOME</span>/bin/spark-shell<span class="w"> </span><span
class="se">\</span>
-<span class="w"> </span>--driver-class-path<span class="w"> </span><span
class="nv">$COMET_JAR</span>:<span class="nv">$ICEBERG_JAR</span><span
class="w"> </span><span class="se">\</span>
-<span class="w"> </span>--conf<span class="w">
</span>spark.executor.extraClassPath<span class="o">=</span><span
class="nv">$COMET_JAR</span>:<span class="nv">$ICEBERG_JAR</span><span
class="w"> </span><span class="se">\</span>
-<span class="w"> </span>--conf<span class="w">
</span>spark.sql.extensions<span
class="o">=</span>org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions<span
class="w"> </span><span class="se">\</span>
-<span class="w"> </span>--conf<span class="w">
</span>spark.sql.catalog.spark_catalog<span
class="o">=</span>org.apache.iceberg.spark.SparkCatalog<span class="w">
</span><span class="se">\</span>
-<span class="w"> </span>--conf<span class="w">
</span>spark.sql.catalog.spark_catalog.type<span class="o">=</span>hadoop<span
class="w"> </span><span class="se">\</span>
-<span class="w"> </span>--conf<span class="w">
</span>spark.sql.catalog.spark_catalog.warehouse<span
class="o">=</span>/tmp/warehouse<span class="w"> </span><span
class="se">\</span>
-<span class="w"> </span>--conf<span class="w"> </span>spark.plugins<span
class="o">=</span>org.apache.spark.CometPlugin<span class="w"> </span><span
class="se">\</span>
-<span class="w"> </span>--conf<span class="w">
</span>spark.shuffle.manager<span
class="o">=</span>org.apache.spark.sql.comet.execution.shuffle.CometShuffleManager<span
class="w"> </span><span class="se">\</span>
-<span class="w"> </span>--conf<span class="w">
</span>spark.sql.iceberg.parquet.reader-type<span class="o">=</span>COMET<span
class="w"> </span><span class="se">\</span>
-<span class="w"> </span>--conf<span class="w">
</span>spark.comet.explainFallback.enabled<span class="o">=</span><span
class="nb">true</span><span class="w"> </span><span class="se">\</span>
-<span class="w"> </span>--conf<span class="w">
</span>spark.memory.offHeap.enabled<span class="o">=</span><span
class="nb">true</span><span class="w"> </span><span class="se">\</span>
-<span class="w"> </span>--conf<span class="w">
</span>spark.memory.offHeap.size<span class="o">=</span>2g<span class="w">
</span><span class="se">\</span>
-<span class="w"> </span>--conf<span class="w">
</span>spark.comet.use.lazyMaterialization<span class="o">=</span><span
class="nb">false</span><span class="w"> </span><span class="se">\</span>
-<span class="w"> </span>--conf<span class="w">
</span>spark.comet.schemaEvolution.enabled<span class="o">=</span><span
class="nb">true</span>
-</pre></div>
-</div>
-<p>Create an Iceberg table. Note that Comet will not accelerate this part.</p>
-<div class="highlight-default notranslate"><div
class="highlight"><pre><span></span><span class="n">scala</span><span
class="o">></span> <span class="n">spark</span><span class="o">.</span><span
class="n">sql</span><span class="p">(</span><span class="n">s</span><span
class="s2">"CREATE TABLE IF NOT EXISTS t1 (c0 INT, c1 STRING) USING
iceberg"</span><span class="p">)</span>
-<span class="n">scala</span><span class="o">></span> <span
class="n">spark</span><span class="o">.</span><span class="n">sql</span><span
class="p">(</span><span class="n">s</span><span class="s2">"INSERT INTO t1
VALUES ${(0 until 10000).map(i => (i, i)).mkString("</span><span
class="p">,</span><span class="s2">")}"</span><span class="p">)</span>
-</pre></div>
-</div>
-<p>Comet should now be able to accelerate reading the table:</p>
-<div class="highlight-default notranslate"><div
class="highlight"><pre><span></span><span class="n">scala</span><span
class="o">></span> <span class="n">spark</span><span class="o">.</span><span
class="n">sql</span><span class="p">(</span><span class="n">s</span><span
class="s2">"SELECT * from t1"</span><span class="p">)</span><span
class="o">.</span><span class="n">show</span><span class="p">()</span>
-</pre></div>
-</div>
-<p>This should produce the following output:</p>
-<div class="highlight-default notranslate"><div
class="highlight"><pre><span></span><span class="n">scala</span><span
class="o">></span> <span class="n">spark</span><span class="o">.</span><span
class="n">sql</span><span class="p">(</span><span class="n">s</span><span
class="s2">"SELECT * from t1"</span><span class="p">)</span><span
class="o">.</span><span class="n">show</span><span class="p">()</span>
-<span class="o">+---+---+</span>
-<span class="o">|</span> <span class="n">c0</span><span class="o">|</span>
<span class="n">c1</span><span class="o">|</span>
-<span class="o">+---+---+</span>
-<span class="o">|</span> <span class="mi">0</span><span class="o">|</span>
<span class="mi">0</span><span class="o">|</span>
-<span class="o">|</span> <span class="mi">1</span><span class="o">|</span>
<span class="mi">1</span><span class="o">|</span>
-<span class="o">|</span> <span class="mi">2</span><span class="o">|</span>
<span class="mi">2</span><span class="o">|</span>
-<span class="o">|</span> <span class="mi">3</span><span class="o">|</span>
<span class="mi">3</span><span class="o">|</span>
-<span class="o">|</span> <span class="mi">4</span><span class="o">|</span>
<span class="mi">4</span><span class="o">|</span>
-<span class="o">|</span> <span class="mi">5</span><span class="o">|</span>
<span class="mi">5</span><span class="o">|</span>
-<span class="o">|</span> <span class="mi">6</span><span class="o">|</span>
<span class="mi">6</span><span class="o">|</span>
-<span class="o">|</span> <span class="mi">7</span><span class="o">|</span>
<span class="mi">7</span><span class="o">|</span>
-<span class="o">|</span> <span class="mi">8</span><span class="o">|</span>
<span class="mi">8</span><span class="o">|</span>
-<span class="o">|</span> <span class="mi">9</span><span class="o">|</span>
<span class="mi">9</span><span class="o">|</span>
-<span class="o">|</span> <span class="mi">10</span><span class="o">|</span>
<span class="mi">10</span><span class="o">|</span>
-<span class="o">|</span> <span class="mi">11</span><span class="o">|</span>
<span class="mi">11</span><span class="o">|</span>
-<span class="o">|</span> <span class="mi">12</span><span class="o">|</span>
<span class="mi">12</span><span class="o">|</span>
-<span class="o">|</span> <span class="mi">13</span><span class="o">|</span>
<span class="mi">13</span><span class="o">|</span>
-<span class="o">|</span> <span class="mi">14</span><span class="o">|</span>
<span class="mi">14</span><span class="o">|</span>
-<span class="o">|</span> <span class="mi">15</span><span class="o">|</span>
<span class="mi">15</span><span class="o">|</span>
-<span class="o">|</span> <span class="mi">16</span><span class="o">|</span>
<span class="mi">16</span><span class="o">|</span>
-<span class="o">|</span> <span class="mi">17</span><span class="o">|</span>
<span class="mi">17</span><span class="o">|</span>
-<span class="o">|</span> <span class="mi">18</span><span class="o">|</span>
<span class="mi">18</span><span class="o">|</span>
-<span class="o">|</span> <span class="mi">19</span><span class="o">|</span>
<span class="mi">19</span><span class="o">|</span>
-<span class="o">+---+---+</span>
-<span class="n">only</span> <span class="n">showing</span> <span
class="n">top</span> <span class="mi">20</span> <span class="n">rows</span>
-</pre></div>
-</div>
-<p>Confirm that the query was accelerated by Comet:</p>
-<div class="highlight-default notranslate"><div
class="highlight"><pre><span></span><span class="n">scala</span><span
class="o">></span> <span class="n">spark</span><span class="o">.</span><span
class="n">sql</span><span class="p">(</span><span class="n">s</span><span
class="s2">"SELECT * from t1"</span><span class="p">)</span><span
class="o">.</span><span class="n">explain</span><span class="p">()</span>
-<span class="o">==</span> <span class="n">Physical</span> <span
class="n">Plan</span> <span class="o">==</span>
-<span class="o">*</span><span class="p">(</span><span class="mi">1</span><span
class="p">)</span> <span class="n">CometColumnarToRow</span>
-<span class="o">+-</span> <span class="n">CometBatchScan</span> <span
class="n">spark_catalog</span><span class="o">.</span><span
class="n">default</span><span class="o">.</span><span class="n">t1</span><span
class="p">[</span><span class="n">c0</span><span class="c1">#26, c1#27]
spark_catalog.default.t1 (branch=null) [filters=, groupedBy=] RuntimeFilters:
[]</span>
-</pre></div>
-</div>
-</section>
-<section id="known-issues">
-<h3>Known issues<a class="headerlink" href="#known-issues" title="Link to this
heading">#</a></h3>
-<ul class="simple">
-<li><p>Spark Runtime Filtering isn’t <a class="reference external"
href="https://github.com/apache/datafusion-comet/issues/2116">working</a></p>
-<ul>
-<li><p>You can bypass the issue by either setting <code class="docutils
literal notranslate"><span
class="pre">spark.sql.adaptive.enabled=false</span></code> or <code
class="docutils literal notranslate"><span
class="pre">spark.comet.exec.broadcastExchange.enabled=false</span></code></p></li>
-</ul>
-</li>
-</ul>
-</section>
-</section>
<section id="native-reader">
<h2>Native Reader<a class="headerlink" href="#native-reader" title="Link to
this heading">#</a></h2>
<p>Comet’s fully-native Iceberg integration does not require modifying Iceberg
source
code. Instead, Comet relies on reflection to extract <code class="docutils
literal notranslate"><span class="pre">FileScanTask</span></code>s from
Iceberg, which are
then serialized to Comet’s native execution engine (see
<a class="reference external"
href="https://github.com/apache/datafusion-comet/pull/2528">PR #2528</a>).</p>
-<p>The example below uses Spark’s package downloader to retrieve Comet 0.12.0
and Iceberg
-1.8.1, but Comet has been tested with Iceberg 1.5, 1.7, 1.8, and 1.10. The key
configuration
+<p>The example below uses Spark’s package downloader to retrieve Comet 0.14.0
and Iceberg
+1.8.1, but Comet has been tested with Iceberg 1.5, 1.7, 1.8, 1.9, and 1.10.
The key configuration
to enable fully-native Iceberg is <code class="docutils literal
notranslate"><span
class="pre">spark.comet.scan.icebergNative.enabled=true</span></code>. This
configuration should <strong>not</strong> be used with the hybrid Iceberg
configuration
-<code class="docutils literal notranslate"><span
class="pre">spark.sql.iceberg.parquet.reader-type=COMET</span></code> from
above.</p>
+<code class="docutils literal notranslate"><span
class="pre">spark.sql.iceberg.parquet.reader-type=COMET</span></code> from
below.</p>
<div class="highlight-shell notranslate"><div
class="highlight"><pre><span></span><span
class="nv">$SPARK_HOME</span>/bin/spark-shell<span class="w"> </span><span
class="se">\</span>
-<span class="w"> </span>--packages<span class="w">
</span>org.apache.datafusion:comet-spark-spark3.5_2.12:0.12.0,org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.8.1,org.apache.iceberg:iceberg-core:1.8.1<span
class="w"> </span><span class="se">\</span>
+<span class="w"> </span>--packages<span class="w">
</span>org.apache.datafusion:comet-spark-spark3.5_2.12:0.14.0,org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.8.1,org.apache.iceberg:iceberg-core:1.8.1<span
class="w"> </span><span class="se">\</span>
<span class="w"> </span>--repositories<span class="w">
</span>https://repo1.maven.org/maven2/<span class="w"> </span><span
class="se">\</span>
<span class="w"> </span>--conf<span class="w">
</span>spark.sql.extensions<span
class="o">=</span>org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions<span
class="w"> </span><span class="se">\</span>
<span class="w"> </span>--conf<span class="w">
</span>spark.sql.catalog.spark_catalog<span
class="o">=</span>org.apache.iceberg.spark.SparkCatalog<span class="w">
</span><span class="se">\</span>
@@ -604,8 +492,13 @@ configuration should <strong>not</strong> be used with the
hybrid Iceberg config
<span class="w"> </span>--conf<span class="w">
</span>spark.memory.offHeap.size<span class="o">=</span>2g
</pre></div>
</div>
-<p>The same sample queries from above can be used to test Comet’s fully-native
Iceberg integration,
-however the scan node to look for is <code class="docutils literal
notranslate"><span class="pre">CometIcebergNativeScan</span></code>.</p>
+<section id="tuning">
+<h3>Tuning<a class="headerlink" href="#tuning" title="Link to this
heading">#</a></h3>
+<p>Comet’s native Iceberg reader supports fetching multiple files in parallel
to hide I/O latency with the
+config <code class="docutils literal notranslate"><span
class="pre">spark.comet.scan.icebergNative.dataFileConcurrencyLimit</span></code>.
This value defaults to 1 to
+maintain test behavior on Iceberg Java tests without <code class="docutils
literal notranslate"><span class="pre">ORDER</span> <span
class="pre">BY</span></code> clauses, but we suggest increasing it to
+values between 2 and 8 based on your workload.</p>
+</section>
<section id="supported-features">
<h3>Supported features<a class="headerlink" href="#supported-features"
title="Link to this heading">#</a></h3>
<p>The native Iceberg reader supports the following features:</p>
@@ -658,7 +551,7 @@ however the scan node to look for is <code class="docutils
literal notranslate">
<p>Comet’s native Iceberg reader also supports REST catalogs. The following
example shows how to
configure Spark to use a REST catalog with Comet’s native Iceberg scan:</p>
<div class="highlight-shell notranslate"><div
class="highlight"><pre><span></span><span
class="nv">$SPARK_HOME</span>/bin/spark-shell<span class="w"> </span><span
class="se">\</span>
-<span class="w"> </span>--packages<span class="w">
</span>org.apache.datafusion:comet-spark-spark3.5_2.12:0.12.0,org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.8.1,org.apache.iceberg:iceberg-core:1.8.1<span
class="w"> </span><span class="se">\</span>
+<span class="w"> </span>--packages<span class="w">
</span>org.apache.datafusion:comet-spark-spark3.5_2.12:0.14.0,org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.8.1,org.apache.iceberg:iceberg-core:1.8.1<span
class="w"> </span><span class="se">\</span>
<span class="w"> </span>--repositories<span class="w">
</span>https://repo1.maven.org/maven2/<span class="w"> </span><span
class="se">\</span>
<span class="w"> </span>--conf<span class="w">
</span>spark.sql.extensions<span
class="o">=</span>org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions<span
class="w"> </span><span class="se">\</span>
<span class="w"> </span>--conf<span class="w">
</span>spark.sql.catalog.rest_cat<span
class="o">=</span>org.apache.iceberg.spark.SparkCatalog<span class="w">
</span><span class="se">\</span>
@@ -696,6 +589,118 @@ transforms falls back)</p></li>
</ul>
</section>
</section>
+<section id="hybrid-reader">
+<h2>Hybrid Reader<a class="headerlink" href="#hybrid-reader" title="Link to
this heading">#</a></h2>
+<section id="build-comet">
+<h3>Build Comet<a class="headerlink" href="#build-comet" title="Link to this
heading">#</a></h3>
+<p>Run a Maven install so that we can compile Iceberg against latest Comet:</p>
+<div class="highlight-shell notranslate"><div
class="highlight"><pre><span></span>mvn<span class="w"> </span>install<span
class="w"> </span>-DskipTests
+</pre></div>
+</div>
+<p>Build the release JAR to be used from Spark:</p>
+<div class="highlight-shell notranslate"><div
class="highlight"><pre><span></span>make<span class="w"> </span>release
+</pre></div>
+</div>
+<p>Set <code class="docutils literal notranslate"><span
class="pre">COMET_JAR</span></code> env var:</p>
+<div class="highlight-shell notranslate"><div
class="highlight"><pre><span></span><span class="nb">export</span><span
class="w"> </span><span class="nv">COMET_JAR</span><span
class="o">=</span><span class="sb">`</span><span class="nb">pwd</span><span
class="sb">`</span>/spark/target/comet-spark-spark3.5_2.12-0.14.0-SNAPSHOT.jar
+</pre></div>
+</div>
+</section>
+<section id="build-iceberg">
+<h3>Build Iceberg<a class="headerlink" href="#build-iceberg" title="Link to
this heading">#</a></h3>
+<p>Clone the Iceberg repository and apply code changes needed by Comet</p>
+<div class="highlight-shell notranslate"><div
class="highlight"><pre><span></span>git<span class="w"> </span>clone<span
class="w"> </span>[email protected]:apache/iceberg.git
+<span class="nb">cd</span><span class="w"> </span>iceberg
+git<span class="w"> </span>checkout<span class="w"> </span>apache-iceberg-1.8.1
+git<span class="w"> </span>apply<span class="w">
</span>../datafusion-comet/dev/diffs/iceberg/1.8.1.diff
+</pre></div>
+</div>
+<p>Perform a clean build</p>
+<div class="highlight-shell notranslate"><div
class="highlight"><pre><span></span>./gradlew<span class="w"> </span>clean<span
class="w"> </span>build<span class="w"> </span>-x<span class="w"> </span><span
class="nb">test</span><span class="w"> </span>-x<span class="w">
</span>integrationTest
+</pre></div>
+</div>
+</section>
+<section id="test">
+<h3>Test<a class="headerlink" href="#test" title="Link to this
heading">#</a></h3>
+<p>Set <code class="docutils literal notranslate"><span
class="pre">ICEBERG_JAR</span></code> environment variable.</p>
+<div class="highlight-shell notranslate"><div
class="highlight"><pre><span></span><span class="nb">export</span><span
class="w"> </span><span class="nv">ICEBERG_JAR</span><span
class="o">=</span><span class="sb">`</span><span class="nb">pwd</span><span
class="sb">`</span>/spark/v3.5/spark-runtime/build/libs/iceberg-spark-runtime-3.5_2.12-1.9.0-SNAPSHOT.jar
+</pre></div>
+</div>
+<p>Launch Spark Shell:</p>
+<div class="highlight-shell notranslate"><div
class="highlight"><pre><span></span><span
class="nv">$SPARK_HOME</span>/bin/spark-shell<span class="w"> </span><span
class="se">\</span>
+<span class="w"> </span>--driver-class-path<span class="w"> </span><span
class="nv">$COMET_JAR</span>:<span class="nv">$ICEBERG_JAR</span><span
class="w"> </span><span class="se">\</span>
+<span class="w"> </span>--conf<span class="w">
</span>spark.executor.extraClassPath<span class="o">=</span><span
class="nv">$COMET_JAR</span>:<span class="nv">$ICEBERG_JAR</span><span
class="w"> </span><span class="se">\</span>
+<span class="w"> </span>--conf<span class="w">
</span>spark.sql.extensions<span
class="o">=</span>org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions<span
class="w"> </span><span class="se">\</span>
+<span class="w"> </span>--conf<span class="w">
</span>spark.sql.catalog.spark_catalog<span
class="o">=</span>org.apache.iceberg.spark.SparkCatalog<span class="w">
</span><span class="se">\</span>
+<span class="w"> </span>--conf<span class="w">
</span>spark.sql.catalog.spark_catalog.type<span class="o">=</span>hadoop<span
class="w"> </span><span class="se">\</span>
+<span class="w"> </span>--conf<span class="w">
</span>spark.sql.catalog.spark_catalog.warehouse<span
class="o">=</span>/tmp/warehouse<span class="w"> </span><span
class="se">\</span>
+<span class="w"> </span>--conf<span class="w"> </span>spark.plugins<span
class="o">=</span>org.apache.spark.CometPlugin<span class="w"> </span><span
class="se">\</span>
+<span class="w"> </span>--conf<span class="w">
</span>spark.shuffle.manager<span
class="o">=</span>org.apache.spark.sql.comet.execution.shuffle.CometShuffleManager<span
class="w"> </span><span class="se">\</span>
+<span class="w"> </span>--conf<span class="w">
</span>spark.sql.iceberg.parquet.reader-type<span class="o">=</span>COMET<span
class="w"> </span><span class="se">\</span>
+<span class="w"> </span>--conf<span class="w">
</span>spark.comet.explainFallback.enabled<span class="o">=</span><span
class="nb">true</span><span class="w"> </span><span class="se">\</span>
+<span class="w"> </span>--conf<span class="w">
</span>spark.memory.offHeap.enabled<span class="o">=</span><span
class="nb">true</span><span class="w"> </span><span class="se">\</span>
+<span class="w"> </span>--conf<span class="w">
</span>spark.memory.offHeap.size<span class="o">=</span>2g<span class="w">
</span><span class="se">\</span>
+<span class="w"> </span>--conf<span class="w">
</span>spark.comet.use.lazyMaterialization<span class="o">=</span><span
class="nb">false</span><span class="w"> </span><span class="se">\</span>
+<span class="w"> </span>--conf<span class="w">
</span>spark.comet.schemaEvolution.enabled<span class="o">=</span><span
class="nb">true</span>
+</pre></div>
+</div>
+<p>Create an Iceberg table. Note that Comet will not accelerate this part.</p>
+<div class="highlight-default notranslate"><div
class="highlight"><pre><span></span><span class="n">scala</span><span
class="o">></span> <span class="n">spark</span><span class="o">.</span><span
class="n">sql</span><span class="p">(</span><span class="n">s</span><span
class="s2">"CREATE TABLE IF NOT EXISTS t1 (c0 INT, c1 STRING) USING
iceberg"</span><span class="p">)</span>
+<span class="n">scala</span><span class="o">></span> <span
class="n">spark</span><span class="o">.</span><span class="n">sql</span><span
class="p">(</span><span class="n">s</span><span class="s2">"INSERT INTO t1
VALUES ${(0 until 10000).map(i => (i, i)).mkString("</span><span
class="p">,</span><span class="s2">")}"</span><span class="p">)</span>
+</pre></div>
+</div>
+<p>Comet should now be able to accelerate reading the table:</p>
+<div class="highlight-default notranslate"><div
class="highlight"><pre><span></span><span class="n">scala</span><span
class="o">></span> <span class="n">spark</span><span class="o">.</span><span
class="n">sql</span><span class="p">(</span><span class="n">s</span><span
class="s2">"SELECT * from t1"</span><span class="p">)</span><span
class="o">.</span><span class="n">show</span><span class="p">()</span>
+</pre></div>
+</div>
+<p>This should produce the following output:</p>
+<div class="highlight-default notranslate"><div
class="highlight"><pre><span></span><span class="n">scala</span><span
class="o">></span> <span class="n">spark</span><span class="o">.</span><span
class="n">sql</span><span class="p">(</span><span class="n">s</span><span
class="s2">"SELECT * from t1"</span><span class="p">)</span><span
class="o">.</span><span class="n">show</span><span class="p">()</span>
+<span class="o">+---+---+</span>
+<span class="o">|</span> <span class="n">c0</span><span class="o">|</span>
<span class="n">c1</span><span class="o">|</span>
+<span class="o">+---+---+</span>
+<span class="o">|</span> <span class="mi">0</span><span class="o">|</span>
<span class="mi">0</span><span class="o">|</span>
+<span class="o">|</span> <span class="mi">1</span><span class="o">|</span>
<span class="mi">1</span><span class="o">|</span>
+<span class="o">|</span> <span class="mi">2</span><span class="o">|</span>
<span class="mi">2</span><span class="o">|</span>
+<span class="o">|</span> <span class="mi">3</span><span class="o">|</span>
<span class="mi">3</span><span class="o">|</span>
+<span class="o">|</span> <span class="mi">4</span><span class="o">|</span>
<span class="mi">4</span><span class="o">|</span>
+<span class="o">|</span> <span class="mi">5</span><span class="o">|</span>
<span class="mi">5</span><span class="o">|</span>
+<span class="o">|</span> <span class="mi">6</span><span class="o">|</span>
<span class="mi">6</span><span class="o">|</span>
+<span class="o">|</span> <span class="mi">7</span><span class="o">|</span>
<span class="mi">7</span><span class="o">|</span>
+<span class="o">|</span> <span class="mi">8</span><span class="o">|</span>
<span class="mi">8</span><span class="o">|</span>
+<span class="o">|</span> <span class="mi">9</span><span class="o">|</span>
<span class="mi">9</span><span class="o">|</span>
+<span class="o">|</span> <span class="mi">10</span><span class="o">|</span>
<span class="mi">10</span><span class="o">|</span>
+<span class="o">|</span> <span class="mi">11</span><span class="o">|</span>
<span class="mi">11</span><span class="o">|</span>
+<span class="o">|</span> <span class="mi">12</span><span class="o">|</span>
<span class="mi">12</span><span class="o">|</span>
+<span class="o">|</span> <span class="mi">13</span><span class="o">|</span>
<span class="mi">13</span><span class="o">|</span>
+<span class="o">|</span> <span class="mi">14</span><span class="o">|</span>
<span class="mi">14</span><span class="o">|</span>
+<span class="o">|</span> <span class="mi">15</span><span class="o">|</span>
<span class="mi">15</span><span class="o">|</span>
+<span class="o">|</span> <span class="mi">16</span><span class="o">|</span>
<span class="mi">16</span><span class="o">|</span>
+<span class="o">|</span> <span class="mi">17</span><span class="o">|</span>
<span class="mi">17</span><span class="o">|</span>
+<span class="o">|</span> <span class="mi">18</span><span class="o">|</span>
<span class="mi">18</span><span class="o">|</span>
+<span class="o">|</span> <span class="mi">19</span><span class="o">|</span>
<span class="mi">19</span><span class="o">|</span>
+<span class="o">+---+---+</span>
+<span class="n">only</span> <span class="n">showing</span> <span
class="n">top</span> <span class="mi">20</span> <span class="n">rows</span>
+</pre></div>
+</div>
+<p>Confirm that the query was accelerated by Comet:</p>
+<div class="highlight-default notranslate"><div
class="highlight"><pre><span></span><span class="n">scala</span><span
class="o">></span> <span class="n">spark</span><span class="o">.</span><span
class="n">sql</span><span class="p">(</span><span class="n">s</span><span
class="s2">"SELECT * from t1"</span><span class="p">)</span><span
class="o">.</span><span class="n">explain</span><span class="p">()</span>
+<span class="o">==</span> <span class="n">Physical</span> <span
class="n">Plan</span> <span class="o">==</span>
+<span class="o">*</span><span class="p">(</span><span class="mi">1</span><span
class="p">)</span> <span class="n">CometColumnarToRow</span>
+<span class="o">+-</span> <span class="n">CometBatchScan</span> <span
class="n">spark_catalog</span><span class="o">.</span><span
class="n">default</span><span class="o">.</span><span class="n">t1</span><span
class="p">[</span><span class="n">c0</span><span class="c1">#26, c1#27]
spark_catalog.default.t1 (branch=null) [filters=, groupedBy=] RuntimeFilters:
[]</span>
+</pre></div>
+</div>
+</section>
+<section id="known-issues">
+<h3>Known issues<a class="headerlink" href="#known-issues" title="Link to this
heading">#</a></h3>
+<ul class="simple">
+<li><p>Spark Runtime Filtering isn’t <a class="reference external"
href="https://github.com/apache/datafusion-comet/issues/2116">working</a></p>
+<ul>
+<li><p>You can bypass the issue by either setting <code class="docutils
literal notranslate"><span
class="pre">spark.sql.adaptive.enabled=false</span></code> or <code
class="docutils literal notranslate"><span
class="pre">spark.comet.exec.broadcastExchange.enabled=false</span></code></p></li>
+</ul>
+</li>
+</ul>
+</section>
+</section>
</section>
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]