hsiang-c commented on code in PR #193:
URL: https://github.com/apache/datafusion-site/pull/193#discussion_r3350485314


##########
content/blog/2026-06-10-comet-eks.md:
##########
@@ -0,0 +1,111 @@
+---
+layout: post
+title: What two months with the Comet community got our Spark workload on 
Amazon EKS
+date: 2026-06-10
+author: Manabu McCloskey (AWS), Vara Bonthu (AWS)
+categories: [performance]
+summary: Apache DataFusion Comet now runs a 3TB TPC-DS workload significantly 
faster than vanilla Spark 3.5.8 on Amazon EKS. This post covers what changed in 
Comet over the past two releases, and how collaboration between EKS users and 
maintainers shaped that work. <br></br> <img 
src="/blog/images/comet-eks/version-arc-slope.png" width="60%" 
class="img-fluid" alt="TPC-DS performance arc across Comet versions"/>
+---
+
+
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+[TOC]
+
+
+## Introduction
+
+Apache Spark workloads are some of the most common and cost-intensive jobs 
running on Kubernetes. Our customers at [AWS](https://aws.amazon.com/) are 
always on the lookout for meaningful speedups, since any speedup translates 
directly into lower bills. That's what got us interested in Apache DataFusion 
Comet. The local TPC-DS and TPC-H benchmarks looked very good, and we wanted to 
see how Comet would hold up on a more realistic setup.
+
+When we ran a 3TB TPC-DS benchmark on Spark 3.5.8 on [Amazon 
EKS](https://aws.amazon.com/pm/eks/), Comet was 11% slower than vanilla Spark, 
and we hit several operational issues along the way that made it hard to keep 
the cluster running smoothly. We built a reproducible benchmark kit and worked 
closely with the Comet maintainers to validate fixes as they landed. The same 
benchmark now runs **32% faster** than vanilla Spark on the same setup. The 
rest of this post walks through the issues we hit and what changed in Comet to 
address them.
+
+## Setup
+
+We ran TPC-DS at 3TB scale on Spark 3.5.8 on Amazon EKS, comparing vanilla 
Spark against Apache DataFusion Comet. Each executor pod was sized at 58 GB of 
RAM. We ran the same benchmark on three Comet versions, 0.14.0, 0.15.0, and 
0.16.0, so we could see how the project progressed across releases. The full 
cluster topology, Spark configurations, and benchmark scripts we used are 
documented on the [Data on EKS benchmark 
page](https://awslabs.github.io/data-on-eks/docs/benchmarks/spark-datafusion-comet-benchmark).
+
+## What we ran into
+
+We hit four classes of issues running Comet at scale on Amazon EKS. The first 
three were operational and the fourth was the source of most of the regression. 
The [Data on EKS benchmark 
page](https://github.com/awslabs/data-on-eks/blob/05d0590e019d14ed0d058f7c314db74a9b161599/website/docs/benchmarks/spark-datafusion-comet-benchmark.md)
 has full configurations and error traces for each one.
+
+### Excessive DNS queries
+
+Comet executors were generating up to 5,000 DNS queries per second per pod, 
roughly 500x what vanilla Spark issued, which pushed us against the Route 53 
Resolver per-ENI limit of 1,024 queries per second and triggered intermittent 
`UnknownHostException` failures. The root cause was that Comet's native Rust 
layer was creating a fresh object store instance for every Parquet file read, 
each with its own HTTP connection pool. The fix in 
[#3802](https://github.com/apache/datafusion-comet/pull/3802) added a 
process-wide cache for object stores, which collapsed DNS volume back to 
vanilla-Spark levels.
+
+### Unreliable S3 region detection
+
+Without an explicitly configured endpoint region, jobs failed intermittently 
with `Generic S3 error: Failed to resolve region`. The cause was the same 
caching gap as DNS: Comet called the S3 `HeadBucket` API for every Parquet file 
read to resolve the region, and those calls were getting throttled under 
concurrent load. The same fix in 
[#3802](https://github.com/apache/datafusion-comet/pull/3802) caches the 
resolved region per bucket, so `HeadBucket` runs once per bucket and the 
`endpoint.region` workaround is no longer required.
+
+### High memory footprint
+
+Comet consistently used about 67% more memory than vanilla Spark and required 
a 32 GB off-heap pool to run reliably. The root cause was in shuffle memory 
sizing: for each Spark task, Comet spun up two concurrent native execution 
contexts (pre-shuffle and shuffle writer), each allocating its own memory pool 
at the per-task limit. The fix in 
[#3924](https://github.com/apache/datafusion-comet/pull/3924) makes the two 
contexts share a single pool, bringing Comet's memory footprint much closer to 
vanilla Spark.
+
+### Missing DPP support
+
+The biggest performance hit came from how Comet handled Dynamic Partition 
Pruning (DPP). DPP is a Spark optimization that prunes fact-table partitions at 
runtime based on filters from broadcast dimensions. For star-schema workloads 
like TPC-DS, DPP is often the difference between scanning 1% of a fact table 
and scanning all of it.
+
+In Comet 0.14.0, the native Parquet scan didn't support DPP, so any query with 
a DPP filter would fall back to vanilla Spark for that scan. Falling back was 
expected, but the fallback path had a planning bug that caused dramatic 
regressions. The resulting plan ran a Spark scan with a Comet shuffle on top, 
with the DPP filter effectively dropped, so the Spark scan read every 
partition. Queries like q25 regressed dramatically as a result.
+
+The fix in [#3982](https://github.com/apache/datafusion-comet/pull/3982) makes 
shuffle fallback decisions sticky across the two planning passes, removing the 
regression. Comet 0.16.0 then closes the broader gap by adding native DPP 
support to the Parquet scan itself, including broadcast exchange reuse so that 
the dimension table is broadcast only once. The 78 queries that previously fell 
back went from 30-50% native execution to 80-97% native. The full set of 
changes is described in the [Comet 0.16.0 release 
announcement](https://datafusion.apache.org/blog/output/2026/05/07/datafusion-comet-0.16.0/)
 on the DataFusion blog.
+
+## How we worked with the Comet community
+
+The improvement came out of a feedback loop with the Comet maintainers. We 
brought something they didn't have ready access to: a 3TB workload running on 
real Amazon EKS infrastructure, with people who could reproduce issues at scale 
and validate fixes against production-like conditions. The full set of issues 
we worked through is tracked in 
[#3799](https://github.com/apache/datafusion-comet/issues/3799).
+
+Some of our setup choices made the reports useful early on. We ran TPC-DS 
sequentially against vanilla Spark and Comet on the same 12-node cluster 
(r8gd.12xlarge instances), with a Grafana dashboard tracking DNS query rate, 
executor memory, network bandwidth, and storage throughput. That's how we 
caught the DNS spike and the elevated memory baseline before they blocked any 
jobs. We also used AI-assisted analysis of Spark's event logs and driver logs 
to narrow down which queries and stages were regressing, which made the bug 
reports concrete enough that maintainers could act on them without a long 
back-and-forth.
+
+Each issue we filed bundled those logs together with Comet's `explainFallback` 
output, stack traces, and a minimal reproduction. The maintainers had enough to 
work with from the first message most of the time, which kept the 
back-and-forth short.
+
+The other side of the loop was experimental builds. When Comet maintainers 
pushed a candidate fix to a branch, we would build a Comet image off it and run 
a subset of TPC-DS against our 3TB workload to validate. Most coordination 
happened through a shared Slack channel and was tracked through GitHub issues, 
with both sides often responding within minutes during business hours.
+
+Three examples give a sense of the cadence. We reported the DNS query volume 
and S3 region detection issues on March 25, 2026. Comet maintainers had a fix 
proposed the next day and merged on March 31. We reported the memory footprint 
issue the same day, and the shared-pool fix landed on April 13.
+
+The DPP work was the deepest collaboration of the three. While testing 
experimental builds containing the earlier DPP changes, we hit a separate 
problem with DPP under AQE that crashed the driver on specific plans. 
Reproducing it reliably took several iterations, and the fix on the maintainer 
side took similar care. We reported it on April 9 
([#3870](https://github.com/apache/datafusion-comet/issues/3870)) and the fix 
landed on April 17.
+
+Two things kept the loop fast. The maintainers knew the Comet and DataFusion 
codebases well enough to turn around fixes within hours of a clean 
reproduction. We had a stable test cluster and could quickly deploy unreleased 
branches into it, so validation was never the bottleneck.
+
+<img
+src="/blog/images/comet-eks/version-arc-slope.png"
+width="80%"
+class="img-fluid"
+alt="TPC-DS 3TB performance vs. vanilla Spark across Comet versions 0.14.0, 
0.15.0, and 0.16.0"
+/>
+
+Across roughly two months, every issue we surfaced was fixed upstream. The 
0.15.0 release picked up the DNS, S3 region, and memory fixes, which moved the 
same benchmark from 11% slower than vanilla Spark to 11% faster. The 0.16.0 
release added the DPP work and pushed it to 32% faster.
+
+## Results
+
+Full per-query results live on the [Data on EKS benchmark 
page](https://awslabs.github.io/data-on-eks/docs/benchmarks/spark-datafusion-comet-benchmark).
 On Parquet, the workload as a whole runs 32% faster than vanilla Spark 3.5.8 
on the same cluster. Of the TPC-DS queries, 66% improve by 20% or more, with 
the largest speedup on q86 at 3.47× faster (+71%).
+
+The DPP fixes drove most of the gains, with the shared memory pool fix 
improving how reliably Comet runs under tight per-task memory limits.
+
+We also ran the same benchmark against Iceberg tables, where Comet was 37% 
faster overall and 90% of queries saw 20% or more improvement.

Review Comment:
   Are these tables of Iceberg V2 spec?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to