nabuskey commented on code in PR #193:
URL: https://github.com/apache/datafusion-site/pull/193#discussion_r3350370882
##########
content/blog/2026-06-10-comet-eks.md:
##########
@@ -0,0 +1,111 @@
+---
+layout: post
+title: What two months with the Comet community got our Spark workload on
Amazon EKS
+date: 2026-06-10
+author: Manabu McCloskey (AWS), Vara Bonthu (AWS), Andy Grove
+categories: [performance]
+summary: Apache DataFusion Comet now runs a 3TB TPC-DS workload significantly
faster than vanilla Spark 3.5.8 on Amazon EKS. This post covers what changed in
Comet over the past two releases, and how collaboration between EKS users and
maintainers shaped that work. <br></br> <img
src="/blog/images/comet-eks/version-arc-slope.png" width="60%"
class="img-fluid" alt="TPC-DS performance arc across Comet versions"/>
+---
+
+
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements. See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+[TOC]
+
+
+## Introduction
+
+Apache Spark workloads are some of the most common and cost-intensive jobs
running on Kubernetes. Our customers at AWS are always on the lookout for
meaningful speedups, since any speedup translates directly into lower bills.
That's what got us interested in Apache DataFusion Comet. The local TPC-DS and
TPC-H benchmarks looked very good, and we wanted to see how Comet would hold up
on a more realistic setup.
+
+When we ran a 3TB TPC-DS benchmark on Spark 3.5.8 on Amazon EKS, Comet was 11%
slower than vanilla Spark, and we hit several operational issues along the way
that made it hard to keep the cluster running smoothly. We built a reproducible
benchmark kit and worked closely with the Comet maintainers to validate fixes
as they landed. The same benchmark now runs **32% faster** than vanilla Spark
on the same setup. The rest of this post walks through the issues we hit and
what changed in Comet to address them.
+
+## Setup
+
+We ran TPC-DS at 3TB scale on Spark 3.5.8 on Amazon EKS, comparing vanilla
Spark against Apache DataFusion Comet. Each executor pod was sized at 58 GB of
RAM. We ran the same benchmark on three Comet versions, 0.14.0, 0.15.0, and
0.16.0, so we could see how the project progressed across releases. The full
cluster topology, Spark configurations, and benchmark scripts we used are
documented on the [Data on EKS benchmark
page](https://awslabs.github.io/data-on-eks/docs/benchmarks/spark-datafusion-comet-benchmark).
+
+## What we ran into
+
+We hit four classes of issues running Comet at scale on Amazon EKS. The first
three were operational and the fourth was the source of most of the regression.
The [Data on EKS benchmark
page](https://github.com/awslabs/data-on-eks/blob/05d0590e019d14ed0d058f7c314db74a9b161599/website/docs/benchmarks/spark-datafusion-comet-benchmark.md)
has full configurations and error traces for each one.
+
+### Excessive DNS queries
+
+Comet executors were generating up to 5,000 DNS queries per second per pod,
roughly 500x what vanilla Spark issued, which pushed us against the Route 53
Resolver per-ENI limit of 1,024 queries per second and triggered intermittent
`UnknownHostException` failures. The root cause was that Comet's native Rust
layer was creating a fresh object store instance for every Parquet file read,
each with its own HTTP connection pool. The fix in
[#3802](https://github.com/apache/datafusion-comet/pull/3802) added a
process-wide cache for object stores, which collapsed DNS volume back to
vanilla-Spark levels.
+
+### Unreliable S3 region detection
+
+Without an explicitly configured endpoint region, jobs failed intermittently
with `Generic S3 error: Failed to resolve region`. The cause was the same
caching gap as DNS: Comet called the S3 `HeadBucket` API for every Parquet file
read to resolve the region, and those calls were getting throttled under
concurrent load. The same fix in
[#3802](https://github.com/apache/datafusion-comet/pull/3802) caches the
resolved region per bucket, so `HeadBucket` runs once per bucket and the
`endpoint.region` workaround is no longer required.
+
+### High memory footprint
+
+Comet consistently used about 67% more memory than vanilla Spark and required
a 32 GB off-heap pool to run reliably. The root cause was in shuffle memory
sizing: for each Spark task, Comet spun up two concurrent native execution
contexts (pre-shuffle and shuffle writer), each allocating its own memory pool
at the per-task limit. The fix in
[#3924](https://github.com/apache/datafusion-comet/pull/3924) makes the two
contexts share a single pool, bringing Comet's memory footprint much closer to
vanilla Spark.
Review Comment:
It is reported through `container_memory_working_set_bytes` so should be
WSS. We can add disk usages and spill metrics to benchmark results going
forward.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]