This is an automated email from the ASF dual-hosted git repository.
zanmato pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/arrow-site.git
The following commit(s) were added to refs/heads/main by this push:
new 4bd3e71f2db Blog post: Recent Improvements to Hash Join in Arrow C++
(#667)
4bd3e71f2db is described below
commit 4bd3e71f2dbf332a4c48858930045418c3d80725
Author: Rossi Sun <[email protected]>
AuthorDate: Fri Jul 18 16:22:51 2025 +0800
Blog post: Recent Improvements to Hash Join in Arrow C++ (#667)
Co-authored-by: David Li <[email protected]>
Co-authored-by: Bryce Mecum <[email protected]>
---
_data/contributors.yml | 3 +
.../2025-07-18-recent-improvements-to-hash-join.md | 148 +++++++++++++++++++++
.../a-nice-cup-of-flame-graph.png | Bin 0 -> 466480 bytes
.../duckdb-bench-oom-baseline.png | Bin 0 -> 110229 bytes
.../duckdb-bench-oom-opt.png | Bin 0 -> 109256 bytes
.../duckdb-bench-perf-baseline.png | Bin 0 -> 354086 bytes
.../duckdb-bench-perf-opt.png | Bin 0 -> 362188 bytes
.../internal-benchmark.png | Bin 0 -> 156310 bytes
.../memory-profile-baseline.png | Bin 0 -> 35321 bytes
.../memory-profile-opt.png | Bin 0 -> 34667 bytes
10 files changed, 151 insertions(+)
diff --git a/_data/contributors.yml b/_data/contributors.yml
index 5e05ad908a9..408dd19c372 100644
--- a/_data/contributors.yml
+++ b/_data/contributors.yml
@@ -67,4 +67,7 @@
- name: Loïc Alleyne
apacheId: loicalleyne # Not a real apacheId
githubId: loicalleyne
+- name: Rossi Sun
+ apacheId: zanmato
+ githubId: zanmato1984
# End contributors.yml
diff --git a/_posts/2025-07-18-recent-improvements-to-hash-join.md
b/_posts/2025-07-18-recent-improvements-to-hash-join.md
new file mode 100644
index 00000000000..a2a0b043036
--- /dev/null
+++ b/_posts/2025-07-18-recent-improvements-to-hash-join.md
@@ -0,0 +1,148 @@
+---
+layout: post
+title: "Recent Improvements to Hash Join in Arrow C++"
+description: "A deep dive into recent improvements to Apache Arrow’s hash join
implementation — enhancing stability, memory efficiency, and parallel
performance for modern analytic workloads."
+date: "2025-07-18 00:00:00"
+author: zanmato
+categories: [application]
+---
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements. See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+*Editor’s Note: Apache Arrow is an expansive project, ranging from the Arrow
columnar format itself, to its numerous specifications, and a long list of
implementations. Arrow is also an expansive project in terms of its community
of contributors. In this blog post, we’d like to highlight recent work by
Apache Arrow Committer Rossi Sun on improving the performance and stability of
Arrow’s embeddable query execution engine: Acero.*
+
+# Introduction
+
+Hash join is a fundamental operation in analytical processing engines — it
matches rows from two tables based on key values using a hash table for fast
lookup. In the C++ implementation of Apache Arrow, the hash join is implemented
in the C++ engine Acero, which powers query execution in bindings like PyArrow
and the R Arrow package. Even if you haven't used Acero directly, your code may
already be benefiting from it under the hood.
+
+For example, this simple PyArrow example uses Acero:
+```python
+import pyarrow as pa
+
+t1 = pa.table({'id': [1, 2, 3],
+ 'year': [2020, 2022, 2019]})
+t2 = pa.table({'id': [3, 4],
+ 'n_legs': [5, 100],
+ 'animal': ["Brittle stars", "Centipede"]})
+
+t1.join(t2, 'id').combine_chunks().sort_by('year')
+```
+
+Acero was originally created in 2019 to demonstrate that the ever-growing
library of compute kernels in Arrow C++ could be linked together into realistic
workflows and also to take advantage of the emerging Datasets API to give these
workflows access to data. Rather than aiming to compete with full query engines
like DuckDB, Acero focuses on enabling flexible, composable, and embeddable
query execution — serving as a building block for tools and systems that need
fast, modular analytics [...]
+
+Across several recent Arrow C++ releases, we've made substantial improvements
to the hash join implementation to address common user pain points. These
changes improve stability, memory efficiency, and parallel performance, with a
focus on making joins more usable and scalable out of the box. If you've had
trouble using Arrow’s hash join in the past, now is a great time to try again.
+
+# Scaling Safely: Improvements to Stability
+
+In earlier versions of Arrow C++, the hash join implementation used internal
data structures that weren’t designed for very large datasets and lacked
safeguards in some of the underlying memory operations. These limitations
rarely surfaced in small to medium workloads but became problematic at scale,
manifesting as crashes or subtle correctness issues.
+
+At the core of Arrow’s join implementation is a compact, row-oriented
structure known as the “row table”. While Arrow’s data model is columnar, its
hash join implementation operates in a row-wise fashion — similar to modern
engines like DuckDB and Meta’s Velox. This layout minimizes CPU cache misses
during hash table lookups by collocating keys, payloads, and null bits in
memory so they can be accessed together.
+
+In previous versions, the row table used 32-bit offsets to reference packed
rows. This capped each table’s size to 4GB and introduced risks of overflow
when working with large datasets or wide rows. Several reported issues —
[GH-34474](https://github.com/apache/arrow/issues/34474),
[GH-41813](https://github.com/apache/arrow/issues/41813), and
[GH-43202](https://github.com/apache/arrow/issues/43202) — highlighted the
limitations of this design. In response, PR [GH-43389](https://github.co [...]
+
+Besides the offset limitation, earlier versions of Arrow C++ also included
overflow-prone logic in the buffer indexing paths used throughout the hash join
implementation. Many internal calculations assumed that 32-bit integers were
sufficient for addressing memory — a fragile assumption when working with large
datasets or wide rows. These issues appeared not only in conventional C++
indexing code but also in Arrow’s SIMD-accelerated paths — Arrow includes heavy
SIMD specializations, used [...]
+
+Two representative examples:
+
+- Row-wise buffer access in C++
+
+The aforementioned row table stores fixed-length data in tightly packed
buffers. Accessing a particular row (and optionally a column within it)
typically involves [pointer
arithmetic](https://github.com/apache/arrow/blob/12f62653c825fbf305bfde61c112d2aa69203c62/cpp/src/arrow/acero/swiss_join_internal.h#L120):
+```cpp
+const uint8_t* row_ptr = row_ptr_base + row_length * row_id;
+```
+When both `row_length` and `row_id` are large 32-bit integers, their product
can overflow.
+
+Similarly, accessing null masks involves [null-bit indexing
arithmetic](https://github.com/apache/arrow/blob/12f62653c825fbf305bfde61c112d2aa69203c62/cpp/src/arrow/acero/swiss_join_internal.h#L150):
+```cpp
+int64_t bit_id = row_id * null_mask_num_bytes * 8 + pos_after_encoding;
+```
+The intermediate multiplication is performed using 32-bit arithmetic and can
overflow even though the final result is stored in a 64-bit variable.
+
+- SIMD gathers with 32-bit offsets
+
+One essential SIMD instruction is the AVX2 intrinsic `__m256i
_mm256_i32gather_epi32(int const * base, __m256i vindex, const int scale);`,
which performs a parallel memory gather of eight 32-bit integers based on eight
32-bit signed offsets. It was extensively used in Arrow for hash table
operations, for example, [fetching 8 group
IDs](https://github.com/apache/arrow/blob/0a00e25f2f6fb927fb555b69038d0be9b9d9f265/cpp/src/arrow/compute/key_map_internal_avx2.cc#L404)
(hash table slots) in p [...]
+```cpp
+__m256i group_id = _mm256_i32gather_epi32(elements, pos, 1);
+```
+and [loading 8 corresponding key
values](https://github.com/apache/arrow/blob/69e8a78c018da88b60f9eb2b3b45703f81f3c93d/cpp/src/arrow/compute/row/compare_internal_avx2.cc#L284)
from the right-side input in parallel for comparison:
+```cpp
+__m256i right = _mm256_i32gather_epi32((const int*)right_base, offset_right,
1);
+```
+If any of the computed offsets exceed `2^31 - 1`, they wrap into the negative
range, which can lead to invalid memory access (i.e., a crash) or, more subtly,
fetch data from a valid but incorrect location — producing silently wrong
results (trust me, you don’t want to debug that).
+
+To mitigate these risks, PR
[GH-45108](https://github.com/apache/arrow/pull/45108),
[GH-45336](https://github.com/apache/arrow/pull/45336), and
[GH-45515](https://github.com/apache/arrow/pull/45515) promoted critical
arithmetic to 64-bit and reworked SIMD logic to use safer indexing. Buffer
access logic was also encapsulated in safer abstractions to avoid repeated
manual casting or unchecked offset math. These examples are not unique to Arrow
— they reflect common pitfalls in building da [...]
+
+Together, these changes make Arrow’s hash join implementation significantly
more robust and better equipped for modern data workloads. These foundations
not only resolve known issues but also reduce the risk of similar bugs in
future development.
+
+# Leaner Memory Usage
+
+While refining overflow-prone parts of the hash join implementation, I ended
up examining most of the code path for potential pitfalls. When doing this kind
of work, one sits down quietly and interrogates every line — asking not just
whether an intermediate value might overflow, but whether it even needs to
exist at all. And during that process, I came across something unrelated to
overflow — but even more impactful.
+
+In a textbook hash join algorithm, once the right-side table (the build-side)
is fully accumulated, a hash table is constructed to support probing the
left-side table (the probe-side) for matches. To parallelize this build step,
Arrow C++’s implementation partitions the build-side into `N` partitions —
typically matching the number of available CPU cores — and builds a separate
hash table for each partition in parallel. These are then merged into a final,
unified hash table used during t [...]
+
+The issue? The memory footprint. The total size of the partitioned hash tables
is roughly equal to that of the final hash table, but they were being held in
memory even after merging. Once the final hash table was built, these temporary
structures had no further use — yet they persisted through the entire join
operation. There were no crashes, no warnings, no visible red flags — just
silent overhead.
+
+Once spotted, the fix was straightforward: restructure the join process to
release these buffers immediately after the merge. The change was implemented
in PR [GH-45552](https://github.com/apache/arrow/issues/45552). The memory
profiles below illustrate its impact.
+
+<div style="display: flex; gap: 16px; justify-content: center; align-items:
flex-start;">
+ <img src="{{ site.baseurl
}}/img/recent-improvements-to-hash-join/memory-profile-baseline.png"
width="50%" class="img-responsive" alt="Memory profile before"
aria-hidden="true">
+ <img src="{{ site.baseurl
}}/img/recent-improvements-to-hash-join/memory-profile-opt.png" width="50%"
class="img-responsive" alt="Memory profile after" aria-hidden="true">
+</div>
+
+At `A`, memory usage rises steadily as the join builds partitioned hash tables
in parallel. `B` marks the merge point, where these partitions are combined
into a final, unified hash table. `C` represents the start of the probe phase,
where the left-side table is scanned and matched against the final hash table.
Memory begins to rise again as join results are materialized. `D` is the peak
of the join operation, just before memory begins to drop as processing
completes. The “leap of faith” [...]
+
+This improvement already benefits real-world scenarios — for example, the
[DuckDB Labs DB Benchmark](https://duckdblabs.github.io/db-benchmark/). Some
benchmark queries that previously failed with out-of-memory (OOM) errors can
now complete successfully — as shown in the comparison below.
+
+<div style="display: flex; gap: 16px; justify-content: center; align-items:
flex-start;">
+ <img src="{{ site.baseurl
}}/img/recent-improvements-to-hash-join/duckdb-bench-oom-baseline.png"
width="50%" class="img-responsive" alt="DuckDB benchmark OOM before"
aria-hidden="true">
+ <img src="{{ site.baseurl
}}/img/recent-improvements-to-hash-join/duckdb-bench-oom-opt.png" width="50%"
class="img-responsive" alt="DuckDB benchmark OOM after" aria-hidden="true">
+</div>
+
+As one reviewer noted in the PR, this was a “low-hanging fruit.” And
sometimes, meaningful performance gains don’t come from tuning hot loops or
digging through flame graphs — they come from noticing something that doesn’t
feel right and asking: why are we still keeping this around?
+
+# Faster Execution Through Better Parallelism
+
+Not every improvement comes from poring over flame graphs — but some
definitely do. Performance is, after all, the most talked-about aspect of any
query engine. So, how about a nice cup of flame graph?
+
+<img src="{{ site.baseurl
}}/img/recent-improvements-to-hash-join/a-nice-cup-of-flame-graph.png"
width="100%" class="img-responsive" alt="A nice cup of flame graph"
aria-hidden="true">
+
+It’s hard not to notice the long, flat bar dominating the middle — especially
with the rather alarming word “Lock” in it. That’s our red flag.
+
+We’ve mentioned that in the build phase, we build partitioned hash tables in
parallel. In earlier versions of Arrow C++, this parallelism was implemented on
a batch basis — each thread processed a build-side batch concurrently. Since
each batch contained arbitrary data that could fall into any partition, threads
had to synchronize when accessing shared partitions. This was managed through
[locks on
partitions](https://github.com/apache/arrow/blob/196cde38c112d32a944afe978b6da9c7ce935ef7/
[...]
+
+To mitigate this contention, we restructured the build phase in PR
[GH-45612](https://github.com/apache/arrow/issues/45612). Instead of having all
threads partition and insert at once — each thread touching every hash table —
we split the work into two distinct stages. In the first partition stage, `M`
threads take their assigned batches and only partition them, recording which
rows belong to which partition. No insertion happens yet — just classification.
Then comes the second, newly se [...]
+
+<img src="{{ site.baseurl
}}/img/recent-improvements-to-hash-join/internal-benchmark.png" width="100%"
class="img-responsive" alt="Internal benchmark" aria-hidden="true">
+
+Also in real-world scenarios like the [DuckDB Labs DB
Benchmark](https://duckdblabs.github.io/db-benchmark/), we’ve observed similar
gains. The comparison below shows around a 2x improvement in query performance
after this change was applied.
+
+<div style="display: flex; gap: 16px; justify-content: center; align-items:
flex-start;">
+ <img src="{{ site.baseurl
}}/img/recent-improvements-to-hash-join/duckdb-bench-perf-baseline.png"
width="50%" class="img-responsive" alt="DuckDB benchmark perf before"
aria-hidden="true">
+ <img src="{{ site.baseurl
}}/img/recent-improvements-to-hash-join/duckdb-bench-perf-opt.png" width="50%"
class="img-responsive" alt="DuckDB benchmark perf after" aria-hidden="true">
+</div>
+
+Additional improvements include
[GH-43832](https://github.com/apache/arrow/pull/43832), which extends AVX2
acceleration to more probing code paths, and
[GH-45918](https://github.com/apache/arrow/pull/45918), which introduces
parallelism to a previously sequential task phase. These target more
specialized scenarios and edge cases.
+
+## Closing
+
+These improvements reflect ongoing investment in Arrow C++’s execution engine
and a commitment to delivering fast, robust building blocks for analytic
workloads. They are available in recent Arrow C++ releases and exposed through
higher-level bindings like PyArrow and the Arrow R package — starting from
version 18.0.0, with the most significant improvements landing in 20.0.0. If
joins were a blocker for you before — due to memory, scale, or correctness —
recent changes may offer a very d [...]
+
+The Arrow C++ engine is not just alive — it’s improving in meaningful,
user-visible ways. We’re also actively monitoring for further issues and open
to expanding the design based on user feedback and real-world needs. If you’ve
tried joins in the past and run into performance or stability issues, we
encourage you to give them another try and file an [issue on
GitHub](https://github.com/apache/arrow/issues) if you run into any issues.
+
+If you have any questions about this blog post, please feel free to contact
the author, [Rossi Sun](mailto:[email protected]).
diff --git a/img/recent-improvements-to-hash-join/a-nice-cup-of-flame-graph.png
b/img/recent-improvements-to-hash-join/a-nice-cup-of-flame-graph.png
new file mode 100644
index 00000000000..5fb67848f40
Binary files /dev/null and
b/img/recent-improvements-to-hash-join/a-nice-cup-of-flame-graph.png differ
diff --git a/img/recent-improvements-to-hash-join/duckdb-bench-oom-baseline.png
b/img/recent-improvements-to-hash-join/duckdb-bench-oom-baseline.png
new file mode 100644
index 00000000000..a046b57223c
Binary files /dev/null and
b/img/recent-improvements-to-hash-join/duckdb-bench-oom-baseline.png differ
diff --git a/img/recent-improvements-to-hash-join/duckdb-bench-oom-opt.png
b/img/recent-improvements-to-hash-join/duckdb-bench-oom-opt.png
new file mode 100644
index 00000000000..ae5ad8c9cd0
Binary files /dev/null and
b/img/recent-improvements-to-hash-join/duckdb-bench-oom-opt.png differ
diff --git
a/img/recent-improvements-to-hash-join/duckdb-bench-perf-baseline.png
b/img/recent-improvements-to-hash-join/duckdb-bench-perf-baseline.png
new file mode 100644
index 00000000000..bb7cb45cc24
Binary files /dev/null and
b/img/recent-improvements-to-hash-join/duckdb-bench-perf-baseline.png differ
diff --git a/img/recent-improvements-to-hash-join/duckdb-bench-perf-opt.png
b/img/recent-improvements-to-hash-join/duckdb-bench-perf-opt.png
new file mode 100644
index 00000000000..3b544a490c0
Binary files /dev/null and
b/img/recent-improvements-to-hash-join/duckdb-bench-perf-opt.png differ
diff --git a/img/recent-improvements-to-hash-join/internal-benchmark.png
b/img/recent-improvements-to-hash-join/internal-benchmark.png
new file mode 100644
index 00000000000..c7b31b44ea3
Binary files /dev/null and
b/img/recent-improvements-to-hash-join/internal-benchmark.png differ
diff --git a/img/recent-improvements-to-hash-join/memory-profile-baseline.png
b/img/recent-improvements-to-hash-join/memory-profile-baseline.png
new file mode 100644
index 00000000000..e039d8f33d1
Binary files /dev/null and
b/img/recent-improvements-to-hash-join/memory-profile-baseline.png differ
diff --git a/img/recent-improvements-to-hash-join/memory-profile-opt.png
b/img/recent-improvements-to-hash-join/memory-profile-opt.png
new file mode 100644
index 00000000000..9bd4e5148ea
Binary files /dev/null and
b/img/recent-improvements-to-hash-join/memory-profile-opt.png differ