(arrow-site) branch main updated: Blog post: Recent Improvements to Hash Join in Arrow C++ (#667)

zanmato Fri, 18 Jul 2025 01:23:01 -0700

This is an automated email from the ASF dual-hosted git repository.

zanmato pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/arrow-site.git



The following commit(s) were added to refs/heads/main by this push:
     new 4bd3e71f2db Blog post: Recent Improvements to Hash Join in Arrow C++ 
(#667)
4bd3e71f2db is described below

commit 4bd3e71f2dbf332a4c48858930045418c3d80725
Author: Rossi Sun <[email protected]>
AuthorDate: Fri Jul 18 16:22:51 2025 +0800

    Blog post: Recent Improvements to Hash Join in Arrow C++ (#667)
    
    Co-authored-by: David Li <[email protected]>
    Co-authored-by: Bryce Mecum <[email protected]>
---
 _data/contributors.yml                             |   3 +
 .../2025-07-18-recent-improvements-to-hash-join.md | 148 +++++++++++++++++++++
 .../a-nice-cup-of-flame-graph.png                  | Bin 0 -> 466480 bytes
 .../duckdb-bench-oom-baseline.png                  | Bin 0 -> 110229 bytes
 .../duckdb-bench-oom-opt.png                       | Bin 0 -> 109256 bytes
 .../duckdb-bench-perf-baseline.png                 | Bin 0 -> 354086 bytes
 .../duckdb-bench-perf-opt.png                      | Bin 0 -> 362188 bytes
 .../internal-benchmark.png                         | Bin 0 -> 156310 bytes
 .../memory-profile-baseline.png                    | Bin 0 -> 35321 bytes
 .../memory-profile-opt.png                         | Bin 0 -> 34667 bytes
 10 files changed, 151 insertions(+)

diff --git a/_data/contributors.yml b/_data/contributors.yml
index 5e05ad908a9..408dd19c372 100644
--- a/_data/contributors.yml
+++ b/_data/contributors.yml
@@ -67,4 +67,7 @@
 - name: Loïc Alleyne
   apacheId: loicalleyne # Not a real apacheId
   githubId: loicalleyne
+- name: Rossi Sun
+  apacheId: zanmato
+  githubId: zanmato1984
 # End contributors.yml
diff --git a/_posts/2025-07-18-recent-improvements-to-hash-join.md 
b/_posts/2025-07-18-recent-improvements-to-hash-join.md
new file mode 100644
index 00000000000..a2a0b043036
--- /dev/null
+++ b/_posts/2025-07-18-recent-improvements-to-hash-join.md
@@ -0,0 +1,148 @@
+---
+layout: post
+title: "Recent Improvements to Hash Join in Arrow C++"
+description: "A deep dive into recent improvements to Apache Arrow’s hash join 
implementation — enhancing stability, memory efficiency, and parallel 
performance for modern analytic workloads."
+date: "2025-07-18 00:00:00"
+author: zanmato
+categories: [application]
+---
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+*Editor’s Note: Apache Arrow is an expansive project, ranging from the Arrow 
columnar format itself, to its numerous specifications, and a long list of 
implementations. Arrow is also an expansive project in terms of its community 
of contributors. In this blog post, we’d like to highlight recent work by 
Apache Arrow Committer Rossi Sun on improving the performance and stability of 
Arrow’s embeddable query execution engine: Acero.*
+
+# Introduction
+
+Hash join is a fundamental operation in analytical processing engines — it 
matches rows from two tables based on key values using a hash table for fast 
lookup. In the C++ implementation of Apache Arrow, the hash join is implemented 
in the C++ engine Acero, which powers query execution in bindings like PyArrow 
and the R Arrow package. Even if you haven't used Acero directly, your code may 
already be benefiting from it under the hood.
+
+For example, this simple PyArrow example uses Acero:
+```python
+import pyarrow as pa
+
+t1 = pa.table({'id': [1, 2, 3],
+               'year': [2020, 2022, 2019]})
+t2 = pa.table({'id': [3, 4],
+               'n_legs': [5, 100],
+               'animal': ["Brittle stars", "Centipede"]})
+
+t1.join(t2, 'id').combine_chunks().sort_by('year')
+```
+
+Acero was originally created in 2019 to demonstrate that the ever-growing 
library of compute kernels in Arrow C++ could be linked together into realistic 
workflows and also to take advantage of the emerging Datasets API to give these 
workflows access to data. Rather than aiming to compete with full query engines 
like DuckDB, Acero focuses on enabling flexible, composable, and embeddable 
query execution — serving as a building block for tools and systems that need 
fast, modular analytics  [...]
+
+Across several recent Arrow C++ releases, we've made substantial improvements 
to the hash join implementation to address common user pain points. These 
changes improve stability, memory efficiency, and parallel performance, with a 
focus on making joins more usable and scalable out of the box. If you've had 
trouble using Arrow’s hash join in the past, now is a great time to try again.
+
+# Scaling Safely: Improvements to Stability
+
+In earlier versions of Arrow C++, the hash join implementation used internal 
data structures that weren’t designed for very large datasets and lacked 
safeguards in some of the underlying memory operations. These limitations 
rarely surfaced in small to medium workloads but became problematic at scale, 
manifesting as crashes or subtle correctness issues.
+
+At the core of Arrow’s join implementation is a compact, row-oriented 
structure known as the “row table”. While Arrow’s data model is columnar, its 
hash join implementation operates in a row-wise fashion — similar to modern 
engines like DuckDB and Meta’s Velox. This layout minimizes CPU cache misses 
during hash table lookups by collocating keys, payloads, and null bits in 
memory so they can be accessed together.
+
+In previous versions, the row table used 32-bit offsets to reference packed 
rows. This capped each table’s size to 4GB and introduced risks of overflow 
when working with large datasets or wide rows. Several reported issues — 
[GH-34474](https://github.com/apache/arrow/issues/34474), 
[GH-41813](https://github.com/apache/arrow/issues/41813), and 
[GH-43202](https://github.com/apache/arrow/issues/43202) — highlighted the 
limitations of this design. In response, PR [GH-43389](https://github.co [...]
+
+Besides the offset limitation, earlier versions of Arrow C++ also included 
overflow-prone logic in the buffer indexing paths used throughout the hash join 
implementation. Many internal calculations assumed that 32-bit integers were 
sufficient for addressing memory — a fragile assumption when working with large 
datasets or wide rows. These issues appeared not only in conventional C++ 
indexing code but also in Arrow’s SIMD-accelerated paths — Arrow includes heavy 
SIMD specializations, used [...]
+
+Two representative examples:
+
+- Row-wise buffer access in C++
+
+The aforementioned row table stores fixed-length data in tightly packed 
buffers. Accessing a particular row (and optionally a column within it) 
typically involves [pointer 
arithmetic](https://github.com/apache/arrow/blob/12f62653c825fbf305bfde61c112d2aa69203c62/cpp/src/arrow/acero/swiss_join_internal.h#L120):
+```cpp
+const uint8_t* row_ptr = row_ptr_base + row_length * row_id;
+```
+When both `row_length` and `row_id` are large 32-bit integers, their product 
can overflow.
+
+Similarly, accessing null masks involves [null-bit indexing 
arithmetic](https://github.com/apache/arrow/blob/12f62653c825fbf305bfde61c112d2aa69203c62/cpp/src/arrow/acero/swiss_join_internal.h#L150):
+```cpp
+int64_t bit_id = row_id * null_mask_num_bytes * 8 + pos_after_encoding;
+```
+The intermediate multiplication is performed using 32-bit arithmetic and can 
overflow even though the final result is stored in a 64-bit variable.
+
+- SIMD gathers with 32-bit offsets
+
+One essential SIMD instruction is the AVX2 intrinsic `__m256i 
_mm256_i32gather_epi32(int const * base, __m256i vindex, const int scale);`, 
which performs a parallel memory gather of eight 32-bit integers based on eight 
32-bit signed offsets. It was extensively used in Arrow for hash table 
operations, for example, [fetching 8 group 
IDs](https://github.com/apache/arrow/blob/0a00e25f2f6fb927fb555b69038d0be9b9d9f265/cpp/src/arrow/compute/key_map_internal_avx2.cc#L404)
 (hash table slots) in p [...]
+```cpp
+__m256i group_id = _mm256_i32gather_epi32(elements, pos, 1);
+```
+and [loading 8 corresponding key 
values](https://github.com/apache/arrow/blob/69e8a78c018da88b60f9eb2b3b45703f81f3c93d/cpp/src/arrow/compute/row/compare_internal_avx2.cc#L284)
 from the right-side input in parallel for comparison:
+```cpp
+__m256i right = _mm256_i32gather_epi32((const int*)right_base, offset_right, 
1);
+```
+If any of the computed offsets exceed `2^31 - 1`, they wrap into the negative 
range, which can lead to invalid memory access (i.e., a crash) or, more subtly, 
fetch data from a valid but incorrect location — producing silently wrong 
results (trust me, you don’t want to debug that).
+
+To mitigate these risks, PR 
[GH-45108](https://github.com/apache/arrow/pull/45108), 
[GH-45336](https://github.com/apache/arrow/pull/45336), and 
[GH-45515](https://github.com/apache/arrow/pull/45515) promoted critical 
arithmetic to 64-bit and reworked SIMD logic to use safer indexing. Buffer 
access logic was also encapsulated in safer abstractions to avoid repeated 
manual casting or unchecked offset math. These examples are not unique to Arrow 
— they reflect common pitfalls in building da [...]
+
+Together, these changes make Arrow’s hash join implementation significantly 
more robust and better equipped for modern data workloads. These foundations 
not only resolve known issues but also reduce the risk of similar bugs in 
future development.
+
+# Leaner Memory Usage
+
+While refining overflow-prone parts of the hash join implementation, I ended 
up examining most of the code path for potential pitfalls. When doing this kind 
of work, one sits down quietly and interrogates every line — asking not just 
whether an intermediate value might overflow, but whether it even needs to 
exist at all. And during that process, I came across something unrelated to 
overflow — but even more impactful.
+
+In a textbook hash join algorithm, once the right-side table (the build-side) 
is fully accumulated, a hash table is constructed to support probing the 
left-side table (the probe-side) for matches. To parallelize this build step, 
Arrow C++’s implementation partitions the build-side into `N` partitions — 
typically matching the number of available CPU cores — and builds a separate 
hash table for each partition in parallel. These are then merged into a final, 
unified hash table used during t [...]
+
+The issue? The memory footprint. The total size of the partitioned hash tables 
is roughly equal to that of the final hash table, but they were being held in 
memory even after merging. Once the final hash table was built, these temporary 
structures had no further use — yet they persisted through the entire join 
operation. There were no crashes, no warnings, no visible red flags — just 
silent overhead.
+
+Once spotted, the fix was straightforward: restructure the join process to 
release these buffers immediately after the merge. The change was implemented 
in PR [GH-45552](https://github.com/apache/arrow/issues/45552). The memory 
profiles below illustrate its impact.
+
+<div style="display: flex; gap: 16px; justify-content: center; align-items: 
flex-start;">
+  <img src="{{ site.baseurl 
}}/img/recent-improvements-to-hash-join/memory-profile-baseline.png" 
width="50%" class="img-responsive" alt="Memory profile before" 
aria-hidden="true">
+  <img src="{{ site.baseurl 
}}/img/recent-improvements-to-hash-join/memory-profile-opt.png" width="50%" 
class="img-responsive" alt="Memory profile after" aria-hidden="true">
+</div>
+
+At `A`, memory usage rises steadily as the join builds partitioned hash tables 
in parallel. `B` marks the merge point, where these partitions are combined 
into a final, unified hash table. `C` represents the start of the probe phase, 
where the left-side table is scanned and matched against the final hash table. 
Memory begins to rise again as join results are materialized. `D` is the peak 
of the join operation, just before memory begins to drop as processing 
completes. The “leap of faith” [...]
+
+This improvement already benefits real-world scenarios — for example, the 
[DuckDB Labs DB Benchmark](https://duckdblabs.github.io/db-benchmark/). Some 
benchmark queries that previously failed with out-of-memory (OOM) errors can 
now complete successfully — as shown in the comparison below.
+
+<div style="display: flex; gap: 16px; justify-content: center; align-items: 
flex-start;">
+  <img src="{{ site.baseurl 
}}/img/recent-improvements-to-hash-join/duckdb-bench-oom-baseline.png" 
width="50%" class="img-responsive" alt="DuckDB benchmark OOM before" 
aria-hidden="true">
+  <img src="{{ site.baseurl 
}}/img/recent-improvements-to-hash-join/duckdb-bench-oom-opt.png" width="50%" 
class="img-responsive" alt="DuckDB benchmark OOM after" aria-hidden="true">
+</div>
+
+As one reviewer noted in the PR, this was a “low-hanging fruit.” And 
sometimes, meaningful performance gains don’t come from tuning hot loops or 
digging through flame graphs — they come from noticing something that doesn’t 
feel right and asking: why are we still keeping this around?
+
+# Faster Execution Through Better Parallelism
+
+Not every improvement comes from poring over flame graphs — but some 
definitely do. Performance is, after all, the most talked-about aspect of any 
query engine. So, how about a nice cup of flame graph?
+
+<img src="{{ site.baseurl 
}}/img/recent-improvements-to-hash-join/a-nice-cup-of-flame-graph.png" 
width="100%" class="img-responsive" alt="A nice cup of flame graph" 
aria-hidden="true">
+
+It’s hard not to notice the long, flat bar dominating the middle — especially 
with the rather alarming word “Lock” in it. That’s our red flag.
+
+We’ve mentioned that in the build phase, we build partitioned hash tables in 
parallel. In earlier versions of Arrow C++, this parallelism was implemented on 
a batch basis — each thread processed a build-side batch concurrently. Since 
each batch contained arbitrary data that could fall into any partition, threads 
had to synchronize when accessing shared partitions. This was managed through 
[locks on 
partitions](https://github.com/apache/arrow/blob/196cde38c112d32a944afe978b6da9c7ce935ef7/
 [...]
+
+To mitigate this contention, we restructured the build phase in PR 
[GH-45612](https://github.com/apache/arrow/issues/45612). Instead of having all 
threads partition and insert at once — each thread touching every hash table — 
we split the work into two distinct stages. In the first partition stage, `M` 
threads take their assigned batches and only partition them, recording which 
rows belong to which partition. No insertion happens yet — just classification. 
Then comes the second, newly se [...]
+
+<img src="{{ site.baseurl 
}}/img/recent-improvements-to-hash-join/internal-benchmark.png" width="100%" 
class="img-responsive" alt="Internal benchmark" aria-hidden="true">
+
+Also in real-world scenarios like the [DuckDB Labs DB 
Benchmark](https://duckdblabs.github.io/db-benchmark/), we’ve observed similar 
gains. The comparison below shows around a 2x improvement in query performance 
after this change was applied.
+
+<div style="display: flex; gap: 16px; justify-content: center; align-items: 
flex-start;">
+  <img src="{{ site.baseurl 
}}/img/recent-improvements-to-hash-join/duckdb-bench-perf-baseline.png" 
width="50%" class="img-responsive" alt="DuckDB benchmark perf before" 
aria-hidden="true">
+  <img src="{{ site.baseurl 
}}/img/recent-improvements-to-hash-join/duckdb-bench-perf-opt.png" width="50%" 
class="img-responsive" alt="DuckDB benchmark perf after" aria-hidden="true">
+</div>
+
+Additional improvements include 
[GH-43832](https://github.com/apache/arrow/pull/43832), which extends AVX2 
acceleration to more probing code paths, and 
[GH-45918](https://github.com/apache/arrow/pull/45918), which introduces 
parallelism to a previously sequential task phase. These target more 
specialized scenarios and edge cases.
+
+## Closing
+
+These improvements reflect ongoing investment in Arrow C++’s execution engine 
and a commitment to delivering fast, robust building blocks for analytic 
workloads. They are available in recent Arrow C++ releases and exposed through 
higher-level bindings like PyArrow and the Arrow R package — starting from 
version 18.0.0, with the most significant improvements landing in 20.0.0. If 
joins were a blocker for you before — due to memory, scale, or correctness — 
recent changes may offer a very d [...]
+
+The Arrow C++ engine is not just alive — it’s improving in meaningful, 
user-visible ways. We’re also actively monitoring for further issues and open 
to expanding the design based on user feedback and real-world needs. If you’ve 
tried joins in the past and run into performance or stability issues, we 
encourage you to give them another try and file an [issue on 
GitHub](https://github.com/apache/arrow/issues) if you run into any issues. 
+
+If you have any questions about this blog post, please feel free to contact 
the author, [Rossi Sun](mailto:[email protected]).
diff --git a/img/recent-improvements-to-hash-join/a-nice-cup-of-flame-graph.png 
b/img/recent-improvements-to-hash-join/a-nice-cup-of-flame-graph.png
new file mode 100644
index 00000000000..5fb67848f40
Binary files /dev/null and 
b/img/recent-improvements-to-hash-join/a-nice-cup-of-flame-graph.png differ
diff --git a/img/recent-improvements-to-hash-join/duckdb-bench-oom-baseline.png 
b/img/recent-improvements-to-hash-join/duckdb-bench-oom-baseline.png
new file mode 100644
index 00000000000..a046b57223c
Binary files /dev/null and 
b/img/recent-improvements-to-hash-join/duckdb-bench-oom-baseline.png differ
diff --git a/img/recent-improvements-to-hash-join/duckdb-bench-oom-opt.png 
b/img/recent-improvements-to-hash-join/duckdb-bench-oom-opt.png
new file mode 100644
index 00000000000..ae5ad8c9cd0
Binary files /dev/null and 
b/img/recent-improvements-to-hash-join/duckdb-bench-oom-opt.png differ
diff --git 
a/img/recent-improvements-to-hash-join/duckdb-bench-perf-baseline.png 
b/img/recent-improvements-to-hash-join/duckdb-bench-perf-baseline.png
new file mode 100644
index 00000000000..bb7cb45cc24
Binary files /dev/null and 
b/img/recent-improvements-to-hash-join/duckdb-bench-perf-baseline.png differ
diff --git a/img/recent-improvements-to-hash-join/duckdb-bench-perf-opt.png 
b/img/recent-improvements-to-hash-join/duckdb-bench-perf-opt.png
new file mode 100644
index 00000000000..3b544a490c0
Binary files /dev/null and 
b/img/recent-improvements-to-hash-join/duckdb-bench-perf-opt.png differ
diff --git a/img/recent-improvements-to-hash-join/internal-benchmark.png 
b/img/recent-improvements-to-hash-join/internal-benchmark.png
new file mode 100644
index 00000000000..c7b31b44ea3
Binary files /dev/null and 
b/img/recent-improvements-to-hash-join/internal-benchmark.png differ
diff --git a/img/recent-improvements-to-hash-join/memory-profile-baseline.png 
b/img/recent-improvements-to-hash-join/memory-profile-baseline.png
new file mode 100644
index 00000000000..e039d8f33d1
Binary files /dev/null and 
b/img/recent-improvements-to-hash-join/memory-profile-baseline.png differ
diff --git a/img/recent-improvements-to-hash-join/memory-profile-opt.png 
b/img/recent-improvements-to-hash-join/memory-profile-opt.png
new file mode 100644
index 00000000000..9bd4e5148ea
Binary files /dev/null and 
b/img/recent-improvements-to-hash-join/memory-profile-opt.png differ

(arrow-site) branch main updated: Blog post: Recent Improvements to Hash Join in Arrow C++ (#667)

Reply via email to