This is an automated email from the ASF dual-hosted git repository.
github-bot pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/datafusion.git
The following commit(s) were added to refs/heads/main by this push:
new 0dfcd97a37 Replace ahash with foldhash for faster hashing in
datafusion-common (#20958)
0dfcd97a37 is described below
commit 0dfcd97a37e083e48aefc5267539ac453cc07b44
Author: Daniël Heres <[email protected]>
AuthorDate: Tue Mar 17 06:26:30 2026 +0100
Replace ahash with foldhash for faster hashing in datafusion-common (#20958)
## Summary
- Replace `ahash` with `foldhash`, hashing
(`with_hashes`/`create_hashes`) - seems to auto-vectorize much better as
it doesn't rely on special instructions.
- Use `SeedableRandomState` for rehash paths: fold existing hash into
hasher's initial state, eliminating the separate `combine_hashes` step
- Add `hash_write` method to `HashValue` trait for writing values into
an existing hasher
- Use `valid_indices()` iterator for null paths instead of per-element
`is_null()` checks
- Update some code to be deterministic. Notably `RandomState::default()`
now does create random seed every instance, with hash it just reused a
single one. Also some group by results changed as the hash function is
different, added rowsort.
## Benchmark results (int64, 8192 rows, Apple M1)
| Benchmark | Before (ahash) | After (foldhash) | Improvement |
|---|---|---|---|
| single array, no nulls | 5.65 µs | 3.30 µs | **-42%** |
| multiple arrays, no nulls | 22.15 µs | 11.19 µs | **-49%** |
| single array, nulls | 11.94 µs | 9.47 µs | **-21%** |
| multiple arrays, nulls | 36.92 µs | 29.80 µs | **-19%** |
String view improvements (utf8_view, 8192 rows):
| Benchmark | Improvement |
|---|---|
| single, no nulls | **-13%** |
| multiple, no nulls | **-28%** |
| small strings, single | **-55%** |
| small strings, multiple | **-60%** |
## Test plan
- [x] All 36 `hash_utils` unit tests pass
- [x] Run full CI suite
In some hash-heavy benchmarks (clickbench_extended) we clearly see that
foldhash is faster!
```
│ QQuery 1 │ 227.74 / 228.63 ±0.76 / 229.71 ms │ 205.48 /
207.44 ±1.72 / 210.35 ms │ +1.10x faster │
│ QQuery 2 │ 541.63 / 543.65 ±1.11 / 544.82 ms │ 499.61 /
502.19 ±1.76 / 504.94 ms │ +1.08x faster │
│ QQuery 3 │ 334.85 / 336.04 ±1.16 / 337.89 ms │ 316.43 /
317.67 ±1.02 / 319.48 ms │ +1.06x faster │
```
# Are there any user-facing changes?
Yes `RandomState::with_seeds` was replaced as `RandomState::with_seeds`
and there were protobuf changes for the same function.
Also the function will generate different hashes, so in distributed
environments it shouldn't use different versions of binaries to run the
same query.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
---------
Co-authored-by: Claude Opus 4.6 (1M context) <[email protected]>
---
Cargo.lock | 8 +-
Cargo.toml | 3 -
datafusion/common/Cargo.toml | 2 +-
datafusion/common/benches/with_hashes.rs | 6 +-
datafusion/common/src/hash_utils.rs | 147 ++++++++++++---------
datafusion/common/src/scalar/mod.rs | 2 +-
.../tests/physical_optimizer/filter_pushdown.rs | 2 +-
datafusion/core/tests/sql/unparser.rs | 16 +--
datafusion/functions-aggregate-common/Cargo.toml | 1 -
.../src/aggregate/count_distinct/native.rs | 2 +-
datafusion/functions-aggregate-common/src/utils.rs | 2 +-
datafusion/functions-aggregate/Cargo.toml | 2 +-
.../functions-aggregate/src/bit_and_or_xor.rs | 2 +-
datafusion/functions-aggregate/src/count.rs | 2 +-
datafusion/functions-aggregate/src/hyperloglog.rs | 11 +-
datafusion/functions-aggregate/src/sum.rs | 2 +-
datafusion/physical-expr-common/Cargo.toml | 1 -
datafusion/physical-expr-common/src/binary_map.rs | 4 +-
.../physical-expr-common/src/binary_view_map.rs | 4 +-
datafusion/physical-expr/Cargo.toml | 1 -
.../physical-expr/src/expressions/in_list.rs | 6 +-
datafusion/physical-plan/Cargo.toml | 1 -
.../aggregates/group_values/multi_group_by/mod.rs | 2 +-
.../src/aggregates/group_values/row.rs | 2 +-
.../group_values/single_group_by/primitive.rs | 3 +-
datafusion/physical-plan/src/aggregates/mod.rs | 5 +-
.../src/aggregates/topk/hash_table.rs | 3 +-
.../physical-plan/src/joins/hash_join/exec.rs | 8 +-
.../src/joins/hash_join/partitioned_hash_eval.rs | 84 ++++++------
.../physical-plan/src/joins/hash_join/stream.rs | 2 +-
.../physical-plan/src/joins/symmetric_hash_join.rs | 4 +-
datafusion/physical-plan/src/joins/utils.rs | 2 +-
datafusion/physical-plan/src/repartition/mod.rs | 3 +-
.../src/windows/bounded_window_agg_exec.rs | 2 +-
datafusion/proto/proto/datafusion.proto | 3 -
datafusion/proto/src/generated/pbjson.rs | 63 ---------
datafusion/proto/src/generated/prost.rs | 6 -
datafusion/proto/src/physical_plan/from_proto.rs | 7 +-
datafusion/proto/src/physical_plan/to_proto.rs | 6 +-
.../proto/tests/cases/roundtrip_physical_plan.rs | 6 +-
datafusion/sqllogictest/test_files/aggregate.slt | 24 ++--
41 files changed, 193 insertions(+), 269 deletions(-)
diff --git a/Cargo.lock b/Cargo.lock
index 273620c53f..9b49918a87 100644
--- a/Cargo.lock
+++ b/Cargo.lock
@@ -1893,12 +1893,12 @@ dependencies = [
name = "datafusion-common"
version = "52.3.0"
dependencies = [
- "ahash",
"apache-avro",
"arrow",
"arrow-ipc",
"chrono",
"criterion",
+ "foldhash 0.2.0",
"half",
"hashbrown 0.16.1",
"hex",
@@ -2252,7 +2252,6 @@ dependencies = [
name = "datafusion-functions-aggregate"
version = "52.3.0"
dependencies = [
- "ahash",
"arrow",
"criterion",
"datafusion-common",
@@ -2263,6 +2262,7 @@ dependencies = [
"datafusion-macros",
"datafusion-physical-expr",
"datafusion-physical-expr-common",
+ "foldhash 0.2.0",
"half",
"log",
"num-traits",
@@ -2273,7 +2273,6 @@ dependencies = [
name = "datafusion-functions-aggregate-common"
version = "52.3.0"
dependencies = [
- "ahash",
"arrow",
"criterion",
"datafusion-common",
@@ -2383,7 +2382,6 @@ dependencies = [
name = "datafusion-physical-expr"
version = "52.3.0"
dependencies = [
- "ahash",
"arrow",
"criterion",
"datafusion-common",
@@ -2422,7 +2420,6 @@ dependencies = [
name = "datafusion-physical-expr-common"
version = "52.3.0"
dependencies = [
- "ahash",
"arrow",
"chrono",
"criterion",
@@ -2460,7 +2457,6 @@ dependencies = [
name = "datafusion-physical-plan"
version = "52.3.0"
dependencies = [
- "ahash",
"arrow",
"arrow-ord",
"arrow-schema",
diff --git a/Cargo.toml b/Cargo.toml
index b55d8b5a72..08d585d3ef 100644
--- a/Cargo.toml
+++ b/Cargo.toml
@@ -87,9 +87,6 @@ version = "52.3.0"
# for the inherited dependency but cannot do the reverse (override from true
to false).
#
# See for more details: https://github.com/rust-lang/cargo/issues/11329
-ahash = { version = "0.8", default-features = false, features = [
- "runtime-rng",
-] }
apache-avro = { version = "0.21", default-features = false }
arrow = { version = "58.0.0", features = [
"prettyprint",
diff --git a/datafusion/common/Cargo.toml b/datafusion/common/Cargo.toml
index 36435580b2..92dd76aa97 100644
--- a/datafusion/common/Cargo.toml
+++ b/datafusion/common/Cargo.toml
@@ -66,7 +66,6 @@ harness = false
name = "stats_merge"
[dependencies]
-ahash = { workspace = true }
apache-avro = { workspace = true, features = [
"bzip",
"snappy",
@@ -76,6 +75,7 @@ apache-avro = { workspace = true, features = [
arrow = { workspace = true }
arrow-ipc = { workspace = true }
chrono = { workspace = true }
+foldhash = "0.2"
half = { workspace = true }
hashbrown = { workspace = true }
hex = { workspace = true, optional = true }
diff --git a/datafusion/common/benches/with_hashes.rs
b/datafusion/common/benches/with_hashes.rs
index 9ee31d9c4b..0e9c53c896 100644
--- a/datafusion/common/benches/with_hashes.rs
+++ b/datafusion/common/benches/with_hashes.rs
@@ -17,7 +17,6 @@
//! Benchmarks for `with_hashes` function
-use ahash::RandomState;
use arrow::array::{
Array, ArrayRef, ArrowPrimitiveType, DictionaryArray, GenericStringArray,
Int32Array,
Int64Array, ListArray, MapArray, NullBufferBuilder, OffsetSizeTrait,
PrimitiveArray,
@@ -28,6 +27,7 @@ use arrow::datatypes::{
ArrowDictionaryKeyType, DataType, Field, Fields, Int32Type, Int64Type,
UnionFields,
};
use criterion::{Bencher, Criterion, criterion_group, criterion_main};
+use datafusion_common::hash_utils::RandomState;
use datafusion_common::hash_utils::with_hashes;
use rand::Rng;
use rand::SeedableRng;
@@ -143,7 +143,7 @@ fn criterion_benchmark(c: &mut Criterion) {
}
fn do_hash_test(b: &mut Bencher, arrays: &[ArrayRef]) {
- let state = RandomState::new();
+ let state = RandomState::default();
b.iter(|| {
with_hashes(arrays, &state, |hashes| {
assert_eq!(hashes.len(), BATCH_SIZE); // make sure the result is
used
@@ -358,7 +358,7 @@ fn sliced_array_benchmark(c: &mut Criterion) {
}
fn do_hash_test_with_len(b: &mut Bencher, arrays: &[ArrayRef], expected_len:
usize) {
- let state = RandomState::new();
+ let state = RandomState::default();
b.iter(|| {
with_hashes(arrays, &state, |hashes| {
assert_eq!(hashes.len(), expected_len);
diff --git a/datafusion/common/src/hash_utils.rs
b/datafusion/common/src/hash_utils.rs
index 3be6118c55..255525b92e 100644
--- a/datafusion/common/src/hash_utils.rs
+++ b/datafusion/common/src/hash_utils.rs
@@ -17,15 +17,19 @@
//! Functionality used both on logical and physical plans
-use ahash::RandomState;
use arrow::array::types::{IntervalDayTime, IntervalMonthDayNano};
use arrow::array::*;
use arrow::compute::take;
use arrow::datatypes::*;
#[cfg(not(feature = "force_hash_collisions"))]
use arrow::{downcast_dictionary_array, downcast_primitive_array};
+use foldhash::fast::FixedState;
use itertools::Itertools;
use std::collections::HashMap;
+use std::hash::{BuildHasher, Hash, Hasher};
+
+/// The hash random state used throughout DataFusion for hashing.
+pub type RandomState = FixedState;
#[cfg(not(feature = "force_hash_collisions"))]
use crate::cast::{
@@ -83,7 +87,7 @@ thread_local! {
/// use std::sync::Arc;
///
/// let array: ArrayRef = Arc::new(Int32Array::from(vec![1, 2, 3]));
-/// let random_state = RandomState::new();
+/// let random_state = RandomState::default();
///
/// let result = with_hashes([&array], &random_state, |hashes| {
/// // Use the hashes here
@@ -149,12 +153,17 @@ fn hash_null(random_state: &RandomState, hashes_buffer:
&'_ mut [u64], mul_col:
pub trait HashValue {
fn hash_one(&self, state: &RandomState) -> u64;
+ /// Write this value into an existing hasher (same data as `hash_one`).
+ fn hash_write(&self, hasher: &mut impl Hasher);
}
impl<T: HashValue + ?Sized> HashValue for &T {
fn hash_one(&self, state: &RandomState) -> u64 {
T::hash_one(self, state)
}
+ fn hash_write(&self, hasher: &mut impl Hasher) {
+ T::hash_write(self, hasher)
+ }
}
macro_rules! hash_value {
@@ -163,6 +172,9 @@ macro_rules! hash_value {
fn hash_one(&self, state: &RandomState) -> u64 {
state.hash_one(self)
}
+ fn hash_write(&self, hasher: &mut impl Hasher) {
+ Hash::hash(self, hasher)
+ }
})+
};
}
@@ -175,14 +187,28 @@ macro_rules! hash_float_value {
fn hash_one(&self, state: &RandomState) -> u64 {
state.hash_one(<$i>::from_ne_bytes(self.to_ne_bytes()))
}
+ fn hash_write(&self, hasher: &mut impl Hasher) {
+ hasher.write(&self.to_ne_bytes())
+ }
})+
};
}
hash_float_value!((half::f16, u16), (f32, u32), (f64, u64));
+/// Create a `SeedableRandomState` whose per-hasher seed incorporates `seed`.
+/// This folds the previous hash into the hasher's initial state so only the
+/// new value needs to pass through the hash function — same cost as
`hash_one`.
+#[inline]
+fn seeded_state(seed: u64) -> foldhash::fast::SeedableRandomState {
+ foldhash::fast::SeedableRandomState::with_seed(
+ seed,
+ foldhash::SharedSeed::global_fixed(),
+ )
+}
+
/// Builds hash values of PrimitiveArray and writes them into `hashes_buffer`
-/// If `rehash==true` this combines the previous hash value in the buffer
-/// with the new hash using `combine_hashes`
+/// If `rehash==true` this folds the existing hash into the hasher state
+/// and hashes only the new value (avoiding a separate combine step).
#[cfg(not(feature = "force_hash_collisions"))]
fn hash_array_primitive<T>(
array: &PrimitiveArray<T>,
@@ -201,7 +227,9 @@ fn hash_array_primitive<T>(
if array.null_count() == 0 {
if rehash {
for (hash, &value) in
hashes_buffer.iter_mut().zip(array.values().iter()) {
- *hash = combine_hashes(value.hash_one(random_state), *hash);
+ let mut hasher = seeded_state(*hash).build_hasher();
+ value.hash_write(&mut hasher);
+ *hash = hasher.finish();
}
} else {
for (hash, &value) in
hashes_buffer.iter_mut().zip(array.values().iter()) {
@@ -209,18 +237,16 @@ fn hash_array_primitive<T>(
}
}
} else if rehash {
- for (i, hash) in hashes_buffer.iter_mut().enumerate() {
- if !array.is_null(i) {
- let value = unsafe { array.value_unchecked(i) };
- *hash = combine_hashes(value.hash_one(random_state), *hash);
- }
+ for i in array.nulls().unwrap().valid_indices() {
+ let value = unsafe { array.value_unchecked(i) };
+ let mut hasher = seeded_state(hashes_buffer[i]).build_hasher();
+ value.hash_write(&mut hasher);
+ hashes_buffer[i] = hasher.finish();
}
} else {
- for (i, hash) in hashes_buffer.iter_mut().enumerate() {
- if !array.is_null(i) {
- let value = unsafe { array.value_unchecked(i) };
- *hash = value.hash_one(random_state);
- }
+ for i in array.nulls().unwrap().valid_indices() {
+ let value = unsafe { array.value_unchecked(i) };
+ hashes_buffer[i] = value.hash_one(random_state);
}
}
}
@@ -257,18 +283,15 @@ fn hash_array<T>(
}
}
} else if rehash {
- for (i, hash) in hashes_buffer.iter_mut().enumerate() {
- if !array.is_null(i) {
- let value = unsafe { array.value_unchecked(i) };
- *hash = combine_hashes(value.hash_one(random_state), *hash);
- }
+ for i in array.nulls().unwrap().valid_indices() {
+ let value = unsafe { array.value_unchecked(i) };
+ hashes_buffer[i] =
+ combine_hashes(value.hash_one(random_state), hashes_buffer[i]);
}
} else {
- for (i, hash) in hashes_buffer.iter_mut().enumerate() {
- if !array.is_null(i) {
- let value = unsafe { array.value_unchecked(i) };
- *hash = value.hash_one(random_state);
- }
+ for i in array.nulls().unwrap().valid_indices() {
+ let value = unsafe { array.value_unchecked(i) };
+ hashes_buffer[i] = value.hash_one(random_state);
}
}
}
@@ -317,7 +340,9 @@ fn hash_string_view_array_inner<
// all views are inlined, no need to access external buffers
if !HAS_BUFFERS || view_len <= 12 {
if REHASH {
- *hash = combine_hashes(v.hash_one(random_state), *hash);
+ let mut hasher = seeded_state(*hash).build_hasher();
+ v.hash_write(&mut hasher);
+ *hash = hasher.finish();
} else {
*hash = v.hash_one(random_state);
}
@@ -326,7 +351,9 @@ fn hash_string_view_array_inner<
// view is not inlined, so we need to hash the bytes as well
let value = view_bytes(view_len, v);
if REHASH {
- *hash = combine_hashes(value.hash_one(random_state), *hash);
+ let mut hasher = seeded_state(*hash).build_hasher();
+ value.hash_write(&mut hasher);
+ *hash = hasher.finish();
} else {
*hash = value.hash_one(random_state);
}
@@ -358,7 +385,9 @@ fn hash_generic_byte_view_array<T: ByteViewType>(
}
(false, false, true) => {
for (hash, &view) in
hashes_buffer.iter_mut().zip(array.views().iter()) {
- *hash = combine_hashes(view.hash_one(random_state), *hash);
+ let mut hasher = seeded_state(*hash).build_hasher();
+ view.hash_write(&mut hasher);
+ *hash = hasher.finish();
}
}
(false, true, false) => hash_string_view_array_inner::<T, false, true,
false>(
@@ -1098,7 +1127,7 @@ mod tests {
.with_precision_and_scale(20, 3)
.unwrap();
let array_ref: ArrayRef = Arc::new(array);
- let random_state = RandomState::with_seeds(0, 0, 0, 0);
+ let random_state = RandomState::with_seed(0);
let hashes_buff = &mut vec![0; array_ref.len()];
let hashes = create_hashes(&[array_ref], &random_state, hashes_buff)?;
assert_eq!(hashes.len(), 4);
@@ -1108,7 +1137,7 @@ mod tests {
#[test]
fn create_hashes_for_empty_fixed_size_lit() -> Result<()> {
let empty_array = FixedSizeListBuilder::new(StringBuilder::new(),
1).finish();
- let random_state = RandomState::with_seeds(0, 0, 0, 0);
+ let random_state = RandomState::with_seed(0);
let hashes_buff = &mut [0; 0];
let hashes = create_hashes(
&[Arc::new(empty_array) as ArrayRef],
@@ -1126,7 +1155,7 @@ mod tests {
let f64_arr: ArrayRef =
Arc::new(Float64Array::from(vec![0.12, 0.5, 1f64, 444.7]));
- let random_state = RandomState::with_seeds(0, 0, 0, 0);
+ let random_state = RandomState::with_seed(0);
let hashes_buff = &mut vec![0; f32_arr.len()];
let hashes = create_hashes(&[f32_arr], &random_state, hashes_buff)?;
assert_eq!(hashes.len(), 4,);
@@ -1155,7 +1184,7 @@ mod tests {
let binary_array: ArrayRef =
Arc::new(binary.iter().cloned().collect::<$ARRAY>());
- let random_state = RandomState::with_seeds(0, 0, 0, 0);
+ let random_state = RandomState::with_seed(0);
let mut binary_hashes = vec![0; binary.len()];
create_hashes(&[binary_array], &random_state, &mut
binary_hashes)
@@ -1189,7 +1218,7 @@ mod tests {
let fixed_size_binary_array: ArrayRef =
Arc::new(FixedSizeBinaryArray::try_from_iter(input_arg.into_iter()).unwrap());
- let random_state = RandomState::with_seeds(0, 0, 0, 0);
+ let random_state = RandomState::with_seed(0);
let hashes_buff = &mut vec![0; fixed_size_binary_array.len()];
let hashes =
create_hashes(&[fixed_size_binary_array], &random_state,
hashes_buff)?;
@@ -1222,7 +1251,7 @@ mod tests {
.collect::<DictionaryArray<Int8Type>>(),
);
- let random_state = RandomState::with_seeds(0, 0, 0, 0);
+ let random_state = RandomState::with_seed(0);
let mut string_hashes = vec![0; strings.len()];
create_hashes(&[string_array], &random_state, &mut
string_hashes)
@@ -1264,7 +1293,7 @@ mod tests {
let run_ends = Arc::new(Int32Array::from(vec![2, 5, 7]));
let array = Arc::new(RunArray::try_new(&run_ends,
values.as_ref()).unwrap());
- let random_state = RandomState::with_seeds(0, 0, 0, 0);
+ let random_state = RandomState::with_seed(0);
let hashes_buff = &mut vec![0; array.len()];
let hashes = create_hashes(
&[Arc::clone(&array) as ArrayRef],
@@ -1292,7 +1321,7 @@ mod tests {
let run_ends = Arc::new(Int32Array::from(vec![2, 5, 7]));
let run_array = Arc::new(RunArray::try_new(&run_ends,
values.as_ref()).unwrap());
- let random_state = RandomState::with_seeds(0, 0, 0, 0);
+ let random_state = RandomState::with_seed(0);
let mut one_col_hashes = vec![0; int_array.len()];
create_hashes(
&[Arc::clone(&int_array) as ArrayRef],
@@ -1340,7 +1369,7 @@ mod tests {
.collect::<DictionaryArray<Int8Type>>(),
);
- let random_state = RandomState::with_seeds(0, 0, 0, 0);
+ let random_state = RandomState::with_seed(0);
let mut string_hashes = vec![0; strings.len()];
create_hashes(&[string_array], &random_state, &mut
string_hashes).unwrap();
@@ -1385,7 +1414,7 @@ mod tests {
];
let list_array =
Arc::new(ListArray::from_iter_primitive::<Int32Type, _, _>(data))
as ArrayRef;
- let random_state = RandomState::with_seeds(0, 0, 0, 0);
+ let random_state = RandomState::with_seed(0);
let mut hashes = vec![0; list_array.len()];
create_hashes(&[list_array], &random_state, &mut hashes).unwrap();
assert_eq!(hashes[0], hashes[5]);
@@ -1411,7 +1440,7 @@ mod tests {
let list_array =
Arc::new(ListArray::from_iter_primitive::<Int32Type, _, _>(data))
as ArrayRef;
let list_array = list_array.slice(2, 3);
- let random_state = RandomState::with_seeds(0, 0, 0, 0);
+ let random_state = RandomState::with_seed(0);
let mut hashes = vec![0; list_array.len()];
create_hashes(&[list_array], &random_state, &mut hashes).unwrap();
assert_eq!(hashes[0], hashes[1]);
@@ -1453,7 +1482,7 @@ mod tests {
Arc::new(ListViewArray::new(field, offsets, sizes, values, nulls))
as ArrayRef;
- let random_state = RandomState::with_seeds(0, 0, 0, 0);
+ let random_state = RandomState::with_seed(0);
let mut hashes = vec![0; list_view_array.len()];
create_hashes(&[list_view_array], &random_state, &mut hashes).unwrap();
@@ -1503,7 +1532,7 @@ mod tests {
field, offsets, sizes, values, nulls,
)) as ArrayRef;
- let random_state = RandomState::with_seeds(0, 0, 0, 0);
+ let random_state = RandomState::with_seed(0);
let mut hashes = vec![0; large_list_view_array.len()];
create_hashes(&[large_list_view_array], &random_state, &mut
hashes).unwrap();
@@ -1534,7 +1563,7 @@ mod tests {
Arc::new(FixedSizeListArray::from_iter_primitive::<Int32Type, _,
_>(
data, 3,
)) as ArrayRef;
- let random_state = RandomState::with_seeds(0, 0, 0, 0);
+ let random_state = RandomState::with_seed(0);
let mut hashes = vec![0; list_array.len()];
create_hashes(&[list_array], &random_state, &mut hashes).unwrap();
assert_eq!(hashes[0], hashes[5]);
@@ -1584,7 +1613,7 @@ mod tests {
let array = Arc::new(struct_array) as ArrayRef;
- let random_state = RandomState::with_seeds(0, 0, 0, 0);
+ let random_state = RandomState::with_seed(0);
let mut hashes = vec![0; array.len()];
create_hashes(&[array], &random_state, &mut hashes).unwrap();
assert_eq!(hashes[0], hashes[1]);
@@ -1621,7 +1650,7 @@ mod tests {
assert!(struct_array.is_valid(1));
let array = Arc::new(struct_array) as ArrayRef;
- let random_state = RandomState::with_seeds(0, 0, 0, 0);
+ let random_state = RandomState::with_seed(0);
let mut hashes = vec![0; array.len()];
create_hashes(&[array], &random_state, &mut hashes).unwrap();
assert_eq!(hashes[0], hashes[1]);
@@ -1674,7 +1703,7 @@ mod tests {
let array = Arc::new(builder.finish()) as ArrayRef;
- let random_state = RandomState::with_seeds(0, 0, 0, 0);
+ let random_state = RandomState::with_seed(0);
let mut hashes = vec![0; array.len()];
create_hashes(&[array], &random_state, &mut hashes).unwrap();
assert_eq!(hashes[0], hashes[1]); // same value
@@ -1701,7 +1730,7 @@ mod tests {
.collect::<DictionaryArray<Int32Type>>(),
);
- let random_state = RandomState::with_seeds(0, 0, 0, 0);
+ let random_state = RandomState::with_seed(0);
let mut one_col_hashes = vec![0; strings1.len()];
create_hashes(
@@ -1731,7 +1760,7 @@ mod tests {
let float_array: ArrayRef =
Arc::new(Float64Array::from(vec![1.0, 2.0, 3.0, 4.0]));
- let random_state = RandomState::with_seeds(0, 0, 0, 0);
+ let random_state = RandomState::with_seed(0);
let hashes_buff = &mut vec![0; int_array.len()];
let hashes =
create_hashes(&[int_array, float_array], &random_state,
hashes_buff).unwrap();
@@ -1746,7 +1775,7 @@ mod tests {
// Verify that we can call create_hashes with only &dyn Array
fn test(arr1: &dyn Array, arr2: &dyn Array) {
- let random_state = RandomState::with_seeds(0, 0, 0, 0);
+ let random_state = RandomState::with_seed(0);
let hashes_buff = &mut vec![0; arr1.len()];
let hashes = create_hashes([arr1, arr2], &random_state,
hashes_buff).unwrap();
assert_eq!(hashes.len(), 4,);
@@ -1757,7 +1786,7 @@ mod tests {
#[test]
fn test_create_hashes_equivalence() {
let array: ArrayRef = Arc::new(Int32Array::from(vec![1, 2, 3, 4]));
- let random_state = RandomState::with_seeds(0, 0, 0, 0);
+ let random_state = RandomState::with_seed(0);
let mut hashes1 = vec![0; array.len()];
create_hashes(
@@ -1776,7 +1805,7 @@ mod tests {
#[test]
fn test_with_hashes() {
let array: ArrayRef = Arc::new(Int32Array::from(vec![1, 2, 3, 4]));
- let random_state = RandomState::with_seeds(0, 0, 0, 0);
+ let random_state = RandomState::with_seed(0);
// Test that with_hashes produces the same results as create_hashes
let mut expected_hashes = vec![0; array.len()];
@@ -1799,7 +1828,7 @@ mod tests {
fn test_with_hashes_multi_column() {
let int_array: ArrayRef = Arc::new(Int32Array::from(vec![1, 2, 3]));
let str_array: ArrayRef = Arc::new(StringArray::from(vec!["a", "b",
"c"]));
- let random_state = RandomState::with_seeds(0, 0, 0, 0);
+ let random_state = RandomState::with_seed(0);
// Test multi-column hashing
let mut expected_hashes = vec![0; int_array.len()];
@@ -1820,7 +1849,7 @@ mod tests {
#[test]
fn test_with_hashes_empty_arrays() {
- let random_state = RandomState::with_seeds(0, 0, 0, 0);
+ let random_state = RandomState::with_seed(0);
// Test that passing no arrays returns an error
let empty: [&ArrayRef; 0] = [];
@@ -1839,7 +1868,7 @@ mod tests {
fn test_with_hashes_reentrancy() {
let array: ArrayRef = Arc::new(Int32Array::from(vec![1, 2, 3]));
let array2: ArrayRef = Arc::new(Int32Array::from(vec![4, 5, 6]));
- let random_state = RandomState::with_seeds(0, 0, 0, 0);
+ let random_state = RandomState::with_seed(0);
// Test that reentrant calls return an error instead of panicking
let result = with_hashes([&array], &random_state, |_hashes| {
@@ -1878,7 +1907,7 @@ mod tests {
let array = UnionArray::try_new(union_fields, type_ids, None,
children).unwrap();
let array_ref = Arc::new(array) as ArrayRef;
- let random_state = RandomState::with_seeds(0, 0, 0, 0);
+ let random_state = RandomState::with_seed(0);
let mut hashes = vec![0; array_ref.len()];
create_hashes(&[array_ref], &random_state, &mut hashes).unwrap();
@@ -1913,7 +1942,7 @@ mod tests {
let array = UnionArray::try_new(union_fields, type_ids, None,
children).unwrap();
let array_ref = Arc::new(array) as ArrayRef;
- let random_state = RandomState::with_seeds(0, 0, 0, 0);
+ let random_state = RandomState::with_seed(0);
let mut hashes = vec![0; array_ref.len()];
create_hashes(&[array_ref], &random_state, &mut hashes).unwrap();
@@ -1954,7 +1983,7 @@ mod tests {
UnionArray::try_new(union_fields, type_ids, Some(offsets),
children).unwrap();
let array_ref = Arc::new(array) as ArrayRef;
- let random_state = RandomState::with_seeds(0, 0, 0, 0);
+ let random_state = RandomState::with_seed(0);
let mut hashes = vec![0; array_ref.len()];
create_hashes(&[array_ref], &random_state, &mut hashes).unwrap();
@@ -1977,7 +2006,7 @@ mod tests {
let run_ends = Arc::new(Int32Array::from(vec![2, 5, 7]));
let array = Arc::new(RunArray::try_new(&run_ends,
values.as_ref()).unwrap());
- let random_state = RandomState::with_seeds(0, 0, 0, 0);
+ let random_state = RandomState::with_seed(0);
let mut full_hashes = vec![0; array.len()];
create_hashes(
&[Arc::clone(&array) as ArrayRef],
@@ -2010,7 +2039,7 @@ mod tests {
let run_ends = Arc::new(Int32Array::from(vec![2, 4, 6]));
let array = Arc::new(RunArray::try_new(&run_ends,
values.as_ref()).unwrap());
- let random_state = RandomState::with_seeds(0, 0, 0, 0);
+ let random_state = RandomState::with_seed(0);
let mut hashes = vec![0; array.len()];
create_hashes(
&[Arc::clone(&array) as ArrayRef],
@@ -2039,7 +2068,7 @@ mod tests {
Arc::new(RunArray::try_new(&run_ends,
run_values.as_ref()).unwrap());
let second_col = Arc::new(Int32Array::from(vec![100, 200, 300]));
- let random_state = RandomState::with_seeds(0, 0, 0, 0);
+ let random_state = RandomState::with_seed(0);
let mut primitive_hashes = vec![0; 3];
create_hashes(
diff --git a/datafusion/common/src/scalar/mod.rs
b/datafusion/common/src/scalar/mod.rs
index 95d8a8511b..ebed41e9d8 100644
--- a/datafusion/common/src/scalar/mod.rs
+++ b/datafusion/common/src/scalar/mod.rs
@@ -996,7 +996,7 @@ impl Hash for ScalarValue {
fn hash_nested_array<H: Hasher>(arr: ArrayRef, state: &mut H) {
let len = arr.len();
let hashes_buffer = &mut vec![0; len];
- let random_state = ahash::RandomState::with_seeds(0, 0, 0, 0);
+ let random_state = crate::hash_utils::RandomState::with_seed(0);
let hashes = create_hashes(&[arr], &random_state, hashes_buffer)
.expect("hash_nested_array: failed to create row hashes");
// Hash back to std::hash::Hasher
diff --git a/datafusion/core/tests/physical_optimizer/filter_pushdown.rs
b/datafusion/core/tests/physical_optimizer/filter_pushdown.rs
index 64d1c7bf10..8f430f7753 100644
--- a/datafusion/core/tests/physical_optimizer/filter_pushdown.rs
+++ b/datafusion/core/tests/physical_optimizer/filter_pushdown.rs
@@ -1253,7 +1253,7 @@ async fn
test_hashjoin_dynamic_filter_pushdown_partitioned() {
- RepartitionExec: partitioning=Hash([a@0, b@1], 12),
input_partitions=1
- DataSourceExec: file_groups={1 group: [[test.parquet]]},
projection=[a, b, c], file_type=test, pushdown_supported=true
- RepartitionExec: partitioning=Hash([a@0, b@1], 12),
input_partitions=1
- - DataSourceExec: file_groups={1 group: [[test.parquet]]},
projection=[a, b, e], file_type=test, pushdown_supported=true,
predicate=DynamicFilter [ CASE hash_repartition % 12 WHEN 2 THEN a@0 >= ab AND
a@0 <= ab AND b@1 >= bb AND b@1 <= bb AND struct(a@0, b@1) IN (SET)
([{c0:ab,c1:bb}]) WHEN 4 THEN a@0 >= aa AND a@0 <= aa AND b@1 >= ba AND b@1 <=
ba AND struct(a@0, b@1) IN (SET) ([{c0:aa,c1:ba}]) ELSE false END ]
+ - DataSourceExec: file_groups={1 group: [[test.parquet]]},
projection=[a, b, e], file_type=test, pushdown_supported=true,
predicate=DynamicFilter [ CASE hash_repartition % 12 WHEN 5 THEN a@0 >= ab AND
a@0 <= ab AND b@1 >= bb AND b@1 <= bb AND struct(a@0, b@1) IN (SET)
([{c0:ab,c1:bb}]) WHEN 8 THEN a@0 >= aa AND a@0 <= aa AND b@1 >= ba AND b@1 <=
ba AND struct(a@0, b@1) IN (SET) ([{c0:aa,c1:ba}]) ELSE false END ]
"
);
diff --git a/datafusion/core/tests/sql/unparser.rs
b/datafusion/core/tests/sql/unparser.rs
index ab1015b2d1..e9bad71843 100644
--- a/datafusion/core/tests/sql/unparser.rs
+++ b/datafusion/core/tests/sql/unparser.rs
@@ -43,7 +43,6 @@ use datafusion::common::Result;
use datafusion::prelude::{ParquetReadOptions, SessionContext};
use datafusion_common::Column;
use datafusion_expr::Expr;
-use datafusion_physical_plan::ExecutionPlanProperties;
use datafusion_sql::unparser::Unparser;
use datafusion_sql::unparser::dialect::DefaultDialect;
use itertools::Itertools;
@@ -323,16 +322,6 @@ async fn collect_results(ctx: &SessionContext, original:
&str) -> TestCaseResult
}
};
- let is_sorted = match
ctx.state().create_physical_plan(df.logical_plan()).await {
- Ok(plan) => plan.equivalence_properties().output_ordering().is_some(),
- Err(e) => {
- return TestCaseResult::ExecutionError {
- original: original.to_string(),
- error: e.to_string(),
- };
- }
- };
-
// Collect results from original query
let mut expected = match df.collect().await {
Ok(batches) => batches,
@@ -368,8 +357,9 @@ async fn collect_results(ctx: &SessionContext, original:
&str) -> TestCaseResult
}
};
- // Sort if needed for comparison
- if !is_sorted {
+ // Always sort for deterministic comparison — even "sorted" results can
have
+ // tied rows in different order between original and unparsed SQL.
+ {
expected = match sort_batches(ctx, expected).await {
Ok(batches) => batches,
Err(e) => {
diff --git a/datafusion/functions-aggregate-common/Cargo.toml
b/datafusion/functions-aggregate-common/Cargo.toml
index 1d4fb29d9c..1714e1800a 100644
--- a/datafusion/functions-aggregate-common/Cargo.toml
+++ b/datafusion/functions-aggregate-common/Cargo.toml
@@ -41,7 +41,6 @@ workspace = true
name = "datafusion_functions_aggregate_common"
[dependencies]
-ahash = { workspace = true }
arrow = { workspace = true }
datafusion-common = { workspace = true }
datafusion-expr-common = { workspace = true }
diff --git
a/datafusion/functions-aggregate-common/src/aggregate/count_distinct/native.rs
b/datafusion/functions-aggregate-common/src/aggregate/count_distinct/native.rs
index aa86052bcb..e506b4acb1 100644
---
a/datafusion/functions-aggregate-common/src/aggregate/count_distinct/native.rs
+++
b/datafusion/functions-aggregate-common/src/aggregate/count_distinct/native.rs
@@ -26,11 +26,11 @@ use std::hash::Hash;
use std::mem::size_of_val;
use std::sync::Arc;
-use ahash::RandomState;
use arrow::array::ArrayRef;
use arrow::array::PrimitiveArray;
use arrow::array::types::ArrowPrimitiveType;
use arrow::datatypes::DataType;
+use datafusion_common::hash_utils::RandomState;
use datafusion_common::ScalarValue;
use datafusion_common::cast::{as_list_array, as_primitive_array};
diff --git a/datafusion/functions-aggregate-common/src/utils.rs
b/datafusion/functions-aggregate-common/src/utils.rs
index 78b8c52490..256d80a67b 100644
--- a/datafusion/functions-aggregate-common/src/utils.rs
+++ b/datafusion/functions-aggregate-common/src/utils.rs
@@ -15,7 +15,6 @@
// specific language governing permissions and limitations
// under the License.
-use ahash::RandomState;
use arrow::array::{
Array, ArrayRef, ArrowNativeTypeOp, ArrowPrimitiveType, PrimitiveArray,
};
@@ -24,6 +23,7 @@ use arrow::datatypes::{
ArrowNativeType, DataType, DecimalType, Field, FieldRef, ToByteSlice,
};
use datafusion_common::cast::{as_list_array, as_primitive_array};
+use datafusion_common::hash_utils::RandomState;
use datafusion_common::utils::SingleRowListArrayBuilder;
use datafusion_common::utils::memory::estimate_memory_size;
use datafusion_common::{
diff --git a/datafusion/functions-aggregate/Cargo.toml
b/datafusion/functions-aggregate/Cargo.toml
index 1ca494e38e..869c7e2b4a 100644
--- a/datafusion/functions-aggregate/Cargo.toml
+++ b/datafusion/functions-aggregate/Cargo.toml
@@ -41,7 +41,6 @@ workspace = true
name = "datafusion_functions_aggregate"
[dependencies]
-ahash = { workspace = true }
arrow = { workspace = true }
datafusion-common = { workspace = true }
datafusion-doc = { workspace = true }
@@ -51,6 +50,7 @@ datafusion-functions-aggregate-common = { workspace = true }
datafusion-macros = { workspace = true }
datafusion-physical-expr = { workspace = true }
datafusion-physical-expr-common = { workspace = true }
+foldhash = "0.2"
half = { workspace = true }
log = { workspace = true }
num-traits = { workspace = true }
diff --git a/datafusion/functions-aggregate/src/bit_and_or_xor.rs
b/datafusion/functions-aggregate/src/bit_and_or_xor.rs
index 734a916e2a..48edbd5d4c 100644
--- a/datafusion/functions-aggregate/src/bit_and_or_xor.rs
+++ b/datafusion/functions-aggregate/src/bit_and_or_xor.rs
@@ -23,12 +23,12 @@ use std::fmt::{Display, Formatter};
use std::hash::Hash;
use std::mem::{size_of, size_of_val};
-use ahash::RandomState;
use arrow::array::{Array, ArrayRef, AsArray, downcast_integer};
use arrow::datatypes::{
ArrowNativeType, ArrowNumericType, DataType, Field, FieldRef, Int8Type,
Int16Type,
Int32Type, Int64Type, UInt8Type, UInt16Type, UInt32Type, UInt64Type,
};
+use datafusion_common::hash_utils::RandomState;
use datafusion_common::cast::as_list_array;
use datafusion_common::{Result, ScalarValue, not_impl_err};
diff --git a/datafusion/functions-aggregate/src/count.rs
b/datafusion/functions-aggregate/src/count.rs
index ebe3c60a4d..67e799d489 100644
--- a/datafusion/functions-aggregate/src/count.rs
+++ b/datafusion/functions-aggregate/src/count.rs
@@ -15,7 +15,6 @@
// specific language governing permissions and limitations
// under the License.
-use ahash::RandomState;
use arrow::{
array::{Array, ArrayRef, AsArray, BooleanArray, Int64Array,
PrimitiveArray},
buffer::BooleanBuffer,
@@ -29,6 +28,7 @@ use arrow::{
UInt8Type, UInt16Type, UInt32Type, UInt64Type,
},
};
+use datafusion_common::hash_utils::RandomState;
use datafusion_common::{
HashMap, Result, ScalarValue, downcast_value, internal_err, not_impl_err,
stats::Precision, utils::expr::COUNT_STAR_EXPANSION,
diff --git a/datafusion/functions-aggregate/src/hyperloglog.rs
b/datafusion/functions-aggregate/src/hyperloglog.rs
index 3074889eab..3a467a8111 100644
--- a/datafusion/functions-aggregate/src/hyperloglog.rs
+++ b/datafusion/functions-aggregate/src/hyperloglog.rs
@@ -34,7 +34,7 @@
//!
//! This module also borrows some code structure from
[pdatastructs.rs](https://github.com/crepererum/pdatastructs.rs/blob/3997ed50f6b6871c9e53c4c5e0f48f431405fc63/src/hyperloglog.rs).
-use ahash::RandomState;
+use std::hash::BuildHasher;
use std::hash::Hash;
use std::marker::PhantomData;
@@ -61,12 +61,7 @@ where
/// shared across cluster, this SEED will have to be consistent across all
/// parties otherwise we might have corruption. So ideally for later this seed
/// shall be part of the serialized form (or stay unchanged across versions).
-const SEED: RandomState = RandomState::with_seeds(
- 0x885f6cab121d01a3_u64,
- 0x71e4379f2976ad8f_u64,
- 0xbf30173dd28a8816_u64,
- 0x0eaea5d736d733a4_u64,
-);
+const SEED: foldhash::quality::FixedState =
foldhash::quality::FixedState::with_seed(0);
impl<T> Default for HyperLogLog<T>
where
@@ -97,7 +92,7 @@ where
}
}
- /// choice of hash function: ahash is already an dependency
+ /// choice of hash function: foldhash is already an dependency
/// and it fits the requirements of being a 64bit hash with
/// reasonable performance.
#[inline]
diff --git a/datafusion/functions-aggregate/src/sum.rs
b/datafusion/functions-aggregate/src/sum.rs
index 5cced80d99..4f638f2cb0 100644
--- a/datafusion/functions-aggregate/src/sum.rs
+++ b/datafusion/functions-aggregate/src/sum.rs
@@ -17,7 +17,6 @@
//! Defines `SUM` and `SUM DISTINCT` aggregate accumulators
-use ahash::RandomState;
use arrow::array::{Array, ArrayRef, ArrowNativeTypeOp, ArrowNumericType,
AsArray};
use arrow::datatypes::Field;
use arrow::datatypes::{
@@ -27,6 +26,7 @@ use arrow::datatypes::{
DurationMillisecondType, DurationNanosecondType, DurationSecondType,
FieldRef,
Float64Type, Int64Type, TimeUnit, UInt64Type,
};
+use datafusion_common::hash_utils::RandomState;
use datafusion_common::internal_err;
use datafusion_common::types::{
NativeType, logical_float64, logical_int8, logical_int16, logical_int32,
diff --git a/datafusion/physical-expr-common/Cargo.toml
b/datafusion/physical-expr-common/Cargo.toml
index 453c8a0cb4..119be36234 100644
--- a/datafusion/physical-expr-common/Cargo.toml
+++ b/datafusion/physical-expr-common/Cargo.toml
@@ -41,7 +41,6 @@ workspace = true
name = "datafusion_physical_expr_common"
[dependencies]
-ahash = { workspace = true }
arrow = { workspace = true }
chrono = { workspace = true }
datafusion-common = { workspace = true }
diff --git a/datafusion/physical-expr-common/src/binary_map.rs
b/datafusion/physical-expr-common/src/binary_map.rs
index 95d085ddfd..ad184d6500 100644
--- a/datafusion/physical-expr-common/src/binary_map.rs
+++ b/datafusion/physical-expr-common/src/binary_map.rs
@@ -18,7 +18,6 @@
//! [`ArrowBytesMap`] and [`ArrowBytesSet`] for storing maps/sets of values
from
//! StringArray / LargeStringArray / BinaryArray / LargeBinaryArray.
-use ahash::RandomState;
use arrow::array::{
Array, ArrayRef, BufferBuilder, GenericBinaryArray, GenericStringArray,
NullBufferBuilder, OffsetSizeTrait,
@@ -27,6 +26,7 @@ use arrow::array::{
};
use arrow::buffer::{NullBuffer, OffsetBuffer, ScalarBuffer};
use arrow::datatypes::DataType;
+use datafusion_common::hash_utils::RandomState;
use datafusion_common::hash_utils::create_hashes;
use datafusion_common::utils::proxy::{HashTableAllocExt, VecAllocExt};
use std::any::type_name;
@@ -250,7 +250,7 @@ where
map_size: 0,
buffer: BufferBuilder::new(INITIAL_BUFFER_CAPACITY),
offsets: vec![O::default()], // first offset is always 0
- random_state: RandomState::new(),
+ random_state: RandomState::default(),
hashes_buffer: vec![],
null: None,
}
diff --git a/datafusion/physical-expr-common/src/binary_view_map.rs
b/datafusion/physical-expr-common/src/binary_view_map.rs
index aa0d186f9e..89a97e18be 100644
--- a/datafusion/physical-expr-common/src/binary_view_map.rs
+++ b/datafusion/physical-expr-common/src/binary_view_map.rs
@@ -18,12 +18,12 @@
//! [`ArrowBytesViewMap`] and [`ArrowBytesViewSet`] for storing maps/sets of
values from
//! `StringViewArray`/`BinaryViewArray`.
use crate::binary_map::OutputType;
-use ahash::RandomState;
use arrow::array::NullBufferBuilder;
use arrow::array::cast::AsArray;
use arrow::array::{Array, ArrayRef, BinaryViewArray, ByteView, make_view};
use arrow::buffer::{Buffer, ScalarBuffer};
use arrow::datatypes::{BinaryViewType, ByteViewType, DataType, StringViewType};
+use datafusion_common::hash_utils::RandomState;
use datafusion_common::hash_utils::create_hashes;
use datafusion_common::utils::proxy::{HashTableAllocExt, VecAllocExt};
use std::fmt::Debug;
@@ -163,7 +163,7 @@ where
in_progress: Vec::new(),
completed: Vec::new(),
nulls: NullBufferBuilder::new(0),
- random_state: RandomState::new(),
+ random_state: RandomState::default(),
hashes_buffer: vec![],
null: None,
}
diff --git a/datafusion/physical-expr/Cargo.toml
b/datafusion/physical-expr/Cargo.toml
index d6cb212737..3a3224545c 100644
--- a/datafusion/physical-expr/Cargo.toml
+++ b/datafusion/physical-expr/Cargo.toml
@@ -44,7 +44,6 @@ name = "datafusion_physical_expr"
recursive_protection = ["dep:recursive"]
[dependencies]
-ahash = { workspace = true }
arrow = { workspace = true }
datafusion-common = { workspace = true }
datafusion-expr = { workspace = true }
diff --git a/datafusion/physical-expr/src/expressions/in_list.rs
b/datafusion/physical-expr/src/expressions/in_list.rs
index e30f256352..e6ef7e7620 100644
--- a/datafusion/physical-expr/src/expressions/in_list.rs
+++ b/datafusion/physical-expr/src/expressions/in_list.rs
@@ -39,8 +39,8 @@ use datafusion_common::{
};
use datafusion_expr::{ColumnarValue, expr_vec_fmt};
-use ahash::RandomState;
use datafusion_common::HashMap;
+use datafusion_common::hash_utils::RandomState;
use hashbrown::hash_map::RawEntryMut;
/// Trait for InList static filters
@@ -189,12 +189,12 @@ impl ArrayStaticFilter {
if in_array.data_type() == &DataType::Null {
return Ok(ArrayStaticFilter {
in_array,
- state: RandomState::new(),
+ state: RandomState::default(),
map: HashMap::with_hasher(()),
});
}
- let state = RandomState::new();
+ let state = RandomState::default();
let mut map: HashMap<usize, (), ()> = HashMap::with_hasher(());
with_hashes([&in_array], &state, |hashes| -> Result<()> {
diff --git a/datafusion/physical-plan/Cargo.toml
b/datafusion/physical-plan/Cargo.toml
index 6a28486cca..7acb21b8f3 100644
--- a/datafusion/physical-plan/Cargo.toml
+++ b/datafusion/physical-plan/Cargo.toml
@@ -47,7 +47,6 @@ tokio_coop_fallback = []
name = "datafusion_physical_plan"
[dependencies]
-ahash = { workspace = true }
arrow = { workspace = true }
arrow-ord = { workspace = true }
arrow-schema = { workspace = true }
diff --git
a/datafusion/physical-plan/src/aggregates/group_values/multi_group_by/mod.rs
b/datafusion/physical-plan/src/aggregates/group_values/multi_group_by/mod.rs
index 479bff001e..cc4576eabd 100644
--- a/datafusion/physical-plan/src/aggregates/group_values/multi_group_by/mod.rs
+++ b/datafusion/physical-plan/src/aggregates/group_values/multi_group_by/mod.rs
@@ -29,7 +29,6 @@ use crate::aggregates::group_values::multi_group_by::{
boolean::BooleanGroupValueBuilder, bytes::ByteGroupValueBuilder,
bytes_view::ByteViewGroupValueBuilder,
primitive::PrimitiveGroupValueBuilder,
};
-use ahash::RandomState;
use arrow::array::{Array, ArrayRef};
use arrow::compute::cast;
use arrow::datatypes::{
@@ -40,6 +39,7 @@ use arrow::datatypes::{
TimestampNanosecondType, TimestampSecondType, UInt8Type, UInt16Type,
UInt32Type,
UInt64Type,
};
+use datafusion_common::hash_utils::RandomState;
use datafusion_common::hash_utils::create_hashes;
use datafusion_common::{Result, internal_datafusion_err, not_impl_err};
use datafusion_execution::memory_pool::proxy::{HashTableAllocExt, VecAllocExt};
diff --git a/datafusion/physical-plan/src/aggregates/group_values/row.rs
b/datafusion/physical-plan/src/aggregates/group_values/row.rs
index dd794c9573..3dcf4e1240 100644
--- a/datafusion/physical-plan/src/aggregates/group_values/row.rs
+++ b/datafusion/physical-plan/src/aggregates/group_values/row.rs
@@ -16,12 +16,12 @@
// under the License.
use crate::aggregates::group_values::GroupValues;
-use ahash::RandomState;
use arrow::array::{Array, ArrayRef, ListArray, StructArray};
use arrow::compute::cast;
use arrow::datatypes::{DataType, SchemaRef};
use arrow::row::{RowConverter, Rows, SortField};
use datafusion_common::Result;
+use datafusion_common::hash_utils::RandomState;
use datafusion_common::hash_utils::create_hashes;
use datafusion_execution::memory_pool::proxy::{HashTableAllocExt, VecAllocExt};
use datafusion_expr::EmitTo;
diff --git
a/datafusion/physical-plan/src/aggregates/group_values/single_group_by/primitive.rs
b/datafusion/physical-plan/src/aggregates/group_values/single_group_by/primitive.rs
index 2b8a2cfa68..4686648fb1 100644
---
a/datafusion/physical-plan/src/aggregates/group_values/single_group_by/primitive.rs
+++
b/datafusion/physical-plan/src/aggregates/group_values/single_group_by/primitive.rs
@@ -16,7 +16,6 @@
// under the License.
use crate::aggregates::group_values::GroupValues;
-use ahash::RandomState;
use arrow::array::types::{IntervalDayTime, IntervalMonthDayNano};
use arrow::array::{
ArrayRef, ArrowNativeTypeOp, ArrowPrimitiveType, NullBufferBuilder,
PrimitiveArray,
@@ -24,10 +23,12 @@ use arrow::array::{
};
use arrow::datatypes::{DataType, i256};
use datafusion_common::Result;
+use datafusion_common::hash_utils::RandomState;
use datafusion_execution::memory_pool::proxy::VecAllocExt;
use datafusion_expr::EmitTo;
use half::f16;
use hashbrown::hash_table::HashTable;
+use std::hash::BuildHasher;
use std::mem::size_of;
use std::sync::Arc;
diff --git a/datafusion/physical-plan/src/aggregates/mod.rs
b/datafusion/physical-plan/src/aggregates/mod.rs
index e573a7f85d..42df1a8b07 100644
--- a/datafusion/physical-plan/src/aggregates/mod.rs
+++ b/datafusion/physical-plan/src/aggregates/mod.rs
@@ -87,8 +87,9 @@ pub fn topk_types_supported(key_type: &DataType, value_type:
&DataType) -> bool
}
/// Hard-coded seed for aggregations to ensure hash values differ from
`RepartitionExec`, avoiding collisions.
-const AGGREGATION_HASH_SEED: ahash::RandomState =
- ahash::RandomState::with_seeds('A' as u64, 'G' as u64, 'G' as u64, 'R' as
u64);
+const AGGREGATION_HASH_SEED: datafusion_common::hash_utils::RandomState =
+ // This seed is chosen to be a large 64-bit number
+
datafusion_common::hash_utils::RandomState::with_seed(15395726432021054657);
/// Whether an aggregate stage consumes raw input data or intermediate
/// accumulator state from a previous aggregation stage.
diff --git a/datafusion/physical-plan/src/aggregates/topk/hash_table.rs
b/datafusion/physical-plan/src/aggregates/topk/hash_table.rs
index 418ec49ddd..694780f085 100644
--- a/datafusion/physical-plan/src/aggregates/topk/hash_table.rs
+++ b/datafusion/physical-plan/src/aggregates/topk/hash_table.rs
@@ -19,7 +19,6 @@
use crate::aggregates::group_values::HashValue;
use crate::aggregates::topk::heap::Comparable;
-use ahash::RandomState;
use arrow::array::types::{IntervalDayTime, IntervalMonthDayNano};
use arrow::array::{
Array, ArrayRef, ArrowPrimitiveType, LargeStringArray, PrimitiveArray,
StringArray,
@@ -28,9 +27,11 @@ use arrow::array::{
use arrow::datatypes::{DataType, i256};
use datafusion_common::Result;
use datafusion_common::exec_datafusion_err;
+use datafusion_common::hash_utils::RandomState;
use half::f16;
use hashbrown::hash_table::HashTable;
use std::fmt::Debug;
+use std::hash::BuildHasher;
use std::sync::Arc;
/// A "type alias" for Keys which are stored in our map
diff --git a/datafusion/physical-plan/src/joins/hash_join/exec.rs
b/datafusion/physical-plan/src/joins/hash_join/exec.rs
index c66123facb..a03cc36958 100644
--- a/datafusion/physical-plan/src/joins/hash_join/exec.rs
+++ b/datafusion/physical-plan/src/joins/hash_join/exec.rs
@@ -89,7 +89,7 @@ use datafusion_physical_expr::expressions::{Column,
DynamicFilterPhysicalExpr, l
use datafusion_physical_expr::projection::{ProjectionRef, combine_projections};
use datafusion_physical_expr::{PhysicalExpr, PhysicalExprRef};
-use ahash::RandomState;
+use datafusion_common::hash_utils::RandomState;
use datafusion_physical_expr_common::physical_expr::fmt_sql;
use datafusion_physical_expr_common::utils::evaluate_expressions_to_arrays;
use futures::TryStreamExt;
@@ -99,7 +99,7 @@ use super::partitioned_hash_eval::SeededRandomState;
/// Hard-coded seed to ensure hash values from the hash join differ from
`RepartitionExec`, avoiding collisions.
pub(crate) const HASH_JOIN_SEED: SeededRandomState =
- SeededRandomState::with_seeds('J' as u64, 'O' as u64, 'I' as u64, 'N' as
u64);
+ SeededRandomState::with_seed(12210250226015887276);
const ARRAY_MAP_CREATED_COUNT_METRIC_NAME: &str = "array_map_created_count";
@@ -4304,7 +4304,7 @@ mod tests {
("y", &vec![200, 300]),
);
- let random_state = RandomState::with_seeds(0, 0, 0, 0);
+ let random_state = RandomState::with_seed(0);
let hashes_buff = &mut vec![0; left.num_rows()];
let hashes = create_hashes([&left.columns()[0]], &random_state,
hashes_buff)?;
@@ -4371,7 +4371,7 @@ mod tests {
("y", &vec![200, 300]),
);
- let random_state = RandomState::with_seeds(0, 0, 0, 0);
+ let random_state = RandomState::with_seed(0);
let hashes_buff = &mut vec![0; left.num_rows()];
let hashes = create_hashes([&left.columns()[0]], &random_state,
hashes_buff)?;
diff --git
a/datafusion/physical-plan/src/joins/hash_join/partitioned_hash_eval.rs
b/datafusion/physical-plan/src/joins/hash_join/partitioned_hash_eval.rs
index e3d432643c..f6c305fba6 100644
--- a/datafusion/physical-plan/src/joins/hash_join/partitioned_hash_eval.rs
+++ b/datafusion/physical-plan/src/joins/hash_join/partitioned_hash_eval.rs
@@ -19,13 +19,13 @@
use std::{any::Any, fmt::Display, hash::Hash, sync::Arc};
-use ahash::RandomState;
use arrow::{
array::{ArrayRef, UInt64Array},
datatypes::{DataType, Schema},
record_batch::RecordBatch,
};
use datafusion_common::Result;
+use datafusion_common::hash_utils::RandomState;
use datafusion_common::hash_utils::{create_hashes, with_hashes};
use datafusion_expr::ColumnarValue;
use datafusion_physical_expr_common::physical_expr::{
@@ -34,22 +34,22 @@ use datafusion_physical_expr_common::physical_expr::{
use crate::joins::Map;
-/// RandomState wrapper that preserves the seeds used to create it.
+/// RandomState wrapper that preserves the seed used to create it.
///
-/// This is needed because ahash's `RandomState` doesn't expose its seeds
after creation,
+/// This is needed because `RandomState` doesn't expose its seed after
creation,
/// but we need them for serialization (e.g., protobuf serde).
#[derive(Clone, Debug)]
pub struct SeededRandomState {
random_state: RandomState,
- seeds: (u64, u64, u64, u64),
+ seed: u64,
}
impl SeededRandomState {
- /// Create a new SeededRandomState with the given seeds.
- pub const fn with_seeds(k0: u64, k1: u64, k2: u64, k3: u64) -> Self {
+ /// Create a new SeededRandomState with the given seed.
+ pub const fn with_seed(k: u64) -> Self {
Self {
- random_state: RandomState::with_seeds(k0, k1, k2, k3),
- seeds: (k0, k1, k2, k3),
+ random_state: RandomState::with_seed(k),
+ seed: k,
}
}
@@ -58,9 +58,9 @@ impl SeededRandomState {
&self.random_state
}
- /// Get the seeds used to create this RandomState.
- pub fn seeds(&self) -> (u64, u64, u64, u64) {
- self.seeds
+ /// Get the seed used to create this RandomState.
+ pub fn seed(&self) -> u64 {
+ self.seed
}
}
@@ -105,9 +105,9 @@ impl HashExpr {
&self.on_columns
}
- /// Get the seeds used for hashing.
- pub fn seeds(&self) -> (u64, u64, u64, u64) {
- self.random_state.seeds()
+ /// Get the seed used for hashing.
+ pub fn seed(&self) -> u64 {
+ self.random_state.seed()
}
/// Get the description.
@@ -124,8 +124,8 @@ impl std::fmt::Debug for HashExpr {
.map(|e| e.to_string())
.collect::<Vec<_>>()
.join(", ");
- let (s1, s2, s3, s4) = self.seeds();
- write!(f, "{}({cols}, [{s1},{s2},{s3},{s4}])", self.description)
+ let seed = self.seed();
+ write!(f, "{}({cols}, [{seed}])", self.description)
}
}
@@ -133,7 +133,7 @@ impl Hash for HashExpr {
fn hash<H: std::hash::Hasher>(&self, state: &mut H) {
self.on_columns.dyn_hash(state);
self.description.hash(state);
- self.seeds().hash(state);
+ self.seed().hash(state);
}
}
@@ -141,7 +141,7 @@ impl PartialEq for HashExpr {
fn eq(&self, other: &Self) -> bool {
self.on_columns == other.on_columns
&& self.description == other.description
- && self.seeds() == other.seeds()
+ && self.seed() == other.seed()
}
}
@@ -254,8 +254,8 @@ impl std::fmt::Debug for HashTableLookupExpr {
.map(|e| e.to_string())
.collect::<Vec<_>>()
.join(", ");
- let (s1, s2, s3, s4) = self.random_state.seeds();
- write!(f, "{}({cols}, [{s1},{s2},{s3},{s4}])", self.description)
+ let seed = self.random_state.seed();
+ write!(f, "{}({cols}, [{seed}])", self.description)
}
}
@@ -263,7 +263,7 @@ impl Hash for HashTableLookupExpr {
fn hash<H: std::hash::Hasher>(&self, state: &mut H) {
self.on_columns.dyn_hash(state);
self.description.hash(state);
- self.random_state.seeds().hash(state);
+ self.random_state.seed().hash(state);
// Note that we compare hash_map by pointer equality.
// Actually comparing the contents of the hash maps would be expensive.
// The way these hash maps are used in actuality is that HashJoinExec
creates
@@ -286,7 +286,7 @@ impl PartialEq for HashTableLookupExpr {
// but that seems unlikely and not worth paying the cost of deep
comparison all the time.
self.on_columns == other.on_columns
&& self.description == other.description
- && self.random_state.seeds() == other.random_state.seeds()
+ && self.random_state.seed() == other.random_state.seed()
&& Arc::ptr_eq(&self.map, &other.map)
}
}
@@ -383,13 +383,13 @@ mod tests {
let expr1 = HashExpr::new(
vec![Arc::clone(&col_a), Arc::clone(&col_b)],
- SeededRandomState::with_seeds(1, 2, 3, 4),
+ SeededRandomState::with_seed(1),
"test_hash".to_string(),
);
let expr2 = HashExpr::new(
vec![Arc::clone(&col_a), Arc::clone(&col_b)],
- SeededRandomState::with_seeds(1, 2, 3, 4),
+ SeededRandomState::with_seed(1),
"test_hash".to_string(),
);
@@ -404,13 +404,13 @@ mod tests {
let expr1 = HashExpr::new(
vec![Arc::clone(&col_a), Arc::clone(&col_b)],
- SeededRandomState::with_seeds(1, 2, 3, 4),
+ SeededRandomState::with_seed(1),
"test_hash".to_string(),
);
let expr2 = HashExpr::new(
vec![Arc::clone(&col_a), Arc::clone(&col_c)],
- SeededRandomState::with_seeds(1, 2, 3, 4),
+ SeededRandomState::with_seed(1),
"test_hash".to_string(),
);
@@ -423,13 +423,13 @@ mod tests {
let expr1 = HashExpr::new(
vec![Arc::clone(&col_a)],
- SeededRandomState::with_seeds(1, 2, 3, 4),
+ SeededRandomState::with_seed(1),
"hash_one".to_string(),
);
let expr2 = HashExpr::new(
vec![Arc::clone(&col_a)],
- SeededRandomState::with_seeds(1, 2, 3, 4),
+ SeededRandomState::with_seed(1),
"hash_two".to_string(),
);
@@ -442,13 +442,13 @@ mod tests {
let expr1 = HashExpr::new(
vec![Arc::clone(&col_a)],
- SeededRandomState::with_seeds(1, 2, 3, 4),
+ SeededRandomState::with_seed(1),
"test_hash".to_string(),
);
let expr2 = HashExpr::new(
vec![Arc::clone(&col_a)],
- SeededRandomState::with_seeds(5, 6, 7, 8),
+ SeededRandomState::with_seed(5),
"test_hash".to_string(),
);
@@ -462,13 +462,13 @@ mod tests {
let expr1 = HashExpr::new(
vec![Arc::clone(&col_a), Arc::clone(&col_b)],
- SeededRandomState::with_seeds(1, 2, 3, 4),
+ SeededRandomState::with_seed(1),
"test_hash".to_string(),
);
let expr2 = HashExpr::new(
vec![Arc::clone(&col_a), Arc::clone(&col_b)],
- SeededRandomState::with_seeds(1, 2, 3, 4),
+ SeededRandomState::with_seed(1),
"test_hash".to_string(),
);
@@ -485,14 +485,14 @@ mod tests {
let expr1 = HashTableLookupExpr::new(
vec![Arc::clone(&col_a)],
- SeededRandomState::with_seeds(1, 2, 3, 4),
+ SeededRandomState::with_seed(1),
Arc::clone(&hash_map),
"lookup".to_string(),
);
let expr2 = HashTableLookupExpr::new(
vec![Arc::clone(&col_a)],
- SeededRandomState::with_seeds(1, 2, 3, 4),
+ SeededRandomState::with_seed(1),
Arc::clone(&hash_map),
"lookup".to_string(),
);
@@ -510,14 +510,14 @@ mod tests {
let expr1 = HashTableLookupExpr::new(
vec![Arc::clone(&col_a)],
- SeededRandomState::with_seeds(1, 2, 3, 4),
+ SeededRandomState::with_seed(1),
Arc::clone(&hash_map),
"lookup".to_string(),
);
let expr2 = HashTableLookupExpr::new(
vec![Arc::clone(&col_b)],
- SeededRandomState::with_seeds(1, 2, 3, 4),
+ SeededRandomState::with_seed(1),
Arc::clone(&hash_map),
"lookup".to_string(),
);
@@ -533,14 +533,14 @@ mod tests {
let expr1 = HashTableLookupExpr::new(
vec![Arc::clone(&col_a)],
- SeededRandomState::with_seeds(1, 2, 3, 4),
+ SeededRandomState::with_seed(1),
Arc::clone(&hash_map),
"lookup_one".to_string(),
);
let expr2 = HashTableLookupExpr::new(
vec![Arc::clone(&col_a)],
- SeededRandomState::with_seeds(1, 2, 3, 4),
+ SeededRandomState::with_seed(1),
Arc::clone(&hash_map),
"lookup_two".to_string(),
);
@@ -559,14 +559,14 @@ mod tests {
Arc::new(Map::HashMap(Box::new(JoinHashMapU32::with_capacity(10))));
let expr1 = HashTableLookupExpr::new(
vec![Arc::clone(&col_a)],
- SeededRandomState::with_seeds(1, 2, 3, 4),
+ SeededRandomState::with_seed(1),
hash_map1,
"lookup".to_string(),
);
let expr2 = HashTableLookupExpr::new(
vec![Arc::clone(&col_a)],
- SeededRandomState::with_seeds(1, 2, 3, 4),
+ SeededRandomState::with_seed(1),
hash_map2,
"lookup".to_string(),
);
@@ -583,14 +583,14 @@ mod tests {
let expr1 = HashTableLookupExpr::new(
vec![Arc::clone(&col_a)],
- SeededRandomState::with_seeds(1, 2, 3, 4),
+ SeededRandomState::with_seed(1),
Arc::clone(&hash_map),
"lookup".to_string(),
);
let expr2 = HashTableLookupExpr::new(
vec![Arc::clone(&col_a)],
- SeededRandomState::with_seeds(1, 2, 3, 4),
+ SeededRandomState::with_seed(1),
Arc::clone(&hash_map),
"lookup".to_string(),
);
diff --git a/datafusion/physical-plan/src/joins/hash_join/stream.rs
b/datafusion/physical-plan/src/joins/hash_join/stream.rs
index 57218244ba..ab63092018 100644
--- a/datafusion/physical-plan/src/joins/hash_join/stream.rs
+++ b/datafusion/physical-plan/src/joins/hash_join/stream.rs
@@ -54,7 +54,7 @@ use datafusion_common::{
};
use datafusion_physical_expr::PhysicalExprRef;
-use ahash::RandomState;
+use datafusion_common::hash_utils::RandomState;
use datafusion_physical_expr_common::utils::evaluate_expressions_to_arrays;
use futures::{Stream, StreamExt, ready};
diff --git a/datafusion/physical-plan/src/joins/symmetric_hash_join.rs
b/datafusion/physical-plan/src/joins/symmetric_hash_join.rs
index 4429d1e3fb..f31cd8d446 100644
--- a/datafusion/physical-plan/src/joins/symmetric_hash_join.rs
+++ b/datafusion/physical-plan/src/joins/symmetric_hash_join.rs
@@ -80,7 +80,7 @@ use
datafusion_physical_expr::intervals::cp_solver::ExprIntervalGraph;
use datafusion_physical_expr_common::physical_expr::{PhysicalExprRef, fmt_sql};
use datafusion_physical_expr_common::sort_expr::{LexOrdering,
OrderingRequirements};
-use ahash::RandomState;
+use datafusion_common::hash_utils::RandomState;
use datafusion_physical_expr_common::utils::evaluate_expressions_to_arrays;
use futures::{Stream, StreamExt, ready};
use parking_lot::Mutex;
@@ -239,7 +239,7 @@ impl SymmetricHashJoinExec {
build_join_schema(&left_schema, &right_schema, join_type);
// Initialize the random state for the join operation:
- let random_state = RandomState::with_seeds(0, 0, 0, 0);
+ let random_state = RandomState::with_seed(0);
let schema = Arc::new(schema);
let cache = Self::compute_properties(&left, &right, schema,
*join_type, &on)?;
Ok(SymmetricHashJoinExec {
diff --git a/datafusion/physical-plan/src/joins/utils.rs
b/datafusion/physical-plan/src/joins/utils.rs
index cf4bf2cd16..3130134e25 100644
--- a/datafusion/physical-plan/src/joins/utils.rs
+++ b/datafusion/physical-plan/src/joins/utils.rs
@@ -39,7 +39,6 @@ pub use super::join_filter::JoinFilter;
pub use super::join_hash_map::JoinHashMapType;
pub use crate::joins::{JoinOn, JoinOnRef};
-use ahash::RandomState;
use arrow::array::{
Array, ArrowPrimitiveType, BooleanBufferBuilder, NativeAdapter,
PrimitiveArray,
RecordBatch, RecordBatchOptions, UInt32Array, UInt32Builder, UInt64Array,
@@ -61,6 +60,7 @@ use arrow::datatypes::{
use arrow_ord::cmp::not_distinct;
use arrow_schema::{ArrowError, DataType, SortOptions, TimeUnit};
use datafusion_common::cast::as_boolean_array;
+use datafusion_common::hash_utils::RandomState;
use datafusion_common::hash_utils::create_hashes;
use datafusion_common::stats::Precision;
use datafusion_common::{
diff --git a/datafusion/physical-plan/src/repartition/mod.rs
b/datafusion/physical-plan/src/repartition/mod.rs
index 21f5bd3729..342b2f5035 100644
--- a/datafusion/physical-plan/src/repartition/mod.rs
+++ b/datafusion/physical-plan/src/repartition/mod.rs
@@ -436,8 +436,7 @@ enum BatchPartitionerState {
/// Fixed RandomState used for hash repartitioning to ensure consistent
behavior across
/// executions and runs.
-pub const REPARTITION_RANDOM_STATE: SeededRandomState =
- SeededRandomState::with_seeds(0, 0, 0, 0);
+pub const REPARTITION_RANDOM_STATE: SeededRandomState =
SeededRandomState::with_seed(0);
impl BatchPartitioner {
/// Create a new [`BatchPartitioner`] for hash-based repartitioning.
diff --git a/datafusion/physical-plan/src/windows/bounded_window_agg_exec.rs
b/datafusion/physical-plan/src/windows/bounded_window_agg_exec.rs
index 33570c2a21..d0c44c659c 100644
--- a/datafusion/physical-plan/src/windows/bounded_window_agg_exec.rs
+++ b/datafusion/physical-plan/src/windows/bounded_window_agg_exec.rs
@@ -67,7 +67,7 @@ use datafusion_physical_expr_common::sort_expr::{
};
use crate::execution_plan::CardinalityEffect;
-use ahash::RandomState;
+use datafusion_common::hash_utils::RandomState;
use futures::stream::Stream;
use futures::{StreamExt, ready};
use hashbrown::hash_table::HashTable;
diff --git a/datafusion/proto/proto/datafusion.proto
b/datafusion/proto/proto/datafusion.proto
index 197be2631a..e422ce7bed 100644
--- a/datafusion/proto/proto/datafusion.proto
+++ b/datafusion/proto/proto/datafusion.proto
@@ -1041,9 +1041,6 @@ message PhysicalExtensionExprNode {
message PhysicalHashExprNode {
repeated PhysicalExprNode on_columns = 1;
uint64 seed0 = 2;
- uint64 seed1 = 3;
- uint64 seed2 = 4;
- uint64 seed3 = 5;
string description = 6;
}
diff --git a/datafusion/proto/src/generated/pbjson.rs
b/datafusion/proto/src/generated/pbjson.rs
index 7a4964c4c4..eb86afe3d6 100644
--- a/datafusion/proto/src/generated/pbjson.rs
+++ b/datafusion/proto/src/generated/pbjson.rs
@@ -17081,15 +17081,6 @@ impl serde::Serialize for PhysicalHashExprNode {
if self.seed0 != 0 {
len += 1;
}
- if self.seed1 != 0 {
- len += 1;
- }
- if self.seed2 != 0 {
- len += 1;
- }
- if self.seed3 != 0 {
- len += 1;
- }
if !self.description.is_empty() {
len += 1;
}
@@ -17102,21 +17093,6 @@ impl serde::Serialize for PhysicalHashExprNode {
#[allow(clippy::needless_borrows_for_generic_args)]
struct_ser.serialize_field("seed0",
ToString::to_string(&self.seed0).as_str())?;
}
- if self.seed1 != 0 {
- #[allow(clippy::needless_borrow)]
- #[allow(clippy::needless_borrows_for_generic_args)]
- struct_ser.serialize_field("seed1",
ToString::to_string(&self.seed1).as_str())?;
- }
- if self.seed2 != 0 {
- #[allow(clippy::needless_borrow)]
- #[allow(clippy::needless_borrows_for_generic_args)]
- struct_ser.serialize_field("seed2",
ToString::to_string(&self.seed2).as_str())?;
- }
- if self.seed3 != 0 {
- #[allow(clippy::needless_borrow)]
- #[allow(clippy::needless_borrows_for_generic_args)]
- struct_ser.serialize_field("seed3",
ToString::to_string(&self.seed3).as_str())?;
- }
if !self.description.is_empty() {
struct_ser.serialize_field("description", &self.description)?;
}
@@ -17133,9 +17109,6 @@ impl<'de> serde::Deserialize<'de> for
PhysicalHashExprNode {
"on_columns",
"onColumns",
"seed0",
- "seed1",
- "seed2",
- "seed3",
"description",
];
@@ -17143,9 +17116,6 @@ impl<'de> serde::Deserialize<'de> for
PhysicalHashExprNode {
enum GeneratedField {
OnColumns,
Seed0,
- Seed1,
- Seed2,
- Seed3,
Description,
}
impl<'de> serde::Deserialize<'de> for GeneratedField {
@@ -17170,9 +17140,6 @@ impl<'de> serde::Deserialize<'de> for
PhysicalHashExprNode {
match value {
"onColumns" | "on_columns" =>
Ok(GeneratedField::OnColumns),
"seed0" => Ok(GeneratedField::Seed0),
- "seed1" => Ok(GeneratedField::Seed1),
- "seed2" => Ok(GeneratedField::Seed2),
- "seed3" => Ok(GeneratedField::Seed3),
"description" => Ok(GeneratedField::Description),
_ => Err(serde::de::Error::unknown_field(value,
FIELDS)),
}
@@ -17195,9 +17162,6 @@ impl<'de> serde::Deserialize<'de> for
PhysicalHashExprNode {
{
let mut on_columns__ = None;
let mut seed0__ = None;
- let mut seed1__ = None;
- let mut seed2__ = None;
- let mut seed3__ = None;
let mut description__ = None;
while let Some(k) = map_.next_key()? {
match k {
@@ -17215,30 +17179,6 @@ impl<'de> serde::Deserialize<'de> for
PhysicalHashExprNode {
Some(map_.next_value::<::pbjson::private::NumberDeserialize<_>>()?.0)
;
}
- GeneratedField::Seed1 => {
- if seed1__.is_some() {
- return
Err(serde::de::Error::duplicate_field("seed1"));
- }
- seed1__ =
-
Some(map_.next_value::<::pbjson::private::NumberDeserialize<_>>()?.0)
- ;
- }
- GeneratedField::Seed2 => {
- if seed2__.is_some() {
- return
Err(serde::de::Error::duplicate_field("seed2"));
- }
- seed2__ =
-
Some(map_.next_value::<::pbjson::private::NumberDeserialize<_>>()?.0)
- ;
- }
- GeneratedField::Seed3 => {
- if seed3__.is_some() {
- return
Err(serde::de::Error::duplicate_field("seed3"));
- }
- seed3__ =
-
Some(map_.next_value::<::pbjson::private::NumberDeserialize<_>>()?.0)
- ;
- }
GeneratedField::Description => {
if description__.is_some() {
return
Err(serde::de::Error::duplicate_field("description"));
@@ -17250,9 +17190,6 @@ impl<'de> serde::Deserialize<'de> for
PhysicalHashExprNode {
Ok(PhysicalHashExprNode {
on_columns: on_columns__.unwrap_or_default(),
seed0: seed0__.unwrap_or_default(),
- seed1: seed1__.unwrap_or_default(),
- seed2: seed2__.unwrap_or_default(),
- seed3: seed3__.unwrap_or_default(),
description: description__.unwrap_or_default(),
})
}
diff --git a/datafusion/proto/src/generated/prost.rs
b/datafusion/proto/src/generated/prost.rs
index c708c61328..e0a0c636fb 100644
--- a/datafusion/proto/src/generated/prost.rs
+++ b/datafusion/proto/src/generated/prost.rs
@@ -1565,12 +1565,6 @@ pub struct PhysicalHashExprNode {
pub on_columns: ::prost::alloc::vec::Vec<PhysicalExprNode>,
#[prost(uint64, tag = "2")]
pub seed0: u64,
- #[prost(uint64, tag = "3")]
- pub seed1: u64,
- #[prost(uint64, tag = "4")]
- pub seed2: u64,
- #[prost(uint64, tag = "5")]
- pub seed3: u64,
#[prost(string, tag = "6")]
pub description: ::prost::alloc::string::String,
}
diff --git a/datafusion/proto/src/physical_plan/from_proto.rs
b/datafusion/proto/src/physical_plan/from_proto.rs
index e424be1626..c7a4bd8226 100644
--- a/datafusion/proto/src/physical_plan/from_proto.rs
+++ b/datafusion/proto/src/physical_plan/from_proto.rs
@@ -486,12 +486,7 @@ pub fn parse_physical_expr_with_converter(
)?;
Arc::new(HashExpr::new(
on_columns,
- SeededRandomState::with_seeds(
- hash_expr.seed0,
- hash_expr.seed1,
- hash_expr.seed2,
- hash_expr.seed3,
- ),
+ SeededRandomState::with_seed(hash_expr.seed0),
hash_expr.description.clone(),
))
}
diff --git a/datafusion/proto/src/physical_plan/to_proto.rs
b/datafusion/proto/src/physical_plan/to_proto.rs
index de2f36e81e..990a54cf94 100644
--- a/datafusion/proto/src/physical_plan/to_proto.rs
+++ b/datafusion/proto/src/physical_plan/to_proto.rs
@@ -490,7 +490,6 @@ pub fn serialize_physical_expr_with_converter(
))),
})
} else if let Some(expr) = expr.downcast_ref::<HashExpr>() {
- let (s0, s1, s2, s3) = expr.seeds();
Ok(protobuf::PhysicalExprNode {
expr_id: None,
expr_type: Some(protobuf::physical_expr_node::ExprType::HashExpr(
@@ -500,10 +499,7 @@ pub fn serialize_physical_expr_with_converter(
codec,
proto_converter,
)?,
- seed0: s0,
- seed1: s1,
- seed2: s2,
- seed3: s3,
+ seed0: expr.seed(),
description: expr.description().to_string(),
},
)),
diff --git a/datafusion/proto/tests/cases/roundtrip_physical_plan.rs
b/datafusion/proto/tests/cases/roundtrip_physical_plan.rs
index 31626ef9fd..0a5ed766e6 100644
--- a/datafusion/proto/tests/cases/roundtrip_physical_plan.rs
+++ b/datafusion/proto/tests/cases/roundtrip_physical_plan.rs
@@ -2501,7 +2501,7 @@ fn roundtrip_hash_table_lookup_expr_to_lit() ->
Result<()> {
let on_columns = vec![datafusion::physical_plan::expressions::col("col",
&schema)?];
let lookup_expr: Arc<dyn PhysicalExpr> = Arc::new(HashTableLookupExpr::new(
on_columns,
- datafusion::physical_plan::joins::SeededRandomState::with_seeds(0, 0,
0, 0),
+ datafusion::physical_plan::joins::SeededRandomState::with_seed(0),
hash_map,
"test_lookup".to_string(),
));
@@ -2545,7 +2545,7 @@ fn roundtrip_hash_expr() -> Result<()> {
let on_columns = vec![col("a", &schema)?, col("b", &schema)?];
let hash_expr: Arc<dyn PhysicalExpr> = Arc::new(HashExpr::new(
on_columns,
- SeededRandomState::with_seeds(0, 1, 2, 3), // arbitrary random seeds
for testing
+ SeededRandomState::with_seed(0), // arbitrary random seed for testing
"test_hash".to_string(),
));
@@ -2559,7 +2559,7 @@ fn roundtrip_hash_expr() -> Result<()> {
// Confirm that the debug string contains the random state seeds
assert!(
- format!("{filter:?}").contains("test_hash(a@0, b@1, [0,1,2,3])"),
+ format!("{filter:?}").contains("test_hash(a@0, b@1, [0])"),
"Debug string missing seeds: {filter:?}"
);
roundtrip_test(filter)
diff --git a/datafusion/sqllogictest/test_files/aggregate.slt
b/datafusion/sqllogictest/test_files/aggregate.slt
index 517467110f..cf894a494a 100644
--- a/datafusion/sqllogictest/test_files/aggregate.slt
+++ b/datafusion/sqllogictest/test_files/aggregate.slt
@@ -1776,7 +1776,7 @@ SELECT approx_distinct(null)
query II
SELECT approx_distinct(c9) AS a, approx_distinct(c9) AS b FROM
aggregate_test_100
----
-100 100
+99 99
# csv_query_approx_count_date_timestamp
query IIIII
@@ -8050,21 +8050,21 @@ group0 -14
group1 100
# group median i16 non-nullable
-query TI
+query TI rowsort
SELECT col_group, median(col_i16) FROM group_median_table_non_nullable GROUP
BY col_group
----
group0 -16334
group1 100
# group median i32 non-nullable
-query TI
+query TI rowsort
SELECT col_group, median(col_i32) FROM group_median_table_non_nullable GROUP
BY col_group
----
group0 -1073741774
group1 100
# group median i64 non-nullable
-query TI
+query TI rowsort
SELECT col_group, median(col_i64) FROM group_median_table_non_nullable GROUP
BY col_group
----
group0 -4611686018427387854
@@ -8078,56 +8078,56 @@ group0 50
group1 100
# group median u16 non-nullable
-query TI
+query TI rowsort
SELECT col_group, median(col_u16) FROM group_median_table_non_nullable GROUP
BY col_group
----
group0 50
group1 100
# group median u32 non-nullable
-query TI
+query TI rowsort
SELECT col_group, median(col_u32) FROM group_median_table_non_nullable GROUP
BY col_group
----
group0 50
group1 100
# group median u64 non-nullable
-query TI
+query TI rowsort
SELECT col_group, median(col_u64) FROM group_median_table_non_nullable GROUP
BY col_group
----
group0 50
group1 100
# group median f32 non-nullable
-query TR
+query TR rowsort
SELECT col_group, median(col_f32) FROM group_median_table_non_nullable GROUP
BY col_group
----
group0 2.75
group1 3.2
# group median f64 non-nullable
-query TR
+query TR rowsort
SELECT col_group, median(col_f64) FROM group_median_table_non_nullable GROUP
BY col_group
----
group0 2.75
group1 3.3
# group median f64_nan non-nullable
-query TR
+query TR rowsort
SELECT col_group, median(col_f64_nan) FROM group_median_table_non_nullable
GROUP BY col_group
----
group0 NaN
group1 NaN
# group median decimal128 non-nullable
-query TR
+query TR rowsort
SELECT col_group, median(col_decimal128) FROM group_median_table_non_nullable
GROUP BY col_group
----
group0 0.0002
group1 0.0003
# group median decimal256 non-nullable
-query TR
+query TR rowsort
SELECT col_group, median(col_decimal256) FROM group_median_table_non_nullable
GROUP BY col_group
----
group0 0.0002
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]