kosiew opened a new pull request, #21313:
URL: https://github.com/apache/datafusion/pull/21313
## Which issue does this PR close?
* Closes #20966.
---
## Rationale for this change
The existing NDV merge implementation (`estimate_ndv_with_overlap`) is not
associative, meaning that merging statistics in different orders can produce
different results. This happens because intermediate merges "smear" min/max/NDV
values, which then influence subsequent computations.
This lack of order invariance leads to inconsistent NDV estimates depending
on merge order, especially when combining more than two inputs.
To address this, this PR introduces a multi-input NDV estimation approach
that computes overlap using the original (unsmeared) statistics across all
inputs simultaneously. This ensures stable and deterministic results regardless
of merge order.
---
## What changes are included in this PR?
* Refactored `Statistics::try_merge_iter` to:
* Perform a single pass for mergeable statistics (null counts, min/max,
sum).
* Defer NDV computation until after all inputs are collected.
* Replaced pairwise NDV merging with a new multi-input algorithm:
* Introduced `estimate_ndv_with_overlap_many` to compute NDV across all
inputs simultaneously.
* Uses segment-based partitioning of value ranges and selects the maximum
density contribution per segment.
* Handles constant (point) values separately to avoid over-counting.
* Ensured order invariance:
* NDV is now computed from original inputs instead of intermediate merged
states.
* Added helper:
* `max_input_ndv` as a fallback when estimation is not possible.
* Code cleanup:
* Simplified iteration using `zip` instead of index-based access.
* Improved clarity and separation of concerns in merge logic.
---
## Are these changes tested?
Yes.
New and updated tests include:
* Reusable helpers for test setup (`int32_test_schema`, `int32_ndv_stats`).
* Validation of NDV behavior for:
* Identical ranges (full overlap).
* Partial overlaps.
* Constant columns (same and different values).
* **New test ensuring order invariance across permutations**, verifying that
all permutations of inputs produce identical NDV results.
These tests ensure correctness, stability, and regression protection for the
new algorithm.
---
## Are there any user-facing changes?
No direct user-facing API changes.
However, NDV estimates are now:
* More stable
* Deterministic across merge orders
* Potentially slightly different (more accurate) compared to previous
behavior
This improves query planning consistency without requiring user intervention.
---
## LLM-generated code disclosure
This PR includes LLM-generated code and comments. All LLM-generated content
has been manually reviewed and tested.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]