SubhamSinghal opened a new pull request, #21859:
URL: https://github.com/apache/datafusion/pull/21859
## Which issue does this PR close?
- Closes [#8393](https://github.com/apache/datafusion/issues/8398)
- Related to [#318](https://github.com/apache/datafusion/issues/318)
## Rationale for this change
Point-in-interval range joins (`e.time >= w.start AND e.time < w.end`)
currently fall through to NestedLoopJoin which
is O(N×M). This adds a specialized `IntervalJoinExec` operator that uses a
sort-merge approach — O(N log N + M log M +
scan) — with sequential memory access and no intermediate JoinFilter batch
construction overhead.
Common use cases: temporal joins (event → time window), financial tier
lookups (amount → price range), SCD joins.
## What changes are included in this PR?
- **`IntervalJoinExec`** (`physical-plan/src/joins/interval_join.rs`):
sort-merge range join operator for inner joins
with point-in-interval pattern. Collects + sorts build side by low bound,
sorts each probe batch internally, uses a
monotonic boundary pointer for amortized O(1) per-probe-row matching.
- **Pattern detection** (`core/src/physical_planner.rs`): detects
`probe_col {>= | >} build_low AND probe_col {< | <=}
build_high` in the non-equi join path. Handles flipped operands and
reordered conditions. Falls through to NLJ if
pattern doesn't match.
- **Config flag** (`common/src/config.rs`):
`datafusion.optimizer.enable_interval_join` (default `true`).
- **SLT tests** (`interval_join.slt`): basic matching, edge cases (empty
sides, NULLs, overlapping intervals, boundary
inclusivity), EXPLAIN plan verification, timestamp + interval arithmetic.
- **Fuzz tests** (`join_fuzz.rs`): compares IntervalJoinExec output
against NLJ for correctness with random data.
## Are these changes tested?
Yes:
- 273 lines of SQL logic tests covering basic joins, edge cases, NULL
handling, overlapping intervals, boundary
operators, EXPLAIN output, and timestamp types
- Fuzz tests comparing IntervalJoin results against NLJ on random data
- Unit tests within the operator
## Are there any user-facing changes?
Yes — new `IntervalJoinExec` physical operator automatically selected for
point-in-interval range joins. Controlled by
`datafusion.optimizer.enable_interval_join` config (default `true`). No
SQL syntax changes.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]