viirya commented on PR #22038:
URL: https://github.com/apache/datafusion/pull/22038#issuecomment-4770746715
@kosiew Thanks for the careful review — both points were spot on.
**Multi-chunk coverage.** You're right that the existing SLT cases only
exercised the single-chunk path: `generate_series(1, 5000)` arrives as one
8192-row batch, so `load_one_chunk` accepts the single over-budget batch and
the coordinator never loads a second chunk. The `spill_count` there comes from
the right side, not from left chunking. I couldn't reliably target multi-chunk
from SLT (a memory limit tight enough to split the left side OOMs other
operators first), so I added Rust regression tests instead:
`test_nlj_memory_limited_multi_partition_multi_chunk_{left,full}_join`. The
left input is one row per batch under a tight memory limit, which I verified
loads multiple distinct chunks with `spill_count > 0`. They assert exact row
sets, so a regression in per-chunk `JoinLeftData` sharing (duplicate unmatched
rows) or in the cross-chunk global right-unmatched accumulation (FULL) would
fail.
One thing worth noting: these tests collect the four output partitions
**concurrently**. Sequential collection deadlocks on the multi-chunk path — a
chunk isn't released until every partition finishes probing it (the
`probe_threads_counter` reaching zero), and no partition can advance to the
next chunk until the current one is released. So if only one partition is
driven at a time, it blocks waiting for a release that can never happen.
Concurrent collection mirrors how partitions actually run under the runtime; I
documented this in the helper.
**Leader-only stream init.** Good catch — moved the lazy
stream/schema/reservation initialization into the leader-claim branch (under
the `loader_in_flight` guard) so waiters no longer open a throwaway spill
stream that the leader overwrites.
Pushed both as separate commits. Ready for another look when you have a
chance.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]