LukFury commented on issue #19876: URL: https://github.com/apache/datafusion/issues/19876#issuecomment-4346020636
Hi, I'd like to take a stab at this if it's still open. After digging into the code, here's my understanding of the plan: FilterExecStream::poll_next currently does filter_record_batch then push_batch separately — two copies per batch. There's already a // TODO: support push_batch_with_filter in LimitedBatchCoalescer comment marking exactly this spot (filter.rs:1053). The fix requires adding push_batch_with_filter to LimitedBatchCoalescer (in coalesce/mod.rs) before wiring it into FilterExecStream. The main wrinkle is that LimitedBatchCoalescer tracks a fetch row limit on post-filter rows, so the new method needs to call filter.true_count() to check the limit before delegating to the inner BatchCoalescer. One thing worth flagging: BatchCoalescer::push_batch_with_filter in arrow-rs 58.1.0 still materializes an intermediate batch internally (it has its own TODO). So there's no perf win from arrow-rs's side yet — the value is cleaner DataFusion code and automatic benefit once arrow-rs ships that optimization. Happy to proceed on that basis, but wanted to confirm that's the intent before starting. Let me know if the approach looks right or if there's a different shape you had in mind. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
