LukFury commented on issue #19876:
URL: https://github.com/apache/datafusion/issues/19876#issuecomment-4346020636

   Hi, I'd like to take a stab at this if it's still open.
   After digging into the code, here's my understanding of the plan:
   
   FilterExecStream::poll_next currently does filter_record_batch then 
push_batch separately — two copies per batch. There's already a // TODO: 
support push_batch_with_filter in LimitedBatchCoalescer comment marking exactly 
this spot (filter.rs:1053).
   The fix requires adding push_batch_with_filter to LimitedBatchCoalescer (in 
coalesce/mod.rs) before wiring it into FilterExecStream. The main wrinkle is 
that LimitedBatchCoalescer tracks a fetch row limit on post-filter rows, so the 
new method needs to call filter.true_count() to check the limit before 
delegating to the inner BatchCoalescer.
   One thing worth flagging: BatchCoalescer::push_batch_with_filter in arrow-rs 
58.1.0 still materializes an intermediate batch internally (it has its own 
TODO). So there's no perf win from arrow-rs's side yet — the value is cleaner 
DataFusion code and automatic benefit once arrow-rs ships that optimization. 
Happy to proceed on that basis, but wanted to confirm that's the intent before 
starting.
   
   Let me know if the approach looks right or if there's a different shape you 
had in mind.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to