2010YOUY01 commented on PR #21851:
URL: https://github.com/apache/datafusion/pull/21851#issuecomment-4321635395

   Thank you—this is an exciting optimization!
   
   I am working on a general infrastructure for NLJ dynamic filters and custom 
build index that could help simplify this implementation. Would you (and other 
reviewers) be open to waiting until I submit that PR next 1-2 weeks, so we can 
coordinate and collaborate on this? I’d appreciate any thoughts on this 
direction!
   
   Here is the preview and WIP draft:
   
   The core idea is that, most specialized joins (e.g., Piecewise Merge Join, 
IEJoin, Spatial Join, Array Set Joins) follow a standard pattern:
   1. Buffer: Collect all build-side data.
   2. Probe: Iterate row-by-row.
   
   Specialization typically only requires:
   - Custom Dynamic Filters: To reduce probe-side size (as seen in this PR).
   - Custom Indices: To accelerate the probing process.
   
   Taking this PR as example, beyond the dynamic filter implemented, if we know 
a window range has a fixed maximum span, we could sort the build side and use a 
custom index to accelerate the probe further. So I'm hoping to add a common 
trait to support both custom dynamic filter and custom runtime index.
   
   Introducing a common extension point can make adding similar optimizations 
easier -- only a small trait need to be implemented to specify how to 
build/probe index, how to build dynamic filters, for each specialization, and 
we won't need to touch the join core state machine each time.
   
   I have a WIP draft of this infrastructure here (only refactor and API rough 
shape is done, still working on adding a example implementation for both custom 
index and dynamic filter): 
   https://github.com/2010YOUY01/arrow-datafusion/tree/join-accelerator


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to