Re: [PR] chore: Add existence (semi / anti ) benchmarks for hashjoinexec [datafusion]

via GitHub Sat, 25 Apr 2026 00:13:04 -0700


2010YOUY01 commented on PR #21821:
URL: https://github.com/apache/datafusion/pull/21821#issuecomment-4318427731


   Thank you for working on this! I have some suggestions for you to consider.
   
   ## High-level issue
   I think the main issue is using `density` as a primary axis when evaluating 
equi-join performance. While it was introduced in 
https://github.com/apache/datafusion/pull/21821 for perfect hash join 
experiments, it seems it is not a good axis for designing representative 
benchmarks.
   
   A good benchmark should reflect realistic workloads. To achieve that, we 
should define a set of core axes and vary them systematically, I think for 
equi-joins, it could be:
   
   ```
   Equi-join benchmark key axes:
   - Build/probe side size
   - Join type (inner, outer, semi, etc.)
   - Number of join keys
   - Join key data type
   - Probe hit rate
   - Fanout (average number of matches per probe key)
   ```
   
   In contrast, `density` (i.e., key range span divided by key count) is not 
representative of typical workloads. It is primarily useful for evaluating 
specific fast paths (e.g., perfect hash join), but making it a primary axis 
complicates the benchmark design, and may mislead future optimization efforts.
   
   I believe we'd better remove `density` from the key axes in the future. For 
fast paths like perfect Hj and semi/anti join, we could simply add a few 
queries that the fast path wins.
   
   ## For this PR
   For this PR, I suggest keeping the end-to-end `hj` benchmark simple. We 
don’t need to enumerate all density combinations here—a smaller set of 
representative queries should be enough to evaluate the optimization.
   
   For the Criterion micro-benchmarks, it would be better to first focus on a 
few representative workloads (e.g., join size, type), and then optionally add a 
small number of targeted cases for specific fast paths, such as right semi/anti 
joins with `Int32` keys, otherwise it would be hard to extend and maintain.
   
   In short, fewer end-to-end queries should be sufficient for this PR. We 
could add criterion micro-benches later based on the above design.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] chore: Add existence (semi / anti ) benchmarks for hashjoinexec [datafusion]

Reply via email to