2010YOUY01 commented on PR #21821: URL: https://github.com/apache/datafusion/pull/21821#issuecomment-4318427731
Thank you for working on this! I have some suggestions for you to consider. ## High-level issue I think the main issue is using `density` as a primary axis when evaluating equi-join performance. While it was introduced in https://github.com/apache/datafusion/pull/21821 for perfect hash join experiments, it seems it is not a good axis for designing representative benchmarks. A good benchmark should reflect realistic workloads. To achieve that, we should define a set of core axes and vary them systematically, I think for equi-joins, it could be: ``` Equi-join benchmark key axes: - Build/probe side size - Join type (inner, outer, semi, etc.) - Number of join keys - Join key data type - Probe hit rate - Fanout (average number of matches per probe key) ``` In contrast, `density` (i.e., key range span divided by key count) is not representative of typical workloads. It is primarily useful for evaluating specific fast paths (e.g., perfect hash join), but making it a primary axis complicates the benchmark design, and may mislead future optimization efforts. I believe we'd better remove `density` from the key axes in the future. For fast paths like perfect Hj and semi/anti join, we could simply add a few queries that the fast path wins. ## For this PR For this PR, I suggest keeping the end-to-end `hj` benchmark simple. We don’t need to enumerate all density combinations here—a smaller set of representative queries should be enough to evaluate the optimization. For the Criterion micro-benchmarks, it would be better to first focus on a few representative workloads (e.g., join size, type), and then optionally add a small number of targeted cases for specific fast paths, such as right semi/anti joins with `Int32` keys, otherwise it would be hard to extend and maintain. In short, fewer end-to-end queries should be sufficient for this PR. We could add criterion micro-benches later based on the above design. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
