2010YOUY01 commented on PR #21817: URL: https://github.com/apache/datafusion/pull/21817#issuecomment-4311633479
> Thank you for your comments . I plan to add benchmarks for hashjoin in a separate pr and rebase this feature once that is merged to main > > 1. Re o(1) claim, the roaringbitmap given its size should be CPU cache friendly effectively making 'contains' checks o(1) (although theoretically they are log(n) to fetch containers and log(k) to find element . Given that the number of containers is small and fits in CPU cache I believe it is close to o(1) in most practical cases > 2. True, per my benchmarks (which I plan to add in separate PR) for HJ in semi / anti join cases, the dense checks were as fast as 2x while the sparse data was almost not effected. That said, we could definitely make this more intelligent by probing stats during planning time and only use roaring bitmaps if we could (albeit heuristically) decide there is dense probing required) According to the physical design of roaring bitmaps, the index building and bulk membership testing should still be slower than an ideal hash map implementation. Roaring bitmaps are primarily designed for large, sparse bitmaps where they can provide compact memory usage, fast index iteration, and fast set operations such as union and intersection. However, index construction and `contains` checks seems not their primary strengths. I suspect that the workloads where roaring bitmaps perform better has a lot of duplications on the build side, and that the current hash map implementation may be hitting some pathological cases 🤔 A recent PR (https://github.com/apache/datafusion/pull/21775) addresses one potential inefficiency, so we could compare the result with it. This is a really interesting idea and observation, I'll try to dive deeper into hash join executor later to investigate further. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
