2010YOUY01 commented on PR #21817:
URL: https://github.com/apache/datafusion/pull/21817#issuecomment-4311633479

   > Thank you for your comments . I plan to add benchmarks for hashjoin in a 
separate pr and rebase this feature once that is merged to main
   > 
   > 1. Re o(1) claim, the roaringbitmap given its size should be CPU cache 
friendly effectively making 'contains' checks o(1) (although theoretically they 
are log(n) to fetch containers and log(k) to find element . Given that the 
number of containers is small and fits in CPU cache I believe it is close to 
o(1) in most practical cases
   > 2. True, per my benchmarks (which I plan to add in separate PR) for HJ in 
semi / anti join cases, the dense checks were as fast as 2x while the sparse 
data was almost not effected. That said, we could definitely make this more 
intelligent by probing stats during planning time and only use roaring bitmaps 
if we could (albeit heuristically) decide there is dense probing required)
   
   According to the physical design of roaring bitmaps, the index building and 
bulk membership testing should still be slower than an ideal hash map 
implementation. Roaring bitmaps are primarily designed for large, sparse 
bitmaps where they can provide compact memory usage, fast index iteration, and 
fast set operations such as union and intersection. However, index construction 
and `contains` checks seems not their primary strengths.
   
   I suspect that the workloads where roaring bitmaps perform better has a lot 
of duplications on the build side, and that the current hash map implementation 
may be hitting some pathological cases 🤔 
   
   A recent PR (https://github.com/apache/datafusion/pull/21775) addresses one 
potential inefficiency, so we could compare the result with it.
   
   This is a really interesting idea and observation, I'll try to dive deeper 
into hash join executor later to investigate further.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to