adriangb commented on PR #21931:
URL: https://github.com/apache/datafusion/pull/21931#issuecomment-4361813807

   > > > I'm quite in favor of this change. It also avoids blowing up the 
expression based on number of partitions, which can happen when partition count 
is high.
   > > 
   > > 
   > > Well it doesn't avoid it completely, and in some ways it makes it worse. 
We still have 1 hash map per partition (cannot be avoided unless we pay the 
memory and build time cost of combining them). And we now scale our probes with 
the number of partitions, they used to be constant with number of partitions. 
But probes are much faster than hashes which is why I think unless the 
partition count is high this will likely be faster.
   > 
   > I am not 100% following this.
   
   The point is: it is probably still slower at some very high partition count. 
But it seems to not matter in reasonable workloads.
   
   > 
   > > A possible structural follow-up — re-introducing partition routing 
inside `MultiMapLookupExpr` (1 routing hash + 1 probe, so per-row cost matches 
legacy CASE) — would close the regression at any N, with or without #20363.
   > 
   > I think this would be worthwhile to add so we avoid both the expression 
blow up as not having to probe each of them?
   > 
   > Perhaps we can as well use #21900 here as for primitive columns the cost 
of `%` is _much_ higher than hashing.
   
   That brings back coupling to hash routing, which IMO would be nice to avoid. 
I will try incorporating 
   #21900


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to