shauryachats opened a new issue, #14685:
URL: https://github.com/apache/pinot/issues/14685

   While running some high-volume multi-stage engine queries on Pinot where the 
join key was high cardinality, we recently observed a disproportionate latency 
increase when data was increased across both sides of the joins for the 
following query shape:
   
   ```
   SELECT
   count(*)
   FROM
     table_A
   WHERE (
       user_uuid IN (
         SELECT
           user_uuid
         FROM
          table_B
       )
     )   
    AND (
       user_uuid NOT IN (
         SELECT
           user_uuid
         FROM
           table_B
       )
     )
   LIMIT
     100 option(useMultistageEngine=true, timeoutMs=120000, useColocatedJoin = 
true, maxRowsInJoin = 40000000)
   ```
   
   After profiling conducted on a server
   <img width="1800" alt="Screenshot 2024-12-18 at 4 36 12 PM" 
src="https://github.com/user-attachments/assets/e5ccea53-baa1-4851-84a5-31532ddc4ddb";
 />
   
   It turns out that the major cause of the latency increase is due to 
inefficient groupId generation in 
`org/apache/pinot/query/runtime/operator/MultistageGroupByExecutor.generateGroupByKeys`,
 which is happening due to a few reasons:
   - Open Addressing is the current collision resolution for 
`Object2IntOpenHashMap` which performs poorly for high cardinality use cases.
   - Low default initial size of 16 and a default load factor of 0.75 which 
causes a high number of multiple resizes and rehashing of existing keys for 
high cardinality use cases, causing a major latency contribution to the overall 
query runtime.
   
   We are considering a few different strategies like better hash-map selection 
(avoid open addressing for high-cardinality), generating groupIds in batches, 
etc. We would be leveraging benchmarks for selecting the appropriate strategy 
with the most RoI.
   
   This optimization can help boost performance for both Pinot v1 and v2 
engines simultaneously, since both the engines rely on this logic. cc: 
@Jackie-Jiang 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org
For additional commands, e-mail: commits-h...@pinot.apache.org

Reply via email to