shauryachats opened a new issue, #14685: URL: https://github.com/apache/pinot/issues/14685
While running some high-volume multi-stage engine queries on Pinot where the join key was high cardinality, we recently observed a disproportionate latency increase when data was increased across both sides of the joins for the following query shape: ``` SELECT count(*) FROM table_A WHERE ( user_uuid IN ( SELECT user_uuid FROM table_B ) ) AND ( user_uuid NOT IN ( SELECT user_uuid FROM table_B ) ) LIMIT 100 option(useMultistageEngine=true, timeoutMs=120000, useColocatedJoin = true, maxRowsInJoin = 40000000) ``` After profiling conducted on a server <img width="1800" alt="Screenshot 2024-12-18 at 4 36 12 PM" src="https://github.com/user-attachments/assets/e5ccea53-baa1-4851-84a5-31532ddc4ddb" /> It turns out that the major cause of the latency increase is due to inefficient groupId generation in `org/apache/pinot/query/runtime/operator/MultistageGroupByExecutor.generateGroupByKeys`, which is happening due to a few reasons: - Open Addressing is the current collision resolution for `Object2IntOpenHashMap` which performs poorly for high cardinality use cases. - Low default initial size of 16 and a default load factor of 0.75 which causes a high number of multiple resizes and rehashing of existing keys for high cardinality use cases, causing a major latency contribution to the overall query runtime. We are considering a few different strategies like better hash-map selection (avoid open addressing for high-cardinality), generating groupIds in batches, etc. We would be leveraging benchmarks for selecting the appropriate strategy with the most RoI. This optimization can help boost performance for both Pinot v1 and v2 engines simultaneously, since both the engines rely on this logic. cc: @Jackie-Jiang -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org For additional commands, e-mail: commits-h...@pinot.apache.org