ankitsultana commented on PR #15698: URL: https://github.com/apache/pinot/pull/15698#issuecomment-2849388177
Hey Siddharth, this is the same workstream I had discussed with you in our 1:1 a few months ago, and the design doc is here: https://docs.google.com/document/d/17ApZbvNphKgEdSAOlZwTwAnnL_dAt9QbjcpzjHb4M0w/edit?tab=t.0#heading=h.txhamf1ek7k The overall goal is to restructure the query optimizer to improve Exchange Simplification and make it possible to add the MSE Lite Mode I had proposed here: https://github.com/apache/pinot/issues/14640 Adding a cost model and doing join reordering is a separate topic and this effort doesn't impede that. We might actually contribute the same ourselves in a few months. > I am very interested to know what are the kinds of queries for which you are seeing performance boost with the physical optimizer. Key optimizations that this brings are (there are more.. I'll highlight them in the doc once we can run more tests on our clusters): 1. Full query plans for eligible queries can be run without any cross worker exchange, regardless of the number of joins involved in the query. (see examples below) 2. Tables involved in a join will no longer need to have the same number of partitions. E.g. you can have a fact table with 256 partitions and a smaller table with 16 partitions, and we'll still be able to avoid cross worker exchange whenever possible. 3. For our internal use-cases, we see anywhere between 2-5x improvement for query performance. I'll try to share detailed results once I propose removing the query option and the old query optimizer path (ETA: June). 4. All of the above optimizations will run without any requirement for hints. Some examples: ``` -- this entire query will run without any cross worker data transfer. SELECT deviceOS, userUUID FROM userAttributes WHERE userUUID IN (SELECT userUUID FROM userGroups WHERE groupUUID = 'group-1') AND userUUID NOT IN (SELECT userUUID FROM userGroups WHERE groupUUID = 'group-2') ... any number of IN/Not-In clauses ``` ``` -- this query will also run without any shuffles, except the last Singleton exchange for computing COUNT(*) WITH tmp AS ( SELECT userUUID FROM userAttributes WHERE deviceOS = '' UNION ALL SELECT userUUID FROM userAttributes WHERE deviceOS IN ('ios') ) SELECT COUNT(*) FROM userAttributes WHERE userUUID IN ( SELECT userUUID FROM userAttributes WHERE deviceOS IN ('ios') ) AND userUUID NOT IN ( SELECT userUUID FROM userAttributes WHERE deviceOS IN ('foo') ) ... any number of joins on the partitioning key AND userUUID IN ( SELECT userUUID FROM tmp ) ``` > Why is this needed? I am hoping that even in the first working version of the optimizer, there is no query option otherwise we are just going to create a lot of tech debt / conditional paths where people will inevitably bypass it and create their own niche optimizations in the OSS code. Completely agree. The reason this is needed is because the existing query optimizer has a lot of features that we need to port to the new optimizer flow. The idea is that we'll soon backport all features of the existing query optimizer, and then remove the old code path. I also want to give other companies time to test the new optimizer path out and run their own A/B tests in case they want to test the perf difference that the new optimizer brings to their workloads. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org For additional commands, e-mail: commits-h...@pinot.apache.org