imay opened a new issue #2498: [PROPOSAL] Refactor query analysis process URL: https://github.com/apache/incubator-doris/issues/2498 ## Background Now, the implementation of Doris query optimizer is a little bit complicated, below is some of problems. * Excessive coupling between analysis and optimization. Doris does a lot of work which optmizer should do when doing semantic analysis. For example, tables' join path is selected when statement is analyzed according some rule, which leads that Doris may get a bad query plan in some complex query scenarios. For another example, when analyzing the GROUP BY clause, analzyer do a lot of work for later distributed execution which makes this work very complicated and very error prone. * The structure is reused from semantic analysis to the optimizer stage, which makes things more complex and hard to change. For example, tuple and slot id is used to do name resolving in the analysis stage, but the tuple and slot is something related with query execution. Because tuple and slot is introduced too early, it is difficult to do some plan transform. The purpose of this proposal is to reduce the coupling between the parts of our query planning. This is the frist step of a series of work. After the refactor job is done, it will be easy to add new feature to Doris, such as supporting more complex SQL operators, and providing a better query optimizer. The main work of this proposal is to introduce QueryBlock between analysis and planner to decouple these two work. Through this work, we can decouple the parsed structure and the structure used by the optimizer, and reduce the complexity of each module. ## Overview The entire of Query analysis and plan is the user's SQL statement, and the output of it is a executable plantree. The main process of current Doris is as follows: ``` SQL -- SyntaxParse --> Statement -- SemanticAnalyze --> Statement -- SingleNodePlan --> PlanTree -- DistributedPlan --> PlanTree ``` The process after this proposal is as follows ``` SQL -- SyntaxParse --> Statement -- **Translate** --> **QueryBlock** -- **QueryBlockSingleNodePlan** --> PlanTree -- DistributedPlan --> PalnTree ``` This work will reuses the previous work of `SyntaxParse` and `DistributedPlan` as well as relevant data structures, such as `Statement` and `PlanNode`. After this work has been finished, Doris optimizer will continue to be refactored. Among them, **QueryBlock** and related data structures are the new data structures introduced in this work, mainly to decouple SQL Parse and Plan, so as to avoid mutual influence, reduce the complexity of the system and facilitate the subsequent addition of new functions. **Translate** and **QueryBlockPlanner** are the main processes introduced in this work. Translate converts Statement into corresponding QueryBlock, while QueryBlockPlanner converts QueryBlock into PlanTree, so as to reuse the logic of distributed plan. ## Key process ### Translate In this work, the input query semantics is checked including resolving name, checking type and so on. The output of this work is QueryBlock, which can be converted to a logical plan very easily. ### QueryBlockPlanner In order to avoid too many changes in one operation, it is necessary to reuse the existing plannode nodes. Then we need to implement a QueryBlockPlanner to generate the corresponding plantree from QueryBlock. ## Key Class ### QueryBlock QueryBlock is used to represent the representation of a query in the memory structure. Any query unit in a SQL statement can be analyzed to a QueryBlock. For example, the outermost query will be converted into a QueryBlock, each CTE can be converted into one, each subquery can be converted to one. The main contents of queryblock are as follows. ``` class QueryBlock { // All CTE(common table expression) in this QueryBlock,Each CTE contains a QuerBlock for its statement List<CommonTableExpr> cteLists; // Relations referenced in From clause. The Relation can be a CTE/table reference // a subquery or two relation joined together. List<Relation> relations; // target list for this QueryBlock. If this Query is the top query // this is the target returned to user. List<Expr> targetList; // Relation referenced by FROM clause List<RelationRef> fromList; // Where clause Expr whereClause; // TODO: add below items later // List<Expr> groupByClause; // List<GroupingSet> groupingSet; // Expr havingClause; // List<Expr> sortClause; // List<WindowClause> windowClause; } ``` The first step is to do POC, first to support simple query of single table, then to support join, group by, sort by and other functions. ### ParseState Similar to the current Analyzer, which is used to record some temporary variables that may be used during translate. For example, temporary variables for namespace resolution. ### QueryBlockPlanner Inroder to reuse the existing plannodes, this class is used to convert a QueryBlock to the current PlanNode tree. And then the existing DistributedPlanner can be used to complete the subsequent distributed plan. ### Relation This represents each logical relation in the query, such as those appearing in the FROM clause. For example, a single table reference in FROM clause will generate a relation, and a subquery will also generate a relation. Each relation mainly includes the corresponding column name, column type, etc. ### Expr Because we need to reuse the later calculation expressions, we should reuse the existing Expr. But new interfaces need to be introduced into expr to avoid influent the current parse logic. ``` class Expr { public // Return a new expr after translating. Expr translate(ParseState state); }; ``` The principle of translate in this issue is that the returned object should be obtained through clone without changing its structure SlotRef: used to complete the resolution of field references. The previous field resolutions are all based on tupledescriptors. This causes tupledescriptors to run through the planner stage from the beginning of semantic analysis. On the one hand, it leads to high coupling, and is not easy to expand functions. On the other hand, it leads to frequent errors. And it's full of a lot of check code, which makes it hard to understand. In this implementation, we are going to introduce a new namespace in the analysis phase, which is specifically used to complete the reference resolution work in the analysis phase. ### Statement For the existing statement, the new interface is roughly as follows ``` class Statement { public: // Each statement will be translated to a QueryBlock whether it is a outermost statement // or a subquery. // when parentState is null, it means that this is a outermost statement QueryBlock translate(ParseState parentState); } ``` For this work, only the SelectStmt is overridden, and for other types of statements, the state is left as it is. After the work of this phase is completed, the interface is gradually implemented with other types of statements. ## Keep compatible To keep compatible with current Parser, the main analysis process is as follows * Try to use the new framework to translate and plan * If 1 succeeds, the plan result of 1 will be returned * Rollback uses the original method for analysis and returns the result.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@doris.apache.org For additional commands, e-mail: commits-h...@doris.apache.org