[GitHub] [incubator-doris] imay opened a new issue #2498: [PROPOSAL] Refactor query analysis process

GitBox Wed, 18 Dec 2019 00:37:13 -0800

imay opened a new issue #2498: [PROPOSAL] Refactor query analysis process
URL: https://github.com/apache/incubator-doris/issues/2498
 
 
   ## Background
   
   Now, the implementation of Doris query optimizer is a little bit 
complicated, below is some of problems.
   
   * Excessive coupling between analysis and optimization. Doris does a lot of 
work which optmizer should do when doing semantic analysis. For example, 
tables' join path is selected when statement is analyzed according some rule, 
which leads that Doris may get a bad query plan in some complex query 
scenarios. For another example, when analyzing the GROUP BY clause, analzyer do 
a lot of work for later distributed execution which makes this work very 
complicated and very error prone.
   * The structure is reused from semantic analysis to the optimizer stage, 
which makes things more complex and hard to change. For example, tuple and slot 
id is used to do name resolving in the analysis stage, but the tuple and slot 
is something related with query execution. Because tuple and slot is introduced 
too early, it is difficult to do some plan transform.
   
   The purpose of this proposal is to reduce the coupling between the parts of 
our query planning. This is the frist step of a series of work. After the 
refactor job is done, it will be easy to add new feature to Doris, such as 
supporting more complex SQL operators, and providing a better query optimizer.
   
   The main work of this proposal is to introduce QueryBlock between analysis 
and planner to decouple these two work. Through this work, we can decouple the 
parsed structure and the structure used by the optimizer, and reduce the 
complexity of each module.
   
   ## Overview
   
   The entire of Query analysis and plan is the user's SQL statement, and the 
output of it is a executable plantree.
   The main process of current Doris is as follows:
   
   ```
   SQL -- SyntaxParse --> Statement -- SemanticAnalyze --> Statement -- 
SingleNodePlan --> PlanTree -- DistributedPlan --> PlanTree
   ```
   
   The process after this proposal is as follows
   
   ```
   SQL -- SyntaxParse --> Statement -- **Translate** --> **QueryBlock** -- 
**QueryBlockSingleNodePlan** --> PlanTree -- DistributedPlan --> PalnTree
   ```
   
   This work will reuses the previous work of `SyntaxParse` and 
`DistributedPlan` as well as relevant data structures, such as `Statement` and 
`PlanNode`. After this work has been finished, Doris optimizer will continue to 
be refactored.
   
   Among them, **QueryBlock** and related data structures are the new data 
structures introduced in this work, mainly to decouple SQL Parse and Plan, so 
as to avoid mutual influence, reduce the complexity of the system and 
facilitate the subsequent addition of new functions.
   
   **Translate** and **QueryBlockPlanner** are the main processes introduced in 
this work. Translate converts Statement into corresponding QueryBlock, while 
QueryBlockPlanner converts QueryBlock into PlanTree, so as to reuse the logic 
of distributed plan.
   
   ## Key process
   
   ### Translate
   
   In this work, the input query semantics is checked including resolving name, 
checking type and so on. The output of this work is QueryBlock, which can be 
converted to a logical plan very easily.
   
   ### QueryBlockPlanner
   
   In order to avoid too many changes in one operation, it is necessary to 
reuse the existing plannode nodes. Then we need to implement a 
QueryBlockPlanner to generate the corresponding plantree from QueryBlock.
   
   ## Key Class
   
   ### QueryBlock
   
   QueryBlock is used to represent the representation of a query in the memory 
structure. Any query unit in a SQL statement can be analyzed to a QueryBlock. 
For example, the outermost query will be converted into a QueryBlock, each CTE 
can be converted into one, each subquery can be converted to one.
   
   The main contents of queryblock are as follows.
   
   ```
   class QueryBlock {
       // All CTE(common table expression) in this QueryBlock，Each CTE contains 
a QuerBlock for its statement
       List<CommonTableExpr> cteLists;
       
       // Relations referenced in From clause. The Relation can be a CTE/table 
reference
       // a subquery or two relation joined together.
       List<Relation> relations;
       
       // target list for this QueryBlock. If this Query is the top query
       // this is the target returned to user.
       List<Expr> targetList;
       
       // Relation referenced by FROM clause
       List<RelationRef> fromList;
       
       // Where clause
       Expr whereClause;
       
       // TODO: add below items later
       // List<Expr> groupByClause;
       // List<GroupingSet> groupingSet;
       // Expr havingClause;
       // List<Expr> sortClause;
       // List<WindowClause> windowClause;
   }
   ```
   
   The first step is to do POC, first to support simple query of single table, 
then to support join, group by, sort by and other functions.
   
   ### ParseState
   
   Similar to the current Analyzer, which is used to record some temporary 
variables that may be used during translate. For example, temporary variables 
for namespace resolution.
   
   ### QueryBlockPlanner
   
   Inroder to reuse the existing plannodes, this class is used to convert a 
QueryBlock to the current PlanNode tree. And then the existing 
DistributedPlanner can be used to complete the subsequent distributed plan.
   
   ### Relation
   
   This represents each logical relation in the query, such as those appearing 
in the FROM clause. For example, a single table reference in FROM clause will 
generate a relation, and a subquery will also generate a relation.
   
   Each relation mainly includes the corresponding column name, column type, 
etc.
   
   ### Expr
   
   Because we need to reuse the later calculation expressions, we should reuse 
the existing Expr. But new interfaces need to be introduced into expr to avoid 
influent the current parse logic.
   
   ```
   class Expr {
   public
       // Return a new expr after translating.
       Expr translate(ParseState state);
   };
   ```
   
   The principle of translate in this issue is that the returned object should 
be obtained through clone without changing its structure
   
   SlotRef: used to complete the resolution of field references. The previous 
field resolutions are all based on tupledescriptors. This causes 
tupledescriptors to run through the planner stage from the beginning of 
semantic analysis. On the one hand, it leads to high coupling, and is not easy 
to expand functions. On the other hand, it leads to frequent errors. And it's 
full of a lot of check code, which makes it hard to understand. In this 
implementation, we are going to introduce a new namespace in the analysis 
phase, which is specifically used to complete the reference resolution work in 
the analysis phase.
   
   ### Statement
   
   For the existing statement, the new interface is roughly as follows
   
   ```
   class Statement {
   public:
       // Each statement will be translated to a QueryBlock whether it is a 
outermost statement
       // or a subquery.
       // when parentState is null, it means that this is a outermost statement
       QueryBlock translate(ParseState parentState);
   }
   ```
   
   For this work, only the SelectStmt is overridden, and for other types of 
statements, the state is left as it is. After the work of this phase is 
completed, the interface is gradually implemented with other types of 
statements.
   
   ## Keep compatible
   
   To keep compatible with current Parser, the main analysis process is as 
follows
   
   * Try to use the new framework to translate and plan
   * If 1 succeeds, the plan result of 1 will be returned
   * Rollback uses the original method for analysis and returns the result.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@doris.apache.org
For additional commands, e-mail: commits-h...@doris.apache.org

[GitHub] [incubator-doris] imay opened a new issue #2498: [PROPOSAL] Refactor query analysis process

Reply via email to