[GitHub] [doris] weizhengte opened a new pull request, #14910: [feature-wip](Nereids)(histogram) add aggregate function histogram and collect histogram statistics

GitBox Wed, 07 Dec 2022 06:23:04 -0800


weizhengte opened a new pull request, #14910:
URL: https://github.com/apache/doris/pull/14910


   # Proposed changes
   
   Currently doris collects statistics, but no histogram data, and by default 
the optimizer assumes that the different values of the columns are evenly 
distributed. This calculation can be problematic when the data distribution is 
skewed. So this pr implements the collection of histogram statistics.
   
   For columns containing data skew columns (columns with unevenly distributed 
data in the column), histogram statistics enable the optimizer to generate more 
accurate estimates of cardinality for filtering or join predicates involving 
these columns, resulting in a more precise execution plan.
   
   The optimization of the execution plan by histogram is mainly in two 
aspects: the selection of where condition and the selection of join order. The 
selection principle of the where condition is relatively simple: the histogram 
is used to calculate the selection rate of each predicate, and the filter with 
higher selection rate is preferred.
   
   The selection of join order is based on the estimation of the number of rows 
in the join result. In the case of uneven data distribution in the join 
condition columns, histogram can greatly improve the accuracy of the prediction 
of the number of rows in the join result. At the same time, if the number of 
rows of a bucket in one of the columns is 0, you can mark it and directly skip 
the bucket in the subsequent join process to improve efficiency.
   
   Histogram statistics are mainly collected by the histogram aggregation 
function, which is used as follows:
   
   **Syntax**
   
   ```SQL
   histogram(expr)
   ```
   
   > The histogram function is used to describe the distribution of the data. 
It uses an "equal height" bucking strategy, and divides the data into buckets 
according to the value of the data. It describes each bucket with some simple 
data, such as the number of values that fall in the bucket. It is mainly used 
by the optimizer to estimate the range query.
   
   **example**
   
   ```
   MySQL [test]> select histogram(login_time) from dev_table;
   
+------------------------------------------------------------------------------------------------------------------------------+
   | histogram(`login_time`)                                                    
                                                  |
   
+------------------------------------------------------------------------------------------------------------------------------+
   | {"bucket_size":5,"buckets":[{"lower":"2022-09-21 
17:30:29","upper":"2022-09-21 22:30:29","count":9,"pre_sum":0,"ndv":1},...]}|
   
+------------------------------------------------------------------------------------------------------------------------------+
   ```
   **description**
   
   ```
   {
       "bucket_size": 5, 
       "buckets": [
           {
               "lower": "2022-09-21 17:30:29", 
               "upper": "2022-09-21 22:30:29", 
               "count": 9, 
               "pre_sum": 0, 
               "ndv": 1
           }, 
           {
               "lower": "2022-09-22 17:30:29", 
               "upper": "2022-09-22 22:30:29", 
               "count": 10, 
               "pre_sum": 9, 
               "ndv": 1
           }, 
           {
               "lower": "2022-09-23 17:30:29", 
               "upper": "2022-09-23 22:30:29", 
               "count": 9, 
               "pre_sum": 19, 
               "ndv": 1
           }, 
           {
               "lower": "2022-09-24 17:30:29", 
               "upper": "2022-09-24 22:30:29", 
               "count": 9, 
               "pre_sum": 28, 
               "ndv": 1
           }, 
           {
               "lower": "2022-09-25 17:30:29", 
               "upper": "2022-09-25 22:30:29", 
               "count": 9, 
               "pre_sum": 37, 
               "ndv": 1
           }
       ]
   }
   ```
   
   TODO:
   - histogram func supports parameter and sample statistics (It's got another 
pr)
   - histogram func requires only one stage agg
   - use histogram statistics
   
   Issue Number: close #xxx
   
   ## Problem summary
   
   Describe your changes.
   
   ## Checklist(Required)
   
   1. Does it affect the original behavior: 
       - [ ] Yes
       - [x] No
       - [ ] I don't know
   2. Has unit tests been added:
       - [x] Yes
       - [ ] No
       - [ ] No Need
   3. Has document been added or modified:
       - [x] Yes
       - [ ] No
       - [ ] No Need
   4. Does it need to update dependencies:
       - [ ] Yes
       - [x] No
   5. Are there any changes that cannot be rolled back:
       - [ ] Yes (If Yes, please explain WHY)
       - [x] No
   
   ## Further comments
   
   If this is a relatively large or complex change, kick off the discussion at 
[d...@doris.apache.org](mailto:d...@doris.apache.org) by explaining why you 
chose the solution you did and what alternatives you considered, etc...
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@doris.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@doris.apache.org
For additional commands, e-mail: commits-h...@doris.apache.org

[GitHub] [doris] weizhengte opened a new pull request, #14910: [feature-wip](Nereids)(histogram) add aggregate function histogram and collect histogram statistics

Reply via email to