This is an automated email from the ASF dual-hosted git repository.

englefly pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/doris.git


The following commit(s) were added to refs/heads/master by this push:
     new e7de4f34c07 [feat](nereids)estimate selected partition row count 
(#43270)
e7de4f34c07 is described below

commit e7de4f34c07d79bd11ccdece62d9c3f9480e9b44
Author: minghong <engle...@gmail.com>
AuthorDate: Fri Nov 8 15:53:05 2024 +0800

    [feat](nereids)estimate selected partition row count (#43270)
    
    ### What problem does this PR solve?
    estimate selected partition rowcount more accurate.
    suppose table T has N partitions, and R rows.
    we estimate M partitions row count, where M1 partitions row count is
    avaliable and sum is R1. M2 partitions row count is not avaliable.
    (M=M1+M2)
    
    before:
    est_row=R * M/N
    after:
    est_row=R1 + R*M2/N
    
    <!--
    You need to clearly describe your PR in this part:
    
    1. What problem was fixed (it's best to include specific error reporting
    information). How it was fixed.
    2. Which behaviors were modified. What was the previous behavior, what
    is it now, why was it modified, and what possible impacts might there
    be.
    3. What features were added. Why this function was added.
    4. Which codes were refactored and why this part of the code was
    refactored.
    5. Which functions were optimized and what is the difference before and
    after the optimization.
    
    The description of the PR needs to enable reviewers to quickly and
    clearly understand the logic of the code modification.
    -->
    
    <!--
    If there are related issues, please fill in the issue number.
    - If you want the issue to be closed after the PR is merged, please use
    "close #12345". Otherwise, use "ref #12345"
    -->
    Issue Number: close #xxx
    
    <!--
    If this PR is followup a preivous PR, for example, fix the bug that
    introduced by a related PR,
    link the PR here
    -->
    Related PR: #xxx
    
    Problem Summary:
    
    ### Check List (For Committer)
    
    - Test <!-- At least one of them must be included. -->
    
        - [ ] Regression test
        - [ ] Unit Test
        - [ ] Manual test (add detailed scripts or steps below)
        - [ ] No need to test or manual test. Explain why:
    - [ ] This is a refactor/code format and no logic has been changed.
            - [ ] Previous test can cover this change.
            - [ ] No colde files have been changed.
            - [ ] Other reason <!-- Add your reason?  -->
    
    - Behavior changed:
    
        - [ ] No.
        - [ ] Yes. <!-- Explain the behavior change -->
    
    - Does this need documentation?
    
        - [ ] No.
    - [ ] Yes. <!-- Add document PR link here. eg:
    https://github.com/apache/doris-website/pull/1214 -->
    
    - Release note
    
        <!-- bugfix, feat, behavior changed need a release note -->
        <!-- Add one line release note for this PR. -->
        None
    
    ### Check List (For Reviewer who merge this PR)
    
    - [ ] Confirm the release note
    - [ ] Confirm test cases
    - [ ] Confirm document
    - [ ] Add branch pick label <!-- Add branch pick label that this PR
    should merge into -->
    
    ---------
    
    Co-authored-by: minghong <zhoumingh...@selectdb.com>
---
 .../doris/nereids/stats/StatsCalculator.java       | 126 +++++++++++++--------
 ...tionRowCunt.groovy => partitionRowCount.groovy} |   0
 .../nereids_p0/stats/partition_key_minmax.groovy   |   4 +-
 3 files changed, 79 insertions(+), 51 deletions(-)

diff --git 
a/fe/fe-core/src/main/java/org/apache/doris/nereids/stats/StatsCalculator.java 
b/fe/fe-core/src/main/java/org/apache/doris/nereids/stats/StatsCalculator.java
index a3c117bb46f..c975a4704fb 100644
--- 
a/fe/fe-core/src/main/java/org/apache/doris/nereids/stats/StatsCalculator.java
+++ 
b/fe/fe-core/src/main/java/org/apache/doris/nereids/stats/StatsCalculator.java
@@ -368,23 +368,34 @@ public class StatsCalculator extends 
DefaultPlanVisitor<Statistics, Void> {
         return getColumnStatistic(catalogRelation.getTable(), slot.getName(), 
idxId);
     }
 
-    private ColumnStatistic getColumnStatsFromPartitionCache(OlapScan 
catalogRelation, SlotReference slot,
+    /**
+     * if get partition col stats failed, then return table level col stats
+     */
+    private ColumnStatistic 
getColumnStatsFromPartitionCacheOrTableCache(OlapScan catalogRelation, 
SlotReference slot,
             List<String> partitionNames) {
         long idxId = catalogRelation.getSelectedIndexId();
 
         return getColumnStatistic(catalogRelation.getTable(), slot.getName(), 
idxId, partitionNames);
     }
 
-    private long getSelectedPartitionRowCount(OlapScan olapScan) {
-        long partRowCountSum = 0;
+    private double getSelectedPartitionRowCount(OlapScan olapScan, double 
tableRowCount) {
+        // the number of partitions whose row count is not available
+        double unknownPartitionCount = 0;
+        double partRowCountSum = 0;
         for (long id : olapScan.getSelectedPartitionIds()) {
             long partRowCount = olapScan.getTable()
                     .getRowCountForPartitionIndex(id, 
olapScan.getSelectedIndexId(), true);
-            // if we cannot get any partition's rowCount, return -1 to 
fallback to table level stats
             if (partRowCount == -1) {
-                return -1;
+                unknownPartitionCount++;
+            } else {
+                partRowCountSum += partRowCount;
             }
-            partRowCountSum += partRowCount;
+        }
+        // estimate row count for unknownPartitionCount
+        if (unknownPartitionCount > 0) {
+            // each selected partition has at least one row
+            partRowCountSum += Math.max(unknownPartitionCount,
+                    tableRowCount * unknownPartitionCount / 
olapScan.getTable().getPartitionNum());
         }
         return partRowCountSum;
     }
@@ -464,10 +475,7 @@ public class StatsCalculator extends 
DefaultPlanVisitor<Statistics, Void> {
             Optional<Statistics> optStats = 
cascadesContext.getStatementContext()
                     .getStatistics(((Relation) olapScan).getRelationId());
             if (optStats.isPresent()) {
-                double selectedPartitionsRowCount = 
getSelectedPartitionRowCount(olapScan);
-                if (selectedPartitionsRowCount == -1) {
-                    selectedPartitionsRowCount = tableRowCount;
-                }
+                double selectedPartitionsRowCount = 
getSelectedPartitionRowCount(olapScan, tableRowCount);
                 // if estimated mv rowCount is more than actual row count, 
fall back to base table stats
                 if (selectedPartitionsRowCount >= 
optStats.get().getRowCount()) {
                     Statistics derivedStats = optStats.get();
@@ -521,38 +529,28 @@ public class StatsCalculator extends 
DefaultPlanVisitor<Statistics, Void> {
             }
         }
 
-        boolean useTableLevelStats = true;
         if (olapScan.getSelectedPartitionIds().size() < 
olapScan.getTable().getPartitionNum()) {
             // partition pruned
             // try to use selected partition stats, if failed, fall back to 
table stats
-            double selectedPartitionsRowCount = 
getSelectedPartitionRowCount(olapScan);
-            if (selectedPartitionsRowCount >= 0) {
-                useTableLevelStats = false;
-                List<String> selectedPartitionNames = new 
ArrayList<>(olapScan.getSelectedPartitionIds().size());
-                olapScan.getSelectedPartitionIds().forEach(id -> {
-                    
selectedPartitionNames.add(olapScan.getTable().getPartition(id).getName());
-                });
-                for (SlotReference slot : visibleOutputSlots) {
-                    ColumnStatistic cache = 
getColumnStatsFromPartitionCache(olapScan, slot, selectedPartitionNames);
-                    if (slot.getColumn().isPresent()) {
-                        cache = updateMinMaxForPartitionKey(olapTable, 
selectedPartitionNames, slot, cache);
-                    }
-                    ColumnStatisticBuilder colStatsBuilder = new 
ColumnStatisticBuilder(cache,
-                            selectedPartitionsRowCount);
-                    colStatsBuilder.normalizeAvgSizeByte(slot);
-                    builder.putColumnStatistics(slot, colStatsBuilder.build());
+            double selectedPartitionsRowCount = 
getSelectedPartitionRowCount(olapScan, tableRowCount);
+            List<String> selectedPartitionNames = new 
ArrayList<>(olapScan.getSelectedPartitionIds().size());
+            olapScan.getSelectedPartitionIds().forEach(id -> {
+                
selectedPartitionNames.add(olapScan.getTable().getPartition(id).getName());
+            });
+            for (SlotReference slot : visibleOutputSlots) {
+                ColumnStatistic cache = 
getColumnStatsFromPartitionCacheOrTableCache(
+                        olapScan, slot, selectedPartitionNames);
+                if (slot.getColumn().isPresent()) {
+                    cache = updateMinMaxForPartitionKey(olapTable, 
selectedPartitionNames, slot, cache);
                 }
-                checkIfUnknownStatsUsedAsKey(builder);
-                builder.setRowCount(selectedPartitionsRowCount);
-            } else {
-                // estimate table rowCount according to pruned partition num
-                tableRowCount = tableRowCount * 
olapScan.getSelectedPartitionIds().size()
-                        / olapScan.getTable().getPartitionNum();
+                ColumnStatisticBuilder colStatsBuilder = new 
ColumnStatisticBuilder(cache,
+                        selectedPartitionsRowCount);
+                colStatsBuilder.normalizeAvgSizeByte(slot);
+                builder.putColumnStatistics(slot, colStatsBuilder.build());
             }
-        }
-        // 1. no partition is pruned, or
-        // 2. fall back to table stats
-        if (useTableLevelStats) {
+            checkIfUnknownStatsUsedAsKey(builder);
+            builder.setRowCount(selectedPartitionsRowCount);
+        } else {
             // get table level stats
             for (SlotReference slot : visibleOutputSlots) {
                 ColumnStatistic cache = 
getColumnStatsFromTableCache((CatalogRelation) olapScan, slot);
@@ -566,6 +564,9 @@ public class StatsCalculator extends 
DefaultPlanVisitor<Statistics, Void> {
         return builder.build();
     }
 
+    /**
+     * Determine whether it is a partition key inside the function.
+     */
     private ColumnStatistic updateMinMaxForPartitionKey(OlapTable olapTable,
             List<String> selectedPartitionNames,
             SlotReference slot, ColumnStatistic cache) {
@@ -616,12 +617,7 @@ public class StatsCalculator extends 
DefaultPlanVisitor<Statistics, Void> {
                     }
                 }
                 if (minExpr != null) {
-                    cache = new ColumnStatisticBuilder(cache)
-                            .setMinExpr(minExpr)
-                            .setMinValue(minValue)
-                            .setMaxExpr(maxExpr)
-                            .setMaxValue(maxValue)
-                            .build();
+                    cache = updateMinMax(cache, minValue, minExpr, maxValue, 
maxExpr);
                 }
             } catch (AnalysisException e) {
                 LOG.debug(e.getMessage());
@@ -671,12 +667,7 @@ public class StatsCalculator extends 
DefaultPlanVisitor<Statistics, Void> {
                     }
                 }
                 if (minExpr != null) {
-                    cache = new ColumnStatisticBuilder(cache)
-                            .setMinExpr(minExpr)
-                            .setMinValue(minValue)
-                            .setMaxExpr(maxExpr)
-                            .setMaxValue(maxValue)
-                            .build();
+                    cache = updateMinMax(cache, minValue, minExpr, maxValue, 
maxExpr);
                 }
             } catch (AnalysisException e) {
                 LOG.debug(e.getMessage());
@@ -685,6 +676,43 @@ public class StatsCalculator extends 
DefaultPlanVisitor<Statistics, Void> {
         return cache;
     }
 
+    private ColumnStatistic updateMinMax(ColumnStatistic cache, double 
minValue, LiteralExpr minExpr,
+            double maxValue, LiteralExpr maxExpr) {
+        boolean shouldUpdateCache = false;
+        if (!cache.isUnKnown) {
+            // merge the min/max with cache.
+            // example: min/max range in cache is [10-20]
+            // range from partition def is [15-30]
+            // the final range is [15-20]
+            if (cache.minValue > minValue) {
+                minValue = cache.minValue;
+                minExpr = cache.minExpr;
+            } else {
+                shouldUpdateCache = true;
+            }
+            if (cache.maxValue < maxValue) {
+                maxValue = cache.maxValue;
+                maxExpr = cache.maxExpr;
+            } else {
+                shouldUpdateCache = true;
+            }
+            // if min/max is invalid, do not update cache
+            if (minValue > maxValue) {
+                shouldUpdateCache = false;
+            }
+        }
+
+        if (shouldUpdateCache) {
+            cache = new ColumnStatisticBuilder(cache)
+                    .setMinExpr(minExpr)
+                    .setMinValue(minValue)
+                    .setMaxExpr(maxExpr)
+                    .setMaxValue(maxValue)
+                    .build();
+        }
+        return cache;
+    }
+
     @Override
     public Statistics visitLogicalOlapScan(LogicalOlapScan olapScan, Void 
context) {
         return computeOlapScan(olapScan);
diff --git a/regression-test/suites/nereids_p0/stats/partitionRowCunt.groovy 
b/regression-test/suites/nereids_p0/stats/partitionRowCount.groovy
similarity index 100%
rename from regression-test/suites/nereids_p0/stats/partitionRowCunt.groovy
rename to regression-test/suites/nereids_p0/stats/partitionRowCount.groovy
diff --git 
a/regression-test/suites/nereids_p0/stats/partition_key_minmax.groovy 
b/regression-test/suites/nereids_p0/stats/partition_key_minmax.groovy
index 6c48b597c92..3a58d1b6f23 100644
--- a/regression-test/suites/nereids_p0/stats/partition_key_minmax.groovy
+++ b/regression-test/suites/nereids_p0/stats/partition_key_minmax.groovy
@@ -36,8 +36,8 @@ suite("partition_key_minmax") {
         sql """memo plan
             select * from rangetable where a < 250;
             """
-        containsAny("a#0 -> ndv=3.0000, min=1.000000(1), max=30.000000(30), 
count=3.0000")
-        containsAny("a#0 -> ndv=2.6667, min=5.000000(5), max=333.000000(333), 
count=2.6667")
+        containsAny("a#0 -> ndv=2.6667, min=5.000000(5), max=30.000000(30), 
count=2.6667")
+        containsAny("a#0 -> ndv=3, min=5.000000(5), max=30.000000(30), 
count=2.6667")
     }
 
     sql """


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@doris.apache.org
For additional commands, e-mail: commits-h...@doris.apache.org

Reply via email to