morrySnow commented on code in PR #64892:
URL: https://github.com/apache/doris/pull/64892#discussion_r3489278981


##########
fe/fe-core/src/main/java/org/apache/doris/nereids/properties/RequestPropertyDeriver.java:
##########
@@ -547,35 +547,24 @@ public Void visitPhysicalBucketedHashAggregate(
         return null;
     }
 
-    private boolean shouldUseParent(List<ExprId> parentHashExprIds, 
PhysicalHashAggregate<? extends Plan> agg,
+    private boolean shouldUseParent(List<Expression> parentHashExprs, 
PhysicalHashAggregate<? extends Plan> agg,
             PlanContext context) {
         if 
(!context.getConnectContext().getSessionVariable().aggShuffleUseParentKey) {
             return false;
         }

Review Comment:
   Good call — if there is no group expression at all, we cannot derive stats 
and should not gamble on the parent subset key.



##########
fe/fe-core/src/main/java/org/apache/doris/nereids/properties/RequestPropertyDeriver.java:
##########
@@ -547,35 +547,24 @@ public Void visitPhysicalBucketedHashAggregate(
         return null;
     }
 
-    private boolean shouldUseParent(List<ExprId> parentHashExprIds, 
PhysicalHashAggregate<? extends Plan> agg,
+    private boolean shouldUseParent(List<Expression> parentHashExprs, 
PhysicalHashAggregate<? extends Plan> agg,
             PlanContext context) {
         if 
(!context.getConnectContext().getSessionVariable().aggShuffleUseParentKey) {
             return false;
         }
         Optional<GroupExpression> groupExpression = agg.getGroupExpression();
         if (!groupExpression.isPresent()) {
-            return true;
+            return false;
         }
         if (agg.hasSourceRepeat()) {
             return false;
         }

Review Comment:
   This is the core fix — previously returning `true` here meant the optimizer 
would use the narrower parent hash key even with no stats at all, which could 
lead to severe data skew and OOM. Returning `false` (fall through to the full 
group-by key) is the conservative and correct choice.
   
   Consider adding a brief comment here explaining the rationale, e.g.:
   ```java
   // Without stats we cannot assess whether the parent subset key has enough
   // NDV to avoid skew; fall back to the safe full group-by distribution.
   ```



##########
fe/fe-core/src/test/java/org/apache/doris/nereids/properties/RequestPropertyDeriverTest.java:
##########
@@ -446,14 +450,108 @@ public org.apache.doris.statistics.Statistics 
childStatistics(int idx) {
         List<List<PhysicalProperties>> actual
                 = 
requestPropertyDeriver.getRequestChildrenPropertyList(groupExpression);
 
-        // When aggShuffleUseParentKey is true, shouldUseParent may return true
-        // If shouldUseParent returns true, it will add parent key (key1) 
first, then all groupByExpressions (key1, key2)
-        Assertions.assertEquals(2, actual.size(), "Should have at least one 
property request");
+        List<List<PhysicalProperties>> expected = Lists.newArrayList();
+        expected.add(Lists.newArrayList(PhysicalProperties.createHash(
+                Lists.newArrayList(key1.getExprId(), key2.getExprId()), 
ShuffleType.REQUIRE)));
+        Assertions.assertEquals(expected, actual);
+    }

Review Comment:
   The `childStatistics` override returns `null` here, which exercises the 
`aggChildStats == null` → `return false` path. The test now correctly expects 
only the full group-by key distribution. Consider updating the comment above to 
reflect the new behavior (e.g., `// When stats are null, parent subset should 
NOT be used`).



##########
fe/fe-core/src/main/java/org/apache/doris/nereids/properties/RequestPropertyDeriver.java:
##########
@@ -547,35 +547,24 @@ public Void visitPhysicalBucketedHashAggregate(
         return null;
     }
 
-    private boolean shouldUseParent(List<ExprId> parentHashExprIds, 
PhysicalHashAggregate<? extends Plan> agg,
+    private boolean shouldUseParent(List<Expression> parentHashExprs, 
PhysicalHashAggregate<? extends Plan> agg,
             PlanContext context) {
         if 
(!context.getConnectContext().getSessionVariable().aggShuffleUseParentKey) {
             return false;
         }
         Optional<GroupExpression> groupExpression = agg.getGroupExpression();
         if (!groupExpression.isPresent()) {
-            return true;
+            return false;
         }
         if (agg.hasSourceRepeat()) {
             return false;
         }
         Statistics aggChildStats = groupExpression.get().childStatistics(0);
         if (aggChildStats == null) {
-            return true;
-        }
-        List<Slot> aggChildOutput = agg.child().getOutput();
-        Map<ExprId, Slot> exprIdSlotMap = new HashMap<>();
-        for (Slot slot : aggChildOutput) {
-            exprIdSlotMap.put(slot.getExprId(), slot);
-        }
-        List<Expression> parentHashExprs = new 
ArrayList<>(parentHashExprIds.size());
-        for (ExprId exprId : parentHashExprIds) {
-            if (exprIdSlotMap.containsKey(exprId)) {
-                parentHashExprs.add(exprIdSlotMap.get(exprId));
-            }
+            return false;

Review Comment:
   Same pattern as above — `hasUnknownStatistics` returning `true` now 
correctly causes us to skip the parent subset optimization instead of blindly 
trying it.



##########
fe/fe-core/src/test/java/org/apache/doris/nereids/rules/rewrite/SplitMultiDistinctTest.java:
##########
@@ -184,12 +184,16 @@ void countMultiColumnsWithGby() {
                                                     physicalHashJoin(
                                                             physicalProject(
                                                             
physicalHashAggregate(
-                                                                    
physicalHashAggregate(
-                                                                            
physicalDistribute(any())))),
+                                                                    
physicalDistribute(
+                                                                            
physicalHashAggregate(

Review Comment:
   The `physicalDistribute` wrappers are now expected in the plan shape because 
`shouldUseParent` no longer returns `true` when stats are unknown (which is the 
case in this unit test). Previously, the parent subset key was blindly adopted, 
which could eliminate the distribute node. This test change correctly reflects 
the stricter stats gate — the full group-by key distribution is used, and the 
distribute is preserved.
   
   This is an intended side effect of the fix, but worth confirming: is the 
plan shape here what you would expect to see in production queries after this 
change?



##########
fe/fe-core/src/main/java/org/apache/doris/nereids/properties/RequestPropertyDeriver.java:
##########
@@ -547,35 +547,24 @@ public Void visitPhysicalBucketedHashAggregate(
         return null;
     }
 
-    private boolean shouldUseParent(List<ExprId> parentHashExprIds, 
PhysicalHashAggregate<? extends Plan> agg,
+    private boolean shouldUseParent(List<Expression> parentHashExprs, 
PhysicalHashAggregate<? extends Plan> agg,
             PlanContext context) {
         if 
(!context.getConnectContext().getSessionVariable().aggShuffleUseParentKey) {
             return false;
         }
         Optional<GroupExpression> groupExpression = agg.getGroupExpression();
         if (!groupExpression.isPresent()) {
-            return true;
+            return false;
         }
         if (agg.hasSourceRepeat()) {
             return false;
         }
         Statistics aggChildStats = groupExpression.get().childStatistics(0);
         if (aggChildStats == null) {
-            return true;
-        }
-        List<Slot> aggChildOutput = agg.child().getOutput();
-        Map<ExprId, Slot> exprIdSlotMap = new HashMap<>();
-        for (Slot slot : aggChildOutput) {
-            exprIdSlotMap.put(slot.getExprId(), slot);
-        }
-        List<Expression> parentHashExprs = new 
ArrayList<>(parentHashExprIds.size());
-        for (ExprId exprId : parentHashExprIds) {
-            if (exprIdSlotMap.containsKey(exprId)) {
-                parentHashExprs.add(exprIdSlotMap.get(exprId));
-            }
+            return false;
         }
         if (AggregateUtils.hasUnknownStatistics(parentHashExprs, 
aggChildStats)) {
-            return true;
+            return false;

Review Comment:
   Note: NDV exactly equal to `LOW_NDV_THRESHOLD` (1024) is treated as 
insufficient — this is consistent with how `SplitAggMultiPhase` also uses `>` 
(strictly greater), so the threshold boundary is uniform across callers. :+1:



##########
fe/fe-core/src/test/java/org/apache/doris/nereids/properties/RequestPropertyDeriverTest.java:
##########
@@ -446,14 +450,108 @@ public org.apache.doris.statistics.Statistics 
childStatistics(int idx) {
         List<List<PhysicalProperties>> actual
                 = 
requestPropertyDeriver.getRequestChildrenPropertyList(groupExpression);
 
-        // When aggShuffleUseParentKey is true, shouldUseParent may return true
-        // If shouldUseParent returns true, it will add parent key (key1) 
first, then all groupByExpressions (key1, key2)
-        Assertions.assertEquals(2, actual.size(), "Should have at least one 
property request");
+        List<List<PhysicalProperties>> expected = Lists.newArrayList();
+        expected.add(Lists.newArrayList(PhysicalProperties.createHash(
+                Lists.newArrayList(key1.getExprId(), key2.getExprId()), 
ShuffleType.REQUIRE)));
+        Assertions.assertEquals(expected, actual);
+    }
+
+    @Test
+    void testAggregateWithAggShuffleUseParentKeyEnabledAndLowNdvStats() {
+        ConnectContext testConnectContext = 
MemoTestUtils.createConnectContext();
+        testConnectContext.getSessionVariable().aggShuffleUseParentKey = true;
+        testConnectContext.getSessionVariable().setBeNumberForTest(3);
+
+        SlotReference key1 = new SlotReference(new ExprId(0), "col1", 
IntegerType.INSTANCE, true, ImmutableList.of());
+        SlotReference key2 = new SlotReference(new ExprId(1), "col2", 
IntegerType.INSTANCE, true, ImmutableList.of());
+        GroupPlan childPlan = new GroupPlan(new 
Group(GroupId.createGenerator().getNextId(),
+                new GroupExpression(new LogicalOneRowRelation(new 
RelationId(6), ImmutableList.of(key1, key2)))
+                        .getPlan().getLogicalProperties()));
+        PhysicalHashAggregate<GroupPlan> aggregate = new 
PhysicalHashAggregate<>(
+                Lists.newArrayList(key1, key2),
+                Lists.newArrayList(key1, key2),
+                new AggregateParam(AggPhase.GLOBAL, AggMode.BUFFER_TO_RESULT),
+                true,
+                logicalProperties,
+                false,
+                childPlan
+        );
+        Statistics childStats = new StatisticsBuilder()
+                .setRowCount(10000)
+                .putColumnStatistics(key1,
+                        new 
ColumnStatisticBuilder(10000).setNdv(AggregateUtils.LOW_NDV_THRESHOLD).build())
+                .build();
+        GroupExpression groupExpression = new GroupExpression(aggregate) {
+            @Override
+            public Statistics childStatistics(int idx) {
+                return childStats;

Review Comment:
   Nice boundary test — `setNdv(AggregateUtils.LOW_NDV_THRESHOLD)` (1024) and 
correctly expecting the parent key NOT to be used, since `combinedNdv > 
LOW_NDV_THRESHOLD` is false when NDV is exactly at the threshold.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to