konstantinb commented on code in PR #6359:
URL: https://github.com/apache/hive/pull/6359#discussion_r3030869220


##########
ql/src/java/org/apache/hadoop/hive/ql/stats/StatsUtils.java:
##########
@@ -2109,7 +2111,10 @@ private static List<Long> 
extractNDVGroupingColumns(List<ColStatistics> colStats
     for (ColStatistics cs : colStats) {
       if (cs != null) {
         long ndv = cs.getCountDistint();
-        if (cs.getNumNulls() > 0) {
+        // NDV needs to be adjusted if a column has a known NDV along with 
NULL values
+        // or if a column happens to be "const NULL"
+        if ((ndv > 0 && cs.getNumNulls() > 0) ||
+            (ndv == 0 && !cs.isEstimated() && cs.getNumNulls() == 
parentStats.getNumRows())) {

Review Comment:
   @zabetak thank you for your feedback; I fully understand your concerns and 
the goal of reducing code changes in one PR.
   
   The main problem with keeping this code 100% intact is that **any** 
ColStatistics object coming into extractNDVGroupingColumns with an NDV of 0 and 
numNulls > 0 would be estimated with **1** GROUP BY row count; and this "1" 
estimate has proven to be the root cause of many severe under-estimations. 
   
   It makes this method logic directly correlated with the proposed changes to 
PessimisticStatCombiner
   
   Technically, the following two alternatives are feasible alternatives to the 
current PR code:
   1.
   `        if ((ndv > 0 && cs.getNumNulls() > 0) {
             ndv = StatsUtils.safeAdd(ndv, 1);
           }
   `
   ^^ This one adjusts NDV estimations except for columns with (potentially) 
unknown NDVs. Applying this change did not cause changes to any already 
modified .out files **of this branch**. I believe this is the safest possible 
change for this logic
   
   2. Completely removing the +1 adjustment for NULLs:
   ` ndvValues.add(cs.getCountDistint());`
   ^^ This code edit modifies only three files on the current branch
   
   If you firmly oppose any changes to **extractNDVGroupingColumns**, then we 
will likely need to get back to the drawing board on how to approach this whole 
problem in a different manner.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to