shubhluck commented on code in PR #6382:
URL: https://github.com/apache/hive/pull/6382#discussion_r3016885593


##########
ql/src/java/org/apache/hadoop/hive/ql/stats/StatsUtils.java:
##########
@@ -2030,34 +2034,41 @@ public static void updateStats(Statistics stats, long 
newNumRows,
 
     if (useColStats) {
       List<ColStatistics> colStats = stats.getColumnStats();
-      for (ColStatistics cs : colStats) {
-        long oldDV = cs.getCountDistint();
-        if (affectedColumns.contains(cs.getColumnName())) {
-          long newDV = oldDV;
-
-          // if ratio is greater than 1, then number of rows increases. This 
can happen
-          // when some operators like GROUPBY duplicates the input rows in 
which case
-          // number of distincts should not change. Update the distinct count 
only when
-          // the output number of rows is less than input number of rows.
-          if (ratio <= 1.0) {
-            newDV = (long) Math.ceil(ratio * oldDV);
+      if (colStats != null && !colStats.isEmpty()) {

Review Comment:
   You're right that throwing IllegalArgumentException would enforce cleaner 
contracts. 
   However, after implementing it, CI tests failed because several existing 
callers 
   (e.g., StatsRulesProcFactory$GroupByStatsRule, TestHiveIcebergStatistics) 
pass 
   useColStats=true expecting graceful fallback when stats are unavailable.
   The existing codebase treats useColStats=true as "use column stats if 
available" 
   rather than "column stats must be present". Enforcing the stricter contract 
would 
   require updating all these callers to check column stats availability before 
calling 
   updateStats - a larger refactoring beyond the scope of this NPE fix.
   Options:
   1. **Graceful fallback (current approach)** - Fall back to ratio-based 
estimation 
      with a debug log. Minimal change, fixes the NPE.
      
   2. **Fix all callers** - Update every call site to check column stats before 
passing 
      useColStats=true. Larger scope but enforces cleaner contract.
   3. **Add strict method variant** - Keep existing method lenient, add new 
      updateStatsStrict() that throws exception. Callers can opt-in.
      
   I've reverted to option 1 to unblock CI. Please let me know if you prefer 
option 2 
   or 3 - happy to implement either, though they have larger scope.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to