shubhluck commented on code in PR #6382:
URL: https://github.com/apache/hive/pull/6382#discussion_r3016885593
##########
ql/src/java/org/apache/hadoop/hive/ql/stats/StatsUtils.java:
##########
@@ -2030,34 +2034,41 @@ public static void updateStats(Statistics stats, long
newNumRows,
if (useColStats) {
List<ColStatistics> colStats = stats.getColumnStats();
- for (ColStatistics cs : colStats) {
- long oldDV = cs.getCountDistint();
- if (affectedColumns.contains(cs.getColumnName())) {
- long newDV = oldDV;
-
- // if ratio is greater than 1, then number of rows increases. This
can happen
- // when some operators like GROUPBY duplicates the input rows in
which case
- // number of distincts should not change. Update the distinct count
only when
- // the output number of rows is less than input number of rows.
- if (ratio <= 1.0) {
- newDV = (long) Math.ceil(ratio * oldDV);
+ if (colStats != null && !colStats.isEmpty()) {
Review Comment:
You're right that throwing IllegalArgumentException would enforce cleaner
contracts.
However, after implementing it, CI tests failed because several existing
callers
(e.g., StatsRulesProcFactory$GroupByStatsRule, TestHiveIcebergStatistics)
pass
useColStats=true expecting graceful fallback when stats are unavailable.
The existing codebase treats useColStats=true as "use column stats if
available"
rather than "column stats must be present". Enforcing the stricter contract
would
require updating all these callers to check column stats availability before
calling
updateStats - a larger refactoring beyond the scope of this NPE fix.
Options:
1. **Graceful fallback (current approach)** - Fall back to ratio-based
estimation
with a debug log. Minimal change, fixes the NPE.
2. **Fix all callers** - Update every call site to check column stats before
passing
useColStats=true. Larger scope but enforces cleaner contract.
3. **Add strict method variant** - Keep existing method lenient, add new
updateStatsStrict() that throws exception. Callers can opt-in.
I've reverted to option 1 to unblock CI. Please let me know if you prefer
option 2
or 3 - happy to implement either, though they have larger scope.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]