shubhluck commented on code in PR #6382:
URL: https://github.com/apache/hive/pull/6382#discussion_r3006807875
##########
ql/src/java/org/apache/hadoop/hive/ql/plan/Statistics.java:
##########
@@ -328,6 +329,19 @@ public List<ColStatistics> getColumnStats() {
return null;
}
+ /**
+ * Returns the column statistics as a list, or an empty list if column
statistics are unavailable.
+ * This method is useful to avoid null checks when iterating over column
statistics.
+ *
+ * @return list of column statistics, or empty list if unavailable
+ */
+ public List<ColStatistics> getColumnStatsOrEmpty() {
+ if (columnStats != null) {
+ return Lists.newArrayList(columnStats.values());
+ }
+ return Collections.emptyList();
+ }
+
Review Comment:
Thanks for the feedback! I agree that adding a new method increases API
complexity. I've updated the PR to implement Option 2 + Option 3 together:
**Option 3 (Root cause fix):** Added a precondition check in
removeSemijoinOptimizationByBenefit():
```
if (filterStats != null && filterStats.getColumnStats() != null) {
```
This prevents the semijoin optimization from proceeding when column
statistics are unavailable, which was the root cause of the NPE in the original
TPC-DS workload.
**Option 2 (Defensive null checks):** Added null checks in:
- StatsUtils.updateStats() - with LOG.warn when stats unavailable
- StatsUtils.getColStatisticsUpdatingTableAlias()
- StatsRulesProcFactory.updateColStats()
- SemanticAnalyzer.getMaterializedTableStats()
Removed: getColumnStatsOrEmpty() method from Statistics.java
**Added .q test:** semijoin_stats_missing_colstats.q - a regression test
that verifies queries execute successfully when basic table stats exist but
column stats are unavailable. Note: Reproducing the exact NPE requires the
original TPC-DS workload where semijoin optimization is actively triggered.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]