Re: [PR] HIVE-29432: Autogather column statistics missing for tables containing a column with an unsupported type [hive]

via GitHub Thu, 02 Apr 2026 08:13:40 -0700


thomasrebele commented on code in PR #6374:
URL: https://github.com/apache/hive/pull/6374#discussion_r3027902499



##########
ql/src/java/org/apache/hadoop/hive/ql/parse/ColumnStatsSemanticAnalyzer.java:
##########
@@ -103,11 +102,27 @@ private boolean shouldRewrite(ASTNode tree) {
     return rwt;
   }
 
+  /**
+   * Get the names of the columns that support column statistics.
+   */
+  private static List<String> getColumnNames(Table tbl) {

Review Comment:
   I'll rename it to `getColumnNamesSupportingStats`.



##########
ql/src/test/queries/clientpositive/stats_unsupported_type.q:
##########
@@ -0,0 +1,11 @@
+set hive.stats.autogather=true;

Review Comment:
   Indeed, we can avoid making the tests take longer. As you mentioned it, 
there are other q file tests that cover this, e.g., 
https://github.com/apache/hive/pull/6374/changes#r3027738989.



##########
ql/src/test/results/clientpositive/llap/cast_null_to_complex.q.out:
##########
@@ -87,7 +87,7 @@ Retention:            0
 #### A masked pattern was here ####
 Table Type:            MANAGED_TABLE            
 Table Parameters:               
-       COLUMN_STATS_ACCURATE   {\"BASIC_STATS\":\"true\"}
+       COLUMN_STATS_ACCURATE   
{\"BASIC_STATS\":\"true\",\"COLUMN_STATS\":{\"_c2\":\"true\"}}

Review Comment:
   Indeed, I'll remove `stats_unsupported_type.q`.



##########
ql/src/java/org/apache/hadoop/hive/ql/parse/ColumnStatsAutoGatherContext.java:
##########
@@ -256,14 +256,32 @@ private void replaceSelectOperatorProcess(SelectOperator 
operator, Operator<? ex
     //                                |
 
     // 1. deal with non-partition columns
+    Map<String, Integer> columnNameToIndex = new HashMap<>();
+    List<ColumnInfo> selRSSig = selRS.getSignature();
+    for (int i = 0; i < selRSSig.size(); i++) {
+      columnNameToIndex.putIfAbsent(selRSSig.get(i).getAlias(), i);
+    }
     for (int i = 0; i < this.columns.size(); i++) {
       ColumnInfo col = columns.get(i);
+      ObjectInspector objectInspector = col.getObjectInspector();
+      if (objectInspector == null) {
+        continue;
+      }
+      boolean columnSupported = 
isColumnSupported(objectInspector.getCategory(), col::getType);
+      if (!columnSupported) {
+        continue;
+      }
+
+      Integer selRSIdx = columnNameToIndex.get(this.columns.get(i).getName());
+      if (selRSIdx == null) {
+        continue;
+      }

Review Comment:
   Originally if [any column did not support 
statistics](https://github.com/apache/hive/blob/ef2c646bd4788b79e56fe2c901e2b5cfbfdd4779/ql/src/java/org/apache/hadoop/hive/ql/parse/ColumnStatsAutoGatherContext.java#L322),
 the whole statistics pipeline [would not be added to the 
plan](https://github.com/apache/hive/blob/ef2c646bd4788b79e56fe2c901e2b5cfbfdd4779/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java#L8264).
 So if I understand the code correctly, this method would not have run at all.
   
   Btw, if you know about any other class or method that is affected by the 
change, please let me know. I've fixed all test cases, but maybe there's 
something that's not covered by the tests?



##########
ql/src/test/results/clientpositive/llap/lateral_view.q.out:
##########
@@ -671,23 +671,23 @@ STAGE PLANS:
             Map Operator Tree:
                 TableScan
                   alias: tmp_pyang_src_rcfile
-                  Statistics: Num rows: 20 Data size: 42080 Basic stats: 
COMPLETE Column stats: NONE
+                  Statistics: Num rows: 20 Data size: 40140 Basic stats: 
COMPLETE Column stats: PARTIAL

Review Comment:
   The columns of the tmp_pyang_src_rcfile table are `key string, value 
array<string>`. Afaik, statistics on arrays are currently not supported. I 
confirmed it in the debugger, getColumnNamesSupportingStats only returns `key`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] HIVE-29432: Autogather column statistics missing for tables containing a column with an unsupported type [hive]

Reply via email to