thomasrebele commented on code in PR #6374:
URL: https://github.com/apache/hive/pull/6374#discussion_r3027902499
##########
ql/src/java/org/apache/hadoop/hive/ql/parse/ColumnStatsSemanticAnalyzer.java:
##########
@@ -103,11 +102,27 @@ private boolean shouldRewrite(ASTNode tree) {
return rwt;
}
+ /**
+ * Get the names of the columns that support column statistics.
+ */
+ private static List<String> getColumnNames(Table tbl) {
Review Comment:
I'll rename it to `getColumnNamesSupportingStats`.
##########
ql/src/test/queries/clientpositive/stats_unsupported_type.q:
##########
@@ -0,0 +1,11 @@
+set hive.stats.autogather=true;
Review Comment:
Indeed, we can avoid making the tests take longer. As you mentioned it,
there are other q file tests that cover this, e.g.,
https://github.com/apache/hive/pull/6374/changes#r3027738989.
##########
ql/src/test/results/clientpositive/llap/cast_null_to_complex.q.out:
##########
@@ -87,7 +87,7 @@ Retention: 0
#### A masked pattern was here ####
Table Type: MANAGED_TABLE
Table Parameters:
- COLUMN_STATS_ACCURATE {\"BASIC_STATS\":\"true\"}
+ COLUMN_STATS_ACCURATE
{\"BASIC_STATS\":\"true\",\"COLUMN_STATS\":{\"_c2\":\"true\"}}
Review Comment:
Indeed, I'll remove `stats_unsupported_type.q`.
##########
ql/src/java/org/apache/hadoop/hive/ql/parse/ColumnStatsAutoGatherContext.java:
##########
@@ -256,14 +256,32 @@ private void replaceSelectOperatorProcess(SelectOperator
operator, Operator<? ex
// |
// 1. deal with non-partition columns
+ Map<String, Integer> columnNameToIndex = new HashMap<>();
+ List<ColumnInfo> selRSSig = selRS.getSignature();
+ for (int i = 0; i < selRSSig.size(); i++) {
+ columnNameToIndex.putIfAbsent(selRSSig.get(i).getAlias(), i);
+ }
for (int i = 0; i < this.columns.size(); i++) {
ColumnInfo col = columns.get(i);
+ ObjectInspector objectInspector = col.getObjectInspector();
+ if (objectInspector == null) {
+ continue;
+ }
+ boolean columnSupported =
isColumnSupported(objectInspector.getCategory(), col::getType);
+ if (!columnSupported) {
+ continue;
+ }
+
+ Integer selRSIdx = columnNameToIndex.get(this.columns.get(i).getName());
+ if (selRSIdx == null) {
+ continue;
+ }
Review Comment:
Originally if [any column did not support
statistics](https://github.com/apache/hive/blob/ef2c646bd4788b79e56fe2c901e2b5cfbfdd4779/ql/src/java/org/apache/hadoop/hive/ql/parse/ColumnStatsAutoGatherContext.java#L322),
the whole statistics pipeline [would not be added to the
plan](https://github.com/apache/hive/blob/ef2c646bd4788b79e56fe2c901e2b5cfbfdd4779/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java#L8264).
So if I understand the code correctly, this method would not have run at all.
Btw, if you know about any other class or method that is affected by the
change, please let me know. I've fixed all test cases, but maybe there's
something that's not covered by the tests?
##########
ql/src/test/results/clientpositive/llap/lateral_view.q.out:
##########
@@ -671,23 +671,23 @@ STAGE PLANS:
Map Operator Tree:
TableScan
alias: tmp_pyang_src_rcfile
- Statistics: Num rows: 20 Data size: 42080 Basic stats:
COMPLETE Column stats: NONE
+ Statistics: Num rows: 20 Data size: 40140 Basic stats:
COMPLETE Column stats: PARTIAL
Review Comment:
The columns of the tmp_pyang_src_rcfile table are `key string, value
array<string>`. Afaik, statistics on arrays are currently not supported. I
confirmed it in the debugger, getColumnNamesSupportingStats only returns `key`.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]