zabetak commented on code in PR #6382:
URL: https://github.com/apache/hive/pull/6382#discussion_r3026635324
##########
ql/src/java/org/apache/hadoop/hive/ql/stats/StatsUtils.java:
##########
@@ -2030,34 +2034,41 @@ public static void updateStats(Statistics stats, long
newNumRows,
if (useColStats) {
List<ColStatistics> colStats = stats.getColumnStats();
- for (ColStatistics cs : colStats) {
- long oldDV = cs.getCountDistint();
- if (affectedColumns.contains(cs.getColumnName())) {
- long newDV = oldDV;
-
- // if ratio is greater than 1, then number of rows increases. This
can happen
- // when some operators like GROUPBY duplicates the input rows in
which case
- // number of distincts should not change. Update the distinct count
only when
- // the output number of rows is less than input number of rows.
- if (ratio <= 1.0) {
- newDV = (long) Math.ceil(ratio * oldDV);
+ if (colStats != null && !colStats.isEmpty()) {
Review Comment:
As mentioned elsewhere let's avoid the changes in StatsUtils altogether and
just focus on fixing the prob in `removeSemijoinOptimizationByBenefit`. We can
follow-up with other enhancements afterwards if its really necessary.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]