MilanTyagi2004 opened a new pull request, #64206:
URL: https://github.com/apache/doris/pull/64206
### What problem does this PR solve?
Issue Number: close #64122
Problem Summary:
`ColStatsData.isValid()` applies a validation rule that assumes all
statistics originate from the same source. However, during SAMPLE ANALYZE this
assumption does not always hold:
* `min/max` values may come from a full-table scan.
* `ndv` may be estimated from sampled data.
* `nullCount` may be a scaled estimate and therefore differ from `count`.
As a result, sampled column statistics can be incorrectly rejected when:
```java
ndv == 0
&& (!isNull(minLit) || !isNull(maxLit))
&& nullCount != count
```
This PR introduces sample-aware handling by adding an `isSample` flag to
`ColStatsData`. For sampled statistics, when `ndv == 0` but non-null `min/max`
values exist, `ndv` is normalized to `1` before validation. This prevents valid
sampled statistics from being rejected while preserving the existing validation
behavior for full analyze jobs.
### Release note
None
### Check List (For Author)
* Test
* [x] Unit Test
* Behavior changed:
* [x] No.
* Does this need documentation?
* [x] No.
### Test Details
Added unit tests in `ColStatsDataTest` covering:
1. Full analyze statistics:
* `ndv = 0`
* non-null `min/max`
* `nullCount != count`
Expected result: validation fails.
2. Sample analyze statistics:
* `ndv = 0`
* non-null `min/max`
* `nullCount != count`
Expected result:
* `ndv` is normalized to `1`
* validation succeeds.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]