[PR] [Bug] Fix invalid validation for sampled column statistics with ndv=0 and non-null min/max [doris]

via GitHub Sun, 07 Jun 2026 23:46:38 -0700


MilanTyagi2004 opened a new pull request, #64206:
URL: https://github.com/apache/doris/pull/64206


   ### What problem does this PR solve?
   
   Issue Number: close #64122
   
   Problem Summary:
   
   `ColStatsData.isValid()` applies a validation rule that assumes all 
statistics originate from the same source. However, during SAMPLE ANALYZE this 
assumption does not always hold:
   
   * `min/max` values may come from a full-table scan.
   * `ndv` may be estimated from sampled data.
   * `nullCount` may be a scaled estimate and therefore differ from `count`.
   
   As a result, sampled column statistics can be incorrectly rejected when:
   
   ```java
   ndv == 0
   && (!isNull(minLit) || !isNull(maxLit))
   && nullCount != count
   ```
   
   This PR introduces sample-aware handling by adding an `isSample` flag to 
`ColStatsData`. For sampled statistics, when `ndv == 0` but non-null `min/max` 
values exist, `ndv` is normalized to `1` before validation. This prevents valid 
sampled statistics from being rejected while preserving the existing validation 
behavior for full analyze jobs.
   
   ### Release note
   
   None
   
   ### Check List (For Author)
   
   * Test
   
     * [x] Unit Test
   
   * Behavior changed:
   
     * [x] No.
   
   * Does this need documentation?
   
     * [x] No.
   
   ### Test Details
   
   Added unit tests in `ColStatsDataTest` covering:
   
   1. Full analyze statistics:
   
      * `ndv = 0`
      * non-null `min/max`
      * `nullCount != count`
   
      Expected result: validation fails.
   
   2. Sample analyze statistics:
   
      * `ndv = 0`
      * non-null `min/max`
      * `nullCount != count`
   
      Expected result:
   
      * `ndv` is normalized to `1`
      * validation succeeds.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] [Bug] Fix invalid validation for sampled column statistics with ndv=0 and non-null min/max [doris]

Reply via email to