neilconway opened a new pull request, #22718: URL: https://github.com/apache/datafusion/pull/22718
## Which issue does this PR close? - Closes #22716 ## Rationale for this change #21081 capped the NDV at the row count when computing statistics for several operators. This PR extends that work and ensures that per-column statistics for filter operators are consistent with the estimated output row count. In particular: * Null count is also capped at the row count * Byte size is scaled down by the estimated selectivity We also extend the analysis to consider null-rejecting predicates; for example, the clause `a = 10` as a top-level conjunct implies that the null-count of the surviving rows is exactly 0. ## What changes are included in this PR? * Ensure per-column statistics (null count, byte size) are consistent with filtered row count * Check for null-rejecting predicates to estimate a more accurate null count of 0 * Update SLT expected plans * Add unit tests for new behavior * Various refactoring and comment improvements ## Are these changes tested? Yes; new tests added. ## Are there any user-facing changes? No. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
