xudong963 commented on PR #21637: URL: https://github.com/apache/datafusion/pull/21637#issuecomment-4394407082
> > @alamb thanks for the review, before getting the PR in, I think it's better to have your look for the comment [#21637 (comment)](https://github.com/apache/datafusion/pull/21637#discussion_r3156327107), and it's fix commit: [da7db27](https://github.com/apache/datafusion/commit/da7db27a6b51345991d67907b9985a0d67224153) (this is the lowest cost way I found to fix the metric. Let me know if you have other thoughts) > > Maybe we should just add a new metric on ParquetScanMetrics 🤔 > > https://github.com/apache/datafusion/blob/4c909bafc5c50749884fdd80a06235d7bd72dbde/datafusion/datasource-parquet/src/metrics.rs#L30 Thanks @alamb, I agree that adding a separate metric is cleaner. I changed the PR https://github.com/apache/datafusion/pull/21637/commits/3f2401e0b422e2ddb590660626fc1716c84a22ae to keep `page_index_pages_pruned` reporting only pages that were actually evaluated by page-index pruning, and added `page_index_pages_skipped_by_fully_matched` for pages where page-index pruning was skipped because row-group statistics already proved the row group was fully matched. For example, the metrics can now look like: ```text row_groups_pruned_statistics=4 total → 3 matched -> 1 fully matched, page_index_pages_pruned=2 total → 2 matched, page_index_pages_skipped_by_fully_matched=1 ``` I would read this as: 1. row-group statistics evaluated 4 row groups, 3 matched, and 1 of those was fully matched; 2. page-index pruning actually evaluated 2 pages, and both matched; 3. 1 additional page belonged to the fully matched row group, so page-index pruning was skipped for that page. The page is still scanned; only page-index predicate evaluation was skipped. This avoids counting statistics-derived fully matched pages as page-index matched pages. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
