xudong963 commented on PR #21637:
URL: https://github.com/apache/datafusion/pull/21637#issuecomment-4394407082

   > > @alamb thanks for the review, before getting the PR in, I think it's 
better to have your look for the comment [#21637 
(comment)](https://github.com/apache/datafusion/pull/21637#discussion_r3156327107),
 and it's fix commit: 
[da7db27](https://github.com/apache/datafusion/commit/da7db27a6b51345991d67907b9985a0d67224153)
 (this is the lowest cost way I found to fix the metric. Let me know if you 
have other thoughts)
   > 
   > Maybe we should just add a new metric on ParquetScanMetrics 🤔
   > 
   > 
https://github.com/apache/datafusion/blob/4c909bafc5c50749884fdd80a06235d7bd72dbde/datafusion/datasource-parquet/src/metrics.rs#L30
   
   Thanks @alamb, I agree that adding a separate metric is cleaner.
   
   I changed the PR 
https://github.com/apache/datafusion/pull/21637/commits/3f2401e0b422e2ddb590660626fc1716c84a22ae
 to keep `page_index_pages_pruned` reporting only pages that were actually 
evaluated by page-index pruning, and added 
`page_index_pages_skipped_by_fully_matched` for pages where page-index pruning 
was skipped because row-group statistics already proved the row group was fully 
matched.
   
   For example, the metrics can now look like:
   
   ```text
   row_groups_pruned_statistics=4 total → 3 matched -> 1 fully matched,
   page_index_pages_pruned=2 total → 2 matched,
   page_index_pages_skipped_by_fully_matched=1
   ```
   
   I would read this as:
   1. row-group statistics evaluated 4 row groups, 3 matched, and 1 of those 
was fully matched;
   2. page-index pruning actually evaluated 2 pages, and both matched;
   3. 1 additional page belonged to the fully matched row group, so page-index 
pruning was skipped for that page. The page is still scanned; only page-index 
predicate evaluation was skipped.
   
   This avoids counting statistics-derived fully matched pages as page-index 
matched pages.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to