morningman commented on PR #60197:
URL: https://github.com/apache/doris/pull/60197#issuecomment-3793531983
⏺ Code Review: PR #60197 - Fix Parquet Reader Cannot Push Down Conjuncts for
Min-Max Filter
Overview
This PR refactors the predicate pushdown mechanism for Parquet files,
specifically addressing how min-max filters and runtime predicates are handled.
The main changes include:
1. Simplified predicate handling: Removes the complex late runtime filter
callback mechanism from ParquetReader
2. Enhanced predicate evaluation: Adds missing evaluate_and methods for
Parquet-specific column statistics in multiple predicate classes
3. Runtime predicate refactoring: Removes tablet_schema dependency from
RuntimePredicate, replacing it with direct DataType storage
4. Improved filter handling: Adds proper handling for null statistics and
row group ranges
Analysis
Strengths
1. Architectural improvement: Removing the late RF callback mechanism
(_call_late_rf_func) simplifies the code flow and reduces complexity
2. Consistent null handling: Properly handles is_all_null cases in
comparison and in-list predicates (comparison_predicate.h:201,
in_list_predicate.h:285)
3. Better abstraction: Using DataTypePtr instead of TabletSchema in
RuntimePredicate is cleaner and more appropriate for external tables
Critical Issues
1. Missing Row Group Range Addition
Severity: High
In multiple files, when get_stat_func returns false (indicating no page
index stats), the code now adds row_group_range to include all rows:
```
if (!(statistic->get_stat_func)(&stat, column_id())) {
row_ranges->add(statistic->row_group_range);
return true;
}
```
Files affected: comparison_predicate.h:229, in_list_predicate.h:313,
null_predicate.h:99, shared_predicate.h:184
Issue: The original behavior when stats are unavailable was to return true
(conservative - read everything). Now it explicitly adds the row group range.
While semantically similar, this changes behavior and could have performance
implications if the calling code doesn't expect row ranges to be modified when
stats are missing.
Recommendation: Verify this behavioral change is intentional and document
why row ranges should be explicitly added when stats are unavailable.
2. Removed Code in tablet_reader.cpp
Severity: Medium
Lines 527-533 in tablet_reader.cpp removed the registration of TopN filter
tablet schemas:
```
- for (int id : read_params.topn_filter_source_node_ids) {
- auto& runtime_predicate =
-
read_params.runtime_state->get_query_ctx()->get_runtime_predicate(id);
-
RETURN_IF_ERROR(runtime_predicate.set_tablet_schema(read_params.topn_filter_target_node_id,
-
_tablet_schema));
- }
```
Issue: This code was called for internal tablet reads (OLAP engine). The
removal suggests TopN filters for internal tables might be broken, or this was
dead code. The PR description doesn't explain this change.
Recommendation: Clarify whether this affects internal table TopN
filtering. If it does, add test coverage.
3. AcceptNullPredicate Row Range Logic
Severity: Medium
In accept_null_predicate.h:110-121, the new evaluate_and implementation
adds rows where has_null[page_id] is true:
```
for (int page_id = 0; page_id < stat->num_of_pages; page_id++) {
if (stat->has_null[page_id]) {
row_ranges->add(stat->ranges[page_id]);
}
}
```
Issue: This only adds pages with nulls but doesn't call the nested
predicate's evaluate_and. The logic should be:
- Add pages that match the nested predicate OR have nulls
- Currently, it only adds null pages regardless of nested predicate result
Recommendation: The implementation appears incomplete. It should combine
results from the nested predicate with null page ranges.
4. Removed Lazy Read Context Update
Severity: High
vparquet_reader.cpp removed significant logic including:
- Runtime filter pushdown (lines 450-478)
- Late arrival RF handling (lines 714-718)
- The entire _update_lazy_read_ctx method refactoring
Issue: The PR claims to "fix parquet reader cannot push down conjuncts"
but actually removes pushdown logic for:
- Binary predicates (GE, LE) from runtime filters
- IN predicates from runtime filters
- Late-arriving runtime filters
Recommendation: This appears contradictory to the PR title. Need
clarification on whether this is:
a) Moving pushdown logic elsewhere (where?)
b) Intentionally removing broken logic (why?)
c) A regression
Minor Issues
1. Inconsistent Error Handling: comparison_predicate.h:198 and
in_list_predicate.h:282 now check is_all_null before parsing min/max, but this
check wasn't in the original code. Verify this is correct for all Parquet files.
2. Missing Documentation: The PR description is incomplete - no problem
summary, no release note justification, no test plan.
3. Unused Code Path: file_scan_operator.h:69-73 adds _push_down_topn that
always returns true, with a comment saying the reader will determine if it can
be pushed. But the reader logic was just removed.
Missing Elements
1. No Tests: PR checklist shows no test coverage mentioned
2. No Problem Statement: The PR description is empty
3. No Verification: How was this tested? What queries benefit?
4. Breaking Changes: Potential behavior changes for internal tables (TopN
filters) not addressed
Recommendations
Must Fix Before Merge
1. Add comprehensive test coverage including:
- Parquet files with/without page indexes
- Null handling in various predicate types
- TopN filter behavior for both internal and external tables
- Runtime filter pushdown scenarios
2. Fill out PR description explaining:
- What bug this fixes (with example)
- Why removing pushdown logic is correct
- Impact on existing functionality
3. Fix AcceptNullPredicate logic - the row range addition appears incorrect
4. Verify tablet_reader.cpp changes - ensure TopN filters still work for
internal tables
Should Consider
1. Add comments explaining why row group ranges are added when stats are
unavailable
2. Document the behavior change from callback-based to upfront predicate
evaluation
3. Consider whether _push_down_topn returning true unconditionally is safe
Risk Assessment
Overall Risk: High
- Significant removal of existing functionality (RF pushdown, late arrival
handling)
- Potential regression in query performance for Parquet files
- Possible breaking change for internal table TopN filters
- No test coverage to validate correctness
- PR description provides no context for reviewing safety
Recommendation: Request changes - This PR needs substantial clarification,
testing, and potentially rework before merge.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]