comphead commented on PR #21075: URL: https://github.com/apache/datafusion/pull/21075#issuecomment-4305772733
I used Claude to make another round of review and there are some findings
worth to address
```
Performance Issues
P1: plan.exists() pre-check — tradeoff, not a clear win
File: unions_to_filter.rs:62-68
The pre-check walks the entire tree to look for Distinct::All, then
rewrite_with_subqueries walks it again. For plans with Distinct::All (the
rewrite target), this doubles traversal cost. For plans
without it, the pre-check is cheaper than spinning up the full rewriter.
Net value depends on workload mix. Worth benchmarking with/without.
P2: HashMap hashing entire LogicalPlan subtrees
File: unions_to_filter.rs:109-128
GroupKey contains a full LogicalPlan + Vec<Wrapper>. Every HashMap
insert/lookup recursively hashes the entire plan tree. For 128 union branches
(as in the benchmarks), this is O(N * tree_size) hash
work. For typical 2-10 branches, a Vec with linear scan + PartialEq would
be faster and avoids the hashing cost entirely.
P3: Possible double-projection from coerce + align
File: unions_to_filter.rs:147-148
coerce_plan_expr_for_schema may insert a Projection, then
align_plan_to_schema may insert another one on top. In practice,
align_plan_to_schema short-circuits if the schema already matches (line 299),
so this is likely benign. But it's worth verifying that coercion always
produces a matching schema, or merging the two into one pass.
P4: Recursive strip_passthrough_nodes (line 283)
Should be a loop to avoid stack overflow on deeply nested plans and to
reduce frame overhead. Simple mechanical fix.
P5: inner.clone() (line 83) and other.clone() (line 209)
Already discussed — both are unnecessary clones.
---
Design / Correctness Concerns
D1: Early bail-out abandons entire UNION on any failed branch
File: unions_to_filter.rs:114-116
If one branch can't be extracted (has LIMIT, SORT, volatile), the entire
UNION is left untouched — even if other branches could be merged. Example: 10
branches from same table with filters + 1 branch
with LIMIT = no optimization at all. A partial-merge strategy would be
more effective, though more complex.
D2: wrapper_projections_are_safe may miss window functions
File: unions_to_filter.rs:332-339
Checks for volatility and subqueries but not window functions. However, in
DataFusion's plan representation, window functions get their own Window node —
they don't appear as expressions inside
Projection. So this is likely not a real issue in practice, but worth
confirming with a test.
D3: SubqueryAlias schema recomputed in wrap_branch
File: unions_to_filter.rs:275-277
wrap_branch uses SubqueryAlias::try_new() which recomputes the schema from
the new input, discarding the stored schema from peel_wrappers. Since the
source table is the same and only the filter
changed, the recomputed schema should match. Likely benign but a subtle
coupling worth documenting.
D4: GroupKey fragility with LogicalPlan as HashMap key
If any upstream optimizer rule mutates plan internals (e.g., pushdown
flags) between branches, two "same" sources could hash differently. Currently
safe since the rule runs early in the pipeline, but
this is a latent risk.
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
