comphead commented on PR #21075:
URL: https://github.com/apache/datafusion/pull/21075#issuecomment-4305772733

   I used Claude to make another round of review and there are some findings 
worth to address
   
   ```
     Performance Issues                                                         
                                                                                
                                              
                                                                                
                                                                                
                                              
     P1: plan.exists() pre-check — tradeoff, not a clear win                    
                                                                                
                                              
                                                                                
                                                                                
                                              
     File: unions_to_filter.rs:62-68                                            
                                                                                
                                              
                                                                                
                                                                                
                                              
     The pre-check walks the entire tree to look for Distinct::All, then 
rewrite_with_subqueries walks it again. For plans with Distinct::All (the 
rewrite target), this doubles traversal cost. For plans    
     without it, the pre-check is cheaper than spinning up the full rewriter. 
Net value depends on workload mix. Worth benchmarking with/without.
                                                                                
                                                                                
                                              
     P2: HashMap hashing entire LogicalPlan subtrees                            
                                                                                
                                              
      
     File: unions_to_filter.rs:109-128                                          
                                                                                
                                              
                     
     GroupKey contains a full LogicalPlan + Vec<Wrapper>. Every HashMap 
insert/lookup recursively hashes the entire plan tree. For 128 union branches 
(as in the benchmarks), this is O(N * tree_size) hash   
     work. For typical 2-10 branches, a Vec with linear scan + PartialEq would 
be faster and avoids the hashing cost entirely.
                                                                                
                                                                                
                                              
     P3: Possible double-projection from coerce + align                         
                                                                                
                                              
      
     File: unions_to_filter.rs:147-148                                          
                                                                                
                                              
                     
     coerce_plan_expr_for_schema may insert a Projection, then 
align_plan_to_schema may insert another one on top. In practice, 
align_plan_to_schema short-circuits if the schema already matches (line 299), 
     so this is likely benign. But it's worth verifying that coercion always 
produces a matching schema, or merging the two into one pass.
                                                                                
                                                                                
                                              
     P4: Recursive strip_passthrough_nodes (line 283)                           
                                                                                
                                              
      
     Should be a loop to avoid stack overflow on deeply nested plans and to 
reduce frame overhead. Simple mechanical fix.                                   
                                                  
                     
     P5: inner.clone() (line 83) and other.clone() (line 209)                   
                                                                                
                                              
                     
     Already discussed — both are unnecessary clones.                           
                                                                                
                                              
                     
     ---                                                                        
                                                                                
                                              
     Design / Correctness Concerns
                                                                                
                                                                                
                                              
     D1: Early bail-out abandons entire UNION on any failed branch
                                                                                
                                                                                
                                              
     File: unions_to_filter.rs:114-116                                          
                                                                                
                                              
                                                                                
                                                                                
                                              
     If one branch can't be extracted (has LIMIT, SORT, volatile), the entire 
UNION is left untouched — even if other branches could be merged. Example: 10 
branches from same table with filters + 1 branch  
     with LIMIT = no optimization at all. A partial-merge strategy would be 
more effective, though more complex.
                                                                                
                                                                                
                                              
     D2: wrapper_projections_are_safe may miss window functions                 
                                                                                
                                              
      
     File: unions_to_filter.rs:332-339                                          
                                                                                
                                              
                     
     Checks for volatility and subqueries but not window functions. However, in 
DataFusion's plan representation, window functions get their own Window node — 
they don't appear as expressions inside        
     Projection. So this is likely not a real issue in practice, but worth 
confirming with a test.
                                                                                
                                                                                
                                              
     D3: SubqueryAlias schema recomputed in wrap_branch                         
                                                                                
                                              
      
     File: unions_to_filter.rs:275-277                                          
                                                                                
                                              
                     
     wrap_branch uses SubqueryAlias::try_new() which recomputes the schema from 
the new input, discarding the stored schema from peel_wrappers. Since the 
source table is the same and only the filter        
     changed, the recomputed schema should match. Likely benign but a subtle 
coupling worth documenting.
                                                                                
                                                                                
                                              
     D4: GroupKey fragility with LogicalPlan as HashMap key                     
                                                                                
                                              
      
     If any upstream optimizer rule mutates plan internals (e.g., pushdown 
flags) between branches, two "same" sources could hash differently. Currently 
safe since the rule runs early in the pipeline, but  
     this is a latent risk.
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to