mpurins-coralogix opened a new issue, #22173:
URL: https://github.com/apache/datafusion/issues/22173

   ### Describe the bug
   
   
   There seems to be a regression in physical planning for nested CASE 
projections with datafusion v51. 
   
   The issue shows up when planning projections that repeatedly rewrite the 
same alias using self-referential CASE expressions, for example:
   
   ```
    label = trans_type
     label = CASE WHEN trans_type = '0' THEN 'label' ELSE label END
     label = CASE WHEN trans_type = '1' THEN 'label' ELSE label END
     ...
     label = CASE WHEN trans_type = '17' THEN 'label' ELSE label END
   ```
   
   After some DataFusion 51 case refactors,`CaseExpr` hashes both the original 
`CaseBody` and the derived `ProjectedCaseBody` stored in `EvalMethod`. For 
nested case chains, this repeats recursive hash work heavily.
   
   ### To Reproduce
   
   I added this test in datafusion/physical-expr/src/projection.rs
   
   ```
   #[derive(Default)]
   struct CountingHasher {
       write_calls: usize,
       bytes_written: usize,
   }
   
   impl Hasher for CountingHasher {
       fn finish(&self) -> u64 {
           0
       }
   
       fn write(&mut self, bytes: &[u8]) {
           self.write_calls += 1;
           self.bytes_written += bytes.len();
       }
   }
   
   fn nested_self_referential_case_expr(
       depth: usize,
       schema: &SchemaRef,
   ) -> Result<Arc<dyn PhysicalExpr>> {
       let kind = col("kind", schema)?;
       let mut label = Arc::clone(&kind);
   
       for idx in 0..depth {
           let predicate = Arc::new(BinaryExpr::new(
               Arc::clone(&kind),
               Operator::Eq,
               lit(idx.to_string()),
           )) as Arc<dyn PhysicalExpr>;
   
           label = case(None, vec![(predicate, lit("label"))], Some(label))?;
       }
   
       Ok(label)
   }
   
   #[test]
   fn nested_self_referential_case_projection_hash_stays_bounded() -> 
Result<()> {
       let schema = Arc::new(Schema::new(vec![Field::new(
           "kind",
           DataType::Utf8,
           true,
       )]));
   
       // Mirrors repeated projections of the same alias:
       // label = CASE WHEN kind = 'N' THEN 'label' ELSE label END.
       // ProjectionMapping keys projection expressions by Arc<dyn 
PhysicalExpr>,
       // so this hash cost is paid while planning ProjectionExec properties.
       let expr = nested_self_referential_case_expr(18, &schema)?;
   
       let mut hasher = CountingHasher::default();
       expr.hash(&mut hasher);
   
       assert!(
           hasher.write_calls < 50_000,
           "hashing nested CASE projection expression took {} hasher writes and 
{} bytes",
           hasher.write_calls,
           hasher.bytes_written
       );
   
       Ok(())
   }
   ```
   
   Local comparison at depth 18:
   ```
   branch                hasher writes        bytes
   upstream/branch-51       11,534,296   72,089,862
   upstream/branch-50              364        2,267
   ```
   
   ### Expected behavior
   
   Planning / hashing nested case projection expressions should not blow up 
with exponential complexity.
   
   A small 16-18 layer CASE chain should stay bounded and scale roughly with 
expression size, not perform millions of recursive hash calls.
   
   ### Additional context
   
   Seems to be introduced by https://github.com/apache/datafusion/pull/18329
   
   That commit made EvalMethod store ProjectedCaseBody while CaseExpr still 
derives Hash.
   
   So hashing CaseExpr now includes both the original CaseBody and a derived 
ProjectedCaseBody. For nested CASE trees, that repeats recursive hash work.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to