mpurins-coralogix opened a new issue, #22173:
URL: https://github.com/apache/datafusion/issues/22173
### Describe the bug
There seems to be a regression in physical planning for nested CASE
projections with datafusion v51.
The issue shows up when planning projections that repeatedly rewrite the
same alias using self-referential CASE expressions, for example:
```
label = trans_type
label = CASE WHEN trans_type = '0' THEN 'label' ELSE label END
label = CASE WHEN trans_type = '1' THEN 'label' ELSE label END
...
label = CASE WHEN trans_type = '17' THEN 'label' ELSE label END
```
After some DataFusion 51 case refactors,`CaseExpr` hashes both the original
`CaseBody` and the derived `ProjectedCaseBody` stored in `EvalMethod`. For
nested case chains, this repeats recursive hash work heavily.
### To Reproduce
I added this test in datafusion/physical-expr/src/projection.rs
```
#[derive(Default)]
struct CountingHasher {
write_calls: usize,
bytes_written: usize,
}
impl Hasher for CountingHasher {
fn finish(&self) -> u64 {
0
}
fn write(&mut self, bytes: &[u8]) {
self.write_calls += 1;
self.bytes_written += bytes.len();
}
}
fn nested_self_referential_case_expr(
depth: usize,
schema: &SchemaRef,
) -> Result<Arc<dyn PhysicalExpr>> {
let kind = col("kind", schema)?;
let mut label = Arc::clone(&kind);
for idx in 0..depth {
let predicate = Arc::new(BinaryExpr::new(
Arc::clone(&kind),
Operator::Eq,
lit(idx.to_string()),
)) as Arc<dyn PhysicalExpr>;
label = case(None, vec![(predicate, lit("label"))], Some(label))?;
}
Ok(label)
}
#[test]
fn nested_self_referential_case_projection_hash_stays_bounded() ->
Result<()> {
let schema = Arc::new(Schema::new(vec![Field::new(
"kind",
DataType::Utf8,
true,
)]));
// Mirrors repeated projections of the same alias:
// label = CASE WHEN kind = 'N' THEN 'label' ELSE label END.
// ProjectionMapping keys projection expressions by Arc<dyn
PhysicalExpr>,
// so this hash cost is paid while planning ProjectionExec properties.
let expr = nested_self_referential_case_expr(18, &schema)?;
let mut hasher = CountingHasher::default();
expr.hash(&mut hasher);
assert!(
hasher.write_calls < 50_000,
"hashing nested CASE projection expression took {} hasher writes and
{} bytes",
hasher.write_calls,
hasher.bytes_written
);
Ok(())
}
```
Local comparison at depth 18:
```
branch hasher writes bytes
upstream/branch-51 11,534,296 72,089,862
upstream/branch-50 364 2,267
```
### Expected behavior
Planning / hashing nested case projection expressions should not blow up
with exponential complexity.
A small 16-18 layer CASE chain should stay bounded and scale roughly with
expression size, not perform millions of recursive hash calls.
### Additional context
Seems to be introduced by https://github.com/apache/datafusion/pull/18329
That commit made EvalMethod store ProjectedCaseBody while CaseExpr still
derives Hash.
So hashing CaseExpr now includes both the original CaseBody and a derived
ProjectedCaseBody. For nested CASE trees, that repeats recursive hash work.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]