Dandandan opened a new issue, #22185:
URL: https://github.com/apache/datafusion/issues/22185
### Describe the bug
A SQL statement containing `regexp_like(..., 'a{5}{5}{5}{5}{5}{5}{5}{5}')`
(8+ nested counted-repetition quantifiers) takes >1 second to *plan* (logical +
physical planning, no execution required). At higher nesting, planning still
takes ~1 s because the regex engine hits a built-in size limit and bails — but
the limit itself takes ~1 s to reach.
This is a planning-time DoS: a 60-byte SQL string from an untrusted source
can stall a query node for a second or more before any data is touched.
### To Reproduce
Using the public `SessionContext::sql` API (no execution required):
```rust
use datafusion::prelude::SessionContext;
#[tokio::main]
async fn main() {
let ctx = SessionContext::new();
let sql = "SELECT regexp_like('aaaaa', 'a{5}{5}{5}{5}{5}{5}{5}{5}');";
let t0 = std::time::Instant::now();
let df = ctx.sql(sql).await.unwrap();
let _ = df.create_physical_plan().await;
println!("planning took {:?}", t0.elapsed());
}
```
Output on M-series Mac, release build:
```
planning took ~1.0 s
```
Scaling with nesting count `N` (measured via the fuzz target's
`DATAFUSION_FUZZ_PLANNING_TIMEOUT_MS`):
| `{5}` nestings | planning time |
|---:|---:|
| 4 | 6 ms |
| 6 | 17 ms |
| 8 | **1008 ms** (regex size limit hit) |
| 10 | 1005 ms |
| 12 | 757 ms |
| 18 | ~1 s (original fuzzer find) |
Note that `a{5}{5}` is treated as `(a{5}){5}` (or similar) — the regex crate
expands the inner counted repetition before noticing the size, so the cost is
roughly exponential up to the crate's `size_limit` (~10 MB by default).
### Expected behavior
Planning should either:
- reject the pattern in <1 ms with a "regex too complex" error, or
- defer regex compilation entirely to execution time, leaving planning fast.
A 60-byte SQL string should not be able to consume a CPU-second of planning
work.
### Additional context
Discovered by a `cargo fuzz` target
(`fuzz/fuzz_targets/sql_physical_plan.rs`) seeded with SQL extracted from
`datafusion/sqllogictest/test_files/`. The fuzz harness panics if planning
exceeds 1 s, which is how this surfaced:
```
physical planning exceeded 1000 ms after 1041 ms for SQL:
SELECT regexp_like('aaaaa',
'a{5}{5}{5}{5}{5}{5}{5}{5}{5}{5}{5}{5}{5}{5}{5}{5}{5}{5});
```
Probable root cause is the simplification pass constant-folding
`regexp_like` literals via `regex::Regex::new`, which is
O(product-of-quantifiers) on patterns like `a{5}{5}…`. The size limit in the
`regex` crate eventually trips, but at a wall-time cost of ~1 s for the
smallest triggering input.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]