Dandandan opened a new issue, #22187:
URL: https://github.com/apache/datafusion/issues/22187

   ### Describe the bug
   
   Setting any `datafusion.runtime.*` runtime variable to a string value that 
ends in a non-ASCII (multi-byte UTF-8) character panics inside `core::str` 
instead of returning a planner error. Found via libFuzzer.
   
   ### To Reproduce
   
   ```rust
   use datafusion::prelude::SessionContext;
   
   #[tokio::main]
   async fn main() {
       let ctx = SessionContext::new();
       let _ = ctx.sql("SET datafusion.runtime.max_temp_directory_size = 
'é'").await;
   }
   ```
   
   Panic:
   
   ```
   thread 'main' panicked at .../core/src/str/mod.rs:846:21:
   end byte index 1 is not a char boundary; it is inside 'é' (bytes 0..2 of 
string)
   ```
   
   All five reachable runtime config keys panic the same way (one for each 
callsite below). The single byte `0xCD` (or any other UTF-8 continuation byte) 
is enough to trigger it, but a real non-ASCII char like `é`, `Δ`, or `字` works 
just as well.
   
   | SET key | min repro value |
   |---|---|
   | `memory_limit` | `'é'` |
   | `max_temp_directory_size` | `'é'` |
   | `metadata_cache_limit` | `'é'` |
   | `list_files_cache_limit` | `'é'` |
   | `list_files_cache_ttl` | `'1mé'` |
   
   ### Expected behavior
   
   A `plan_err!` (or similar `Err`) explaining the unsupported unit / malformed 
value. The public SQL API should never panic on user-supplied SQL.
   
   ### Root cause
   
   
[`datafusion/core/src/execution/context/mod.rs`](https://github.com/apache/datafusion/blob/main/datafusion/core/src/execution/context/mod.rs)
 — three callers of `split_at(s.len() - 1)` that assume the last byte is its 
own char:
   
   ```rust
   // line 1259  parse_memory_limit  (deprecated but pub, still has the bug)
   // line 1302  parse_capacity_limit
   // line 1336  parse_duration
   let (number, unit) = limit.split_at(limit.len() - 1);
   ```
   
   `str::split_at` requires the index to fall on a UTF-8 char boundary, 
otherwise it panics. For any string ending in a non-ASCII char, `limit.len() - 
1` is in the middle of a multi-byte sequence.
   
   `parse_capacity_limit` is reached from `set_runtime_variable` for 
`memory_limit`, `max_temp_directory_size`, `metadata_cache_limit`, 
`list_files_cache_limit`. `parse_duration` is reached for 
`list_files_cache_ttl`.
   
   ### Fix
   
   Either:
   - Slice by `char` instead of bytes: e.g. `if !limit.is_ascii() { return 
plan_err!(...) }` before the `split_at`, **or**
   - Use `limit.char_indices().next_back().map(|(i, _)| 
limit.split_at(i)).ok_or_else(|| ...)`, **or**
   - Compare via `chars().last()` and slice safely.
   
   The unit set is `{K, M, G, m, s}` — all ASCII — so rejecting non-ASCII 
values up front is sufficient.
   
   ### Additional context
   
   Discovered by a `cargo fuzz` target seeded with SQL from 
`datafusion/sqllogictest/test_files/`. The fuzzer reached the panic within the 
first INITED pass on the seeded corpus after a random mutation flipped a 
trailing byte of a SET value to a UTF-8 continuation byte 
(`String::from_utf8_lossy` in the fuzz harness replaced it with `U+FFFD`, a 
3-byte char, which is still enough to reach the bug).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to