youichi-uda opened a new issue, #21998: URL: https://github.com/apache/datafusion/issues/21998
# Add cargo-fuzz / OSS-Fuzz coverage for DataFusion's parser and analyzer ## Background DataFusion already has excellent property-based fuzzers for runtime correctness — `sort_fuzz`, `aggregation_fuzzer`, `join_fuzz`, and friends in `datafusion/core/tests/fuzz_cases/`. They generate random queries against valid data and cross-check results, which is exactly the right tool for catching aggregation/sort/join semantic bugs. What we don't yet have is **byte-level (libFuzzer / cargo-fuzz)** coverage. That's a different attack surface: feeding arbitrary bytes into entry points like `DFParser::parse_sql` exercises the SQL tokenizer/parser, the dialect-extension code, and any error-recovery paths the parser walks while building the AST. These are the kinds of bugs that show up as panics or pathological allocations rather than wrong results. This is the same delta arrow-rs#5332 is closing on the format-parser side; DataFusion benefits from byte-level fuzzing for its own reasons. ## Concrete proposal 1. **Add per-crate `fuzz/` directories** with cargo-fuzz harnesses, starting with the smallest blast radius and growing outward as we land. Two are already prototyped on my fork at [`fuzz/initial-harnesses`](https://github.com/masumi-ryugo/datafusion/tree/fuzz/initial-harnesses): - `datafusion/sql/fuzz/parse_sql` — `DFParser::parse_sql` over UTF-8-validated `&[u8]`. 22 s sanity run reaches 3,520 edges / 12,788 features / 2,342 corpus entries from an empty start, RSS peak 117 MiB, no crashes. - `datafusion/sql/fuzz/parse_expr` — `DFParser::parse_sql_into_expr` (different production set). 22 s sanity reaches 3,980 edges / 14,500 features / 2,492 corpus entries at 9,090 runs/s, no crashes. Then in subsequent PRs, depending on what shakes out: a `LogicalPlanBuilder` harness in `datafusion/expr`, a `SessionContext::sql` end-to-end harness, and possibly per-format datasource readers (csv/json/parquet/arrow/avro), though those overlap with the apache/arrow-rs#5332 work and we'd want to dedupe. 2. **Wire ClusterFuzzLite into GitHub Actions** so PR-time fuzz + nightly batch + corpus pruning run automatically. Catches regressions before they merge instead of relying on me to remember to run cargo-fuzz locally. 3. **Open the OSS-Fuzz integration PR** (`projects/datafusion/`) once the upstream `fuzz/` lands and the harnesses are stable. Mirrors the layout I'm proposing for arrow-rs in [google/oss-fuzz](https://github.com/google/oss-fuzz). Coverage target ≥80% on harness-touched modules. ## Why this is worth doing - DataFusion is widely used as a SQL engine inside other projects (InfluxDB, Ballista, Coralogix, DataFusion Comet, ...). A panic or pathological allocation in `DFParser::parse_sql` can DoS any service that exposes a SQL endpoint to untrusted input. - The SQL parser walks an attacker-controlled token stream through DataFusion's extensions on top of `sqlparser-rs`. The `sqlparser` crate itself does not appear to be fuzzed in OSS-Fuzz, so DataFusion sits behind two unfuzzed parser layers right now. - Coverage cost is modest: cargo-fuzz harnesses are typically tens of lines each, and the OSS-Fuzz/ClusterFuzzLite plumbing is a one-time setup. ## Status I'm prototyping this in my fork at [masumi-ryugo/datafusion@fuzz/initial-harnesses](https://github.com/masumi-ryugo/datafusion/tree/fuzz/initial-harnesses) and have a working `parse_sql` harness building under nightly cargo-fuzz. Happy to send the upstream PR once the layout question is settled. **Two questions for the maintainers** (cc @alamb @andygrove @comphead @ozankabak): 1. **Layout preference**: per-crate `fuzz/` dirs (gitoxide-style — independent buildable units, easy to add per-crate seed corpora) vs. a single top-level `fuzz/` workspace? I'd lean per-crate but happy to do whatever you prefer. 2. **`primary_contact` / `auto_ccs` for OSS-Fuzz `project.yaml`**: OSS-Fuzz wants Google-account emails on file for crash notifications. Who from the PMC would want to be on that list? I can default to a single contact if a PMC alias isn't preferred. xref apache/arrow-rs#5332 (sibling work for arrow-rs side, same author, same OSS-Fuzz integration pattern). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
