youichi-uda opened a new issue, #21998:
URL: https://github.com/apache/datafusion/issues/21998

   # Add cargo-fuzz / OSS-Fuzz coverage for DataFusion's parser and analyzer
   
   ## Background
   
   DataFusion already has excellent property-based fuzzers for runtime 
correctness — `sort_fuzz`, `aggregation_fuzzer`, `join_fuzz`, and friends in 
`datafusion/core/tests/fuzz_cases/`. They generate random queries against valid 
data and cross-check results, which is exactly the right tool for catching 
aggregation/sort/join semantic bugs.
   
   What we don't yet have is **byte-level (libFuzzer / cargo-fuzz)** coverage. 
That's a different attack surface: feeding arbitrary bytes into entry points 
like `DFParser::parse_sql` exercises the SQL tokenizer/parser, the 
dialect-extension code, and any error-recovery paths the parser walks while 
building the AST. These are the kinds of bugs that show up as panics or 
pathological allocations rather than wrong results.
   
   This is the same delta arrow-rs#5332 is closing on the format-parser side; 
DataFusion benefits from byte-level fuzzing for its own reasons.
   
   ## Concrete proposal
   
   1. **Add per-crate `fuzz/` directories** with cargo-fuzz harnesses, starting 
with the smallest blast radius and growing outward as we land. Two are already 
prototyped on my fork at 
[`fuzz/initial-harnesses`](https://github.com/masumi-ryugo/datafusion/tree/fuzz/initial-harnesses):
      - `datafusion/sql/fuzz/parse_sql` — `DFParser::parse_sql` over 
UTF-8-validated `&[u8]`. 22 s sanity run reaches 3,520 edges / 12,788 features 
/ 2,342 corpus entries from an empty start, RSS peak 117 MiB, no crashes.
      - `datafusion/sql/fuzz/parse_expr` — `DFParser::parse_sql_into_expr` 
(different production set). 22 s sanity reaches 3,980 edges / 14,500 features / 
2,492 corpus entries at 9,090 runs/s, no crashes.
      
      Then in subsequent PRs, depending on what shakes out: a 
`LogicalPlanBuilder` harness in `datafusion/expr`, a `SessionContext::sql` 
end-to-end harness, and possibly per-format datasource readers 
(csv/json/parquet/arrow/avro), though those overlap with the 
apache/arrow-rs#5332 work and we'd want to dedupe.
   
   2. **Wire ClusterFuzzLite into GitHub Actions** so PR-time fuzz + nightly 
batch + corpus pruning run automatically. Catches regressions before they merge 
instead of relying on me to remember to run cargo-fuzz locally.
   
   3. **Open the OSS-Fuzz integration PR** (`projects/datafusion/`) once the 
upstream `fuzz/` lands and the harnesses are stable. Mirrors the layout I'm 
proposing for arrow-rs in 
[google/oss-fuzz](https://github.com/google/oss-fuzz). Coverage target ≥80% on 
harness-touched modules.
   
   ## Why this is worth doing
   
   - DataFusion is widely used as a SQL engine inside other projects (InfluxDB, 
Ballista, Coralogix, DataFusion Comet, ...). A panic or pathological allocation 
in `DFParser::parse_sql` can DoS any service that exposes a SQL endpoint to 
untrusted input.
   - The SQL parser walks an attacker-controlled token stream through 
DataFusion's extensions on top of `sqlparser-rs`. The `sqlparser` crate itself 
does not appear to be fuzzed in OSS-Fuzz, so DataFusion sits behind two 
unfuzzed parser layers right now.
   - Coverage cost is modest: cargo-fuzz harnesses are typically tens of lines 
each, and the OSS-Fuzz/ClusterFuzzLite plumbing is a one-time setup.
   
   ## Status
   
   I'm prototyping this in my fork at 
[masumi-ryugo/datafusion@fuzz/initial-harnesses](https://github.com/masumi-ryugo/datafusion/tree/fuzz/initial-harnesses)
 and have a working `parse_sql` harness building under nightly cargo-fuzz. 
Happy to send the upstream PR once the layout question is settled.
   
   **Two questions for the maintainers** (cc @alamb @andygrove @comphead 
@ozankabak):
   
   1. **Layout preference**: per-crate `fuzz/` dirs (gitoxide-style — 
independent buildable units, easy to add per-crate seed corpora) vs. a single 
top-level `fuzz/` workspace? I'd lean per-crate but happy to do whatever you 
prefer.
   
   2. **`primary_contact` / `auto_ccs` for OSS-Fuzz `project.yaml`**: OSS-Fuzz 
wants Google-account emails on file for crash notifications. Who from the PMC 
would want to be on that list? I can default to a single contact if a PMC alias 
isn't preferred.
   
   xref apache/arrow-rs#5332 (sibling work for arrow-rs side, same author, same 
OSS-Fuzz integration pattern).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to