youichi-uda commented on issue #21998:
URL: https://github.com/apache/datafusion/issues/21998#issuecomment-4367772092
Thanks @2010YOUY01 — that's a fair read of where byte-level fuzzing pays off
and where it doesn't, and I agree with most of it. A couple of clarifications
and a smaller proposal.
**On the existing `apache/datafusion-sqlparser-rs/fuzz`**
I missed that — sorry. It uses honggfuzz with a single `fuzz_parse_sql`
target that round-robins across all 14 dialects, which is a perfectly
reasonable design for the parser and is more comprehensive than the `parse_sql`
harness I'd prototyped. I don't think layering a second cargo-fuzz target on
top of that crate buys much.
**On byte-level fuzzing for SQL specifically**
Your read matches mine. For comparison — the binary-format harnesses I've
been prototyping in parallel (apache/arrow-rs#5332 / pola-rs/polars#27488)
found 6 distinct DoS-class bugs over the past two days (fix PRs:
arrow-rs#9883/9884/9886/9887, polars#27489/27490). For the SQL parser side, a
5-minute libFuzzer run on `DFParser::parse_sql` reached 4,377 edges and 20,378
features but produced no crashes — entirely consistent with what you'd expect
from a parser that's been under property-based + honggfuzz coverage already.
So the leverage isn't comparable, and I don't think DataFusion needs the
kind of harness-density push the binary-parser side does. Withdrawing the broad
proposal.
**Smaller scope I'd actually suggest**
Two narrower things, both fine to decline:
1. **Bring `apache/datafusion-sqlparser-rs/fuzz` under OSS-Fuzz.** The
honggfuzz harness already exists; OSS-Fuzz would just give it persistent
corpus, 24/7 compute, and an automated crash-notification path. ~zero net work
in this repo or in `datafusion-sqlparser-rs` itself; net change is a new
`projects/datafusion-sqlparser-rs/` directory in google/oss-fuzz that drives
the existing harness. Happy to send that as a standalone proposal in
`datafusion-sqlparser-rs` if it's useful.
2. **Corpus-mutation harness only.** As you suggested, a small target that
takes a short byte slice and uses it to mutate the existing test SQL corpus
already in `datafusion/sql/tests`, then re-parses. Stays in this repo, tiny
surface, no big harness fleet. Sketch:
```rust
fuzz_target!(|data: &[u8]| {
let base = TEST_SQL_CORPUS[(data.first().copied().unwrap_or(0) as
usize) % TEST_SQL_CORPUS.len()];
let s = mutate_corpus_string(base, &data[1..]); // small bit-flip
mutator
let _ = DFParser::parse_sql(&s);
});
```
This addresses your "valid SQL or close to it" point — every input is a
near-valid query rather than random bytes.
Either or neither — happy to follow whatever suits the project.
**On how OSS-Fuzz works**
Good question. Short version:
- It's basically Google-hosted ClusterFuzz, free for OSS projects.
- You ship a `projects/<name>/` directory in `google/oss-fuzz` with a
`project.yaml` (contacts, sanitizers, fuzzing engines), a `Dockerfile` that
fetches the source, and a `build.sh` that builds the harnesses into `$OUT/`.
- Their infra rebuilds your harnesses daily, runs them on a fleet 24/7 with
persistent corpus, and reports new crashes to the email addresses in
`project.yaml` (`primary_contact` + `auto_ccs`). Crashes get auto-bisected,
deduped, and filed as private issues that flip public after the standard 90-day
disclosure window.
- Local workflow is `python infra/helper.py {build_image, build_fuzzers,
check_build, run_fuzzer, coverage} <project>`, which mirrors what their infra
does so you can iterate on the harness before pushing.
- Apache projects that already use it for comparison: `projects/arrow`
(Arrow C++), `projects/librdkafka`, `projects/parquet-mr`, etc. Same shape,
just pointed at a Rust crate instead.
The closest precedent for the Rust workspace shape we'd want here is
`projects/gitoxide` (auto-discovers per-crate `fuzz/` dirs in a workspace and
builds each) or `projects/fontations`. Both are referenced from my prototype
`projects/datafusion/` if you want to look — though as above, I'd actually
recommend pointing it at `datafusion-sqlparser-rs` instead and leaving
DataFusion proper alone.
---
Net: I'd close this issue or downscope it to "OSS-Fuzz integration of the
existing `datafusion-sqlparser-rs` honggfuzz harness, tracked there." Whichever
you prefer. cc @alamb @andygrove @comphead @ozankabak in case anyone has a
different read.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]