timsaucer opened a new issue, #1394:
URL: https://github.com/apache/datafusion-python/issues/1394
## Problem
More and more users reach for LLMs to generate DataFusion Python code.
Today, agents are excellent at writing SQL but struggle to produce
idiomatic DataFrame API code — they either transliterate SQL literally
or invent patterns that don't match the library's grain. Nothing the
project currently ships reliably surfaces to the agent at the moment
it's writing code.
## Goals
1. Establish a single, authoritative guide for writing idiomatic
DataFusion Python code, usable by both humans and agents.
2. Make that guide discoverable through every channel agents actually
use — not just the channels we wish they used.
3. Validate the guide against a reference corpus (TPC-H) so it stays
honest as the API evolves.
4. Extend the same pattern across the wider DataFusion family
(Ballista, Comet, Ray, etc.) via an upstream `llms.txt` hub.
## Where idiomatic code is defined
**Single source of truth: `python/datafusion/AGENTS.md`.**
This one file — kept inside the repo, shipped inside the wheel, and
included verbatim on the docs site — is the canonical guide. It
contains:
- Core abstractions (`SessionContext` / `DataFrame` / `Expr` /
`functions`) and import conventions.
- A quick-start example that works end-to-end.
- SQL-to-DataFrame reference table (for users who think in SQL first).
- Migration sections for users coming from **Spark**, **Pandas**, and
**Polars** — same shape as the SQL table, column-mapping each API's
idioms to DataFusion's.
- Common pitfalls caught in real agent sessions: `&`/`|`/`~` vs
Python `and`/`or`/`not`, `lit()` wrapping, decimal/float literal
interactions, `F.substring` vs `F.substr` arity, join-key
disambiguation, absence of `how="cross"`, etc.
- Idiomatic patterns: fluent chaining, window functions in place of
correlated subqueries, semi/anti joins in place of `EXISTS`/`NOT
EXISTS`, `aggregate().filter()` for `HAVING`, variable assignment
for CTEs.
The **TPC-H example suite** (`examples/tpch/`) is the reference
corpus: every query is written as idiomatic DataFrame code,
validated by answer-file comparison, and where the optimized logical
plan differs from the SQL version, the difference is documented in a
comment. This gives the AGENTS.md guidance a continuously verified
ground truth.
## How agents discover it
Discovery is layered. Each layer catches agents the prior ones
missed, so no single channel is load-bearing.
| Layer | Mechanism | Target audience |
|-------|-----------|-----------------|
| 1 | `datafusion-init` writes a short pointer block into the user's
project-root `AGENTS.md` / `CLAUDE.md` / `.cursorrules` | Any agent working in
the user's repo — project-root files are loaded into context automatically |
| 2 | `https://datafusion.apache.org/python/llms.txt` published on the docs
site (llmstxt.org convention) | Agents that auto-fetch `/llms.txt` from
documentation sites |
| 3 | `AGENTS.md` inside the installed wheel + pointer in
`datafusion.__doc__` | Agents that introspect the installed package |
| 4 | Docs site page that `{include}`s `AGENTS.md` | Humans and
WebSearch-capable agents browsing the docs |
| 5 | `https://datafusion.apache.org/llms.txt` upstream hub (separate PR to
`apache/datafusion`) pointing at each subproject's `llms.txt` | Agents that
land anywhere in the DataFusion ecosystem |
Layer 1 is the highest-leverage item — in an empirical test, an agent
with AGENTS.md present in five different in-package locations still
missed all of them because nothing pointed the agent at them from the
project root. The ~200-byte pointer solves that without embedding the
full guide in user repos.
## Task list
- [x] PR 1a — `AGENTS.md` + package entry point
- [ ] PR 1b — Module docstrings + doctest examples
- [ ] PR 1c — `datafusion-init` project-root pointer
- [ ] PR 2 — TPC-H reference SQL + plan comparison diagnostic
- [ ] PR 3 — Rewrite TPC-H non-idiomatic queries
- [ ] PR 4 — Docs site (`{include}` + `llms.txt`) + AI skills + CLAUDE.md
- [ ] PR 5 — Upstream sync process documentation
- [ ] PR 6 — `apache/datafusion` `llms.txt` hub (separate repo)
Detailed plan to follow as a comment.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]