timsaucer opened a new issue, #1394:
URL: https://github.com/apache/datafusion-python/issues/1394

   ## Problem
   
   More and more users reach for LLMs to generate DataFusion Python code.
   Today, agents are excellent at writing SQL but struggle to produce
   idiomatic DataFrame API code — they either transliterate SQL literally
   or invent patterns that don't match the library's grain. Nothing the
   project currently ships reliably surfaces to the agent at the moment
   it's writing code.
   
   ## Goals
   
   1. Establish a single, authoritative guide for writing idiomatic
      DataFusion Python code, usable by both humans and agents.
   2. Make that guide discoverable through every channel agents actually
      use — not just the channels we wish they used.
   3. Validate the guide against a reference corpus (TPC-H) so it stays
      honest as the API evolves.
   4. Extend the same pattern across the wider DataFusion family
      (Ballista, Comet, Ray, etc.) via an upstream `llms.txt` hub.
   
   ## Where idiomatic code is defined
   
   **Single source of truth: `python/datafusion/AGENTS.md`.**
   
   This one file — kept inside the repo, shipped inside the wheel, and
   included verbatim on the docs site — is the canonical guide. It
   contains:
   
   - Core abstractions (`SessionContext` / `DataFrame` / `Expr` /
     `functions`) and import conventions.
   - A quick-start example that works end-to-end.
   - SQL-to-DataFrame reference table (for users who think in SQL first).
   - Migration sections for users coming from **Spark**, **Pandas**, and
     **Polars** — same shape as the SQL table, column-mapping each API's
     idioms to DataFusion's.
   - Common pitfalls caught in real agent sessions: `&`/`|`/`~` vs
     Python `and`/`or`/`not`, `lit()` wrapping, decimal/float literal
     interactions, `F.substring` vs `F.substr` arity, join-key
     disambiguation, absence of `how="cross"`, etc.
   - Idiomatic patterns: fluent chaining, window functions in place of
     correlated subqueries, semi/anti joins in place of `EXISTS`/`NOT
     EXISTS`, `aggregate().filter()` for `HAVING`, variable assignment
     for CTEs.
   
   The **TPC-H example suite** (`examples/tpch/`) is the reference
   corpus: every query is written as idiomatic DataFrame code,
   validated by answer-file comparison, and where the optimized logical
   plan differs from the SQL version, the difference is documented in a
   comment. This gives the AGENTS.md guidance a continuously verified
   ground truth.
   
   ## How agents discover it
   
   Discovery is layered. Each layer catches agents the prior ones
   missed, so no single channel is load-bearing.
   
   | Layer | Mechanism | Target audience |
   |-------|-----------|-----------------|
   | 1 | `datafusion-init` writes a short pointer block into the user's 
project-root `AGENTS.md` / `CLAUDE.md` / `.cursorrules` | Any agent working in 
the user's repo — project-root files are loaded into context automatically |
   | 2 | `https://datafusion.apache.org/python/llms.txt` published on the docs 
site (llmstxt.org convention) | Agents that auto-fetch `/llms.txt` from 
documentation sites |
   | 3 | `AGENTS.md` inside the installed wheel + pointer in 
`datafusion.__doc__` | Agents that introspect the installed package |
   | 4 | Docs site page that `{include}`s `AGENTS.md` | Humans and 
WebSearch-capable agents browsing the docs |
   | 5 | `https://datafusion.apache.org/llms.txt` upstream hub (separate PR to 
`apache/datafusion`) pointing at each subproject's `llms.txt` | Agents that 
land anywhere in the DataFusion ecosystem |
   
   Layer 1 is the highest-leverage item — in an empirical test, an agent
   with AGENTS.md present in five different in-package locations still
   missed all of them because nothing pointed the agent at them from the
   project root. The ~200-byte pointer solves that without embedding the
   full guide in user repos.
   
   ## Task list
   
   - [x] PR 1a — `AGENTS.md` + package entry point
   - [ ] PR 1b — Module docstrings + doctest examples
   - [ ] PR 1c — `datafusion-init` project-root pointer
   - [ ] PR 2  — TPC-H reference SQL + plan comparison diagnostic
   - [ ] PR 3  — Rewrite TPC-H non-idiomatic queries
   - [ ] PR 4  — Docs site (`{include}` + `llms.txt`) + AI skills + CLAUDE.md
   - [ ] PR 5  — Upstream sync process documentation
   - [ ] PR 6  — `apache/datafusion` `llms.txt` hub (separate repo)
   
   Detailed plan to follow as a comment.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to