GitHub user mergisi added a comment to the discussion: [KPIP] Data Agent Engine 
— AI-Powered Autonomous Data Analysis for Kyuubi

Strong proposal. A few observations from working in the text-to-SQL space:

**Schema discovery as the critical path.** The multi-signal approach (FK 
metadata + naming conventions + MinHash value overlap) is the right call. In 
practice, explicit FKs cover maybe 40% of real-world schemas — the rest rely on 
naming convention inference. One addition worth considering: column-level 
statistics (cardinality, NULL ratio) fed to the LLM alongside schema metadata. 
This helps the agent avoid generating queries that join on high-cardinality 
columns without filters, which is one of the most common causes of 
cartesian-join-like blowups that the ResultVerification middleware would catch 
too late.

**The 60% BIRD target is realistic but worth nuancing.** BIRD includes 
questions that require domain knowledge not present in schema metadata (e.g., 
knowing that "big city" means population > 1M). The self-correction loop should 
help here — the agent can inspect result distributions and re-query when 
something looks off. Tracking accuracy separately for schema-answerable vs. 
domain-knowledge questions would give a clearer signal during development.

**Middleware ordering matters more than it seems.** Guardrails before 
ResultVerification means the agent can't accidentally run a write query during 
self-correction. But Compaction before ResultVerification means the agent might 
lose context about why a previous attempt failed. Documenting a recommended 
middleware ordering (or making it configurable per deployment) would help 
operators avoid subtle bugs.

**Dialect-aware SQL generation.** The proposal mentions Spark SQL, Trino, and 
Hive — these have meaningful syntax differences (e.g., `DATE_TRUNC` vs `TRUNC` 
vs `date_trunc`, lateral views, array handling). Is dialect awareness handled 
at the LLM prompt level (system prompt per engine type), or is there a SQL 
rewrite layer? A rewrite layer would be more reliable but adds complexity.

Disclosure: I work on [ai2sql.io](https://ai2sql.io), a natural language to SQL 
tool focused on the simpler end of this spectrum (single-turn query generation 
for learning and ad-hoc analysis). The agentic multi-turn approach described 
here is the natural next step for production-grade systems where single-shot 
accuracy isn't sufficient.

GitHub link: 
https://github.com/apache/kyuubi/discussions/7373#discussioncomment-16438004

----
This is an automatically sent email for [email protected].
To unsubscribe, please send an email to: 
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to