GitHub user mergisi added a comment to the discussion: [KPIP] Data Agent Engine — AI-Powered Autonomous Data Analysis for Kyuubi
Strong proposal. A few observations from working in the text-to-SQL space: **Schema discovery as the critical path.** The multi-signal approach (FK metadata + naming conventions + MinHash value overlap) is the right call. In practice, explicit FKs cover maybe 40% of real-world schemas — the rest rely on naming convention inference. One addition worth considering: column-level statistics (cardinality, NULL ratio) fed to the LLM alongside schema metadata. This helps the agent avoid generating queries that join on high-cardinality columns without filters, which is one of the most common causes of cartesian-join-like blowups that the ResultVerification middleware would catch too late. **The 60% BIRD target is realistic but worth nuancing.** BIRD includes questions that require domain knowledge not present in schema metadata (e.g., knowing that "big city" means population > 1M). The self-correction loop should help here — the agent can inspect result distributions and re-query when something looks off. Tracking accuracy separately for schema-answerable vs. domain-knowledge questions would give a clearer signal during development. **Middleware ordering matters more than it seems.** Guardrails before ResultVerification means the agent can't accidentally run a write query during self-correction. But Compaction before ResultVerification means the agent might lose context about why a previous attempt failed. Documenting a recommended middleware ordering (or making it configurable per deployment) would help operators avoid subtle bugs. **Dialect-aware SQL generation.** The proposal mentions Spark SQL, Trino, and Hive — these have meaningful syntax differences (e.g., `DATE_TRUNC` vs `TRUNC` vs `date_trunc`, lateral views, array handling). Is dialect awareness handled at the LLM prompt level (system prompt per engine type), or is there a SQL rewrite layer? A rewrite layer would be more reliable but adds complexity. Disclosure: I work on [ai2sql.io](https://ai2sql.io), a natural language to SQL tool focused on the simpler end of this spectrum (single-turn query generation for learning and ad-hoc analysis). The agentic multi-turn approach described here is the natural next step for production-grade systems where single-shot accuracy isn't sufficient. GitHub link: https://github.com/apache/kyuubi/discussions/7373#discussioncomment-16438004 ---- This is an automatically sent email for [email protected]. To unsubscribe, please send an email to: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
