GitHub user wangzhigang1999 edited a discussion: [KPIP] Data Agent Engine —
AI-Powered Autonomous Data Analysis for Kyuubi
# [KPIP] Data Agent Engine — AI-Powered Autonomous Data Analysis for Kyuubi
## Abstract
This proposal introduces a **Data Agent Engine** to Apache Kyuubi, enabling
users to perform data analysis through natural language. Unlike the existing
Chat Engine, which provides stateless LLM Q&A without data access, the Data
Agent Engine bridges LLMs with Kyuubi's multi-engine SQL execution capabilities
— allowing an AI agent to autonomously explore schemas, generate SQL, execute
queries, verify results, and self-correct through multi-turn reasoning.
---
## Q1. What are you trying to do?
Add a new engine type `DATA_AGENT` to Kyuubi that provides **agentic,
multi-turn data analysis** via natural language. Users connect through standard
JDBC/REST interfaces, ask questions in plain language, and the agent
autonomously:
1. Explores table schemas and discovers join relationships
2. Generates SQL queries tailored to the underlying engine's dialect (Spark
SQL, Trino, Hive, etc.)
3. Executes queries through Kyuubi's existing engine infrastructure
4. Verifies results for anomalies (empty results, cartesian joins, NULL
explosions)
5. Self-corrects when issues are detected — without user intervention
6. Presents conclusions with supporting data
This enables **business users, analysts, and non-SQL-proficient engineers** to
query data warehouses and lakehouses through Kyuubi without writing SQL.
## Q2. How is it done today, and what are the limits of current practice?
### Current state in Kyuubi
Kyuubi has a **Chat Engine** (`externals/kyuubi-chat-engine/`) that integrates
LLMs (ChatGPT, ErnieBot) via a pluggable `ChatProvider` interface.
**Limitations of the current Chat Engine:**
| Limitation | Impact |
|---|---|
| No data access — LLM cannot execute SQL or inspect schemas | Cannot answer
data questions; purely conversational |
| Single-turn `ask()` interface — no iterative reasoning | Cannot decompose
complex questions into multi-step analysis |
| No result verification — no awareness of query correctness | Cannot detect or
fix wrong joins, empty results, or data anomalies |
| No schema awareness — LLM has no knowledge of the user's tables | Generates
hallucinated SQL if users try to ask data questions |
### Current state in the industry
Tools like text-to-SQL assistants generate a single SQL query from a natural
language question. This works for simple lookups but fails on complex
analytical questions that require:
- Multi-step reasoning (explore → hypothesize → query → verify → refine)
- Schema understanding (which table owns which field, how tables join)
- Error recovery (fixing wrong JOINs, adjusting WHERE clauses)
## Q3. What is new in your approach, and why do you think it will be successful?
### Core innovation: Agentic SQL analysis within Kyuubi's multi-tenant gateway
Instead of single-shot text-to-SQL, we introduce a **ReAct (Reasoning + Acting)
agent loop** that iterates toward correct answers:
```
User question
→ Agent: inspect schema (Tool: describe_schema)
→ Agent: discover join paths (Tool: find_relationships)
→ Agent: generate and execute SQL (Tool: sql_query → JDBC → Kyuubi Server →
Compute Engine)
→ Middleware: verify results (empty? NULL explosion? cartesian join?)
→ Agent: self-correct if needed, or present final answer
```
### Key design decisions
**1. Reuse Kyuubi's engine infrastructure for SQL execution**
The Data Agent does NOT connect to databases directly. Instead, its `sql_query`
tool uses **Kyuubi's JDBC driver to connect back to Kyuubi Server**, which then
routes the SQL to the appropriate compute engine (Spark SQL, Trino, Hive,
etc.). This is similar to the existing JDBC Engine pattern, where an engine
acts as a JDBC client to a backend data source — except here the "backend" is
Kyuubi Server itself.
This design ensures that every SQL query the agent executes goes through the
Server gateway, inheriting multi-tenant resource isolation,
authentication/authorization (Kyuubi AuthZ / Apache Ranger), query auditing,
and engine lifecycle management. The agent connects using the original user's
credentials, so the same ACL rules are enforced.
```
┌──────────┐ ┌───────────────┐ ┌───────────────────────┐
│ Client │───▶│ Kyuubi Server │───▶│ Data Agent Engine │
│(JDBC/REST)│ │ (Gateway) │ │ │
└──────────┘ └──┬────────────┘ │ ReAct Loop │
│ ▲ │ LLM ←→ Tools │
│ │ │ │ │
│ │ └─────────┼──────────────┘
│ │ │
│ │ JDBC (user creds) │
│ └─────────────────────┘
│ sql_query tool connects
│ back to Kyuubi Server
▼
┌─────────────────┐
│ Compute Engine │
│ (Spark/Trino/ │
│ Hive/...) │
└─────────────────┘
```
**2. Pluggable middleware pipeline for reasoning control**
An onion-model middleware system allows operators to customize agent behavior
without modifying core logic:
- **ResultVerification** — detects anomalies and injects correction hints
- **Guardrails** — enforces SQL safety policies (SELECT-only)
- **Deadline** — enforces time budgets for multi-turn analysis
- **Compaction** — compresses conversation context to stay within LLM token
limits
**3. Multi-signal schema understanding**
The agent actively discovers schema structure through:
- Explicit foreign keys from database metadata
- Naming conventions (`*_id` columns matched to primary keys)
- Value overlap detection (MinHash sketches)
- Type compatibility filtering
**4. Java/Scala implementation on the JVM**
Implemented in Java/Scala, consistent with all existing Kyuubi engines. Uses
[LangChain4j](https://docs.langchain4j.dev/) for LLM interaction (tool calling,
chat memory, streaming). This allows direct reuse of Kyuubi's `Serverable`,
`SessionManager`, `OperationManager`, and Thrift service infrastructure.
### Why it will be successful
1. **Natural extension of Kyuubi's existing architecture** — follows the same
engine model as Chat, Spark, Flink, and JDBC engines
2. **Validated approach** — the underlying agent design has been tested against
the [BIRD benchmark](https://bird-bench.github.io/) (a standard text-to-SQL
evaluation with 500 questions across 11 databases)
3. **Enterprise-ready from day one** — inherits Kyuubi's authentication,
authorization, multi-tenancy, and audit capabilities
4. **Addresses real demand** — natural language data access is a top request in
the data platform space
## Q4. Who cares? If you are successful, what difference will it make?
| User type | Current pain point | How Data Agent helps |
|---|---|---|
| **Business analysts** | Must learn SQL or wait for engineers to write queries
| Ask questions in natural language, get verified results |
| **Data engineers** | Spend time on ad-hoc query requests from business teams
| Offload routine queries to the agent |
| **Platform teams** | Need to provide self-service analytics while maintaining
governance | Data Agent respects Kyuubi's AuthZ — users can only query what
they're authorized to access |
### Strategic value for Kyuubi
- **Differentiator** — one of the first open-source SQL gateways with built-in
agentic AI analysis
- **Broader adoption** — lowers the barrier to entry for non-technical users
- **Ecosystem growth** — attracts AI/ML community contributors to the project
## Q5. How will you measure success?
| Metric | Target | How to measure |
|---|---|---|
| **Text-to-SQL accuracy** | ≥ 60% EX on BIRD benchmark | Automated benchmark
suite |
| **Self-correction rate** | ≥ 50% recovery from initial wrong queries | Log
analysis of multi-turn sessions |
| **Query safety** | 0 unauthorized writes or data leaks | SQL validator +
AuthZ integration tests |
| **Latency** | < 30s for simple questions | End-to-end timing |
| **Community adoption** | ≥ 3 external contributors within 6 months | GitHub
activity tracking |
## Q6. What are the mid-term and final "exams" to check for success?
### Phase 1: Foundation (Mid-term)
- [ ] `DATA_AGENT` engine type registered
- [ ] `DataAgentProcessBuilder` launches the agent engine process
- [ ] Agent can execute simple single-turn text-to-SQL through Kyuubi's engine
- [ ] SQL validation enforces SELECT-only policy
- [ ] Basic authentication pass-through works
**Success criterion:** A user connects via beeline/JDBC, types a natural
language question, and receives a SQL-backed answer.
### Phase 2: Intelligent Analysis (Final)
- [ ] Multi-turn ReAct reasoning with schema exploration
- [ ] Relationship discovery (explicit FK + inferred)
- [ ] Result verification and self-correction
- [ ] Streaming output via REST API
- [ ] Context compaction for long conversations
- [ ] Documentation and user guide
**Success criterion:** Agent correctly answers multi-step analytical questions
that require schema exploration, multi-table joins, and result verification.
## Q7. What are the risks?
| Risk | Likelihood | Impact | Mitigation |
|---|---|---|---|
| LLM hallucination generates wrong SQL | High | Medium | Result verification +
SQL validation + max 3 retries |
| LLM API latency causes poor UX | Medium | Medium | Streaming output +
deadline middleware + async execution |
| Agent generates unauthorized queries | Low | High | SQL validator
(SELECT-only) + Kyuubi AuthZ integration |
| Token cost for complex analysis | Medium | Low | Context compaction +
configurable iteration limits |
| LLM provider lock-in | Low | Medium | Pluggable provider interface
(OpenAI-compatible API as standard) |
---
## Implementation Roadmap
| Phase | Scope | Estimated PRs |
|---|---|---|
| **Phase 0** | Community discussion + design review | This KPIP |
| **Phase 1** | Engine skeleton: EngineType, ProcessBuilder, Thrift service |
2-3 PRs |
| **Phase 2** | Core agent: ReAct loop, basic tools (schema + sql_query) | 3-4
PRs |
| **Phase 3** | Advanced: relationship discovery, result verification,
middleware | 3-5 PRs |
| **Phase 4** | Streaming, documentation, benchmark CI | 2-3 PRs |
---
## Rejected Alternatives
### Alternative 1: Extend the existing Chat Engine
The Chat Engine's `ChatProvider.ask()` interface is fundamentally single-turn
and synchronous. Adding multi-turn reasoning, tool execution, and streaming
would require rewriting most of the internals. A new engine type is cleaner and
avoids breaking existing Chat Engine users.
### Alternative 2: Implement as a Kyuubi Server plugin / extension
The agent needs its own process lifecycle (LLM client, conversation memory,
middleware pipeline). Running inside the Server JVM would compete for resources
and complicate failure isolation.
### Alternative 3: External service with REST integration
Would bypass Kyuubi's authentication, authorization, and audit pipeline. Users
would need to manage a separate service.
### Alternative 4: Implement in Python
While the AI/ML ecosystem is richer in Python, the agent's core operations (LLM
API calls, SQL orchestration, schema introspection) are well-supported in Java
via LangChain4j. A Python implementation would introduce a technology stack
inconsistency, require re-implementing Kyuubi infrastructure, and add a runtime
dependency that the existing community cannot effectively review or maintain.
---
## References
- [BIRD Benchmark](https://bird-bench.github.io/) — Text-to-SQL evaluation
standard
- [ReAct: Synergizing Reasoning and Acting in Language
Models](https://arxiv.org/abs/2210.03629)
- [LangChain4j](https://docs.langchain4j.dev/) — Java LLM framework
- Kyuubi Chat Engine — `externals/kyuubi-chat-engine/`
GitHub link: https://github.com/apache/kyuubi/discussions/7373
----
This is an automatically sent email for [email protected].
To unsubscribe, please send an email to:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]