GitHub user wangzhigang1999 edited a discussion: [KPIP] Data Agent Engine — 
AI-Powered Autonomous Data Analysis for Kyuubi

# [KPIP] Data Agent Engine — AI-Powered Autonomous Data Analysis for Kyuubi

## Abstract

This proposal introduces a **Data Agent Engine** to Apache Kyuubi, enabling 
users to perform data analysis through natural language. Unlike the existing 
Chat Engine, which provides stateless LLM Q&A without data access, the Data 
Agent Engine bridges LLMs with Kyuubi's multi-engine SQL execution capabilities 
— allowing an AI agent to autonomously explore schemas, generate SQL, execute 
queries, verify results, and self-correct through multi-turn reasoning.

---

## Q1. What are you trying to do?

Add a new engine type `DATA_AGENT` to Kyuubi that provides **agentic, 
multi-turn data analysis** via natural language. Users connect through standard 
JDBC/REST interfaces, ask questions in plain language, and the agent 
autonomously:

1. Explores table schemas and discovers join relationships
2. Generates SQL queries tailored to the underlying engine's dialect (Spark 
SQL, Trino, Hive, etc.)
3. Executes queries through Kyuubi's existing engine infrastructure
4. Verifies results for anomalies (empty results, cartesian joins, NULL 
explosions)
5. Self-corrects when issues are detected — without user intervention
6. Presents conclusions with supporting data

This enables **business users, analysts, and non-SQL-proficient engineers** to 
query data warehouses and lakehouses through Kyuubi without writing SQL.

## Q2. How is it done today, and what are the limits of current practice?

### Current state in Kyuubi

Kyuubi has a **Chat Engine** (`externals/kyuubi-chat-engine/`) that integrates 
LLMs (ChatGPT, ErnieBot) via a pluggable `ChatProvider` interface.

**Limitations of the current Chat Engine:**

| Limitation | Impact |
|---|---|
| No data access — LLM cannot execute SQL or inspect schemas | Cannot answer 
data questions; purely conversational |
| Single-turn `ask()` interface — no iterative reasoning | Cannot decompose 
complex questions into multi-step analysis |
| No result verification — no awareness of query correctness | Cannot detect or 
fix wrong joins, empty results, or data anomalies |
| No schema awareness — LLM has no knowledge of the user's tables | Generates 
hallucinated SQL if users try to ask data questions |

### Current state in the industry

Tools like text-to-SQL assistants generate a single SQL query from a natural 
language question. This works for simple lookups but fails on complex 
analytical questions that require:
- Multi-step reasoning (explore → hypothesize → query → verify → refine)
- Schema understanding (which table owns which field, how tables join)
- Error recovery (fixing wrong JOINs, adjusting WHERE clauses)

## Q3. What is new in your approach, and why do you think it will be successful?

### Core innovation: Agentic SQL analysis within Kyuubi's multi-tenant gateway

Instead of single-shot text-to-SQL, we introduce a **ReAct (Reasoning + Acting) 
agent loop** that iterates toward correct answers:

```
User question
    → Agent: inspect schema (Tool: describe_schema)
    → Agent: discover join paths (Tool: find_relationships)
    → Agent: generate and execute SQL (Tool: sql_query → JDBC → Kyuubi Server → 
Compute Engine)
    → Middleware: verify results (empty? NULL explosion? cartesian join?)
    → Agent: self-correct if needed, or present final answer
```

### Key design decisions

**1. Reuse Kyuubi's engine infrastructure for SQL execution**

The Data Agent does NOT connect to databases directly. Instead, its `sql_query` 
tool uses **Kyuubi's JDBC driver to connect back to Kyuubi Server**, which then 
routes the SQL to the appropriate compute engine (Spark SQL, Trino, Hive, 
etc.). This is similar to the existing JDBC Engine pattern, where an engine 
acts as a JDBC client to a backend data source — except here the "backend" is 
Kyuubi Server itself.

This design ensures that every SQL query the agent executes goes through the 
Server gateway, inheriting multi-tenant resource isolation, 
authentication/authorization (Kyuubi AuthZ / Apache Ranger), query auditing, 
and engine lifecycle management. The agent connects using the original user's 
credentials, so the same ACL rules are enforced.

```
┌──────────┐     ┌───────────────┐     ┌───────────────────────┐
│  Client   │───▶│ Kyuubi Server  │───▶│  Data Agent Engine     │
│(JDBC/REST)│     │  (Gateway)    │     │                       │
└──────────┘     └──┬────────────┘     │  ReAct Loop            │
                     │      ▲           │  LLM ←→ Tools         │
                     │      │           │         │              │
                     │      │           └─────────┼──────────────┘
                     │      │                     │
                     │      │  JDBC (user creds)  │
                     │      └─────────────────────┘
                     │         sql_query tool connects
                     │         back to Kyuubi Server
                     ▼
              ┌─────────────────┐
              │  Compute Engine  │
              │ (Spark/Trino/   │
              │  Hive/...)      │
              └─────────────────┘
```

**2. Pluggable middleware pipeline for reasoning control**

An onion-model middleware system allows operators to customize agent behavior 
without modifying core logic:
- **ResultVerification** — detects anomalies and injects correction hints
- **Guardrails** — enforces SQL safety policies (SELECT-only)
- **Deadline** — enforces time budgets for multi-turn analysis
- **Compaction** — compresses conversation context to stay within LLM token 
limits

**3. Multi-signal schema understanding**

The agent actively discovers schema structure through:
- Explicit foreign keys from database metadata
- Naming conventions (`*_id` columns matched to primary keys)
- Value overlap detection (MinHash sketches)
- Type compatibility filtering

**4. Java/Scala implementation on the JVM**

Implemented in Java/Scala, consistent with all existing Kyuubi engines. Uses 
[LangChain4j](https://docs.langchain4j.dev/) for LLM interaction (tool calling, 
chat memory, streaming). This allows direct reuse of Kyuubi's `Serverable`, 
`SessionManager`, `OperationManager`, and Thrift service infrastructure.

### Why it will be successful

1. **Natural extension of Kyuubi's existing architecture** — follows the same 
engine model as Chat, Spark, Flink, and JDBC engines
2. **Validated approach** — the underlying agent design has been tested against 
the [BIRD benchmark](https://bird-bench.github.io/) (a standard text-to-SQL 
evaluation with 500 questions across 11 databases)
3. **Enterprise-ready from day one** — inherits Kyuubi's authentication, 
authorization, multi-tenancy, and audit capabilities
4. **Addresses real demand** — natural language data access is a top request in 
the data platform space

## Q4. Who cares? If you are successful, what difference will it make?

| User type | Current pain point | How Data Agent helps |
|---|---|---|
| **Business analysts** | Must learn SQL or wait for engineers to write queries 
| Ask questions in natural language, get verified results |
| **Data engineers** | Spend time on ad-hoc query requests from business teams 
| Offload routine queries to the agent |
| **Platform teams** | Need to provide self-service analytics while maintaining 
governance | Data Agent respects Kyuubi's AuthZ — users can only query what 
they're authorized to access |

### Strategic value for Kyuubi

- **Differentiator** — one of the first open-source SQL gateways with built-in 
agentic AI analysis
- **Broader adoption** — lowers the barrier to entry for non-technical users
- **Ecosystem growth** — attracts AI/ML community contributors to the project

## Q5. How will you measure success?

| Metric | Target | How to measure |
|---|---|---|
| **Text-to-SQL accuracy** | ≥ 60% EX on BIRD benchmark | Automated benchmark 
suite |
| **Self-correction rate** | ≥ 50% recovery from initial wrong queries | Log 
analysis of multi-turn sessions |
| **Query safety** | 0 unauthorized writes or data leaks | SQL validator + 
AuthZ integration tests |
| **Latency** | < 30s for simple questions | End-to-end timing |
| **Community adoption** | ≥ 3 external contributors within 6 months | GitHub 
activity tracking |

## Q6. What are the mid-term and final "exams" to check for success?

### Phase 1: Foundation (Mid-term)

- [ ] `DATA_AGENT` engine type registered
- [ ] `DataAgentProcessBuilder` launches the agent engine process
- [ ] Agent can execute simple single-turn text-to-SQL through Kyuubi's engine
- [ ] SQL validation enforces SELECT-only policy
- [ ] Basic authentication pass-through works

**Success criterion:** A user connects via beeline/JDBC, types a natural 
language question, and receives a SQL-backed answer.

### Phase 2: Intelligent Analysis (Final)

- [ ] Multi-turn ReAct reasoning with schema exploration
- [ ] Relationship discovery (explicit FK + inferred)
- [ ] Result verification and self-correction
- [ ] Streaming output via REST API
- [ ] Context compaction for long conversations
- [ ] Documentation and user guide

**Success criterion:** Agent correctly answers multi-step analytical questions 
that require schema exploration, multi-table joins, and result verification.

## Q7. What are the risks?

| Risk | Likelihood | Impact | Mitigation |
|---|---|---|---|
| LLM hallucination generates wrong SQL | High | Medium | Result verification + 
SQL validation + max 3 retries |
| LLM API latency causes poor UX | Medium | Medium | Streaming output + 
deadline middleware + async execution |
| Agent generates unauthorized queries | Low | High | SQL validator 
(SELECT-only) + Kyuubi AuthZ integration |
| Token cost for complex analysis | Medium | Low | Context compaction + 
configurable iteration limits |
| LLM provider lock-in | Low | Medium | Pluggable provider interface 
(OpenAI-compatible API as standard) |

---

## Implementation Roadmap

| Phase | Scope | Estimated PRs |
|---|---|---|
| **Phase 0** | Community discussion + design review | This KPIP |
| **Phase 1** | Engine skeleton: EngineType, ProcessBuilder, Thrift service | 
2-3 PRs |
| **Phase 2** | Core agent: ReAct loop, basic tools (schema + sql_query) | 3-4 
PRs |
| **Phase 3** | Advanced: relationship discovery, result verification, 
middleware | 3-5 PRs |
| **Phase 4** | Streaming, documentation, benchmark CI | 2-3 PRs |

---

## Rejected Alternatives

### Alternative 1: Extend the existing Chat Engine

The Chat Engine's `ChatProvider.ask()` interface is fundamentally single-turn 
and synchronous. Adding multi-turn reasoning, tool execution, and streaming 
would require rewriting most of the internals. A new engine type is cleaner and 
avoids breaking existing Chat Engine users.

### Alternative 2: Implement as a Kyuubi Server plugin / extension

The agent needs its own process lifecycle (LLM client, conversation memory, 
middleware pipeline). Running inside the Server JVM would compete for resources 
and complicate failure isolation.

### Alternative 3: External service with REST integration

Would bypass Kyuubi's authentication, authorization, and audit pipeline. Users 
would need to manage a separate service.

### Alternative 4: Implement in Python

While the AI/ML ecosystem is richer in Python, the agent's core operations (LLM 
API calls, SQL orchestration, schema introspection) are well-supported in Java 
via LangChain4j. A Python implementation would introduce a technology stack 
inconsistency, require re-implementing Kyuubi infrastructure, and add a runtime 
dependency that the existing community cannot effectively review or maintain.

---

## References

- [BIRD Benchmark](https://bird-bench.github.io/) — Text-to-SQL evaluation 
standard
- [ReAct: Synergizing Reasoning and Acting in Language 
Models](https://arxiv.org/abs/2210.03629)
- [LangChain4j](https://docs.langchain4j.dev/) — Java LLM framework
- Kyuubi Chat Engine — `externals/kyuubi-chat-engine/`


GitHub link: https://github.com/apache/kyuubi/discussions/7373

----
This is an automatically sent email for [email protected].
To unsubscribe, please send an email to: 
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to