wangzhigang1999 opened a new issue, #7379: URL: https://github.com/apache/kyuubi/issues/7379
### Code of Conduct - [X] I agree to follow this project's [Code of Conduct](https://www.apache.org/foundation/policies/conduct) ### Search before asking - [X] I have searched in the [issues](https://github.com/apache/kyuubi/issues?q=is%3Aissue) and found no similar issues. ### Describe the feature Umbrella issue tracking the implementation of `DATA_AGENT` engine type. KPIP: https://github.com/apache/kyuubi/discussions/7373 The Data Agent enables users to perform data analysis through natural language — an AI agent autonomously explores schemas, generates SQL, executes queries via Kyuubi's existing multi-engine infrastructure, and self-corrects through multi-turn ReAct reasoning. #### Architecture ``` Client (JDBC/REST/Web UI) │ ▼ Kyuubi Server (Gateway) ◄─── JDBC (user creds) ───┐ │ │ ▼ │ Data Agent Engine │ ┌──────────────────────────┐ │ │ ReAct Loop │ │ │ LLM ←→ Tools ──────────┼──────────────────────────┘ │ └─ sql_query (via Kyuubi JDBC) │ Middleware Pipeline │ │ ├─ ApprovalMiddleware │ │ └─ LoggingMiddleware │ └──────────────────────────┘ ``` #### Sub-tasks - [ ] **PR 1: Module skeleton, configuration, and engine core** — New module `externals/kyuubi-data-agent-engine` with engine fully runnable via Echo provider. Includes Thrift frontend, session/operation management, IncrementalFetchIterator for streaming, event system, and all `kyuubi.engine.data.agent.*` configuration entries. - [ ] **PR 2a: Tool system, data source, and prompt templates** — SqlQueryTool with maxRows enforcement and output truncation, ToolRegistry with JSON schema generation for LLM function calling, data source abstraction with dialect auto-detection (Spark/SQLite/MySQL/Trino), and composable system prompt builder with per-dialect templates. - [ ] **PR 2b: Agent runtime, middleware, and OpenAI provider** — ReAct loop agent with streaming LLM interaction, ConversationMemory for multi-turn context, middleware pipeline (ApprovalMiddleware with STRICT/NORMAL/AUTO_APPROVE modes, LoggingMiddleware), and OpenAI-compatible provider. Integration tests with MockLlmProvider validate the complete tool-call pipeline without a real LLM. - [ ] **PR 3: REST API and Web UI** — SSE streaming chat endpoint (`POST /api/v1/data-agent/{sessionHandle}/chat`), tool approval endpoint (`POST /api/v1/data-agent/{sessionHandle}/approve`), and complete Vue web interface with session management, real-time message streaming, tool call visualization, and approval workflow UI. ### Motivation See [KPIP-7373](https://github.com/apache/kyuubi/discussions/7373) for full motivation. In short: Kyuubi's existing Chat Engine is stateless with no data access. The Data Agent Engine bridges LLMs with Kyuubi's multi-engine SQL execution, enabling business users and analysts to query data warehouses through natural language without writing SQL. ### Describe the solution See [KPIP-7373](https://github.com/apache/kyuubi/discussions/7373) for detailed design. Key decisions: 1. **SQL routes through Kyuubi Server** — The agent's `sql_query` tool connects back to Kyuubi Server via JDBC with the original user's credentials, inheriting AuthZ/Ranger policies, audit, and resource isolation. 2. **Pluggable LLM providers** — OpenAI-compatible API as default via official OpenAI Java SDK; extensible through `DataAgentProvider` interface. 3. **Java for business logic, Scala for framework wrappers** — Agent runtime, tools, providers, events, and middleware are all in Java; Scala is used only for thin integration with Kyuubi's Session/Operation/Engine infrastructure. 4. **Streaming-first** — `IncrementalFetchIterator` enables real-time event streaming to both JDBC and REST/SSE clients. 5. **Human-in-the-loop approval** — Configurable approval workflow (AUTO_APPROVE / NORMAL / STRICT) for controlling tool execution risk. ### Additional context Test strategy: | Layer | Approach | LLM Required? | |---|---|---| | Unit tests (Java) | JUnit 4 — tools, events, memory, middleware, prompts, data source | No | | Integration tests (Scala) | MockLlmProvider drives full engine pipeline against SQLite | No | | JDBC tests (Scala) | HiveJDBCTestHelper + Echo/Mock engine | No | | Live tests (Java/Scala) | Real LLM API + SQLite test database | Yes (CI-optional) | | Web UI tests (TypeScript) | Vitest — API client mocking | No | ### Are you willing to submit PR? - [X] Yes. I would be willing to submit a PR with guidance from the Kyuubi community to improve. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
