wangzhigang1999 commented on issue #7379: URL: https://github.com/apache/kyuubi/issues/7379#issuecomment-4200073491
### Preliminary Benchmark Results > **Note:** This is a quick feasibility test to validate the approach, not a rigorous evaluation. The results below were obtained using an internal production data agent system. The same core idea (ReAct + tool-calling architecture, with data-agent-specific tools and mechanisms) has been contributed to Kyuubi as the Data Agent Engine, but these numbers were not benchmarked on the Kyuubi implementation itself. We plan to integrate these optimizations into Kyuubi incrementally. We ran against the [BIRD-SQL Mini-Dev set](https://huggingface.co/datasets/birdsql/bird_mini_dev) (500 examples), a lite version of the [BIRD benchmark](https://bird-bench.github.io/) covering multiple professional domains. <img width="3376" height="1334" alt="Image" src="https://github.com/user-attachments/assets/4f53c25e-d41b-4d99-8d61-bd36c34ab1e0" /> <img width="3376" height="1460" alt="Image" src="https://github.com/user-attachments/assets/82abd567-cec7-4920-8c8f-8badffc918f4" /> | Metric | Result | |---|---| | EX (Execution Accuracy, overall) | **~70%** | | Soft-F1 (simple difficulty) | **~83–84%** | For reference, frontier LLM baselines on the BIRD full test set (not directly comparable due to different eval splits, but provides context): | Model | EX (test) | |---|---| | Claude Opus 4.6 | 70.15% | | Qwen3-Coder-480B-A35B | 68.14% | | Claude 4.5 Sonnet | 66.85% | | GLM-4.7 | 62.94% | | DeepSeek-R1 | 60.93% | And on the BIRD Mini-Dev leaderboard (SQLite, with oracle knowledge): | Model | Size | EX | |---|---|---| | TA + GPT-4 | UNK | 58.00% | | GPT-4 | UNK | 47.80% | | GPT-4-turbo | UNK | 45.80% | | Llama3-70b-instruct | 70B | 40.80% | Our ~70% EX is achieved with Qwen3.5/3.6-plus — a much smaller model compared to the frontier baselines above. Unlike vanilla LLM baselines that rely on single-pass prompting, our system uses a ReAct agent with data-agent-specific tools (schema exploration, SQL execution, result validation). These domain-specific optimizations enable a smaller model to match or exceed the accuracy of much larger frontier models. While these numbers come from an internal system rather than the Kyuubi engine, the architectural approach is identical and will be progressively integrated into the Kyuubi Data Agent Engine. This at least demonstrates that the agent-based approach is viable and effective. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
