wangzhigang1999 commented on issue #7379:
URL: https://github.com/apache/kyuubi/issues/7379#issuecomment-4200073491

   ### Preliminary Benchmark Results
   
   > **Note:** This is a quick feasibility test to validate the approach, not a 
rigorous evaluation. The results below were obtained using an internal 
production data agent system. The same core idea (ReAct + tool-calling 
architecture, with data-agent-specific tools and mechanisms) has been 
contributed to Kyuubi as the Data Agent Engine, but these numbers were not 
benchmarked on the Kyuubi implementation itself. We plan to integrate these 
optimizations into Kyuubi incrementally.
   
   We ran against the [BIRD-SQL Mini-Dev 
set](https://huggingface.co/datasets/birdsql/bird_mini_dev) (500 examples), a 
lite version of the [BIRD benchmark](https://bird-bench.github.io/) covering 
multiple professional domains.
   
   <img width="3376" height="1334" alt="Image" 
src="https://github.com/user-attachments/assets/4f53c25e-d41b-4d99-8d61-bd36c34ab1e0";
 />
   
   <img width="3376" height="1460" alt="Image" 
src="https://github.com/user-attachments/assets/82abd567-cec7-4920-8c8f-8badffc918f4";
 />
   
   | Metric | Result |
   |---|---|
   | EX (Execution Accuracy, overall) | **~70%** |
   | Soft-F1 (simple difficulty) | **~83–84%** |
   
   For reference, frontier LLM baselines on the BIRD full test set (not 
directly comparable due to different eval splits, but provides context):
   
   | Model | EX (test) |
   |---|---|
   | Claude Opus 4.6 | 70.15% |
   | Qwen3-Coder-480B-A35B | 68.14% |
   | Claude 4.5 Sonnet | 66.85% |
   | GLM-4.7 | 62.94% |
   | DeepSeek-R1 | 60.93% |
   
   And on the BIRD Mini-Dev leaderboard (SQLite, with oracle knowledge):
   
   | Model | Size | EX |
   |---|---|---|
   | TA + GPT-4 | UNK | 58.00% |
   | GPT-4 | UNK | 47.80% |
   | GPT-4-turbo | UNK | 45.80% |
   | Llama3-70b-instruct | 70B | 40.80% |
   
   Our ~70% EX is achieved with Qwen3.5/3.6-plus — a much smaller model 
compared to the frontier baselines above. Unlike vanilla LLM baselines that 
rely on single-pass prompting, our system uses a ReAct agent with 
data-agent-specific tools (schema exploration, SQL execution, result 
validation). These domain-specific optimizations enable a smaller model to 
match or exceed the accuracy of much larger frontier models. While these 
numbers come from an internal system rather than the Kyuubi engine, the 
architectural approach is identical and will be progressively integrated into 
the Kyuubi Data Agent Engine. This at least demonstrates that the agent-based 
approach is viable and effective.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to