Hello everyone, Since I first opened the thread discussing the MCP server, I've been thinking for a long while about how we can practically bring AI-assisted debugging and operational insights directly into Airflow. As orchestration environments grow more complex, the cost of troubleshooting, such as navigating across Dag code, task instances, scheduler logs, and configurations, translates directly to lost on-call time and delayed pipelines.
Today, organizations that want AI-assisted debugging are forced to build custom, ad-hoc integrations or rely on external paid solutions. This leads to fragmented user experiences, duplicated effort, and most critically, inconsistent security controls that risk exposing sensitive metadata or bypassing Airflow's native Role-Based Access Control (RBAC). I think we can do better, by proposing AIP-101: Airflow AI Assistant - Phase 1 (Read-only assistance). This AIP introduces an official, opt-in plugin that provides a conversational UI directly within Airflow to answer user questions about their instances, explain errors, and help troubleshoot failures. To ensure this is done safely and securely, Phase 1 is strictly read-only. The assistant does not modify Airflow state, nor does it operate autonomously. Instead, it relies on the newly proposed Airflow MCP Server (AIP-91) as its data-retrieval engine. By leveraging the MCP standard, the assistant guarantees that its answers are grounded in live system state while strictly enforcing the authenticated user's RBAC permissions so the AI never accesses data the user cannot see. tl;dr of the proposed implementation: Packaging: Delivered as an opt-in, standalone plugin package within the apache/airflow monorepo (with an independent release cycle). Frontend: A conversational UI embedded directly in the Airflow web interface. Backend: A FastAPI-based plugin backend utilizing pydantic-ai to safely orchestrate external LLM calls. Data Access: Relies entirely on the Airflow MCP Server (AIP-91) to fetch read-only state. --- Because the assistant is heavily coupled with the secure tool-calling execution provided by the MCP server, which is covered in a separate AIP (AIP-91) - please note the ongoing discussion here, as well as AIP-91 itself: https://lists.apache.org/thread/xgd66v6s7zf0xkvy3c7ysqvn4csgmw06 https://cwiki.apache.org/confluence/x/G4q3FQ --- AIP-101 is available here: https://cwiki.apache.org/confluence/x/8Ic8G A quick warning before you read: the AIP is quite long! (sorry Jarek) Because integrating AI into an orchestrator opens up a lot of potential pitfalls, [ChatGPT and] I tried to be extremely thorough in covering all the possible stuff that could go wrong :) If you find a specific section to be overly detailed or repetitive, please comment in the AIP and I'll try to handle it. I've managed to build a very inital POC, screenshots are available in this section: https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=406620144#AIP101AirflowAIAssistantPhase1(Readonlyassistance)-BehavioralModel I would love to hear your thoughts. Please comment on the AIP and/or reply to this thread. Thank you, Shahar --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
