Re: [PR] Add MultimodalToolset to Common AI [airflow]

via GitHub Wed, 01 Apr 2026 17:48:55 -0700


Copilot commented on code in PR #64407:
URL: https://github.com/apache/airflow/pull/64407#discussion_r3025334456



##########
providers/common/ai/docs/toolsets.rst:
##########
@@ -24,17 +24,23 @@ Airflow's 350+ provider hooks already have typed methods, 
rich docstrings,
 and managed credentials. Toolsets expose them as pydantic-ai tools so that
 LLM agents can call them during multi-turn reasoning.
 
-Three toolsets are included:
+Five toolsets are included:
 
+- :class:`~airflow.providers.common.ai.toolsets.datafusion.DataFusionToolset`
+  — curated SQL toolset for querying file and object-store data via Apache
+  DataFusion.
 - :class:`~airflow.providers.common.ai.toolsets.hook.HookToolset` — generic
   adapter for any Airflow Hook.
+- :class:`~airflow.providers.common.ai.toolsets.multimodal.MultimodalToolset`
+  — curated read-only file and object-store inspection toolset for text,
+  image, and PDF inputs.
 - :class:`~airflow.providers.common.ai.toolsets.sql.SQLToolset` — curated
   4-tool database toolset.
 - :class:`~airflow.providers.common.ai.toolsets.mcp.MCPToolset` — connect to
   `MCP servers <https://modelcontextprotocol.io/>`__ configured via Airflow
   connections.
 
-All three implement pydantic-ai's
+All five implement pydantic-ai's
 `AbstractToolset <https://ai.pydantic.dev/toolsets/>`__ interface and can be

Review Comment:
   This doc now says "Five toolsets are included" and lists 5, but the same 
page also documents `LoggingToolset` later (which is also shipped in 
`airflow.providers.common.ai.toolsets.logging`). Either include it in the 
count/list, or clarify that the list is only the primary/curated toolsets and 
that `LoggingToolset` is an additional wrapper toolset.



##########
providers/common/ai/src/airflow/providers/common/ai/toolsets/multimodal.py:
##########
@@ -0,0 +1,203 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+"""Curated multimodal toolset for file and object-store inspection."""
+
+from __future__ import annotations
+
+import json
+from typing import TYPE_CHECKING, Any
+
+from pydantic_ai.tools import ToolDefinition
+from pydantic_ai.toolsets.abstract import AbstractToolset, ToolsetTool
+from pydantic_core import SchemaValidator, core_schema
+
+from airflow.providers.common.ai.utils.file_analysis import (
+    _infer_partitions,
+    _resolve_paths,
+    build_file_analysis_request,
+    detect_file_format,
+)
+from airflow.providers.common.compat.sdk import ObjectStoragePath
+
+if TYPE_CHECKING:
+    from pydantic_ai._run_context import RunContext
+
+_PASSTHROUGH_VALIDATOR = SchemaValidator(core_schema.any_schema())
+
+_LIST_FILES_SCHEMA: dict[str, Any] = {
+    "type": "object",
+    "properties": {
+        "file_path": {
+            "type": "string",
+            "description": "Optional file or prefix to inspect. Defaults to 
the configured file_path.",
+        },
+    },
+}
+
+_LOAD_FILES_SCHEMA: dict[str, Any] = {
+    "type": "object",
+    "properties": {
+        "file_path": {
+            "type": "string",
+            "description": "Optional file or prefix to inspect. Defaults to 
the configured file_path.",
+        },
+    },
+}
+
+
+class MultimodalToolset(AbstractToolset[Any]):
+    """
+    Curated toolset that lets an LLM agent inspect files from a configured 
file path or prefix.
+
+    ``list_files`` returns JSON metadata for files under ``file_path`` or an
+    overridden tool argument. ``load_files`` returns normalized text context 
for
+    text-like inputs and text plus ``BinaryContent`` attachments for images and
+    PDFs. The toolset reuses the same safety limits and format handling as
+    
:class:`~airflow.providers.common.ai.operators.llm_file_analysis.LLMFileAnalysisOperator`.
+
+    :param file_path: File, directory, or object-storage prefix to expose.
+    :param file_conn_id: Optional Airflow connection ID for the storage 
backend.
+    :param max_files: Maximum number of files to resolve from a directory or
+        prefix. Default ``20``.
+    :param max_file_size_bytes: Maximum size of any single input file. Default
+        ``5 MiB``.
+    :param max_total_size_bytes: Maximum cumulative size across all resolved
+        files. Default ``20 MiB``.
+    :param max_text_chars: Maximum normalized text context returned from
+        ``load_files``. Default ``100000``.
+    :param sample_rows: Maximum sampled rows or records for structured file
+        previews. Default ``10``.
+    """
+
+    def __init__(
+        self,
+        file_path: str,
+        *,
+        file_conn_id: str | None = None,
+        max_files: int = 20,
+        max_file_size_bytes: int = 5 * 1024 * 1024,
+        max_total_size_bytes: int = 20 * 1024 * 1024,
+        max_text_chars: int = 100_000,
+        sample_rows: int = 10,
+    ) -> None:
+        if max_files <= 0:
+            raise ValueError("max_files must be greater than zero.")
+        if max_file_size_bytes <= 0:
+            raise ValueError("max_file_size_bytes must be greater than zero.")
+        if max_total_size_bytes <= 0:
+            raise ValueError("max_total_size_bytes must be greater than zero.")
+        if max_text_chars <= 0:
+            raise ValueError("max_text_chars must be greater than zero.")
+        if sample_rows <= 0:
+            raise ValueError("sample_rows must be greater than zero.")
+
+        self._file_path = file_path
+        self._file_conn_id = file_conn_id
+        self._max_files = max_files
+        self._max_file_size_bytes = max_file_size_bytes
+        self._max_total_size_bytes = max_total_size_bytes
+        self._max_text_chars = max_text_chars
+        self._sample_rows = sample_rows
+
+    @property
+    def id(self) -> str:
+        return f"multimodal-{self._file_path}"
+
+    def _resolve_target(self, file_path: str = "") -> ObjectStoragePath:
+        return ObjectStoragePath(file_path or self._file_path, 
conn_id=self._file_conn_id)
+

Review Comment:
   The tool argument `file_path` fully overrides the configured `file_path` 
(`_resolve_target(file_path or self._file_path)`), which lets the agent 
read/list arbitrary local/object-store paths (e.g. `/etc/passwd`) rather than 
being constrained to the configured root/prefix. Consider enforcing that any 
override is a subpath of the configured `file_path` (e.g. treat the tool 
argument as a relative path and/or validate with 
`ObjectStoragePath(...).is_relative_to(base)` before resolving).
   ```suggestion
           """
           Resolve the effective target path, constraining overrides to the 
configured base.
   
           The optional ``file_path`` argument is treated as a *relative* path 
segment
           under the configured ``self._file_path``. Absolute paths or 
fully-qualified
           URIs are rejected to prevent escaping the configured root/prefix.
           """
           base = ObjectStoragePath(self._file_path, conn_id=self._file_conn_id)
           if not file_path:
               return base
   
           # Disallow absolute filesystem paths and fully-qualified URIs, which
           # would override the configured base and potentially expose arbitrary
           # locations (e.g. "/etc/passwd" or "s3://other-bucket").
           if file_path.startswith("/") or "://" in file_path:
               raise ValueError(
                   "file_path override must be a relative path under the 
configured file_path."
               )
   
           return base / file_path
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] Add MultimodalToolset to Common AI [airflow]

Reply via email to