Re: [PR] Add MultimodalToolset to Common AI [airflow]

via GitHub Tue, 31 Mar 2026 17:23:55 -0700


kaxil commented on code in PR #64407:
URL: https://github.com/apache/airflow/pull/64407#discussion_r3019109604



##########
providers/common/ai/src/airflow/providers/common/ai/toolsets/multimodal.py:
##########
@@ -0,0 +1,203 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+"""Curated multimodal toolset for file and object-store inspection."""
+
+from __future__ import annotations
+
+import json
+from typing import TYPE_CHECKING, Any
+
+from pydantic_ai.tools import ToolDefinition
+from pydantic_ai.toolsets.abstract import AbstractToolset, ToolsetTool
+from pydantic_core import SchemaValidator, core_schema
+
+from airflow.providers.common.ai.utils.file_analysis import (
+    _infer_partitions,
+    _resolve_paths,
+    build_file_analysis_request,
+    detect_file_format,
+)
+from airflow.providers.common.compat.sdk import ObjectStoragePath
+
+if TYPE_CHECKING:
+    from pydantic_ai._run_context import RunContext
+
+_PASSTHROUGH_VALIDATOR = SchemaValidator(core_schema.any_schema())
+
+_LIST_FILES_SCHEMA: dict[str, Any] = {
+    "type": "object",
+    "properties": {
+        "file_path": {

Review Comment:
   Both `list_files` and `load_files` accept an unrestricted `file_path` 
override from the LLM. The agent can pass `"/etc/shadow"` or 
`"s3://other-bucket/secrets/"` and the toolset reads it without validating the 
path is under the configured root. `SQLToolset` has `allowed_tables` for 
analogous scoping. Should this validate that overrides stay under 
`self._file_path`, or at minimum make the tool descriptions explicit that the 
LLM can read arbitrary paths accessible to the worker?



##########
providers/common/ai/src/airflow/providers/common/ai/toolsets/multimodal.py:
##########
@@ -0,0 +1,203 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+"""Curated multimodal toolset for file and object-store inspection."""
+
+from __future__ import annotations
+
+import json
+from typing import TYPE_CHECKING, Any
+
+from pydantic_ai.tools import ToolDefinition
+from pydantic_ai.toolsets.abstract import AbstractToolset, ToolsetTool
+from pydantic_core import SchemaValidator, core_schema
+
+from airflow.providers.common.ai.utils.file_analysis import (
+    _infer_partitions,
+    _resolve_paths,
+    build_file_analysis_request,
+    detect_file_format,
+)
+from airflow.providers.common.compat.sdk import ObjectStoragePath
+
+if TYPE_CHECKING:
+    from pydantic_ai._run_context import RunContext
+
+_PASSTHROUGH_VALIDATOR = SchemaValidator(core_schema.any_schema())
+
+_LIST_FILES_SCHEMA: dict[str, Any] = {
+    "type": "object",
+    "properties": {
+        "file_path": {
+            "type": "string",
+            "description": "Optional file or prefix to inspect. Defaults to 
the configured file_path.",
+        },
+    },
+}
+
+_LOAD_FILES_SCHEMA: dict[str, Any] = {
+    "type": "object",
+    "properties": {
+        "file_path": {
+            "type": "string",
+            "description": "Optional file or prefix to inspect. Defaults to 
the configured file_path.",
+        },
+    },
+}
+
+
+class MultimodalToolset(AbstractToolset[Any]):

Review Comment:
   "Multimodal" describes the LLM's capability, not what this toolset does. The 
other toolsets are named after what they operate on: `SQLToolset`, 
`HookToolset`, `DataFusionToolset`. This one reads files. Something like 
`FileToolset` would be more consistent and self-descriptive, and it aligns with 
the `file_analysis.py` module it wraps.



##########
providers/common/ai/src/airflow/providers/common/ai/toolsets/__init__.py:
##########
@@ -40,4 +40,20 @@ def __getattr__(name: str):
 
             raise AirflowOptionalProviderFeatureException(e)
         return MCPToolset
+    if name == "DataFusionToolset":
+        try:
+            from airflow.providers.common.ai.toolsets.datafusion import 
DataFusionToolset
+        except ImportError as e:
+            from airflow.providers.common.compat.sdk import 
AirflowOptionalProviderFeatureException
+
+            raise AirflowOptionalProviderFeatureException(e)
+        return DataFusionToolset
+    if name == "MultimodalToolset":

Review Comment:
   Unlike `MCPToolset` (needs `mcp`) and `DataFusionToolset` (needs 
`datafusion`), `MultimodalToolset` has no optional dependencies -- 
`file_analysis` and `ObjectStoragePath` are always available. The `try/except 
ImportError` here is dead code. Should this just be imported eagerly like 
`HookToolset`?



##########
providers/common/ai/src/airflow/providers/common/ai/toolsets/multimodal.py:
##########
@@ -0,0 +1,203 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+"""Curated multimodal toolset for file and object-store inspection."""
+
+from __future__ import annotations
+
+import json
+from typing import TYPE_CHECKING, Any
+
+from pydantic_ai.tools import ToolDefinition
+from pydantic_ai.toolsets.abstract import AbstractToolset, ToolsetTool
+from pydantic_core import SchemaValidator, core_schema
+
+from airflow.providers.common.ai.utils.file_analysis import (
+    _infer_partitions,
+    _resolve_paths,
+    build_file_analysis_request,
+    detect_file_format,
+)
+from airflow.providers.common.compat.sdk import ObjectStoragePath
+
+if TYPE_CHECKING:
+    from pydantic_ai._run_context import RunContext
+
+_PASSTHROUGH_VALIDATOR = SchemaValidator(core_schema.any_schema())
+
+_LIST_FILES_SCHEMA: dict[str, Any] = {
+    "type": "object",
+    "properties": {
+        "file_path": {
+            "type": "string",
+            "description": "Optional file or prefix to inspect. Defaults to 
the configured file_path.",
+        },
+    },
+}
+
+_LOAD_FILES_SCHEMA: dict[str, Any] = {
+    "type": "object",
+    "properties": {
+        "file_path": {
+            "type": "string",
+            "description": "Optional file or prefix to inspect. Defaults to 
the configured file_path.",
+        },
+    },
+}
+
+
+class MultimodalToolset(AbstractToolset[Any]):
+    """
+    Curated toolset that lets an LLM agent inspect files from a configured 
file path or prefix.
+
+    ``list_files`` returns JSON metadata for files under ``file_path`` or an
+    overridden tool argument. ``load_files`` returns normalized text context 
for
+    text-like inputs and text plus ``BinaryContent`` attachments for images and
+    PDFs. The toolset reuses the same safety limits and format handling as
+    
:class:`~airflow.providers.common.ai.operators.llm_file_analysis.LLMFileAnalysisOperator`.
+
+    :param file_path: File, directory, or object-storage prefix to expose.
+    :param file_conn_id: Optional Airflow connection ID for the storage 
backend.
+    :param max_files: Maximum number of files to resolve from a directory or
+        prefix. Default ``20``.
+    :param max_file_size_bytes: Maximum size of any single input file. Default
+        ``5 MiB``.
+    :param max_total_size_bytes: Maximum cumulative size across all resolved
+        files. Default ``20 MiB``.
+    :param max_text_chars: Maximum normalized text context returned from
+        ``load_files``. Default ``100000``.
+    :param sample_rows: Maximum sampled rows or records for structured file
+        previews. Default ``10``.
+    """
+
+    def __init__(
+        self,
+        file_path: str,
+        *,
+        file_conn_id: str | None = None,
+        max_files: int = 20,
+        max_file_size_bytes: int = 5 * 1024 * 1024,
+        max_total_size_bytes: int = 20 * 1024 * 1024,
+        max_text_chars: int = 100_000,
+        sample_rows: int = 10,
+    ) -> None:
+        if max_files <= 0:
+            raise ValueError("max_files must be greater than zero.")
+        if max_file_size_bytes <= 0:
+            raise ValueError("max_file_size_bytes must be greater than zero.")
+        if max_total_size_bytes <= 0:
+            raise ValueError("max_total_size_bytes must be greater than zero.")
+        if max_text_chars <= 0:
+            raise ValueError("max_text_chars must be greater than zero.")
+        if sample_rows <= 0:
+            raise ValueError("sample_rows must be greater than zero.")
+
+        self._file_path = file_path
+        self._file_conn_id = file_conn_id
+        self._max_files = max_files
+        self._max_file_size_bytes = max_file_size_bytes
+        self._max_total_size_bytes = max_total_size_bytes
+        self._max_text_chars = max_text_chars
+        self._sample_rows = sample_rows
+
+    @property
+    def id(self) -> str:
+        return f"multimodal-{self._file_path}"
+
+    def _resolve_target(self, file_path: str = "") -> ObjectStoragePath:
+        return ObjectStoragePath(file_path or self._file_path, 
conn_id=self._file_conn_id)

Review Comment:
   When the LLM provides a local path override like `/tmp/local-file`, this 
still passes `self._file_conn_id` (e.g. `aws_default`) to `ObjectStoragePath`. 
What happens when a local path gets an S3 connection attached? Worth either 
validating the scheme matches or documenting that overrides must use the same 
storage backend.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] Add MultimodalToolset to Common AI [airflow]

Reply via email to