Re: [PR] [GSoC 2026] POC: Breeze-aware agent skill with context detection and drift enforcement [airflow]

via GitHub Sun, 12 Apr 2026 09:24:29 -0700


Copilot commented on code in PR #63661:
URL: https://github.com/apache/airflow/pull/63661#discussion_r3069726432



##########
scripts/ci/prek/extract_agent_skills.py:
##########
@@ -0,0 +1,226 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+from __future__ import annotations
+
+"""
+Extract agent skill definitions from SKILL.md marker comments and write 
skills.json.
+
+This script mirrors the pattern used by update-breeze-cmd-output (line 909 of
+.pre-commit-config.yaml): a prek hook reads a source file, generates a derived
+artifact, and fails CI if the committed artifact diverges from what would be 
generated.
+
+Markers in SKILL.md look like:
+    <!-- agent-skill-sync: workflow=run-tests host="uv run..." 
breeze="pytest..." -->
+
+This script:
+1. Parses those markers from SKILL.md
+2. Builds a structured skills dict
+3. Writes .github/skills/breeze-contribution/skills.json
+4. Exits 1 if the committed skills.json differs from what was just generated
+   (drift detection — same principle as update-breeze-cmd-output)
+
+Usage:
+    # Generate/update skills.json:
+    python3 scripts/ci/prek/extract_agent_skills.py
+
+    # Check for drift only (CI mode — exits 1 if drift detected):
+    python3 scripts/ci/prek/extract_agent_skills.py --check
+"""
+
+import argparse
+import json
+import re
+import sys
+from pathlib import Path
+
+# Canonical paths — relative to repo root
+SKILL_MD = Path(".github/skills/breeze-contribution/SKILL.md")
+SKILLS_JSON = Path(".github/skills/breeze-contribution/skills.json")
+
+# Regex to match agent-skill-sync marker lines
+# Handles both quoted and unquoted values:
+#   workflow=run-tests
+#   host="uv run --project {dist} pytest {path} -xvs"
+#   fallback=missing_system_deps
+MARKER_RE = re.compile(r"<!--\s*agent-skill-sync:\s*(.+?)\s*-->")
+FIELD_RE = re.compile(r'(\w+)=(?:"([^"]*?)"|(\S+))')
+
+
+def parse_marker(line: str) -> dict[str, str] | None:
+    """
+    Parse a single agent-skill-sync marker line into a dict.
+
+    Returns None if the line does not contain a valid marker.
+    """
+    match = MARKER_RE.search(line)
+    if not match:
+        return None
+
+    fields: dict[str, str] = {}
+    for field_match in FIELD_RE.finditer(match.group(1)):
+        key = field_match.group(1)
+        # group(2) = quoted value, group(3) = unquoted value
+        value = field_match.group(2) if field_match.group(2) is not None else 
field_match.group(3)
+        fields[key] = value
+
+    if "workflow" not in fields:
+        return None
+
+    return fields
+
+
+RST_SKILL_RE = re.compile(r"[.][.] agent-skill::[ ]*\n(?P<fields>(?:[ 
]{3}:[^:]+: .+\n)+)", re.MULTILINE)
+RST_FIELD_RE = re.compile(r"[ ]{3}:([^:]+): (.+)")
+
+
+def extract_skills_from_rst(rst_path: Path) -> list[dict[str, str]]:
+    """Extract agent-skill directives from RST contributing docs."""
+    if not rst_path.exists():
+        return []
+    skills = []
+    for match in RST_SKILL_RE.finditer(rst_path.read_text(encoding="utf-8")):
+        fields = dict(RST_FIELD_RE.findall(match.group("fields")))
+        if "id" in fields:
+            fields["workflow"] = fields.pop("id")
+            skills.append(fields)
+    return skills
+
+
+def extract_skills(skill_md_path: Path) -> list[dict[str, str]]:
+    """
+    Read SKILL.md and extract all agent-skill-sync markers.
+
+    Returns a list of skill dicts, one per marker found.
+    """
+    if not skill_md_path.exists():
+        print(f"ERROR: {skill_md_path} not found. Run from repo root.", 
file=sys.stderr)
+        sys.exit(1)
+
+    skills = []
+    for line in skill_md_path.read_text(encoding="utf-8").splitlines():
+        parsed = parse_marker(line)
+        if parsed:
+            skills.append(parsed)
+
+    return skills
+
+
+def build_skills_json(skills: list[dict[str, str]]) -> dict:
+    """
+    Build the full skills.json structure from extracted skill dicts.
+    """
+    return {
+        "$schema": "breeze-agent-skills/v1",
+        "source": str(SKILL_MD),
+        "description": (
+            "Auto-generated from agent-skill-sync markers in SKILL.md. "
+            "Do not edit manually — update SKILL.md markers instead."
+        ),
+        "skills": [
+            {
+                "workflow": s["workflow"],
+                "host": s.get("host", ""),
+                "breeze": s.get("breeze", ""),
+                "fallback_condition": s.get("fallback", 
s.get("fallback_condition", "never")),
+            }
+            for s in skills
+        ],

Review Comment:
   `extract_skills_from_rst()` returns fields like `local` (host command) and 
`breeze`, but `build_skills_json()` only reads `host`. This makes the generated 
`skills.json` lose the host command (currently written as an empty string). 
Consider normalizing field names (e.g., map `local` -> `host`) before building 
the JSON so host commands are preserved.



##########
contributing-docs/03_contributors_quick_start.rst:
##########
@@ -745,6 +754,16 @@ All Tests are inside ./tests directory.
 - Running Unit tests inside Breeze environment.
 
   Just run ``pytest filepath+filename`` to run the tests.
+.. agent-skill::
+   :id: run-tests
+   :context: host
+   :local: uv run --project {distribution_folder} pytest {test_path} -xvs
+   :breeze: pytest {test_path} -xvs
+   :prereqs: run-static-checks

Review Comment:
   `:prereqs: run-static-checks` doesn’t correspond to any `:id:` defined in 
this file (the static checks skill is `static-checks`). This makes the 
dependency reference inconsistent for downstream tooling. Consider changing it 
to `static-checks` (or rename the referenced skill id).
   



##########
.github/skills/breeze-contribution/DX_REPORT.md:
##########
@@ -0,0 +1,72 @@
+ <!-- SPDX-License-Identifier: Apache-2.0
+      https://www.apache.org/licenses/LICENSE-2.0 -->
+
+# Developer Experience Report: With vs Without Breeze Agent Skill
+
+## The Problem This Skill Solves
+
+Without this skill, AI agents treat Airflow like a generic Python project.
+This report documents the exact failure modes and how the skill fixes them.
+
+## Failure Mode 1: Wrong test command on host
+
+WITHOUT skill:
+    Agent runs: pytest airflow-core/tests/unit/test_dag.py
+    Result: ModuleNotFoundError: No module named 'airflow'
+    Agent is stuck. No context for what to do next.
+
+WITH skill:
+    Agent calls: get_command('run-tests', 
test_path='airflow-core/tests/unit/test_dag.py')

Review Comment:
   The example `get_command('run-tests', test_path=...)` is missing 
`distribution_folder`, but the current `run-tests` host template includes 
`{distribution_folder}` and will raise `KeyError` if it’s not provided. Update 
the example to include `distribution_folder` (or adjust `get_command` to 
provide a default).
   



##########
.github/skills/breeze-contribution/skills.json:
##########
@@ -0,0 +1,19 @@
+{
+  "$schema": "breeze-agent-skills/v1",
+  "source": ".github/skills/breeze-contribution/SKILL.md",
+  "description": "Auto-generated from agent-skill-sync markers in SKILL.md. Do 
not edit manually — update SKILL.md markers instead.",
+  "skills": [
+    {
+      "workflow": "static-checks",
+      "host": "",
+      "breeze": "prek",
+      "fallback_condition": "never"
+    },
+    {
+      "workflow": "run-tests",
+      "host": "",

Review Comment:
   Both skills have an empty `host` command, which makes the generated spec 
unusable on host. This appears to be caused by the generator expecting a `host` 
key while the RST directives use `:local:`. After fixing the generator mapping, 
please regenerate and commit `skills.json` so the host commands are populated 
(e.g., `prek`, `uv run --project ... pytest ...`).
   



##########
.pre-commit-config.yaml:
##########
@@ -960,6 +960,16 @@ repos:
           ^generated/provider_dependencies\.json$
         require_serial: true
         pass_filenames: false
+      - id: check-agent-skills-drift
+        name: Check agent skills are in sync with SKILL.md
+        description: Fails if skills.json has drifted from agent-skill-sync 
markers in SKILL.md
+        entry: python3 scripts/ci/prek/extract_agent_skills.py --check
+        language: python
+        files: >
+          (?x)
+          ^\.github/skills/breeze-contribution/SKILL\.md$|
+          ^\.github/skills/breeze-contribution/skills\.json$

Review Comment:
   This hook only triggers on changes to `SKILL.md` and `skills.json`, but 
`extract_agent_skills.py` also reads 
`contributing-docs/03_contributors_quick_start.rst`. As a result, updates to 
the RST directives can drift `skills.json` without this hook running. Add the 
contributing-docs RST path(s) to `files:` (or switch to `always_run: true`) so 
drift enforcement covers all generation inputs.
   



##########
scripts/ci/prek/breeze_context_detect.py:
##########
@@ -0,0 +1,160 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+from __future__ import annotations
+
+"""
+Breeze context detector for agent skills.
+
+Zero-dependency (stdlib only). Called by AI tools to determine whether
+they are running on the HOST or inside a Breeze container, then returns
+the correct command for the requested workflow.
+
+Usage:
+    python3 scripts/ci/prek/breeze_context_detect.py
+    python3 scripts/ci/prek/breeze_context_detect.py --workflow run-tests
+"""
+
+import os
+from pathlib import Path
+
+
+def is_inside_breeze() -> bool:
+    """
+    Returns True if running inside a Breeze container.
+
+    Detection priority (most reliable first):
+    1. AIRFLOW_BREEZE_CONTAINER=true  (explicit env marker)
+    2. /.dockerenv exists             (Docker sets in all containers)
+    3. /opt/airflow exists            (Breeze canonical mount point)
+    """
+    if os.getenv("AIRFLOW_BREEZE_CONTAINER") == "true":
+        return True
+    if Path("/.dockerenv").exists():
+        return True
+    if Path("/.containerenv").exists():
+        return True
+    if Path("/opt/airflow").exists():
+        return True
+    return False
+
+
+def get_context() -> str:
+    return "breeze" if is_inside_breeze() else "host"
+
+
+WORKFLOWS: dict[str, dict[str, str]] = {
+    "static-checks": {
+        "host": "prek",
+        "breeze": "prek",
+        "note": "Runs identically on host and inside container",
+    },
+    "run-tests": {
+        "host": "uv run --project {distribution_folder} pytest {test_path} 
-xvs",
+        "breeze": "pytest {test_path} -xvs",
+        "note": "Local-first. Fall back to breeze exec if missing system deps",
+    },
+    "enter-breeze": {
+        "host": "breeze shell",
+        "breeze": "ERROR: already inside Breeze container",
+        "note": "Only valid on host",
+    },
+    "start-airflow": {
+        "host": "breeze start-airflow",
+        "breeze": "ERROR: already inside Breeze container",
+        "note": "Starts full Airflow stack for system verification",
+    },
+    "exec-command": {
+        "host": "breeze exec -- {command}",
+        "breeze": "{command}",
+        "note": "Run single command in container from host, or directly if 
already inside",
+    },
+    "build-docs": {
+        "host": "breeze build-docs",
+        "breeze": "ERROR: build-docs runs on host only",
+        "note": "Doc builds always run on host",
+    },
+    "git-operations": {
+        "host": "git {args}",
+        "breeze": "ERROR: git operations must run on host",
+        "note": "All git operations run on host only",
+    },
+    "stage-changes": {
+        "host": "git add {file_path}",
+        "breeze": "ERROR: git operations must run on host",
+        "note": "Stage files for commit — always runs on host before prek",
+    },
+    "system-verify": {
+        "host": "breeze start-airflow --integration {integration}",
+        "breeze": "pytest {dag_test_path} -xvs",
+        "note": "Full system verification — start Airflow on host, run DAG 
tests inside Breeze",
+    },
+    "force-context-override": {
+        "host": "AIRFLOW_BREEZE_CONTAINER=true {command}",
+        "breeze": "{command}",
+        "note": "Override context detection for testing — sets explicit env 
marker",
+    },
+}
+
+
+def get_command(workflow: str, **params: str) -> dict[str, str]:
+    if workflow not in WORKFLOWS:
+        raise ValueError(f"Unknown workflow '{workflow}'. Available: 
{sorted(WORKFLOWS.keys())}")
+    context = get_context()
+    template = WORKFLOWS[workflow][context]
+    command = template.format(**params) if params else template
+    return {
+        "context": context,
+        "workflow": workflow,
+        "command": command,
+        "note": WORKFLOWS[workflow].get("note", ""),
+    }
+
+
+def _main() -> None:
+    import argparse
+
+    parser = argparse.ArgumentParser(
+        description="Detect Breeze context and get recommended agent skill 
commands"
+    )
+    parser.add_argument("--workflow", choices=sorted(WORKFLOWS.keys()))
+    parser.add_argument("--test-path", default="{test_path}")
+    parser.add_argument("--distribution-folder", 
default="{distribution_folder}")
+    args = parser.parse_args()
+
+    context = get_context()
+    print(f"Context: {context.upper()}")
+    print()
+
+    if args.workflow:
+        result = get_command(
+            args.workflow,
+            test_path=args.test_path,
+            distribution_folder=args.distribution_folder,
+        )
+        print(f"Workflow : {result['workflow']}")
+        print(f"Command  : {result['command']}")
+        print(f"Note     : {result['note']}")
+    else:
+        print("All workflows for this context:")
+        print("-" * 60)
+        for wf_name in sorted(WORKFLOWS.keys()):
+            result = get_command(wf_name)
+            print(f"  {wf_name:<20} {result['command']}")

Review Comment:
   `_main()` always passes only `test_path` and `distribution_folder` into 
`get_command()`. Selecting workflows like `system-verify` (requires 
`integration`/`dag_test_path`) or `stage-changes` (requires `file_path`) will 
raise a `KeyError` during `.format()`. Either restrict `--workflow` choices in 
the CLI to the ones it can parameterize, or add CLI args for all placeholders 
(or make formatting tolerant of missing params).



##########
.github/skills/breeze-contribution/DX_REPORT.md:
##########
@@ -0,0 +1,72 @@
+ <!-- SPDX-License-Identifier: Apache-2.0
+      https://www.apache.org/licenses/LICENSE-2.0 -->
+
+# Developer Experience Report: With vs Without Breeze Agent Skill
+
+## The Problem This Skill Solves
+
+Without this skill, AI agents treat Airflow like a generic Python project.
+This report documents the exact failure modes and how the skill fixes them.
+
+## Failure Mode 1: Wrong test command on host
+
+WITHOUT skill:
+    Agent runs: pytest airflow-core/tests/unit/test_dag.py
+    Result: ModuleNotFoundError: No module named 'airflow'
+    Agent is stuck. No context for what to do next.
+
+WITH skill:
+    Agent calls: get_command('run-tests', 
test_path='airflow-core/tests/unit/test_dag.py')
+    Context detected: HOST
+    Agent runs: uv run --project airflow-core pytest 
airflow-core/tests/unit/test_dag.py -xvs
+    Result: Tests run correctly
+
+## Failure Mode 2: Running breeze inside breeze
+
+WITHOUT skill:
+    Agent inside container runs: breeze shell
+    Result: Hangs or throws Docker-in-Docker error
+    Agent cannot recover
+
+WITH skill:
+    Agent calls: get_command('enter-breeze')
+    Context detected: BREEZE
+    Agent receives: ERROR: already inside Breeze container
+    Agent stops and runs pytest directly instead
+
+## Failure Mode 3: Git operations inside container
+
+WITHOUT skill:
+    Agent inside container runs: git push origin main
+    Result: SSH key not available, credential error
+    Agent cannot push
+
+WITH skill:
+    Agent calls: get_command('git-operations')
+    Context detected: BREEZE
+    Agent receives: ERROR: git operations must run on host
+    Agent exits container first, then pushes
+
+## Failure Mode 4: Skill drift (the maintenance problem)
+
+WITHOUT sync mechanism:
+    Breeze adds new command -> SKILL.md not updated -> agents use wrong 
commands
+    Detected only when agent fails in production
+
+WITH extract_agent_skills.py --check as prek hook:
+    Contributor changes SKILL.md without updating skills.json -> prek fails
+    Drift detected at commit time, not runtime
+
+## Verification
+
+Run context detection on host:
+    python3 scripts/ci/prek/breeze_context_detect.py
+    Expected: Context: HOST
+
+Run drift check:
+    python3 scripts/ci/prek/extract_agent_skills.py --check
+    Expected: OK: skills.json is in sync with SKILL.md
+
+Run full test suite:
+    python3 -m pytest scripts/ci/prek/test_breeze_agent_skills.py -v
+    Expected: 20 passed

Review Comment:
   The stated test expectation (`Expected: 20 passed`) is out of date relative 
to the added test suite (the PR description and the current test file indicate 
29 tests). Update this count or avoid hard-coding the number of passing tests 
to prevent future drift.
   



##########
scripts/ci/prek/test_breeze_agent_skills.py:
##########
@@ -0,0 +1,324 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+from __future__ import annotations
+
+# ruff: noqa: S101
+
+"""
+Tests for Breeze agent skill extraction and context detection.
+
+Run with:
+    python3 -m pytest scripts/ci/prek/test_breeze_agent_skills.py -v
+"""
+
+import json
+import os
+import sys
+from pathlib import Path
+from unittest import mock
+
+import pytest
+
+# Make scripts importable when running from repo root
+sys.path.insert(0, str(Path(__file__).parent))
+
+from breeze_context_detect import get_command, is_inside_breeze
+from extract_agent_skills import build_skills_json, check_drift, 
extract_skills, parse_marker

Review Comment:
   This test module mixes two import strategies: it prepends `scripts/ci/prek` 
to `sys.path` (enabling `import breeze_context_detect`), but later imports 
`ci.prek.*`, which requires `scripts/` to be on `PYTHONPATH` (per 
`scripts/pyproject.toml`). As written, running pytest from the repo root as 
suggested in the docstring will likely fail to resolve `ci.prek`. Use a single 
consistent import approach (e.g., add `scripts` to `sys.path` and import 
everything via `ci.prek.*`, or avoid `ci.prek.*` here).
   



##########
scripts/ci/prek/extract_agent_skills.py:
##########
@@ -0,0 +1,226 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+from __future__ import annotations
+
+"""
+Extract agent skill definitions from SKILL.md marker comments and write 
skills.json.
+
+This script mirrors the pattern used by update-breeze-cmd-output (line 909 of
+.pre-commit-config.yaml): a prek hook reads a source file, generates a derived
+artifact, and fails CI if the committed artifact diverges from what would be 
generated.
+
+Markers in SKILL.md look like:
+    <!-- agent-skill-sync: workflow=run-tests host="uv run..." 
breeze="pytest..." -->
+
+This script:
+1. Parses those markers from SKILL.md
+2. Builds a structured skills dict
+3. Writes .github/skills/breeze-contribution/skills.json
+4. Exits 1 if the committed skills.json differs from what was just generated
+   (drift detection — same principle as update-breeze-cmd-output)
+
+Usage:
+    # Generate/update skills.json:
+    python3 scripts/ci/prek/extract_agent_skills.py
+
+    # Check for drift only (CI mode — exits 1 if drift detected):
+    python3 scripts/ci/prek/extract_agent_skills.py --check
+"""
+
+import argparse
+import json
+import re
+import sys
+from pathlib import Path
+
+# Canonical paths — relative to repo root
+SKILL_MD = Path(".github/skills/breeze-contribution/SKILL.md")
+SKILLS_JSON = Path(".github/skills/breeze-contribution/skills.json")
+
+# Regex to match agent-skill-sync marker lines
+# Handles both quoted and unquoted values:
+#   workflow=run-tests
+#   host="uv run --project {dist} pytest {path} -xvs"
+#   fallback=missing_system_deps
+MARKER_RE = re.compile(r"<!--\s*agent-skill-sync:\s*(.+?)\s*-->")
+FIELD_RE = re.compile(r'(\w+)=(?:"([^"]*?)"|(\S+))')
+
+
+def parse_marker(line: str) -> dict[str, str] | None:
+    """
+    Parse a single agent-skill-sync marker line into a dict.
+
+    Returns None if the line does not contain a valid marker.
+    """
+    match = MARKER_RE.search(line)
+    if not match:
+        return None
+
+    fields: dict[str, str] = {}
+    for field_match in FIELD_RE.finditer(match.group(1)):
+        key = field_match.group(1)
+        # group(2) = quoted value, group(3) = unquoted value
+        value = field_match.group(2) if field_match.group(2) is not None else 
field_match.group(3)
+        fields[key] = value
+
+    if "workflow" not in fields:
+        return None
+
+    return fields
+
+
+RST_SKILL_RE = re.compile(r"[.][.] agent-skill::[ ]*\n(?P<fields>(?:[ 
]{3}:[^:]+: .+\n)+)", re.MULTILINE)
+RST_FIELD_RE = re.compile(r"[ ]{3}:([^:]+): (.+)")
+
+
+def extract_skills_from_rst(rst_path: Path) -> list[dict[str, str]]:
+    """Extract agent-skill directives from RST contributing docs."""
+    if not rst_path.exists():
+        return []
+    skills = []
+    for match in RST_SKILL_RE.finditer(rst_path.read_text(encoding="utf-8")):
+        fields = dict(RST_FIELD_RE.findall(match.group("fields")))
+        if "id" in fields:
+            fields["workflow"] = fields.pop("id")
+            skills.append(fields)
+    return skills

Review Comment:
   `extract_skills_from_rst()` uses `RST_SKILL_RE` which only matches option 
lines with a value (`:key: <value>`). In `03_contributors_quick_start.rst`, the 
`static-checks` directive has an empty `:prereqs:` line, which causes the regex 
match to stop early and drop subsequent fields like `:description:` from 
extraction. If those fields are meant to be part of the generated spec, 
consider updating the parser to handle empty option values (or parse directives 
line-by-line rather than with a single regex).



##########
.pre-commit-config.yaml:
##########
@@ -960,6 +960,16 @@ repos:
           ^generated/provider_dependencies\.json$
         require_serial: true
         pass_filenames: false
+      - id: check-agent-skills-drift
+        name: Check agent skills are in sync with SKILL.md
+        description: Fails if skills.json has drifted from agent-skill-sync 
markers in SKILL.md
+        entry: python3 scripts/ci/prek/extract_agent_skills.py --check

Review Comment:
   The hook `entry` hard-codes `python3 ...` while other local hooks use an 
executable script path (and `language: python` already provisions an 
interpreter). Hard-coding `python3` can break on platforms/environments where 
only `python` is available. Consider switching to `entry: 
./scripts/ci/prek/extract_agent_skills.py --check` (add a shebang) or `entry: 
python scripts/ci/prek/extract_agent_skills.py --check` for portability.
   



##########
scripts/ci/prek/extract_agent_skills.py:
##########
@@ -0,0 +1,226 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+from __future__ import annotations
+
+"""
+Extract agent skill definitions from SKILL.md marker comments and write 
skills.json.
+
+This script mirrors the pattern used by update-breeze-cmd-output (line 909 of
+.pre-commit-config.yaml): a prek hook reads a source file, generates a derived
+artifact, and fails CI if the committed artifact diverges from what would be 
generated.
+
+Markers in SKILL.md look like:
+    <!-- agent-skill-sync: workflow=run-tests host="uv run..." 
breeze="pytest..." -->
+
+This script:
+1. Parses those markers from SKILL.md
+2. Builds a structured skills dict
+3. Writes .github/skills/breeze-contribution/skills.json
+4. Exits 1 if the committed skills.json differs from what was just generated
+   (drift detection — same principle as update-breeze-cmd-output)
+
+Usage:
+    # Generate/update skills.json:
+    python3 scripts/ci/prek/extract_agent_skills.py
+
+    # Check for drift only (CI mode — exits 1 if drift detected):
+    python3 scripts/ci/prek/extract_agent_skills.py --check
+"""
+
+import argparse
+import json
+import re
+import sys
+from pathlib import Path
+
+# Canonical paths — relative to repo root
+SKILL_MD = Path(".github/skills/breeze-contribution/SKILL.md")
+SKILLS_JSON = Path(".github/skills/breeze-contribution/skills.json")
+
+# Regex to match agent-skill-sync marker lines
+# Handles both quoted and unquoted values:
+#   workflow=run-tests
+#   host="uv run --project {dist} pytest {path} -xvs"
+#   fallback=missing_system_deps
+MARKER_RE = re.compile(r"<!--\s*agent-skill-sync:\s*(.+?)\s*-->")
+FIELD_RE = re.compile(r'(\w+)=(?:"([^"]*?)"|(\S+))')
+
+
+def parse_marker(line: str) -> dict[str, str] | None:
+    """
+    Parse a single agent-skill-sync marker line into a dict.
+
+    Returns None if the line does not contain a valid marker.
+    """
+    match = MARKER_RE.search(line)
+    if not match:
+        return None
+
+    fields: dict[str, str] = {}
+    for field_match in FIELD_RE.finditer(match.group(1)):
+        key = field_match.group(1)
+        # group(2) = quoted value, group(3) = unquoted value
+        value = field_match.group(2) if field_match.group(2) is not None else 
field_match.group(3)
+        fields[key] = value
+
+    if "workflow" not in fields:
+        return None
+
+    return fields
+
+
+RST_SKILL_RE = re.compile(r"[.][.] agent-skill::[ ]*\n(?P<fields>(?:[ 
]{3}:[^:]+: .+\n)+)", re.MULTILINE)
+RST_FIELD_RE = re.compile(r"[ ]{3}:([^:]+): (.+)")
+
+
+def extract_skills_from_rst(rst_path: Path) -> list[dict[str, str]]:
+    """Extract agent-skill directives from RST contributing docs."""
+    if not rst_path.exists():
+        return []
+    skills = []
+    for match in RST_SKILL_RE.finditer(rst_path.read_text(encoding="utf-8")):
+        fields = dict(RST_FIELD_RE.findall(match.group("fields")))
+        if "id" in fields:
+            fields["workflow"] = fields.pop("id")
+            skills.append(fields)
+    return skills
+
+
+def extract_skills(skill_md_path: Path) -> list[dict[str, str]]:
+    """
+    Read SKILL.md and extract all agent-skill-sync markers.
+
+    Returns a list of skill dicts, one per marker found.
+    """
+    if not skill_md_path.exists():
+        print(f"ERROR: {skill_md_path} not found. Run from repo root.", 
file=sys.stderr)
+        sys.exit(1)
+
+    skills = []
+    for line in skill_md_path.read_text(encoding="utf-8").splitlines():
+        parsed = parse_marker(line)
+        if parsed:
+            skills.append(parsed)
+
+    return skills
+
+
+def build_skills_json(skills: list[dict[str, str]]) -> dict:
+    """
+    Build the full skills.json structure from extracted skill dicts.
+    """
+    return {
+        "$schema": "breeze-agent-skills/v1",
+        "source": str(SKILL_MD),
+        "description": (
+            "Auto-generated from agent-skill-sync markers in SKILL.md. "
+            "Do not edit manually — update SKILL.md markers instead."
+        ),
+        "skills": [
+            {
+                "workflow": s["workflow"],
+                "host": s.get("host", ""),
+                "breeze": s.get("breeze", ""),
+                "fallback_condition": s.get("fallback", 
s.get("fallback_condition", "never")),
+            }
+            for s in skills
+        ],
+    }
+
+
+def write_skills_json(data: dict, output_path: Path) -> None:
+    """Write skills dict to JSON file with stable formatting."""
+    output_path.parent.mkdir(parents=True, exist_ok=True)
+    output_path.write_text(
+        json.dumps(data, indent=2, ensure_ascii=False) + "\n",
+        encoding="utf-8",
+    )
+
+
+def check_drift(generated: dict, existing_path: Path) -> bool:
+    """
+    Returns True if drift is detected (committed file differs from generated).
+    Returns False if they match.
+    """
+    if not existing_path.exists():
+        print(f"DRIFT: {existing_path} does not exist but should be 
generated.", file=sys.stderr)
+        return True
+
+    committed = json.loads(existing_path.read_text(encoding="utf-8"))
+
+    # Compare only the skills list — ignore metadata fields like description
+    if committed.get("skills") != generated.get("skills"):
+        print("DRIFT DETECTED: committed skills.json does not match SKILL.md 
markers.", file=sys.stderr)
+        print("Run: python3 scripts/ci/prek/extract_agent_skills.py", 
file=sys.stderr)
+        print("Then commit the updated skills.json.", file=sys.stderr)
+        return True

Review Comment:
   `extract_agent_skills.py` currently reads skills from both `SKILL.md` 
markers and `contributing-docs/03_contributors_quick_start.rst`, but the drift 
messaging and metadata still claim `skills.json` is derived only from 
`SKILL.md` markers (e.g., the DRIFT message and `description`/`source`). This 
will be confusing when drift is introduced by the RST inputs; consider updating 
the messages/metadata to reflect all sources (or making SKILL.md the actual 
single source of truth).



##########
scripts/ci/prek/breeze_context_detect.py:
##########
@@ -0,0 +1,160 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+from __future__ import annotations
+
+"""
+Breeze context detector for agent skills.
+
+Zero-dependency (stdlib only). Called by AI tools to determine whether
+they are running on the HOST or inside a Breeze container, then returns
+the correct command for the requested workflow.
+
+Usage:
+    python3 scripts/ci/prek/breeze_context_detect.py
+    python3 scripts/ci/prek/breeze_context_detect.py --workflow run-tests
+"""
+
+import os
+from pathlib import Path
+
+
+def is_inside_breeze() -> bool:
+    """
+    Returns True if running inside a Breeze container.
+
+    Detection priority (most reliable first):
+    1. AIRFLOW_BREEZE_CONTAINER=true  (explicit env marker)
+    2. /.dockerenv exists             (Docker sets in all containers)
+    3. /opt/airflow exists            (Breeze canonical mount point)
+    """
+    if os.getenv("AIRFLOW_BREEZE_CONTAINER") == "true":
+        return True
+    if Path("/.dockerenv").exists():
+        return True
+    if Path("/.containerenv").exists():
+        return True
+    if Path("/opt/airflow").exists():
+        return True
+    return False
+
+
+def get_context() -> str:
+    return "breeze" if is_inside_breeze() else "host"
+
+
+WORKFLOWS: dict[str, dict[str, str]] = {
+    "static-checks": {
+        "host": "prek",
+        "breeze": "prek",
+        "note": "Runs identically on host and inside container",
+    },
+    "run-tests": {
+        "host": "uv run --project {distribution_folder} pytest {test_path} 
-xvs",
+        "breeze": "pytest {test_path} -xvs",
+        "note": "Local-first. Fall back to breeze exec if missing system deps",
+    },
+    "enter-breeze": {
+        "host": "breeze shell",
+        "breeze": "ERROR: already inside Breeze container",
+        "note": "Only valid on host",
+    },
+    "start-airflow": {
+        "host": "breeze start-airflow",
+        "breeze": "ERROR: already inside Breeze container",
+        "note": "Starts full Airflow stack for system verification",
+    },
+    "exec-command": {
+        "host": "breeze exec -- {command}",
+        "breeze": "{command}",
+        "note": "Run single command in container from host, or directly if 
already inside",
+    },
+    "build-docs": {
+        "host": "breeze build-docs",
+        "breeze": "ERROR: build-docs runs on host only",
+        "note": "Doc builds always run on host",
+    },
+    "git-operations": {
+        "host": "git {args}",
+        "breeze": "ERROR: git operations must run on host",
+        "note": "All git operations run on host only",
+    },
+    "stage-changes": {
+        "host": "git add {file_path}",
+        "breeze": "ERROR: git operations must run on host",
+        "note": "Stage files for commit — always runs on host before prek",
+    },
+    "system-verify": {
+        "host": "breeze start-airflow --integration {integration}",
+        "breeze": "pytest {dag_test_path} -xvs",
+        "note": "Full system verification — start Airflow on host, run DAG 
tests inside Breeze",
+    },
+    "force-context-override": {
+        "host": "AIRFLOW_BREEZE_CONTAINER=true {command}",
+        "breeze": "{command}",
+        "note": "Override context detection for testing — sets explicit env 
marker",
+    },
+}
+
+
+def get_command(workflow: str, **params: str) -> dict[str, str]:
+    if workflow not in WORKFLOWS:
+        raise ValueError(f"Unknown workflow '{workflow}'. Available: 
{sorted(WORKFLOWS.keys())}")
+    context = get_context()
+    template = WORKFLOWS[workflow][context]
+    command = template.format(**params) if params else template
+    return {
+        "context": context,
+        "workflow": workflow,
+        "command": command,
+        "note": WORKFLOWS[workflow].get("note", ""),
+    }
+
+
+def _main() -> None:
+    import argparse
+

Review Comment:
   There’s an `import argparse` inside `_main()`. In this codebase we generally 
keep imports at module level unless there’s a specific reason (circular import 
avoidance, lazy loading, TYPE_CHECKING). Consider moving it to the top of the 
file for consistency and to satisfy the project’s import rules.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] [GSoC 2026] POC: Breeze-aware agent skill with context detection and drift enforcement [airflow]

Reply via email to