shivaam opened a new issue, #64534: URL: https://github.com/apache/airflow/issues/64534
**Follow-up from:** #61375 and PR #62561 ## Background When a user creates a backfill, Airflow does it in two steps: 1. **Step 1:** Create and commit the `Backfill` row to the database (so it gets an ID) 2. **Step 2:** Create all the `DagRun` and `BackfillDagRun` rows linked to that backfill The scheduler runs `_mark_backfills_complete()` every ~30 seconds. This method looks for backfills where all associated DagRuns have finished. The problem is: if the scheduler runs between Step 1 and Step 2, it sees a backfill with zero DagRuns, concludes "all zero runs are done", and marks the backfill as complete. When Step 2 then creates the DagRuns, they're orphaned -- the backfill is already marked done. This was reported in #61375 and fixed in PR #62561 with a pragmatic guard: require at least one `BackfillDagRun` to exist before marking complete, with a 2-minute timeout to clean up orphaned backfills that failed during initialization. ## What still needs to be done The current fix works but relies on a time-based heuristic (2-minute cutoff) rather than addressing the root cause. During [PR #62561 review](https://github.com/apache/airflow/pull/62561#pullrequestreview-4030269060), Daniel Standish suggested two more robust approaches as follow-ups. Either one (or both combined) would replace the heuristic with a proper solution. ## Possible approaches ### Approach 1: Atomic transaction (wrap creation in a single commit) Make the backfill creation atomic so the `Backfill` row and all its `DagRun`/`BackfillDagRun` rows appear in the database at the same time. The scheduler would never see a "half-created" backfill. **How:** Change `session.commit()` to `session.flush()` in `_create_backfill()` at [`backfill.py L605`](https://github.com/apache/airflow/blob/main/airflow-core/src/airflow/models/backfill.py#L605). `flush()` assigns the `Backfill.id` (needed as FK) without committing. The `create_session()` context manager commits everything at the end, so all rows appear atomically. **Pros:** - Eliminates the race window at the root cause - Clean rollback on failure (no orphaned Backfill rows) - No database migration needed **Cons / things to solve:** - The [`AlreadyRunningBackfill` check](https://github.com/apache/airflow/blob/main/airflow-core/src/airflow/models/backfill.py#L585-L589) relies on the early commit to prevent concurrent duplicate backfills for the same DAG. If the commit is delayed, a concurrent request won't see the in-progress backfill and could create a duplicate. This would need database-level locking or an advisory lock. (Originally suggested by both [Daniel](https://github.com/apache/airflow/pull/62561#pullrequestreview-4030269060) and [Kaxil](https://github.com/apache/airflow/pull/62561#discussion_r2880363898)) ### Approach 2: Add a state field or datetime to the Backfill model Add an explicit signal to distinguish "still initializing" from "running" backfills, rather than inferring it from the presence of `BackfillDagRun` rows + a time window. **Option A -- Datetime field (e.g. `started_at` or `initialized_at`):** Nullable, set to `NULL` on creation, set to `utcnow()` after all DagRuns are created. The scheduler filters on `Backfill.started_at.is_not(None)`. **Option B -- State enum (e.g. `QUEUED` / `RUNNING` / `COMPLETED` / `FAILED`):** More expressive. Could eventually replace the `completed_at`-based completion check and provide a proper lifecycle model for backfills. **Pros:** - Explicit signal instead of a time-based heuristic - Option B improves observability and lifecycle management **Cons:** - Requires a database migration - Doesn't fix the root cause (non-atomic creation), only makes initialization state explicit (Suggested by [Daniel](https://github.com/apache/airflow/pull/62561#pullrequestreview-4030269060)) ### These approaches can be combined Approach 1 fixes the root cause (race window). Approach 2 adds observability (explicit state). They're not mutually exclusive, but either one alone is an improvement over the current heuristic. ## Code locations - [`airflow-core/src/airflow/models/backfill.py`](https://github.com/apache/airflow/blob/main/airflow-core/src/airflow/models/backfill.py) -- `Backfill` model and `_create_backfill()` method - [`airflow-core/src/airflow/jobs/scheduler_job_runner.py`](https://github.com/apache/airflow/blob/main/airflow-core/src/airflow/jobs/scheduler_job_runner.py) -- `_mark_backfills_complete()` (has the guard from PR #62561) - `airflow-core/src/airflow/migrations/` -- migration needed for Approach 2 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
