shivaam opened a new issue, #64534:
URL: https://github.com/apache/airflow/issues/64534

   **Follow-up from:** #61375 and PR #62561
   
   ## Background
   
   When a user creates a backfill, Airflow does it in two steps:
   
   1. **Step 1:** Create and commit the `Backfill` row to the database (so it 
gets an ID)
   2. **Step 2:** Create all the `DagRun` and `BackfillDagRun` rows linked to 
that backfill
   
   The scheduler runs `_mark_backfills_complete()` every ~30 seconds. This 
method looks for backfills where all associated DagRuns have finished. The 
problem is: if the scheduler runs between Step 1 and Step 2, it sees a backfill 
with zero DagRuns, concludes "all zero runs are done", and marks the backfill 
as complete. When Step 2 then creates the DagRuns, they're orphaned -- the 
backfill is already marked done.
   
   This was reported in #61375 and fixed in PR #62561 with a pragmatic guard: 
require at least one `BackfillDagRun` to exist before marking complete, with a 
2-minute timeout to clean up orphaned backfills that failed during 
initialization.
   
   ## What still needs to be done
   
   The current fix works but relies on a time-based heuristic (2-minute cutoff) 
rather than addressing the root cause. During [PR #62561 
review](https://github.com/apache/airflow/pull/62561#pullrequestreview-4030269060),
 Daniel Standish suggested two more robust approaches as follow-ups. Either one 
(or both combined) would replace the heuristic with a proper solution.
   
   ## Possible approaches
   
   ### Approach 1: Atomic transaction (wrap creation in a single commit)
   
   Make the backfill creation atomic so the `Backfill` row and all its 
`DagRun`/`BackfillDagRun` rows appear in the database at the same time. The 
scheduler would never see a "half-created" backfill.
   
   **How:** Change `session.commit()` to `session.flush()` in 
`_create_backfill()` at [`backfill.py 
L605`](https://github.com/apache/airflow/blob/main/airflow-core/src/airflow/models/backfill.py#L605).
 `flush()` assigns the `Backfill.id` (needed as FK) without committing. The 
`create_session()` context manager commits everything at the end, so all rows 
appear atomically.
   
   **Pros:**
   - Eliminates the race window at the root cause
   - Clean rollback on failure (no orphaned Backfill rows)
   - No database migration needed
   
   **Cons / things to solve:**
   - The [`AlreadyRunningBackfill` 
check](https://github.com/apache/airflow/blob/main/airflow-core/src/airflow/models/backfill.py#L585-L589)
 relies on the early commit to prevent concurrent duplicate backfills for the 
same DAG. If the commit is delayed, a concurrent request won't see the 
in-progress backfill and could create a duplicate. This would need 
database-level locking or an advisory lock.
   
   (Originally suggested by both 
[Daniel](https://github.com/apache/airflow/pull/62561#pullrequestreview-4030269060)
 and 
[Kaxil](https://github.com/apache/airflow/pull/62561#discussion_r2880363898))
   
   ### Approach 2: Add a state field or datetime to the Backfill model
   
   Add an explicit signal to distinguish "still initializing" from "running" 
backfills, rather than inferring it from the presence of `BackfillDagRun` rows 
+ a time window.
   
   **Option A -- Datetime field (e.g. `started_at` or `initialized_at`):** 
Nullable, set to `NULL` on creation, set to `utcnow()` after all DagRuns are 
created. The scheduler filters on `Backfill.started_at.is_not(None)`.
   
   **Option B -- State enum (e.g. `QUEUED` / `RUNNING` / `COMPLETED` / 
`FAILED`):** More expressive. Could eventually replace the `completed_at`-based 
completion check and provide a proper lifecycle model for backfills.
   
   **Pros:**
   - Explicit signal instead of a time-based heuristic
   - Option B improves observability and lifecycle management
   
   **Cons:**
   - Requires a database migration
   - Doesn't fix the root cause (non-atomic creation), only makes 
initialization state explicit
   
   (Suggested by 
[Daniel](https://github.com/apache/airflow/pull/62561#pullrequestreview-4030269060))
   
   ### These approaches can be combined
   
   Approach 1 fixes the root cause (race window). Approach 2 adds observability 
(explicit state). They're not mutually exclusive, but either one alone is an 
improvement over the current heuristic.
   
   ## Code locations
   
   - 
[`airflow-core/src/airflow/models/backfill.py`](https://github.com/apache/airflow/blob/main/airflow-core/src/airflow/models/backfill.py)
 -- `Backfill` model and `_create_backfill()` method
   - 
[`airflow-core/src/airflow/jobs/scheduler_job_runner.py`](https://github.com/apache/airflow/blob/main/airflow-core/src/airflow/jobs/scheduler_job_runner.py)
 -- `_mark_backfills_complete()` (has the guard from PR #62561)
   - `airflow-core/src/airflow/migrations/` -- migration needed for Approach 2


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to