skymensch opened a new pull request, #64743: URL: https://github.com/apache/airflow/pull/64743
When a task pushes a large XCom payload (e.g. 300 MB), the supervisor's single-threaded event loop is blocked on the synchronous HTTP POST to the API server. During that time no heartbeats can be sent, and the scheduler eventually marks the task instance as failed with a heartbeat timeout — even though the task itself is still running successfully. **Root cause:** `ActivitySubprocess._handle_request()` calls `self.client.xcoms.set()` synchronously. Because the supervisor uses a `selectors`-based event loop, any blocking call inside a handler stalls the entire loop, including `_send_heartbeat_if_needed()`. **Fix:** Offload the `SetXCom` API call to a single-worker `ThreadPoolExecutor`. The handler submits the future and returns immediately, so the event loop keeps ticking and heartbeats continue uninterrupted. A new `_drain_pending_requests()` helper is called on every loop iteration; it inspects completed futures and sends the appropriate response (or error) back to the task process. - `max_workers=1` preserves ordering of concurrent XCom writes from the same task. - `httpx.Client` is thread-safe, so sharing the existing client with the worker thread is safe. - On process cleanup `shutdown(wait=False)` discards any in-flight upload because the task process is already gone. closes: #64628 --- ##### Was generative AI tooling used to co-author this PR? - [X] Yes — Claude Code (claude-sonnet-4-6) Generated-by: Claude Code (claude-sonnet-4-6) following [the guidelines](https://github.com/apache/airflow/blob/main/contributing-docs/05_pull_requests.rst#gen-ai-assisted-contributions) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
