andygrove opened a new issue, #1908:
URL: https://github.com/apache/datafusion-ballista/issues/1908
## Describe the bug
When an executor fails to **launch/decode** a task (a non-transient,
deterministic error such as a plan deserialization failure), the scheduler
treats it the same as losing the executor: it removes the executor, rolls back
its stages, and — with a single executor — then reports "There are no alive
executors to bind tasks" and the job hangs forever instead of failing.
A task that cannot be decoded will fail the same way on every executor, so
this is a query-level error, not an executor-health problem. Marking the
executor dead both hides the real error and (with few executors) wedges the
whole job.
## Observed sequence
Submitting TPC-H q15 (an uncorrelated scalar subquery) on DataFusion 54 with
1 scheduler + 1 executor:
```
ERROR ballista_scheduler::state: Failed to launch new task: Internal
Ballista error:
Failed to connect to executor <id>: Status { code: InvalidArgument,
message:
"DataFusion error: Internal error: ScalarSubqueryExpr can only be
deserialized as part
of a surrounding ScalarSubqueryExec. ..." }
INFO ballista_scheduler::state::executor_manager: Removing executor <id>:
Some("Failed to launch new task: ...")
WARN ballista_scheduler::state::execution_graph: Reset N tasks ... on lost
Executor <id>
WARN ballista_scheduler::state::executor_manager: There are no alive
executors to bind tasks
DEBUG ballista_scheduler::state: No schedulable tasks found to be launched
```
The client then blocks indefinitely (both client and scheduler idle at ~0%
CPU). The underlying `InvalidArgument` error is never surfaced to the client.
(The specific decode error above is a separate DataFusion-54 issue — filed
separately — but it is a good example: any deterministic per-task decode/launch
failure produces this hang.)
## Expected behavior
A deterministic task launch/decode failure (e.g. gRPC `InvalidArgument`, or
an error that recurs across attempts) should **fail the query** and propagate
the error to the client, rather than marking the executor dead. Executor
removal should be reserved for genuine connectivity/liveness failures
(transport errors, timeouts, heartbeat loss).
## Suggested direction
- Distinguish "executor unreachable / lost" from "executor rejected this
task" at the task-launch site (`ballista/scheduler/src/state/` task launching).
`InvalidArgument` (and similar deterministic statuses) should mark the
**task/stage/job** failed, not the executor.
- Respect the task retry limit for launch failures so a
deterministically-undecodable task fails the job after its attempts instead of
looping or wedging.
## Impact
With a single executor this turns any undecodable task into an indefinite
hang with no error reported. In CI this presents as a timeout rather than a
clear failure, and one bad query can take down the whole suite.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]