andygrove opened a new pull request, #1589: URL: https://github.com/apache/datafusion-ballista/pull/1589
# Which issue does this PR close? Closes #1587. # Rationale for this change When the scheduler launches tasks it embeds its own advertised address in `LaunchTaskParams.scheduler_id`, and executors use that string to dial back for task status and heartbeats (`executor_server.rs::get_scheduler_client` literally does `format!(\"http://{scheduler_id}\")`). The advertised address is built from `SchedulerConfig::scheduler_name()` = `format!(\"{external_host}:{bind_port}\")`, and `external_host` defaults to `\"localhost\"` with no env-var binding. The Kubernetes deployment example in the user guide and the in-tree `docker-compose.yml` both omit `--external-host`, so any cluster deployed from those templates inherits the default. The executor's outgoing connection works because it is configured separately on the executor side via `--scheduler-host`, but every status report and heartbeat tries to dial `localhost:50050` inside the executor's own pod and fails with `Fail to connect to scheduler localhost:50050`. The failure mode is undocumented today. # What changes are included in this PR? - **Kubernetes deployment guide**: scheduler args now include `--external-host=ballista-scheduler`, with a paragraph explaining why and what the symptom looks like if it is omitted. - **`docker-compose.yml`**: scheduler `command:` now sets `--external-host ballista-scheduler` to match the Compose service name. - **Docker Compose deployment guide**: short note added describing why the scheduler advertises its Compose service name to executors. - **`ballista/scheduler/src/scheduler_process.rs`**: emit a `WARN` at scheduler startup when `bind_host` is non-loopback but `external_host` is still the default `\"localhost\"`. This fires only on misconfigured cluster deploys (single-machine runs with the all-defaults `bind_host=127.0.0.1` are unaffected) and gives operators a clear diagnostic instead of waiting for the first task-status callback to fail. # Are there any user-facing changes? No API changes. Operators of misconfigured clusters will see a new `WARN` log at scheduler startup; deployments using the updated example manifests will work without status-callback failures. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
