andygrove opened a new pull request, #1589:
URL: https://github.com/apache/datafusion-ballista/pull/1589

   # Which issue does this PR close?
   
   Closes #1587.
   
   # Rationale for this change
   
   When the scheduler launches tasks it embeds its own advertised address in 
`LaunchTaskParams.scheduler_id`, and executors use that string to dial back for 
task status and heartbeats (`executor_server.rs::get_scheduler_client` 
literally does `format!(\"http://{scheduler_id}\";)`). The advertised address is 
built from `SchedulerConfig::scheduler_name()` = 
`format!(\"{external_host}:{bind_port}\")`, and `external_host` defaults to 
`\"localhost\"` with no env-var binding.
   
   The Kubernetes deployment example in the user guide and the in-tree 
`docker-compose.yml` both omit `--external-host`, so any cluster deployed from 
those templates inherits the default. The executor's outgoing connection works 
because it is configured separately on the executor side via 
`--scheduler-host`, but every status report and heartbeat tries to dial 
`localhost:50050` inside the executor's own pod and fails with `Fail to connect 
to scheduler localhost:50050`. The failure mode is undocumented today.
   
   # What changes are included in this PR?
   
   - **Kubernetes deployment guide**: scheduler args now include 
`--external-host=ballista-scheduler`, with a paragraph explaining why and what 
the symptom looks like if it is omitted.
   - **`docker-compose.yml`**: scheduler `command:` now sets `--external-host 
ballista-scheduler` to match the Compose service name.
   - **Docker Compose deployment guide**: short note added describing why the 
scheduler advertises its Compose service name to executors.
   - **`ballista/scheduler/src/scheduler_process.rs`**: emit a `WARN` at 
scheduler startup when `bind_host` is non-loopback but `external_host` is still 
the default `\"localhost\"`. This fires only on misconfigured cluster deploys 
(single-machine runs with the all-defaults `bind_host=127.0.0.1` are 
unaffected) and gives operators a clear diagnostic instead of waiting for the 
first task-status callback to fail.
   
   # Are there any user-facing changes?
   
   No API changes. Operators of misconfigured clusters will see a new `WARN` 
log at scheduler startup; deployments using the updated example manifests will 
work without status-callback failures.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to