fix: warn and document scheduler --external-host default in cluster deploys#1589
Draft
andygrove wants to merge 1 commit intoapache:mainfrom
Draft
fix: warn and document scheduler --external-host default in cluster deploys#1589andygrove wants to merge 1 commit intoapache:mainfrom
andygrove wants to merge 1 commit intoapache:mainfrom
Conversation
…eploys The scheduler embeds external_host:bind_port in LaunchTaskParams.scheduler_id sent to executors, and executors use that string to dial back for task status and heartbeats. external_host defaults to "localhost", so any deployment that copies the k8s example or docker-compose template silently ends up telling executors to report to localhost:50050 inside their own pod. - Update the Kubernetes deployment example to pass --external-host=ballista-scheduler on the scheduler and explain why it is required. - Update docker-compose.yml and the docker-compose deployment guide the same way. - Emit a WARN at scheduler startup when bind_host is non-loopback but external_host is still "localhost", so misconfigured deployments surface a clear diagnostic instead of failing only at task-status report time. Closes apache#1587
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Which issue does this PR close?
Closes #1587.
Rationale for this change
When the scheduler launches tasks it embeds its own advertised address in
LaunchTaskParams.scheduler_id, and executors use that string to dial back for task status and heartbeats (executor_server.rs::get_scheduler_clientliterally doesformat!(\"http://{scheduler_id}\")). The advertised address is built fromSchedulerConfig::scheduler_name()=format!(\"{external_host}:{bind_port}\"), andexternal_hostdefaults to\"localhost\"with no env-var binding.The Kubernetes deployment example in the user guide and the in-tree
docker-compose.ymlboth omit--external-host, so any cluster deployed from those templates inherits the default. The executor's outgoing connection works because it is configured separately on the executor side via--scheduler-host, but every status report and heartbeat tries to diallocalhost:50050inside the executor's own pod and fails withFail to connect to scheduler localhost:50050. The failure mode is undocumented today.What changes are included in this PR?
--external-host=ballista-scheduler, with a paragraph explaining why and what the symptom looks like if it is omitted.docker-compose.yml: schedulercommand:now sets--external-host ballista-schedulerto match the Compose service name.ballista/scheduler/src/scheduler_process.rs: emit aWARNat scheduler startup whenbind_hostis non-loopback butexternal_hostis still the default\"localhost\". This fires only on misconfigured cluster deploys (single-machine runs with the all-defaultsbind_host=127.0.0.1are unaffected) and gives operators a clear diagnostic instead of waiting for the first task-status callback to fail.Are there any user-facing changes?
No API changes. Operators of misconfigured clusters will see a new
WARNlog at scheduler startup; deployments using the updated example manifests will work without status-callback failures.