Skip to content

fix: warn and document scheduler --external-host default in cluster deploys#1589

Draft
andygrove wants to merge 1 commit intoapache:mainfrom
andygrove:fix/scheduler-external-host-default
Draft

fix: warn and document scheduler --external-host default in cluster deploys#1589
andygrove wants to merge 1 commit intoapache:mainfrom
andygrove:fix/scheduler-external-host-default

Conversation

@andygrove
Copy link
Copy Markdown
Member

Which issue does this PR close?

Closes #1587.

Rationale for this change

When the scheduler launches tasks it embeds its own advertised address in LaunchTaskParams.scheduler_id, and executors use that string to dial back for task status and heartbeats (executor_server.rs::get_scheduler_client literally does format!(\"http://{scheduler_id}\")). The advertised address is built from SchedulerConfig::scheduler_name() = format!(\"{external_host}:{bind_port}\"), and external_host defaults to \"localhost\" with no env-var binding.

The Kubernetes deployment example in the user guide and the in-tree docker-compose.yml both omit --external-host, so any cluster deployed from those templates inherits the default. The executor's outgoing connection works because it is configured separately on the executor side via --scheduler-host, but every status report and heartbeat tries to dial localhost:50050 inside the executor's own pod and fails with Fail to connect to scheduler localhost:50050. The failure mode is undocumented today.

What changes are included in this PR?

  • Kubernetes deployment guide: scheduler args now include --external-host=ballista-scheduler, with a paragraph explaining why and what the symptom looks like if it is omitted.
  • docker-compose.yml: scheduler command: now sets --external-host ballista-scheduler to match the Compose service name.
  • Docker Compose deployment guide: short note added describing why the scheduler advertises its Compose service name to executors.
  • ballista/scheduler/src/scheduler_process.rs: emit a WARN at scheduler startup when bind_host is non-loopback but external_host is still the default \"localhost\". This fires only on misconfigured cluster deploys (single-machine runs with the all-defaults bind_host=127.0.0.1 are unaffected) and gives operators a clear diagnostic instead of waiting for the first task-status callback to fail.

Are there any user-facing changes?

No API changes. Operators of misconfigured clusters will see a new WARN log at scheduler startup; deployments using the updated example manifests will work without status-callback failures.

…eploys

The scheduler embeds external_host:bind_port in LaunchTaskParams.scheduler_id
sent to executors, and executors use that string to dial back for task status
and heartbeats. external_host defaults to "localhost", so any deployment that
copies the k8s example or docker-compose template silently ends up telling
executors to report to localhost:50050 inside their own pod.

- Update the Kubernetes deployment example to pass --external-host=ballista-scheduler
  on the scheduler and explain why it is required.
- Update docker-compose.yml and the docker-compose deployment guide the same way.
- Emit a WARN at scheduler startup when bind_host is non-loopback but
  external_host is still "localhost", so misconfigured deployments surface a
  clear diagnostic instead of failing only at task-status report time.

Closes apache#1587
@github-actions github-actions Bot added the documentation Improvements or additions to documentation label Apr 26, 2026
@andygrove andygrove marked this pull request as draft April 26, 2026 22:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Executors try to connect with local scheduler instead of remote scheduler

1 participant