Skip to content

[services] Update pod priorities and give grafana a realistic memory request#15357

Open
cjllanwarne wants to merge 2 commits intohail-is:mainfrom
cjllanwarne:cjl_batch_driver_resilience
Open

[services] Update pod priorities and give grafana a realistic memory request#15357
cjllanwarne wants to merge 2 commits intohail-is:mainfrom
cjllanwarne:cjl_batch_driver_resilience

Conversation

@cjllanwarne
Copy link
Collaborator

@cjllanwarne cjllanwarne commented Mar 24, 2026

Change Description

Fixes #15339

The batch-driver node has been evicted multiple times from the node it's been running on, leaving a trail of Error and ContainerStatusUnknown pods in the cluster.

A couple of issues together seem likely to cause this:

  • batch-driver was being colocated on nodes with 3 grafana pods (including from test namespaces), and prometheus-0, which are all marked as infrastructure priority (above batch-driver which is "only" production).
  • both grafana and prometheus were underestimating their memory requirements (low requests, high limits). Nb k8s bucket-fills based on requests, which means nodes can be over-provisioned quite easily.
  • therefore: prometheus hits a large enough WAL, expands its memory to try to process it, batch-driver gets evicted
  • In this case, the WAL was too large for prometheus to handle even with all the node memory, so my hypothesis is that it got OOM'd itself, got restarted on the node the batch-driver had recovered to, and then the cycle continued

Therefore this change:

  • Makes batch-driver higher priority than prometheus or grafana
  • Gives grafana a more realistic memory request
  • Bumps prometheus to 2.53.5 (LTS) which includes some attempts to make WAL processing more efficient
  • Moves grafana down to preemptible pods for testing

To think about if prometheus starts hitting WAL problems more often:

  • infra/gcp-broad projects are still using n1-standard-2 VMs (unlike gcp/ projects which are on n1-standard-4). That's slightly cheaper, but each node only gets a max of 5.7GB to play with - and in this case that wasn't enough for prometheus to catch up with its WAL. Switching to n1-standard-4 would give a higher cap of 10+G to expand into.

Warning

This change requires a manual addition to the priorityclasses set in kubernetes before rollout.
The following priorityclass addition must be manually applied (or ci/bootstrap.yaml can be re-applied, but
verify with --dry-run=server first)

apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: auxiliary
value: 700000
globalDefault: false
description: "For auxiliary services like CI, monitoring, grafana, prometheus."

Security Assessment

Delete all except the correct answer:

  • This change potentially impacts the Hail Batch instance as deployed by Broad Institute in GCP

Impact Rating

Delete all except the correct answer:

  • This change has a low security impact

Impact Description

Just internal pod re-prioritization, and a prometheus update

Appsec Review

  • Required: The impact has been assessed and approved by appsec

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adjusts Kubernetes scheduling priorities and resource sizing for cluster monitoring/CI components to reduce batch-driver evictions and improve node packing behavior based on resource requests.

Changes:

  • Introduces a new auxiliary PriorityClass and assigns it to Prometheus, Grafana, Monitoring, and CI (deploy only).
  • Updates Grafana pod scheduling so test/non-deploy runs on preemptible nodes, and increases Grafana memory requests while tightening limits.
  • Bumps Prometheus to v2.53.5 pinned by digest.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated no comments.

Show a summary per file
File Description
prometheus/prometheus.yaml Moves Prometheus to auxiliary priority (deploy) and bumps Prometheus image to 2.53.5 (digest-pinned).
monitoring/deployment.yaml Assigns Monitoring deployment to auxiliary priority.
grafana/deployment.yaml Moves Grafana to auxiliary, makes deploy/non-deploy node placement explicit, and updates resource requests/limits.
ci/deployment.yaml Changes CI priority class to auxiliary for deploy environments.
ci/bootstrap.yaml Adds the new cluster-scoped auxiliary PriorityClass definition.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Batch driver restarts

3 participants