[services] Update pod priorities and give grafana a realistic memory request by cjllanwarne · Pull Request #15357 · hail-is/hail

cjllanwarne · 2026-03-24T16:53:01Z

Change Description

The batch-driver node has been evicted multiple times from the node it's been running on, leaving a trail of Error and ContainerStatusUnknown pods in the cluster.

A couple of issues together seem likely to cause this:

batch-driver was being colocated on nodes with 3 grafana pods (including from test namespaces), and prometheus-0, which are all marked as infrastructure priority (above batch-driver which is "only" production).
both grafana and prometheus were underestimating their memory requirements (low requests, high limits). Nb k8s bucket-fills based on requests, which means nodes can be over-provisioned quite easily.
therefore: prometheus hits a large enough WAL, expands its memory to try to process it, batch-driver gets evicted
In this case, the WAL was too large for prometheus to handle even with all the node memory, so my hypothesis is that it got OOM'd itself, got restarted on the node the batch-driver had recovered to, and then the cycle continued

Therefore this change:

Makes batch-driver higher priority than prometheus or grafana
Gives grafana a more realistic memory request
Bumps prometheus to 2.53.5 (LTS) which includes some attempts to make WAL processing more efficient
Moves grafana down to preemptible pods for testing

To think about if prometheus starts hitting WAL problems more often:

infra/gcp-broad projects are still using n1-standard-2 VMs (unlike gcp/ projects which are on n1-standard-4). That's slightly cheaper, but each node only gets a max of 5.7GB to play with - and in this case that wasn't enough for prometheus to catch up with its WAL. Switching to n1-standard-4 would give a higher cap of 10+G to expand into.

Warning

This change requires a manual addition to the priorityclasses set in kubernetes before rollout.
The following priorityclass addition must be manually applied (or ci/bootstrap.yaml can be re-applied, but
verify with --dry-run=server first)

apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: auxiliary
value: 700000
globalDefault: false
description: "For auxiliary services like CI, monitoring, grafana, prometheus."

Security Assessment

Delete all except the correct answer:

This change potentially impacts the Hail Batch instance as deployed by Broad Institute in GCP

Impact Rating

Delete all except the correct answer:

This change has a low security impact

Impact Description

Just internal pod re-prioritization, and a prometheus update

Appsec Review

Required: The impact has been assessed and approved by appsec

Copilot

Pull request overview

Adjusts Kubernetes scheduling priorities and resource sizing for cluster monitoring/CI components to reduce batch-driver evictions and improve node packing behavior based on resource requests.

Changes:

Introduces a new auxiliary PriorityClass and assigns it to Prometheus, Grafana, Monitoring, and CI (deploy only).
Updates Grafana pod scheduling so test/non-deploy runs on preemptible nodes, and increases Grafana memory requests while tightening limits.
Bumps Prometheus to v2.53.5 pinned by digest.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated no comments.

Show a summary per file

File	Description
prometheus/prometheus.yaml	Moves Prometheus to `auxiliary` priority (deploy) and bumps Prometheus image to 2.53.5 (digest-pinned).
monitoring/deployment.yaml	Assigns Monitoring deployment to `auxiliary` priority.
grafana/deployment.yaml	Moves Grafana to `auxiliary`, makes deploy/non-deploy node placement explicit, and updates resource requests/limits.
ci/deployment.yaml	Changes CI priority class to `auxiliary` for deploy environments.
ci/bootstrap.yaml	Adds the new cluster-scoped `auxiliary` PriorityClass definition.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

cjllanwarne added 2 commits March 24, 2026 11:21

reset pod priorities and give grafana a realistic memory request

586635e

also bump prometheus

585f3d7

cjllanwarne requested a review from a team as a code owner March 24, 2026 16:53

cjllanwarne requested review from Copilot and kush-chandra March 24, 2026 17:54

Copilot started reviewing on behalf of cjllanwarne March 25, 2026 14:38 View session

Copilot AI reviewed Mar 25, 2026

View reviewed changes

kush-chandra approved these changes Mar 25, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[services] Update pod priorities and give grafana a realistic memory request#15357

[services] Update pod priorities and give grafana a realistic memory request#15357
cjllanwarne wants to merge 2 commits intohail-is:mainfrom
cjllanwarne:cjl_batch_driver_resilience

cjllanwarne commented Mar 24, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

cjllanwarne commented Mar 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Change Description

Security Assessment

Impact Rating

Impact Description

Appsec Review

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

cjllanwarne commented Mar 24, 2026 •

edited

Loading