Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
50 changes: 50 additions & 0 deletions content/postmortem/2026-03-19.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,50 @@
---
title: '2026-03-19'
description: Memory pressure on Paris hypervisors causing service disruptions and deployment pipeline outage
date: 2026-03-27
excludeSearch: true
type: docs
---

{{< hextra/hero-subtitle >}}

A memory pressure spike on ~15 Paris hypervisors, amplified by a placement algorithm threshold effect, caused cascading failures impacting hosted services, the deployment pipeline, and observability from 15:50 to 18:10 UTC.

Check notice on line 11 in content/postmortem/2026-03-19.md

View workflow job for this annotation

GitHub Actions / vale

[vale] content/postmortem/2026-03-19.md#L11

[Google.Acronyms] Spell out 'UTC', if it's unfamiliar to the audience.
Raw output
{"message": "[Google.Acronyms] Spell out 'UTC', if it's unfamiliar to the audience.", "location": {"path": "content/postmortem/2026-03-19.md", "range": {"start": {"line": 11, "column": 220}}}, "severity": "INFO"}

{{< /hextra/hero-subtitle >}}

The overloaded servers triggered OS memory reclaim (OOM), which consumed CPU and destabilized HAProxy load balancers β€” causing connectivity loss for hosted services and paralyzing internal automation (deployments, monitoring). Full recovery at 21:41 UTC.

Check notice on line 15 in content/postmortem/2026-03-19.md

View workflow job for this annotation

GitHub Actions / vale

[vale] content/postmortem/2026-03-19.md#L15

[Google.Acronyms] Spell out 'OOM', if it's unfamiliar to the audience.
Raw output
{"message": "[Google.Acronyms] Spell out 'OOM', if it's unfamiliar to the audience.", "location": {"path": "content/postmortem/2026-03-19.md", "range": {"start": {"line": 15, "column": 53}}}, "severity": "INFO"}

Check failure on line 15 in content/postmortem/2026-03-19.md

View workflow job for this annotation

GitHub Actions / vale

[vale] content/postmortem/2026-03-19.md#L15

[Vale.Spelling] Did you really mean 'HAProxy'?
Raw output
{"message": "[Vale.Spelling] Did you really mean 'HAProxy'?", "location": {"path": "content/postmortem/2026-03-19.md", "range": {"start": {"line": 15, "column": 95}}}, "severity": "ERROR"}

Check failure on line 15 in content/postmortem/2026-03-19.md

View workflow job for this annotation

GitHub Actions / vale

[vale] content/postmortem/2026-03-19.md#L15

[Google.EmDash] Don't put a space before or after a dash.
Raw output
{"message": "[Google.EmDash] Don't put a space before or after a dash.", "location": {"path": "content/postmortem/2026-03-19.md", "range": {"start": {"line": 15, "column": 117}}}, "severity": "ERROR"}

Check notice on line 15 in content/postmortem/2026-03-19.md

View workflow job for this annotation

GitHub Actions / vale

[vale] content/postmortem/2026-03-19.md#L15

[Google.Acronyms] Spell out 'UTC', if it's unfamiliar to the audience.
Raw output
{"message": "[Google.Acronyms] Spell out 'UTC', if it's unfamiliar to the audience.", "location": {"path": "content/postmortem/2026-03-19.md", "range": {"start": {"line": 15, "column": 251}}}, "severity": "INFO"}

Status updates were communicated through: https://www.clevercloudstatus.com/

Check notice on line 17 in content/postmortem/2026-03-19.md

View workflow job for this annotation

GitHub Actions / vale

[vale] content/postmortem/2026-03-19.md#L17

[Google.Passive] In general, use active voice instead of passive voice ('were communicated').
Raw output
{"message": "[Google.Passive] In general, use active voice instead of passive voice ('were communicated').", "location": {"path": "content/postmortem/2026-03-19.md", "range": {"start": {"line": 17, "column": 16}}}, "severity": "INFO"}

### Timeline

| Time | Description |
| ---------------- | ------------------------------------------------------------ |
| 2026-03-19 15:20 | First degradation signals on internal services (VM registration, deployment service database access). No client impact yet. |
| 2026-03-19 15:50 | Degradations and outages visible for some clients. Services hosted on the most impacted hypervisors become unreachable. Deployment pipeline stops. |
| 2026-03-19 16:03 | Incident response team mobilized. Mitigation actions begin: stabilizing hypervisors under pressure, progressive restart of internal clusters (messaging, observability), partial deployment resumption. |
| 2026-03-19 17:30 | Full restoration of the deployment pipeline and observability. Continued recovery of impacted client services. |
| 2026-03-19 18:09 | All impacted client services restored. Residual alert cleanup begins. |
| 2026-03-19 21:41 | Service fully operational. All residual alerts cleared. |

## Analysis

A simultaneous memory consumption spike was observed on approximately fifteen host servers (hypervisors) in the Paris region. The primary suspect identified is the TCP load balancing process (HAProxy), whose instrumentation at the time did not provide sufficient granularity to immediately confirm the cause.

Check notice on line 32 in content/postmortem/2026-03-19.md

View workflow job for this annotation

GitHub Actions / vale

[vale] content/postmortem/2026-03-19.md#L32

[Google.Passive] In general, use active voice instead of passive voice ('was observed').
Raw output
{"message": "[Google.Passive] In general, use active voice instead of passive voice ('was observed').", "location": {"path": "content/postmortem/2026-03-19.md", "range": {"start": {"line": 32, "column": 41}}}, "severity": "INFO"}

Check failure on line 32 in content/postmortem/2026-03-19.md

View workflow job for this annotation

GitHub Actions / vale

[vale] content/postmortem/2026-03-19.md#L32

[Vale.Spelling] Did you really mean 'HAProxy'?
Raw output
{"message": "[Vale.Spelling] Did you really mean 'HAProxy'?", "location": {"path": "content/postmortem/2026-03-19.md", "range": {"start": {"line": 32, "column": 193}}}, "severity": "ERROR"}

Check notice on line 32 in content/postmortem/2026-03-19.md

View workflow job for this annotation

GitHub Actions / vale

[vale] content/postmortem/2026-03-19.md#L32

[Google.Contractions] Use 'didn't' instead of 'did not'.
Raw output
{"message": "[Google.Contractions] Use 'didn't' instead of 'did not'.", "location": {"path": "content/postmortem/2026-03-19.md", "range": {"start": {"line": 32, "column": 237}}}, "severity": "INFO"}

An amplifying effect from the placement algorithm was identified: servers were already running at elevated load due to a threshold effect in the orchestration and resource placement algorithm. During rapid workload growth, the algorithm did not distribute load linearly beyond a certain threshold, causing some servers to be more heavily loaded than they should have been.

Check notice on line 34 in content/postmortem/2026-03-19.md

View workflow job for this annotation

GitHub Actions / vale

[vale] content/postmortem/2026-03-19.md#L34

[Google.Passive] In general, use active voice instead of passive voice ('was identified').
Raw output
{"message": "[Google.Passive] In general, use active voice instead of passive voice ('was identified').", "location": {"path": "content/postmortem/2026-03-19.md", "range": {"start": {"line": 34, "column": 51}}}, "severity": "INFO"}

Check notice on line 34 in content/postmortem/2026-03-19.md

View workflow job for this annotation

GitHub Actions / vale

[vale] content/postmortem/2026-03-19.md#L34

[Google.Contractions] Use 'didn't' instead of 'did not'.
Raw output
{"message": "[Google.Contractions] Use 'didn't' instead of 'did not'.", "location": {"path": "content/postmortem/2026-03-19.md", "range": {"start": {"line": 34, "column": 238}}}, "severity": "INFO"}

The memory saturation triggered the OS memory reclaim mechanism, which itself is highly CPU-intensive. This sudden CPU load spike disrupted the load balancers (HAProxy) on those servers β€” cascading into connectivity interruptions for hosted services and temporarily paralyzing internal automation services (deployments, monitoring).

Check failure on line 36 in content/postmortem/2026-03-19.md

View workflow job for this annotation

GitHub Actions / vale

[vale] content/postmortem/2026-03-19.md#L36

[Vale.Spelling] Did you really mean 'HAProxy'?
Raw output
{"message": "[Vale.Spelling] Did you really mean 'HAProxy'?", "location": {"path": "content/postmortem/2026-03-19.md", "range": {"start": {"line": 36, "column": 161}}}, "severity": "ERROR"}

Check failure on line 36 in content/postmortem/2026-03-19.md

View workflow job for this annotation

GitHub Actions / vale

[vale] content/postmortem/2026-03-19.md#L36

[Google.EmDash] Don't put a space before or after a dash.
Raw output
{"message": "[Google.EmDash] Don't put a space before or after a dash.", "location": {"path": "content/postmortem/2026-03-19.md", "range": {"start": {"line": 36, "column": 186}}}, "severity": "ERROR"}

### Actions

* Placement scoring function modified to smooth load ramp-up and eliminate the threshold effect
* Dedicated monitoring added to detect any recurrence of this behavior
* Temporary fixes deployed immediately during the incident to mitigate effects
* Partial algorithm redesign to guarantee smooth and optimal resource allocation under high load
* Enhanced observability for non-VM processes running on hypervisors (especially HAProxy instrumentation)

Check failure on line 44 in content/postmortem/2026-03-19.md

View workflow job for this annotation

GitHub Actions / vale

[vale] content/postmortem/2026-03-19.md#L44

[Vale.Spelling] Did you really mean 'HAProxy'?
Raw output
{"message": "[Vale.Spelling] Did you really mean 'HAProxy'?", "location": {"path": "content/postmortem/2026-03-19.md", "range": {"start": {"line": 44, "column": 82}}}, "severity": "ERROR"}
* Memory pressure resilience improvements prioritized and delivered within the following week
* Two new datacenters planned for the Paris region in Q2 2026 to increase capacity and reduce sensitivity to load spikes

Check failure on line 46 in content/postmortem/2026-03-19.md

View workflow job for this annotation

GitHub Actions / vale

[vale] content/postmortem/2026-03-19.md#L46

[Vale.Spelling] Did you really mean 'datacenters'?
Raw output
{"message": "[Vale.Spelling] Did you really mean 'datacenters'?", "location": {"path": "content/postmortem/2026-03-19.md", "range": {"start": {"line": 46, "column": 11}}}, "severity": "ERROR"}

### Conclusion

The incident was caused by a combination of a memory spike on HAProxy processes and a placement algorithm threshold effect that left insufficient headroom on impacted hypervisors. The response team was mobilized quickly and worked continuously until full recovery. Corrective measures have been applied to the placement algorithm, and observability improvements have been deployed to accelerate future diagnosis. Additional Paris datacenter capacity is planned for Q2 2026.

Check notice on line 50 in content/postmortem/2026-03-19.md

View workflow job for this annotation

GitHub Actions / vale

[vale] content/postmortem/2026-03-19.md#L50

[Google.Passive] In general, use active voice instead of passive voice ('was caused').
Raw output
{"message": "[Google.Passive] In general, use active voice instead of passive voice ('was caused').", "location": {"path": "content/postmortem/2026-03-19.md", "range": {"start": {"line": 50, "column": 14}}}, "severity": "INFO"}

Check failure on line 50 in content/postmortem/2026-03-19.md

View workflow job for this annotation

GitHub Actions / vale

[vale] content/postmortem/2026-03-19.md#L50

[Vale.Spelling] Did you really mean 'HAProxy'?
Raw output
{"message": "[Vale.Spelling] Did you really mean 'HAProxy'?", "location": {"path": "content/postmortem/2026-03-19.md", "range": {"start": {"line": 50, "column": 63}}}, "severity": "ERROR"}

Check notice on line 50 in content/postmortem/2026-03-19.md

View workflow job for this annotation

GitHub Actions / vale

[vale] content/postmortem/2026-03-19.md#L50

[Google.Passive] In general, use active voice instead of passive voice ('was mobilized').
Raw output
{"message": "[Google.Passive] In general, use active voice instead of passive voice ('was mobilized').", "location": {"path": "content/postmortem/2026-03-19.md", "range": {"start": {"line": 50, "column": 199}}}, "severity": "INFO"}

Check notice on line 50 in content/postmortem/2026-03-19.md

View workflow job for this annotation

GitHub Actions / vale

[vale] content/postmortem/2026-03-19.md#L50

[Google.Passive] In general, use active voice instead of passive voice ('been applied').
Raw output
{"message": "[Google.Passive] In general, use active voice instead of passive voice ('been applied').", "location": {"path": "content/postmortem/2026-03-19.md", "range": {"start": {"line": 50, "column": 291}}}, "severity": "INFO"}

Check notice on line 50 in content/postmortem/2026-03-19.md

View workflow job for this annotation

GitHub Actions / vale

[vale] content/postmortem/2026-03-19.md#L50

[Google.Passive] In general, use active voice instead of passive voice ('been deployed').
Raw output
{"message": "[Google.Passive] In general, use active voice instead of passive voice ('been deployed').", "location": {"path": "content/postmortem/2026-03-19.md", "range": {"start": {"line": 50, "column": 368}}}, "severity": "INFO"}

Check failure on line 50 in content/postmortem/2026-03-19.md

View workflow job for this annotation

GitHub Actions / vale

[vale] content/postmortem/2026-03-19.md#L50

[Vale.Spelling] Did you really mean 'datacenter'?
Raw output
{"message": "[Vale.Spelling] Did you really mean 'datacenter'?", "location": {"path": "content/postmortem/2026-03-19.md", "range": {"start": {"line": 50, "column": 431}}}, "severity": "ERROR"}

Check notice on line 50 in content/postmortem/2026-03-19.md

View workflow job for this annotation

GitHub Actions / vale

[vale] content/postmortem/2026-03-19.md#L50

[Google.Passive] In general, use active voice instead of passive voice ('is planned').
Raw output
{"message": "[Google.Passive] In general, use active voice instead of passive voice ('is planned').", "location": {"path": "content/postmortem/2026-03-19.md", "range": {"start": {"line": 50, "column": 451}}}, "severity": "INFO"}
1 change: 1 addition & 0 deletions content/postmortem/_index.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,3 +9,4 @@ type: docs
- [2024-08-02](/postmortem/2024-08-02)
- [2025-03-03](/postmortem/2025-03-03)
- [2025-10-09](/postmortem/2025-10-09)
- [2026-03-19](/postmortem/2026-03-19)
Loading