feat(prometheus-m7): Alertmanager bi-directional routing#46
Open
LinChuang2008 wants to merge 1 commit intofeat/prometheus-sidecarfrom
Open
feat(prometheus-m7): Alertmanager bi-directional routing#46LinChuang2008 wants to merge 1 commit intofeat/prometheus-sidecarfrom
LinChuang2008 wants to merge 1 commit intofeat/prometheus-sidecarfrom
Conversation
Prometheus 集成 Milestone 7:Alertmanager 加入 sidecar 阵列,形成
Prom rule fire → AM (group/dedup/inhibit) → NightMend webhook → AI 诊断 + runbook
的完整正向闭环;同时反向提供 silence 管理 API(NightMend → AM /api/v2)。
改动(9 文件,+460):
Alertmanager 部署:
- alertmanager/alertmanager.yml.template:
route 树按 severity 分级;critical 10s/1m/30min,default 30s/5m/4h
inhibit_rules:critical 压制同 instance 的 warning(降噪)
receiver nightmend-webhook 带 Bearer token,send_resolved=true
- alertmanager/entrypoint.sh:envsubst 把 $NIGHTMEND_WEBHOOK_URL +
$ALERTMANAGER_WEBHOOK_TOKEN 插入模板再启动(Alertmanager 不原生支持 env 插值)
- alertmanager/Dockerfile:prom/alertmanager:v0.27.0 + gettext(envsubst)+ 自定义 entrypoint
Compose 集成:
- docker-compose.yml 新增 alertmanager service(profile prometheus 守门,
healthcheck /-/healthy,9093 端口,alertmanagerdata volume)
Prom 对接:
- prometheus/prometheus.yml 启用 alerting.alertmanagers → alertmanager:9093
api_version v2
反向路由(NightMend → AM):
- core/config.py 新增 alertmanager_url(默认 http://alertmanager:9093,空=反向禁用)
- services/alertmanager_client.py:
_iso RFC3339 转换(naive → UTC)
create_silence / delete_silence(404 视为成功) / list_silences / is_healthy
错误统一 AlertmanagerUnavailable
- routers/alertmanager_silences.py:
POST /api/v1/alertmanager/silences:matchers + 60s-7d duration + comment
DELETE /api/v1/alertmanager/silences/{id}
GET /api/v1/alertmanager/silences?active_only=true
GET /api/v1/alertmanager/health
POST/DELETE 要 operator;GET 要登录
所有变更带审计日志
- main.py 挂载 alertmanager_silences router(noqa E402)
测试(17 条):
- client 单元 10:ISO 转换 × 2 / create 成功+alt键 / create 4xx+ConnectError /
delete 404 视为成功 / delete 5xx 抛 / list active_only 过滤 / 未配置 URL 抛
- router 集成 7:POST 204 / viewer 403 / AM 不可达 502 / DELETE 204 /
list 返回 / health / duration<60s 422
验收:
- 新测试 17/17 passed
- 回归 prom 全家桶 + alerts = 71/71 passed
- ruff 新文件 All checks passed(config.py 1 条 pre-existing os import debt)
- docker compose config —— valid
至此 M1-M7 完整闭环,Prometheus 正向 + 反向全打通。
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Prometheus 集成 M7:Alertmanager 加入 sidecar 阵列,打通最后一公里。
Base 分支:`feat/prometheus-sidecar`(M1-M5),因为 M7 依赖 M1 的 prometheus compose service。合并顺序:PR #44 → 本 PR。
改动清单(10 文件,+647)
Alertmanager 部署
Compose + Prom 对接
反向 silence 管理
测试 — 17 条
验收证据
```
新测试 17/17 passed
回归全家桶 71/71 passed(alertmanager + remote_write + file_sd + rules_sync + remote + alerts)
ruff 新文件 All checks passed(config.py 1 条 pre-existing)
docker compose config valid
```
启用契约
```bash
新增 .env
ALERTMANAGER_WEBHOOK_TOKEN=$(openssl rand -hex 32)
要与 NightMend .env 里的 ALERTMANAGER_WEBHOOK_TOKEN 一致(webhook 鉴权)
启动全家桶
docker compose --profile prometheus up -d
```
完整闭环图
```
用户 exporter (M6) → file_sd (M3) → Prom scrape (M1)
↓
PromQL rule (M2 + M5 UI)
↓ fire
Alertmanager (M7)
↓ webhook (Bearer)
NightMend backend
↓
AI 诊断 + 自动修复 runbook
↓ 执行中
NightMend → AM silence (M7 反向)
↓
其他 Prom 可 remote_write (M4) 回 NightMend
```
🤖 Generated with Claude Code