Skip to content

feat(prometheus-m7): Alertmanager bi-directional routing#46

Open
LinChuang2008 wants to merge 1 commit intofeat/prometheus-sidecarfrom
feat/prometheus-m7-alertmanager
Open

feat(prometheus-m7): Alertmanager bi-directional routing#46
LinChuang2008 wants to merge 1 commit intofeat/prometheus-sidecarfrom
feat/prometheus-m7-alertmanager

Conversation

@LinChuang2008
Copy link
Copy Markdown
Owner

Summary

Prometheus 集成 M7:Alertmanager 加入 sidecar 阵列,打通最后一公里。

  • 正向:Prom rule fire → AM (group/dedup/inhibit) → NightMend webhook → AI 诊断 + runbook
  • 反向:NightMend → AM `/api/v2/silences` 管理静默

Base 分支:`feat/prometheus-sidecar`(M1-M5),因为 M7 依赖 M1 的 prometheus compose service。合并顺序:PR #44 → 本 PR。

改动清单(10 文件,+647)

Alertmanager 部署

  • `alertmanager/alertmanager.yml.template` 路由配置
    • route 按 severity 分级:critical 10s/1m/30min,default 30s/5m/4h
    • inhibit_rules:critical 压制同 instance 的 warning 降噪
    • receiver nightmend-webhook 带 Bearer token + send_resolved
  • `alertmanager/entrypoint.sh` envsubst 模板化(Alertmanager 不原生支持 env 插值)
  • `alertmanager/Dockerfile` 基于 prom/alertmanager:v0.27.0 + gettext + 自定义 entrypoint

Compose + Prom 对接

  • `docker-compose.yml` 新增 alertmanager service(profile 守门 + healthcheck + 9093 端口)
  • `prometheus.yml` 启用 `alerting.alertmanagers → alertmanager:9093` api_version v2

反向 silence 管理

  • `core/config.py` 新增 `alertmanager_url` 配置
  • `services/alertmanager_client.py` `create/delete/list_silences` + `is_healthy`
    • RFC3339 时间转换
    • 错误统一 `AlertmanagerUnavailable`
    • 404 视为 delete 成功(幂等)
  • `routers/alertmanager_silences.py` 4 端点
    • POST /silences 带审计 + 60s-7d duration + operator 权限
    • DELETE /silences/{id} operator + 审计
    • GET /silences 全员可查,`?active_only=true` 过滤
    • GET /health

测试 — 17 条

  • client 单元 10:ISO 转换 × 2 / create 成功 + 备用键 / create 4xx + ConnectError / delete 404 成功 / delete 5xx 抛 / list active_only / 未配置 URL 抛
  • router 集成 7:POST 204 / viewer 403 / AM 不可达 502 / DELETE 204 / list 返回 / health / duration<60s 422

验收证据

```
新测试 17/17 passed
回归全家桶 71/71 passed(alertmanager + remote_write + file_sd + rules_sync + remote + alerts)
ruff 新文件 All checks passed(config.py 1 条 pre-existing)
docker compose config valid
```

启用契约

```bash

新增 .env

ALERTMANAGER_WEBHOOK_TOKEN=$(openssl rand -hex 32)

要与 NightMend .env 里的 ALERTMANAGER_WEBHOOK_TOKEN 一致(webhook 鉴权)

启动全家桶

docker compose --profile prometheus up -d
```

完整闭环图

```
用户 exporter (M6) → file_sd (M3) → Prom scrape (M1)

PromQL rule (M2 + M5 UI)
↓ fire
Alertmanager (M7)
↓ webhook (Bearer)
NightMend backend

AI 诊断 + 自动修复 runbook
↓ 执行中
NightMend → AM silence (M7 反向)

其他 Prom 可 remote_write (M4) 回 NightMend
```

🤖 Generated with Claude Code

Prometheus 集成 Milestone 7:Alertmanager 加入 sidecar 阵列,形成
    Prom rule fire → AM (group/dedup/inhibit) → NightMend webhook → AI 诊断 + runbook
的完整正向闭环;同时反向提供 silence 管理 API(NightMend → AM /api/v2)。

改动(9 文件,+460):

Alertmanager 部署:
- alertmanager/alertmanager.yml.template:
    route 树按 severity 分级;critical 10s/1m/30min,default 30s/5m/4h
    inhibit_rules:critical 压制同 instance 的 warning(降噪)
    receiver nightmend-webhook 带 Bearer token,send_resolved=true
- alertmanager/entrypoint.sh:envsubst 把 $NIGHTMEND_WEBHOOK_URL +
    $ALERTMANAGER_WEBHOOK_TOKEN 插入模板再启动(Alertmanager 不原生支持 env 插值)
- alertmanager/Dockerfile:prom/alertmanager:v0.27.0 + gettext(envsubst)+ 自定义 entrypoint

Compose 集成:
- docker-compose.yml 新增 alertmanager service(profile prometheus 守门,
  healthcheck /-/healthy,9093 端口,alertmanagerdata volume)

Prom 对接:
- prometheus/prometheus.yml 启用 alerting.alertmanagers → alertmanager:9093
  api_version v2

反向路由(NightMend → AM):
- core/config.py 新增 alertmanager_url(默认 http://alertmanager:9093,空=反向禁用)
- services/alertmanager_client.py:
    _iso RFC3339 转换(naive → UTC)
    create_silence / delete_silence(404 视为成功) / list_silences / is_healthy
    错误统一 AlertmanagerUnavailable
- routers/alertmanager_silences.py:
    POST /api/v1/alertmanager/silences:matchers + 60s-7d duration + comment
    DELETE /api/v1/alertmanager/silences/{id}
    GET /api/v1/alertmanager/silences?active_only=true
    GET /api/v1/alertmanager/health
    POST/DELETE 要 operator;GET 要登录
    所有变更带审计日志
- main.py 挂载 alertmanager_silences router(noqa E402)

测试(17 条):
- client 单元 10:ISO 转换 × 2 / create 成功+alt键 / create 4xx+ConnectError /
    delete 404 视为成功 / delete 5xx 抛 / list active_only 过滤 / 未配置 URL 抛
- router 集成 7:POST 204 / viewer 403 / AM 不可达 502 / DELETE 204 /
    list 返回 / health / duration<60s 422

验收:
- 新测试 17/17 passed
- 回归 prom 全家桶 + alerts = 71/71 passed
- ruff 新文件 All checks passed(config.py 1 条 pre-existing os import debt)
- docker compose config —— valid

至此 M1-M7 完整闭环,Prometheus 正向 + 反向全打通。

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant