Skip to content

feat(sprint1): security hardening + resilience polish#43

Open
LinChuang2008 wants to merge 9 commits intomainfrom
feat/sprint1-security-hardening
Open

feat(sprint1): security hardening + resilience polish#43
LinChuang2008 wants to merge 9 commits intomainfrom
feat/sprint1-security-hardening

Conversation

@LinChuang2008
Copy link
Copy Markdown
Owner

Summary

Sprint 1 六处抓手闭环:安全堵口 + 性能索引 + 健壮性打磨。零新增依赖,最小切面改动。

改动清单(+338 / -24,10 文件)

🔒 安全(3 处)

  • Topology tooltip XSS 堵口 (frontend/src/pages/Topology.tsx)
    新增 `escapeHtml()`;节点 tooltip(name/host/status/type)+ 边 tooltip(source/target/description/labelKey)全部转义,防止恶意节点名/描述注入 `<script>`。
  • Webhook IP 白名单 (backend/app/{core/config.py, routers/webhooks.py})
    新增 `ALERTMANAGER_WEBHOOK_ALLOWED_IPS` + `WEBHOOK_TRUST_FORWARDED`,支持 CIDR/IPv4/IPv6;白名单在 token 校验之后执行(防 IP 配置错误被未认证调用者探测)。
  • 全局日志脱敏 Filter (backend/app/core/log_redaction.py + main.py)
    在 root logger 挂 `RedactionFilter`,拦截 `Bearer` / `Authorization` / `api_key` / `token` / `secret` 等字段,防 ELK/日志文件凭证泄漏。幂等安装。

⚡ 性能(1 处)

  • offline_detector 复合索引 (backend/alembic/versions/032_*, models/host.py, tasks/offline_detector.py)
    新增 `ix_hosts_status_last_heartbeat` 复合索引(Postgres 用 `CONCURRENTLY` 避免锁表),同时把 `last_heartbeat < cutoff` 下推到 SQL,扫描从 O(online hosts) 压到 O(online AND stale)。

🛡️ 健壮性(2 处)

  • OpsWebSocket 指数退避 (frontend/src/hooks/useOpsWebSocket.ts)
    原固定 3s 重连 → 指数退避(1s → 30s 封顶,每次 ×2,±20% jitter),最多 10 次,超限推送 `error` 事件提示用户。`onopen` 成功时重置计数。
  • 前端 API 统一错误处理 (frontend/src/services/api.ts)
    axios interceptor 重构:401 清缓存 + redirect;429/403/5xx toast 节流 3s;幂等 GET 遇网络错误 / 502/503/504 自动指数退避重试最多 2 次;组件可通过 `__noToast` 退出全局提示。

验收证据

```
backend ruff(新文件) ✅ All checks passed
backend 测试 ✅ 38 passed(test_webhooks + test_hosts + test_alerts)
⚠️ 1 baseline failure(git stash 验证为存量债,无关本 PR)
frontend tsc ✅ 0 errors
frontend eslint(新代码) ✅ 0 new errors(pre-existing any debt 不动)
smoke: redaction filter ✅ Bearer / header / query / JSON 4/4 脱敏
smoke: IP 白名单解析 ✅ IPv4 / IPv6 / CIDR / bare IP 匹配,bad entry 正确拒绝
smoke: Host.table ✅ ix_hosts_status_last_heartbeat 已注册
```

部署注意

新增两个可选环境变量,`.env.example` 建议补上:

```bash

AlertManager Bridge IP 白名单(留空关闭)

ALERTMANAGER_WEBHOOK_ALLOWED_IPS=10.0.0.0/8,192.168.1.100

经反向代理时开启

WEBHOOK_TRUST_FORWARDED=false
```

迁移需要执行 `alembic upgrade head` 以应用新索引。

Test plan

  • alembic upgrade head 成功创建 `ix_hosts_status_last_heartbeat`
  • 配置 `ALERTMANAGER_WEBHOOK_ALLOWED_IPS=127.0.0.1` 后,从非白名单 IP 调用 `/api/v1/webhooks/alertmanager` 返回 403
  • 拓扑页节点名包含 `` 等内容时 tooltip 显示转义字符而非执行
  • 断网测试 OpsAssistant WS,观察控制台能看到指数退避日志且 10 次后 toast error
  • 打开 DevTools 触发 429,前端仅提示一次(3s 节流)

🤖 Generated with Claude Code

LinChuang2008 and others added 9 commits April 8, 2026 15:07
Two bugs in the AI 运维助手 (/ops) page made it unusable on a fresh
install:

1. POST /api/v1/ops/sessions was reusing the user's existing empty
   draft via _cleanup_empty_sessions(), so clicking "新建会话"
   silently returned the same session id every time and body.title
   was dropped on the floor. Replace the reuse path with explicit
   cleanup of stale empty drafts (no title + no messages) followed
   by always creating a fresh session that respects body.title.

2. OpsInputBar wrapped the entire host-picker row in
   `{hosts.length > 0 && ...}`. On a fresh install with no hosts
   in the database the row vanished, but the send guard still
   required selectedHostId, so users saw "请先选择目标主机" with
   no way to actually select one. Always render the row and show
   an empty-state hint linking to /hosts when no hosts exist.

Add 4 regression tests covering:
- Two consecutive POSTs return different ids
- body.title is persisted
- Stale empty drafts are cleaned up on create
- Sessions with title or messages are preserved

Verified end-to-end against the running backend (3 POSTs returned
3 distinct ids with titles "first" / "second" / null) and via
git stash to confirm the tests fail without the backend fix.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Backend:
- Add DEMO_MODE and DEMO_FAULT_DELAY_SECONDS config options
- Create demo_orchestrator.py: seed data, fault injection, auto AI diagnosis
- Create demo.py router: GET /api/v1/demo/status endpoint
- Add auto_approve flag to ToolContext for demo command execution
- Register demo flow as background task in main.py lifespan

Frontend:
- Add DEMO badge to OpsAssistant header when demo mode active
- Add global alert bar to AppLayout with auto-redirect to OpsAssistant
- Add getDemoStatus API to opsApi.ts
- Auto-select demo session and poll demo phase

Infrastructure:
- Create docker-compose.demo.yml with DEMO_MODE=true

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
… TLS, host override

- Restore with_for_update() on first-user admin check (race condition fix)
- Demo auto_approve now uses safe command whitelist instead of blanket approve
- Demo state synced to Redis for multi-worker consistency
- Host override only applies when LLM omits host_id (no silent overwrite)
- Draft session cleanup skips recently active sessions
- restore_yum function defined before usage in install script

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
… hardening

- notifications.py: add missing datetime/timezone import (NameError fix)
- settings.py: add missing HTTPException import (NameError fix)
- service_checker.py: remove verify=False on httpx client
- main.py: remove hardcoded production IP from CORS origins
- docker-compose.prod.yml: POSTGRES_PASSWORD now required (no default fallback)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
… cleanup

- ruff --fix: removed 181 unused imports across backend
- requirements.txt: bump cryptography>=46.0.6, asyncssh>=2.14.2, PyJWT>=2.12.0,
  python-multipart==0.0.22, fastmcp>=3.2.0 (5 CVE fixes)
- npm audit fix: resolved 11 vulnerabilities (including critical axios SSRF)
- Removed 5 unused npm deps: @dnd-kit/*, @monaco-editor/react, @types/react-grid-layout

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
闭环 Sprint 1 六处抓手,覆盖安全堵口、性能优化、健壮性提升:

1. Topology tooltip XSS 堵口 — 新增 escapeHtml,节点/边 tooltip 全部转义
2. Webhook IP 白名单 — alertmanager endpoint 支持 CIDR/IP 白名单 + 可选 X-Forwarded-For
3. OpsWebSocket 指数退避 — 1s→30s ± 20% jitter × 10 次,超限推 error
4. offline_detector 复合索引 — hosts(status, last_heartbeat) + SQL cutoff 下推
5. 全局日志脱敏 Filter — 拦截 Bearer/api_key/secret,防 ELK 泄漏
6. 前端 API 统一错误处理 — 401/429/403/5xx toast 节流 + 幂等 GET 指数重试

验收:
- backend ruff(新文件)✅ smoke test(redaction 4/4、IP 白名单解析)✅
- backend 测试:38 passed,1 baseline pre-existing failure(git stash 验证)
- frontend tsc ✅ eslint(新增零新增 error)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant