feat(sprint1): security hardening + resilience polish#43
Open
LinChuang2008 wants to merge 9 commits intomainfrom
Open
feat(sprint1): security hardening + resilience polish#43LinChuang2008 wants to merge 9 commits intomainfrom
LinChuang2008 wants to merge 9 commits intomainfrom
Conversation
Two bugs in the AI 运维助手 (/ops) page made it unusable on a fresh
install:
1. POST /api/v1/ops/sessions was reusing the user's existing empty
draft via _cleanup_empty_sessions(), so clicking "新建会话"
silently returned the same session id every time and body.title
was dropped on the floor. Replace the reuse path with explicit
cleanup of stale empty drafts (no title + no messages) followed
by always creating a fresh session that respects body.title.
2. OpsInputBar wrapped the entire host-picker row in
`{hosts.length > 0 && ...}`. On a fresh install with no hosts
in the database the row vanished, but the send guard still
required selectedHostId, so users saw "请先选择目标主机" with
no way to actually select one. Always render the row and show
an empty-state hint linking to /hosts when no hosts exist.
Add 4 regression tests covering:
- Two consecutive POSTs return different ids
- body.title is persisted
- Stale empty drafts are cleaned up on create
- Sessions with title or messages are preserved
Verified end-to-end against the running backend (3 POSTs returned
3 distinct ids with titles "first" / "second" / null) and via
git stash to confirm the tests fail without the backend fix.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Backend: - Add DEMO_MODE and DEMO_FAULT_DELAY_SECONDS config options - Create demo_orchestrator.py: seed data, fault injection, auto AI diagnosis - Create demo.py router: GET /api/v1/demo/status endpoint - Add auto_approve flag to ToolContext for demo command execution - Register demo flow as background task in main.py lifespan Frontend: - Add DEMO badge to OpsAssistant header when demo mode active - Add global alert bar to AppLayout with auto-redirect to OpsAssistant - Add getDemoStatus API to opsApi.ts - Auto-select demo session and poll demo phase Infrastructure: - Create docker-compose.demo.yml with DEMO_MODE=true Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
… TLS, host override - Restore with_for_update() on first-user admin check (race condition fix) - Demo auto_approve now uses safe command whitelist instead of blanket approve - Demo state synced to Redis for multi-worker consistency - Host override only applies when LLM omits host_id (no silent overwrite) - Draft session cleanup skips recently active sessions - restore_yum function defined before usage in install script Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
… hardening - notifications.py: add missing datetime/timezone import (NameError fix) - settings.py: add missing HTTPException import (NameError fix) - service_checker.py: remove verify=False on httpx client - main.py: remove hardcoded production IP from CORS origins - docker-compose.prod.yml: POSTGRES_PASSWORD now required (no default fallback) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
… cleanup - ruff --fix: removed 181 unused imports across backend - requirements.txt: bump cryptography>=46.0.6, asyncssh>=2.14.2, PyJWT>=2.12.0, python-multipart==0.0.22, fastmcp>=3.2.0 (5 CVE fixes) - npm audit fix: resolved 11 vulnerabilities (including critical axios SSRF) - Removed 5 unused npm deps: @dnd-kit/*, @monaco-editor/react, @types/react-grid-layout Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
闭环 Sprint 1 六处抓手,覆盖安全堵口、性能优化、健壮性提升: 1. Topology tooltip XSS 堵口 — 新增 escapeHtml,节点/边 tooltip 全部转义 2. Webhook IP 白名单 — alertmanager endpoint 支持 CIDR/IP 白名单 + 可选 X-Forwarded-For 3. OpsWebSocket 指数退避 — 1s→30s ± 20% jitter × 10 次,超限推 error 4. offline_detector 复合索引 — hosts(status, last_heartbeat) + SQL cutoff 下推 5. 全局日志脱敏 Filter — 拦截 Bearer/api_key/secret,防 ELK 泄漏 6. 前端 API 统一错误处理 — 401/429/403/5xx toast 节流 + 幂等 GET 指数重试 验收: - backend ruff(新文件)✅ smoke test(redaction 4/4、IP 白名单解析)✅ - backend 测试:38 passed,1 baseline pre-existing failure(git stash 验证) - frontend tsc ✅ eslint(新增零新增 error) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Sprint 1 六处抓手闭环:安全堵口 + 性能索引 + 健壮性打磨。零新增依赖,最小切面改动。
改动清单(+338 / -24,10 文件)
🔒 安全(3 处)
frontend/src/pages/Topology.tsx)新增 `escapeHtml()`;节点 tooltip(name/host/status/type)+ 边 tooltip(source/target/description/labelKey)全部转义,防止恶意节点名/描述注入 `<script>`。
backend/app/{core/config.py, routers/webhooks.py})新增 `ALERTMANAGER_WEBHOOK_ALLOWED_IPS` + `WEBHOOK_TRUST_FORWARDED`,支持 CIDR/IPv4/IPv6;白名单在 token 校验之后执行(防 IP 配置错误被未认证调用者探测)。
backend/app/core/log_redaction.py+main.py)在 root logger 挂 `RedactionFilter`,拦截 `Bearer` / `Authorization` / `api_key` / `token` / `secret` 等字段,防 ELK/日志文件凭证泄漏。幂等安装。
⚡ 性能(1 处)
backend/alembic/versions/032_*,models/host.py,tasks/offline_detector.py)新增 `ix_hosts_status_last_heartbeat` 复合索引(Postgres 用 `CONCURRENTLY` 避免锁表),同时把 `last_heartbeat < cutoff` 下推到 SQL,扫描从 O(online hosts) 压到 O(online AND stale)。
🛡️ 健壮性(2 处)
frontend/src/hooks/useOpsWebSocket.ts)原固定 3s 重连 → 指数退避(1s → 30s 封顶,每次 ×2,±20% jitter),最多 10 次,超限推送 `error` 事件提示用户。`onopen` 成功时重置计数。
frontend/src/services/api.ts)axios interceptor 重构:401 清缓存 + redirect;429/403/5xx toast 节流 3s;幂等 GET 遇网络错误 / 502/503/504 自动指数退避重试最多 2 次;组件可通过 `__noToast` 退出全局提示。
验收证据
```
⚠️ 1 baseline failure(git stash 验证为存量债,无关本 PR)
backend ruff(新文件) ✅ All checks passed
backend 测试 ✅ 38 passed(test_webhooks + test_hosts + test_alerts)
frontend tsc ✅ 0 errors
frontend eslint(新代码) ✅ 0 new errors(pre-existing any debt 不动)
smoke: redaction filter ✅ Bearer / header / query / JSON 4/4 脱敏
smoke: IP 白名单解析 ✅ IPv4 / IPv6 / CIDR / bare IP 匹配,bad entry 正确拒绝
smoke: Host.table ✅ ix_hosts_status_last_heartbeat 已注册
```
部署注意
新增两个可选环境变量,`.env.example` 建议补上:
```bash
AlertManager Bridge IP 白名单(留空关闭)
ALERTMANAGER_WEBHOOK_ALLOWED_IPS=10.0.0.0/8,192.168.1.100
经反向代理时开启
WEBHOOK_TRUST_FORWARDED=false
```
迁移需要执行 `alembic upgrade head` 以应用新索引。
Test plan
alembic upgrade head成功创建 `ix_hosts_status_last_heartbeat`🤖 Generated with Claude Code