Skip to content

feat: 添加 RAID 阵列卡硬盘存储监控支持#74

Closed
firmament2008 wants to merge 3 commits intoMapleawaa:mainfrom
firmament2008:feature/raid-storage-support-v2
Closed

feat: 添加 RAID 阵列卡硬盘存储监控支持#74
firmament2008 wants to merge 3 commits intoMapleawaa:mainfrom
firmament2008:feature/raid-storage-support-v2

Conversation

@firmament2008
Copy link
Copy Markdown

@firmament2008 firmament2008 commented Mar 25, 2026

主要功能:

  • 支持检测和显示 RAID 控制器(如 LSI MR9260-8i、DELL PERC H310)
  • 支持显示 RAID 卡下的物理硬盘 SMART 信息
  • 异步 SMART 数据采集,避免阻塞 API 请求
  • 动态 JSON 数组渲染,支持任意数量的硬盘

安全修复:

  • 添加 HTML 转义函数防止 XSS 漏洞
  • 修复空对象过滤问题
  • 保留原始 SMART 信号不强制默认值
  • 补充缺失字段保持前后端一致

技术实现:

  • 添加 smart-cache-collect.py 异步采集脚本
  • 添加 systemd 定时服务自动采集 SMART 数据
  • 修改 Nodes.pm 后端 API 支持 RAID 数据查询
  • 修改 pvemanagerlib.js 前端渲染 RAID 硬盘列表
  • 添加 device_type 字段区分控制器和物理硬盘

Summary by CodeRabbit

  • 新功能

    • 增加智能缓存服务,周期性收集存储设备 SMART 信息并供 WebUI 使用。
    • 在节点概览中新增基于缓存的 RAID 展示,NVMe/SATA/RAID 使用汇总视图统一渲染设备列表。
    • 在设备变更流程中自动部署与移除缓存服务以保持数据同步。
  • 性能优化

    • 合并并简化设备信息渲染路径,减少页面生成时的同步采集开销。
    • 动态计算页面高度以适配实际设备数量,提升显示稳定性。

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Mar 25, 2026

📝 Walkthrough

Walkthrough

新增周期性 SMART 缓存采集服务并将结果写入 /run/smart-cache.json;WebUI 节点概览改为从该缓存读取并聚合 NVMe/SATA/RAID 数据并统一渲染;在 cpu_add()/cpu_del() 流程中安装与移除该服务。

Changes

Cohort / File(s) Summary
SMART 缓存服务
PVE-Tools.sh
新增导出函数 deploy_smart_cache_service()remove_smart_cache_service();创建/删除 smart-cache.service(oneshot)与 smart-cache.timer(30s 间隔),并调用 Python 采集器写入 /run/smart-cache.json
NVMe 聚合渲染
PVE-Tools.sh
停止为每块 NVMe 同步输出完整 smartctl -a -j;改为从单次 smartctl 提取精简字段,构建每块条目并合并为 nvme_all 数组供前端统一渲染。
SATA 聚合渲染
PVE-Tools.sh
移除 UI 渲染时对每个 sd* 的实时 smartctl/待机检测;从 /run/smart-cache.json 读取 sd* 条目,编码并合并为 sata_all 统一渲染。
RAID 支持(缓存驱动)
PVE-Tools.sh
新增基于 smartctl --scan 的非 /dev/sd*/非 /dev/nvme* 设备发现逻辑,读取缓存生成 raid_all(含可选控制器行)并统一渲染。
UI 控件与高度计算
PVE-Tools.sh
移除逐磁盘重复 UI 块,改以聚合 JSON 数组(nvme_all/sata_all/raid_all)渲染;页面高度由检测到的显示行数(NVMe/SATA/RAID/控制器/UPS)计算。
生命周期集成
PVE-Tools.sh
cpu_add() 启用内核/模块/工具后调用 deploy_smart_cache_service();在 cpu_del() 恢复 WebUI 时调用 remove_smart_cache_service()

Sequence Diagram(s)

sequenceDiagram
    participant UI as WebUI
    participant Script as PVE-Tools.sh
    participant Cache as /run/smart-cache.json
    participant Py as smart-cache-collect.py
    participant Systemd as systemd
    participant SmartCtl as smartctl

    Systemd->>Systemd: timer 定期触发
    Systemd->>Py: 启动 oneshot 服务
    Py->>SmartCtl: 扫描并收集 NVMe/SATA/RAID
    SmartCtl-->>Py: 返回原始 SMART 数据
    Py->>Cache: 写入/更新 /run/smart-cache.json

    UI->>Script: 请求节点概览数据
    Script->>Cache: 读取缓存
    Cache-->>Script: 返回 nvme_all / sata_all / raid_all
    Script->>Script: 聚合/转义/格式化 渲染块
    Script-->>UI: 返回格式化概览数据
    UI->>UI: 渲染聚合设备列表
Loading
sequenceDiagram
    participant Admin as 管理员
    participant Script as PVE-Tools.sh
    participant Systemd as systemd
    participant Py as smart-cache-collect.py
    participant Cache as /run/smart-cache.json

    Admin->>Script: 调用 cpu_add()
    Script->>Script: 启用内核/模块/工具
    Script->>Script: deploy_smart_cache_service()
    Script->>Systemd: 安装 timer & oneshot 单元
    Systemd->>Py: 定期执行收集
    Py->>Cache: 初始化并更新缓存

    Admin->>Script: 调用 cpu_del()
    Script->>Script: remove_smart_cache_service()
    Script->>Systemd: 卸载 timer & oneshot 单元
    Systemd->>Systemd: 停止定期收集
Loading

预估代码审查工作量

🎯 4 (复杂) | ⏱️ ~45 分钟

诗歌

🐰 我在草丛里数盘子,轻敲脚本写缓存,
定时采集跳一跳,SMART 数据归一页,
NVMe/SATA/RAID 齐整队,小文件里暖洋洋,
前端一读笑眯眯,面板干净又宽敞,
小兔点根胡萝卜:部署成功,赞一个!

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed PR 标题与主要变更高度相关。标题「添加 RAID 阵列卡硬盘存储监控支持」清晰准确地反映了本次变更的核心功能——引入 RAID 存储监控能力。
Docstring Coverage ✅ Passed Docstring coverage is 80.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@PVE-Tools.sh`:
- Around line 2847-2849: The code is defaulting missing smart_status.passed to
true which incorrectly marks unknown/unsupported SMART as healthy; instead only
set result[key]["smart_status"] when the smart_status field is present in the
parsed data. Update the SATA/NVMe handling (the block using
data.get("smart_status", {}).get("passed", True) and any analogous code around
smart_status handling, also the similar section referenced at lines 3233-3246)
to check for the existence of "smart_status" (or that "passed" key exists)
before assigning result[key]["smart_status"], matching the RAID branch semantics
of "only write when field exists."
- Around line 4026-4031: display_rows 计算少算了 RAID 控制器的标题行:在有 RAID 控制器时函数 raid-all
会 unshift 一行“控制器: ...”,但现在只把 raidi 加到高度里,导致高度少 1 行。修改 display_rows 的计算(涉及变量
display_rows、raidi 和调用 raid-all 的逻辑)以在存在 RAID 控制器时再加 1(即当 raidi>0
或检测到控制器信息时把额外的控制器标题行计入高度),确保 enable_ups 分支保持不变。
- Around line 3746-3785: The SATA renderer crashes because escapeHtml is
undefined in its scope and the sleep flag is misspelled; update the SATA
renderer to either import/define the same escapeHtml helper used by the NVMe
renderer (so calls to escapeHtml(...) resolve) or move escapeHtml to a shared
scope accessible to both renderers, and correct the property check from v.standy
to v.standby so the standby branch executes; ensure you reference the SATA
renderer function, the escapeHtml helper, and the v.standby property when making
the changes.
- Around line 2808-2810: 当前对磁盘的 smartctl 调用使用 "-a" 会唤醒处于 standby 的硬盘,需改为先加 "-n
standby" 以避免唤醒:在产生 cmd 的位置(变量名 cmd)将 smartctl 参数从 "smartctl {device} {args} -a
-j" 改为 "smartctl -n standby {device} {args} -a -j"(保持 subprocess.run
调用、shell/timeout/capture_output/text 不变),并在脚本中其他所有对 smartctl 的调用点(SATA/NVMe 检测块和
RAID 检测块,当前以 smartctl 调用为标识)做同样替换以统一行为。
- Around line 3316-3320: 当前用正则从 $cache_data 提取 JSON 片段(匹配 "sd$sdi" 或 RAID 条目,赋值到
$sd_data 等)会在对象内含嵌套结构时失配,导致 $res->{sdX} / $res->{raidX} 退回为 '{}';请改为用 JSON
解析代替正则:在处理缓存时用 JSON->new/JSON::XS 的 decode_json 将整个 $cache_data 解析为 Perl
数据结构,直接访问 $decoded->{"sd$sdi"} 或相应 RAID 键并赋值给 $sd_data/$raid_data;同样将 3396-3400
处的 RAID 提取逻辑替换为从已解析的结构读取值,并添加必要的存在性和类型检查以保底(若键不存在仍回退到 '{}')。

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: a2ed859d-f7b0-419c-8065-78bd19757996

📥 Commits

Reviewing files that changed from the base of the PR and between 6d4ad15 and 87e6512.

⛔ Files ignored due to path filters (1)
  • images/add-raid-storage.png is excluded by !**/*.png
📒 Files selected for processing (1)
  • PVE-Tools.sh

Comment thread PVE-Tools.sh Outdated
Comment thread PVE-Tools.sh Outdated
Comment on lines +2847 to +2849
# smart_status
passed = data.get("smart_status", {}).get("passed", True)
result[key]["smart_status"] = {"passed": passed}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

缺失的 SMART 结果不能默认回填为通过。

smart_status.passed 缺失时这里直接写成 true,前端就会把“未知 / 读取失败 / 不支持”渲染成绿色“正常”。这和这次改动里“保留原始 SMART 信号”的目标是相反的;RAID 分支已经改成“字段存在才写入”,SATA/NVMe 这里也应该保持同样语义。

🛠️ 建议只在字段存在时输出 SMART 状态
-    passed = data.get("smart_status", {}).get("passed", True)
-    result[key]["smart_status"] = {"passed": passed}
+    smart_status = data.get("smart_status", {})
+    if "passed" in smart_status:
+        result[key]["smart_status"] = {"passed": smart_status["passed"]}

Also applies to: 3233-3246

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@PVE-Tools.sh` around lines 2847 - 2849, The code is defaulting missing
smart_status.passed to true which incorrectly marks unknown/unsupported SMART as
healthy; instead only set result[key]["smart_status"] when the smart_status
field is present in the parsed data. Update the SATA/NVMe handling (the block
using data.get("smart_status", {}).get("passed", True) and any analogous code
around smart_status handling, also the similar section referenced at lines
3233-3246) to check for the existence of "smart_status" (or that "passed" key
exists) before assigning result[key]["smart_status"], matching the RAID branch
semantics of "only write when field exists."

Comment thread PVE-Tools.sh Outdated
Comment thread PVE-Tools.sh
Comment thread PVE-Tools.sh
主要功能:
- 支持检测和显示 RAID 控制器(如 LSI MR9260-8i、DELL PERC H310)
- 支持显示 RAID 卡下的物理硬盘 SMART 信息
- 异步 SMART 数据采集,避免阻塞 API 请求
- 动态 JSON 数组渲染,支持任意数量的硬盘

安全修复:
- 添加 HTML 转义函数防止 XSS 漏洞
- 修复空对象过滤问题
- 保留原始 SMART 信号不强制默认值
- 补充缺失字段保持前后端一致

技术实现:
- 添加 smart-cache-collect.py 异步采集脚本
- 添加 systemd 定时服务自动采集 SMART 数据
- 修改 Nodes.pm 后端 API 支持 RAID 数据查询
- 修改 pvemanagerlib.js 前端渲染 RAID 硬盘列表
- 添加 device_type 字段区分控制器和物理硬盘
@firmament2008 firmament2008 force-pushed the feature/raid-storage-support-v2 branch from 87e6512 to ddc9f37 Compare March 25, 2026 07:06
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

♻️ Duplicate comments (1)
PVE-Tools.sh (1)

3236-3249: ⚠️ Potential issue | 🟠 Major

不要把缺失的 NVMe SMART 结果回填成 true

Line 3236 默认 'true' 后,Line 3249 就会把“未知/读取失败/不支持”渲染成绿色“正常”。这里应和 SATA/RAID 分支保持一致:只有在 smart_status.passed 真正存在时才输出该字段。

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@PVE-Tools.sh` around lines 3236 - 3249, The code defaults $smart_passed to
'true' when no "smart_status.passed" is present in $nvme_full, causing
unknown/unsupported reads to appear as healthy; change this by not defaulting
$smart_passed and instead only append the "smart_status" field into $nvme_data
when the regex on $nvme_full actually captures a value (i.e., if $smart_passed
is defined/exists after matching). Update the extraction from $nvme_full so
$smart_passed is undefined on no match, and modify the $nvme_data construction
to conditionally add the "  \"smart_status\": {\"passed\": ...} " entry only
when that capture succeeded.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@PVE-Tools.sh`:
- Around line 3224-3261: Replace the inline smartctl collection (the block that
builds $nvme_full/$nvme_data and assigns $res->{nvme$nvi}) so it reads
precomputed NVMe JSON from the smart-cache-collect.py cache instead of running
smartctl per request: look up the cache file produced by smart-cache-collect.py
for the device, parse/validate its JSON and assign that to $res->{nvme$nvi};
only if the cache is missing or stale, fall back to a non-blocking quick read or
return an empty/placeholder value (do not run timeout-bound smartctl
synchronously). Update the code that references $nvme_full, $nvme_data and
$res->{nvme$nvi} accordingly to use the cached JSON format and preserve the same
keys (model_name, temperature.current, power_on_time.hours, power_cycle_count,
smart_status.passed, nvme_smart_health_information_log) so downstream consumers
remain compatible.
- Line 2926: The device-glob patterns (e.g., the loop using for dev in
sorted(glob.glob("/dev/sd[a-z]")) and the nvme0-9 handling) miss multi-letter
device names like sdaa or nvme10; replace these ad-hoc globs with a robust
whole-disk enumeration using lsblk or /sys/block: call lsblk -dn -o NAME (or
read /sys/block and select entries where /sys/block/<name>/partition does not
exist) then filter names by whole-disk regex (e.g., ^sd[a-z]+$ or
^nvme[0-9]+n[0-9]+$/or simply accept all names returned by lsblk), prepend /dev/
and iterate over that list; update all places referencing the old glob (the for
dev in sorted(glob.glob("/dev/sd[a-z]")) site and the nvme handling) to use the
new enumeration helper so multi-letter device names are collected.
- Around line 4025-4056: The UI/count logic must account for the always-present
"nvme-all", "sata-all", and "raid-all" blocks even when there are zero drives;
update nvme_ui_count, sata_ui_count, and raid_ui_count so they reflect the
presence of those blocks (i.e. count as 1 even when nvi/sdi/raidi == 0) and then
compute addRs using those UI counts and enable_ups as before; also change
display_rows to add one row per block when there are zero drives (use:
display_rows = 2 + (nvi>0 ? nvi : 1) + (sdi>0 ? sdi : 1) + (raidi>0 ? raidi : 1)
and still add the extra RAID controller row when raidi>0 and the UPS row when
enable_ups=true) so empty-state "未检测到 …" rows are included (update references to
nvme_ui_count, sata_ui_count, raid_ui_count, addRs, display_rows, nvi, sdi,
raidi, enable_ups, ups_info).

---

Duplicate comments:
In `@PVE-Tools.sh`:
- Around line 3236-3249: The code defaults $smart_passed to 'true' when no
"smart_status.passed" is present in $nvme_full, causing unknown/unsupported
reads to appear as healthy; change this by not defaulting $smart_passed and
instead only append the "smart_status" field into $nvme_data when the regex on
$nvme_full actually captures a value (i.e., if $smart_passed is defined/exists
after matching). Update the extraction from $nvme_full so $smart_passed is
undefined on no match, and modify the $nvme_data construction to conditionally
add the "  \"smart_status\": {\"passed\": ...} " entry only when that capture
succeeded.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 3d8235c1-814c-4e38-87d7-d380cf32324a

📥 Commits

Reviewing files that changed from the base of the PR and between 87e6512 and ddc9f37.

⛔ Files ignored due to path filters (1)
  • images/add-raid-storage.png is excluded by !**/*.png
📒 Files selected for processing (1)
  • PVE-Tools.sh

Comment thread PVE-Tools.sh
# 收集 SATA 设备(包括物理硬盘和 RAID 控制器虚拟磁盘)
sdi = 0
raid_controller_key = None # 用于存储 RAID 控制器信息
for dev in sorted(glob.glob("/dev/sd[a-z]")):
Copy link
Copy Markdown

@coderabbitai coderabbitai bot Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

设备匹配模式会静默漏盘。

Line 2926 和 Line 3287 只覆盖 /dev/sd[a-z],超过 26 块盘后的 /dev/sdaa/dev/sdab 等都不会被采集;Line 3221 也只覆盖到 nvme0~nvme9。这个功能正好面向多盘/HBA/RAID 主机,这里最好改成基于 lsblk/sys/block 的整盘枚举,而不是依赖单字符模式。

Also applies to: 3221-3221, 3287-3287

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@PVE-Tools.sh` at line 2926, The device-glob patterns (e.g., the loop using
for dev in sorted(glob.glob("/dev/sd[a-z]")) and the nvme0-9 handling) miss
multi-letter device names like sdaa or nvme10; replace these ad-hoc globs with a
robust whole-disk enumeration using lsblk or /sys/block: call lsblk -dn -o NAME
(or read /sys/block and select entries where /sys/block/<name>/partition does
not exist) then filter names by whole-disk regex (e.g., ^sd[a-z]+$ or
^nvme[0-9]+n[0-9]+$/or simply accept all names returned by lsblk), prepend /dev/
and iterate over that list; update all places referencing the old glob (the for
dev in sorted(glob.glob("/dev/sd[a-z]")) site and the nvme handling) to use the
new enumeration helper so multi-letter device names are collected.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

计算节点不会有大量的存储,作为将来todo项改进

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@firmament2008 理解,已记录为待改进项。如果需要,我可以帮您创建一个 GitHub Issue 来追踪这个优化点(将设备枚举从 glob("/dev/sd[a-z]") 改为基于 lsblk//sys/block 的完整枚举,以支持 26 块以上 SATA 盘和 nvme10+ 等多位数 NVMe)。需要的话请告知。


✏️ Learnings added
Learnt from: firmament2008
Repo: Mapleawaa/PVE-Tools-9 PR: 74
File: PVE-Tools.sh:2926-2926
Timestamp: 2026-03-25T07:21:17.071Z
Learning: In PVE-Tools.sh (smart-cache-collect.py), the device glob patterns `/dev/sd[a-z]` (for SATA, line ~2926) and the NVMe glob (nvme0-9) are intentionally limited to single-character/single-digit device names. The author (firmament2008) has accepted this as a known limitation because the target use case (compute nodes) is not expected to have more than 26 SATA disks or more than 10 NVMe devices. This is tracked as a future TODO to improve using lsblk or /sys/block enumeration. Do not flag this as a blocking issue.

Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.


🧠 Learnings used
Learnt from: firmament2008
Repo: Mapleawaa/PVE-Tools-9 PR: 73
File: PVE-Tools.sh:2976-3028
Timestamp: 2026-03-24T04:34:32.616Z
Learning: In PVE-Tools.sh, the RAID card disk smartctl probing (around line 3018) intentionally uses `timeout 1` per device. The existing SATA (`smartctl $sd -a -j`) and NVMe (`smartctl $nvme -a -j`) sections have NO timeout at all, so the RAID `timeout 1` is actually more protective than the existing pattern — not a regression. Do not flag the per-disk `timeout 1` in the RAID section as a blocking concern relative to the other disk types.

Learnt from: firmament2008
Repo: Mapleawaa/PVE-Tools-9 PR: 73
File: PVE-Tools.sh:3471-3482
Timestamp: 2026-03-24T04:32:47.952Z
Learning: In PVE-Tools.sh (Bash/JavaScript), the RAID card disk SMART status rendering logic (in the JavaScript renderer injected for `raidX` panels) is intentionally designed with the following priority: (1) gate on `smart_support.available === false` to show "不支持"; (2) use `smart_status.passed` as the primary SMART pass/fail indicator; (3) fall back to `ata_smart_data.self_test.status.passed` only when `smart_status.passed` is undefined. The `ata_smart_data` fallback is intentional for RAID controllers where `smart_status` may be absent.

Copy link
Copy Markdown
Author

@firmament2008 firmament2008 Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lsblk 或 /sys/block 不能l列全 raid1 的所有设备。 待有环境后再开发。

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@firmament2008 了解,lsblk//sys/block 同样无法完整枚举 RAID1 阵列背后的所有物理设备,这个改进确实需要在有实际 RAID 环境下才能正确设计和验证。已记录这一额外约束,等您有环境时再开发。


✏️ Learnings added
Learnt from: firmament2008
Repo: Mapleawaa/PVE-Tools-9 PR: 74
File: PVE-Tools.sh:2926-2926
Timestamp: 2026-03-25T07:43:03.128Z
Learning: In PVE-Tools.sh (smart-cache-collect.py), the proposed improvement of replacing `/dev/sd[a-z]` glob with `lsblk` or `/sys/block` enumeration is also insufficient for full RAID1 array device discovery — lsblk/sys/block cannot enumerate all physical devices behind a RAID1 array either. The author (firmament2008) has deferred this improvement until a proper RAID environment is available for development and testing. Do not suggest lsblk/sys/block as a drop-in replacement for the current glob pattern in the context of RAID device enumeration.

Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.


🧠 Learnings used
Learnt from: firmament2008
Repo: Mapleawaa/PVE-Tools-9 PR: 74
File: PVE-Tools.sh:3224-3261
Timestamp: 2026-03-25T07:26:27.733Z
Learning: In PVE-Tools.sh (cpu_add function), the NVMe smartctl collection (around line 3228) is intentionally run inline within the Nodes.pm API path rather than via the async smart-cache-collect.py cache. The author (firmament2008) has accepted this because NVMe devices are CPU-directly attached (PCIe direct), platform lane limits bound the count to a very small number (typically 1-4), and the cumulative per-device overhead is not significant enough to block the API. Do not flag inline NVMe smartctl in Nodes.pm as a blocking performance concern.

Learnt from: firmament2008
Repo: Mapleawaa/PVE-Tools-9 PR: 74
File: PVE-Tools.sh:2926-2926
Timestamp: 2026-03-25T07:21:17.071Z
Learning: In PVE-Tools.sh (smart-cache-collect.py), the device glob patterns `/dev/sd[a-z]` (for SATA, line ~2926) and the NVMe glob (nvme0-9) are intentionally limited to single-character/single-digit device names. The author (firmament2008) has accepted this as a known limitation because the target use case (compute nodes) is not expected to have more than 26 SATA disks or more than 10 NVMe devices. This is tracked as a future TODO to improve using lsblk or /sys/block enumeration. Do not flag this as a blocking issue.

Learnt from: firmament2008
Repo: Mapleawaa/PVE-Tools-9 PR: 73
File: PVE-Tools.sh:3471-3482
Timestamp: 2026-03-24T04:32:47.952Z
Learning: In PVE-Tools.sh (Bash/JavaScript), the RAID card disk SMART status rendering logic (in the JavaScript renderer injected for `raidX` panels) is intentionally designed with the following priority: (1) gate on `smart_support.available === false` to show "不支持"; (2) use `smart_status.passed` as the primary SMART pass/fail indicator; (3) fall back to `ata_smart_data.self_test.status.passed` only when `smart_status.passed` is undefined. The `ata_smart_data` fallback is intentional for RAID controllers where `smart_status` may be absent.

Learnt from: firmament2008
Repo: Mapleawaa/PVE-Tools-9 PR: 73
File: PVE-Tools.sh:2976-3028
Timestamp: 2026-03-24T04:34:32.616Z
Learning: In PVE-Tools.sh, the RAID card disk smartctl probing (around line 3018) intentionally uses `timeout 1` per device. The existing SATA (`smartctl $sd -a -j`) and NVMe (`smartctl $nvme -a -j`) sections have NO timeout at all, so the RAID `timeout 1` is actually more protective than the existing pattern — not a regression. Do not flag the per-disk `timeout 1` in the RAID section as a blocking concern relative to the other disk types.

Learnt from: firmament2008
Repo: Mapleawaa/PVE-Tools-9 PR: 74
File: PVE-Tools.sh:3221-3265
Timestamp: 2026-03-25T07:25:27.596Z
Learning: In PVE-Tools.sh (cpu_add function), the NVMe/SATA/RAID device topology (sdN, nvmeN, raidN key names) is intentionally captured at cpu_add() install time rather than dynamically at API runtime. The Perl code injected into Nodes.pm reads /run/smart-cache.json at each API call, but the set of sdN/nvmeN/raidN keys to look up is fixed when cpu_add() runs. This is an accepted design trade-off by the author (firmament2008) for efficiency — the async smart-cache-collect.py handles collection, and the static key set avoids runtime device scanning overhead. Do not flag this as a blocking issue.

Learnt from: firmament2008
Repo: Mapleawaa/PVE-Tools-9 PR: 74
File: PVE-Tools.sh:3221-3265
Timestamp: 2026-03-25T07:25:27.596Z
Learning: In PVE-Tools.sh (cpu_add), each sdN and raidN Perl block independently opens and JSON::PP::decode_json's the entire /run/smart-cache.json file. This repeated open+decode pattern per disk is intentional and accepted by the author (firmament2008) — the async collection design is the intended efficiency mechanism, not per-call optimization. Do not flag repeated JSON decodes as a blocking concern.

Comment thread PVE-Tools.sh
Comment on lines 3221 to 3265
for nvme in $(ls /dev/nvme[0-9] 2> /dev/null); do
chmod +s /usr/sbin/smartctl 2>/dev/null

# 使用 -a 获取完整信息(-i 可能无法获取 nvme_smart_health_information_log)
cat >> $tmpf << EOF

\$res->{nvme$nvi} = \`smartctl $nvme -a -j\`;
{
my \$nvme_full = \`timeout 2 smartctl $nvme -a -j 2>/dev/null\`;
my \$nvme_data = '{}';
if (\$nvme_full =~ /^\s*\{/) {
# 提取必要字段,精简 JSON 数据
my \$model = (\$nvme_full =~ /"model_name"\s*:\s*"([^"]+)"/) ? \$1 : '';
my \$temp = (\$nvme_full =~ /"temperature"\s*:\s*\{[^}]*"current"\s*:\s*(\d+)/) ? \$1 : 0;
my \$hours = (\$nvme_full =~ /"power_on_time"\s*:\s*\{[^}]*"hours"\s*:\s*(\d+)/) ? \$1 : 0;
my \$cycles = (\$nvme_full =~ /"power_cycle_count"\s*:\s*(\d+)/) ? \$1 : 0;
my \$smart_passed = (\$nvme_full =~ /"smart_status"\s*:\s*\{[^}]*"passed"\s*:\s*(true|false)/) ? \$1 : 'true';
# NVMe 健康信息日志
my \$pct_used = (\$nvme_full =~ /"percentage_used"\s*:\s*(\d+)/) ? \$1 : '';
my \$media_err = (\$nvme_full =~ /"media_errors"\s*:\s*(\d+)/) ? \$1 : 0;
my \$unsafe_shutdown = (\$nvme_full =~ /"unsafe_shutdowns"\s*:\s*(\d+)/) ? \$1 : 0;
my \$data_read = (\$nvme_full =~ /"data_units_read"\s*:\s*(\d+)/) ? \$1 : 0;
my \$data_written = (\$nvme_full =~ /"data_units_written"\s*:\s*(\d+)/) ? \$1 : 0;
# 构建精简 JSON
\$nvme_data = "{\n";
\$nvme_data .= " \\"model_name\\": \\"\$model\\",\n";
\$nvme_data .= " \\"temperature\\": {\\"current\\": \$temp},\n";
\$nvme_data .= " \\"power_on_time\\": {\\"hours\\": \$hours},\n";
\$nvme_data .= " \\"power_cycle_count\\": \$cycles,\n";
\$nvme_data .= " \\"smart_status\\": {\\"passed\\": \$smart_passed}";
if (\$pct_used ne '') {
\$nvme_data .= ",\n \\"nvme_smart_health_information_log\\": {\n";
\$nvme_data .= " \\"percentage_used\\": \$pct_used,\n";
\$nvme_data .= " \\"media_errors\\": \$media_err,\n";
\$nvme_data .= " \\"unsafe_shutdowns\\": \$unsafe_shutdown,\n";
\$nvme_data .= " \\"data_units_read\\": \$data_read,\n";
\$nvme_data .= " \\"data_units_written\\": \$data_written\n";
\$nvme_data .= " }";
}
\$nvme_data .= "\n}";
}
\$res->{nvme$nvi} = \$nvme_data;
}
EOF
echo "检测到 NVME 硬盘: $nvme (nvme$nvi)"
let nvi++
Copy link
Copy Markdown

@coderabbitai coderabbitai bot Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

设备拓扑被安装时写死了。

这几段 for 循环是在执行 cpu_add() 时把当前盘数展开进 Nodes.pm,不是在 API 运行时动态发现设备。后面新增/删掉的 NVMe/SATA/RAID 盘不会自动进入 *_all,而且每个 sdN/raidN 块都会重复 open + decode_json 整个缓存文件;更稳的实现是一次性读取 /run/smart-cache.json,直接按 key 前缀生成 nvme_all / sata_all / raid_all

Also applies to: 3287-3335, 3357-3419

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

为了效率,使用异步获取设备信息。逻辑没问题。

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@firmament2008 明白了,异步缓存采集的设计是为了效率考虑,在 cpu_add() 安装时确定设备拓扑、运行时从 /run/smart-cache.json 读取数据,属于有意为之的权衡,不再追究。


✏️ Learnings added
Learnt from: firmament2008
Repo: Mapleawaa/PVE-Tools-9 PR: 74
File: PVE-Tools.sh:3221-3265
Timestamp: 2026-03-25T07:25:27.596Z
Learning: In PVE-Tools.sh (cpu_add function), the NVMe/SATA/RAID device topology (sdN, nvmeN, raidN key names) is intentionally captured at cpu_add() install time rather than dynamically at API runtime. The Perl code injected into Nodes.pm reads /run/smart-cache.json at each API call, but the set of sdN/nvmeN/raidN keys to look up is fixed when cpu_add() runs. This is an accepted design trade-off by the author (firmament2008) for efficiency — the async smart-cache-collect.py handles collection, and the static key set avoids runtime device scanning overhead. Do not flag this as a blocking issue.

Learnt from: firmament2008
Repo: Mapleawaa/PVE-Tools-9 PR: 74
File: PVE-Tools.sh:3221-3265
Timestamp: 2026-03-25T07:25:27.596Z
Learning: In PVE-Tools.sh (cpu_add), each sdN and raidN Perl block independently opens and JSON::PP::decode_json's the entire /run/smart-cache.json file. This repeated open+decode pattern per disk is intentional and accepted by the author (firmament2008) — the async collection design is the intended efficiency mechanism, not per-call optimization. Do not flag repeated JSON decodes as a blocking concern.

Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.


🧠 Learnings used
Learnt from: firmament2008
Repo: Mapleawaa/PVE-Tools-9 PR: 73
File: PVE-Tools.sh:2976-3028
Timestamp: 2026-03-24T04:34:32.616Z
Learning: In PVE-Tools.sh, the RAID card disk smartctl probing (around line 3018) intentionally uses `timeout 1` per device. The existing SATA (`smartctl $sd -a -j`) and NVMe (`smartctl $nvme -a -j`) sections have NO timeout at all, so the RAID `timeout 1` is actually more protective than the existing pattern — not a regression. Do not flag the per-disk `timeout 1` in the RAID section as a blocking concern relative to the other disk types.

Learnt from: firmament2008
Repo: Mapleawaa/PVE-Tools-9 PR: 73
File: PVE-Tools.sh:3471-3482
Timestamp: 2026-03-24T04:32:47.952Z
Learning: In PVE-Tools.sh (Bash/JavaScript), the RAID card disk SMART status rendering logic (in the JavaScript renderer injected for `raidX` panels) is intentionally designed with the following priority: (1) gate on `smart_support.available === false` to show "不支持"; (2) use `smart_status.passed` as the primary SMART pass/fail indicator; (3) fall back to `ata_smart_data.self_test.status.passed` only when `smart_status.passed` is undefined. The `ata_smart_data` fallback is intentional for RAID controllers where `smart_status` may be absent.

Learnt from: firmament2008
Repo: Mapleawaa/PVE-Tools-9 PR: 74
File: PVE-Tools.sh:2926-2926
Timestamp: 2026-03-25T07:21:17.071Z
Learning: In PVE-Tools.sh (smart-cache-collect.py), the device glob patterns `/dev/sd[a-z]` (for SATA, line ~2926) and the NVMe glob (nvme0-9) are intentionally limited to single-character/single-digit device names. The author (firmament2008) has accepted this as a known limitation because the target use case (compute nodes) is not expected to have more than 26 SATA disks or more than 10 NVMe devices. This is tracked as a future TODO to improve using lsblk or /sys/block enumeration. Do not flag this as a blocking issue.

Comment thread PVE-Tools.sh
Comment on lines +3224 to +3261
# 使用 -a 获取完整信息(-i 可能无法获取 nvme_smart_health_information_log)
cat >> $tmpf << EOF

\$res->{nvme$nvi} = \`smartctl $nvme -a -j\`;
{
my \$nvme_full = \`timeout 2 smartctl $nvme -a -j 2>/dev/null\`;
my \$nvme_data = '{}';
if (\$nvme_full =~ /^\s*\{/) {
# 提取必要字段,精简 JSON 数据
my \$model = (\$nvme_full =~ /"model_name"\s*:\s*"([^"]+)"/) ? \$1 : '';
my \$temp = (\$nvme_full =~ /"temperature"\s*:\s*\{[^}]*"current"\s*:\s*(\d+)/) ? \$1 : 0;
my \$hours = (\$nvme_full =~ /"power_on_time"\s*:\s*\{[^}]*"hours"\s*:\s*(\d+)/) ? \$1 : 0;
my \$cycles = (\$nvme_full =~ /"power_cycle_count"\s*:\s*(\d+)/) ? \$1 : 0;
my \$smart_passed = (\$nvme_full =~ /"smart_status"\s*:\s*\{[^}]*"passed"\s*:\s*(true|false)/) ? \$1 : 'true';
# NVMe 健康信息日志
my \$pct_used = (\$nvme_full =~ /"percentage_used"\s*:\s*(\d+)/) ? \$1 : '';
my \$media_err = (\$nvme_full =~ /"media_errors"\s*:\s*(\d+)/) ? \$1 : 0;
my \$unsafe_shutdown = (\$nvme_full =~ /"unsafe_shutdowns"\s*:\s*(\d+)/) ? \$1 : 0;
my \$data_read = (\$nvme_full =~ /"data_units_read"\s*:\s*(\d+)/) ? \$1 : 0;
my \$data_written = (\$nvme_full =~ /"data_units_written"\s*:\s*(\d+)/) ? \$1 : 0;
# 构建精简 JSON
\$nvme_data = "{\n";
\$nvme_data .= " \\"model_name\\": \\"\$model\\",\n";
\$nvme_data .= " \\"temperature\\": {\\"current\\": \$temp},\n";
\$nvme_data .= " \\"power_on_time\\": {\\"hours\\": \$hours},\n";
\$nvme_data .= " \\"power_cycle_count\\": \$cycles,\n";
\$nvme_data .= " \\"smart_status\\": {\\"passed\\": \$smart_passed}";
if (\$pct_used ne '') {
\$nvme_data .= ",\n \\"nvme_smart_health_information_log\\": {\n";
\$nvme_data .= " \\"percentage_used\\": \$pct_used,\n";
\$nvme_data .= " \\"media_errors\\": \$media_err,\n";
\$nvme_data .= " \\"unsafe_shutdowns\\": \$unsafe_shutdown,\n";
\$nvme_data .= " \\"data_units_read\\": \$data_read,\n";
\$nvme_data .= " \\"data_units_written\\": \$data_written\n";
\$nvme_data .= " }";
}
\$nvme_data .= "\n}";
}
\$res->{nvme$nvi} = \$nvme_data;
Copy link
Copy Markdown

@coderabbitai coderabbitai bot Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

NVMe 采集仍然阻塞 Nodes API。

Line 3228 这里还在 Nodes.pm 请求路径里逐块执行 smartctl。主机一旦有多块 NVMe,节点概览每次刷新都会继续吃掉这段 IO/超时预算,异步缓存服务并没有覆盖这条路径;NVMe 也应该走 smart-cache-collect.py 生成的缓存。

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@PVE-Tools.sh` around lines 3224 - 3261, Replace the inline smartctl
collection (the block that builds $nvme_full/$nvme_data and assigns
$res->{nvme$nvi}) so it reads precomputed NVMe JSON from the
smart-cache-collect.py cache instead of running smartctl per request: look up
the cache file produced by smart-cache-collect.py for the device, parse/validate
its JSON and assign that to $res->{nvme$nvi}; only if the cache is missing or
stale, fall back to a non-blocking quick read or return an empty/placeholder
value (do not run timeout-bound smartctl synchronously). Update the code that
references $nvme_full, $nvme_data and $res->{nvme$nvi} accordingly to use the
cached JSON format and preserve the same keys (model_name, temperature.current,
power_on_time.hours, power_cycle_count, smart_status.passed,
nvme_smart_health_information_log) so downstream consumers remain compatible.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nvme 是CPU直连设备,支持数量有限,不影响整体性能。

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@firmament2008 明白了,感谢说明。NVMe 设备通过 CPU PCIe 通道直连,受平台通道数量限制,实际部署中数量极为有限(通常 1-4 块),每次 API 刷新的 smartctl 累计开销可控,不会形成显著阻塞。已记录此设计决策,后续审查不再将此标记为阻塞问题。


✏️ Learnings added
Learnt from: firmament2008
Repo: Mapleawaa/PVE-Tools-9 PR: 74
File: PVE-Tools.sh:3224-3261
Timestamp: 2026-03-25T07:26:27.733Z
Learning: In PVE-Tools.sh (cpu_add function), the NVMe smartctl collection (around line 3228) is intentionally run inline within the Nodes.pm API path rather than via the async smart-cache-collect.py cache. The author (firmament2008) has accepted this because NVMe devices are CPU-directly attached (PCIe direct), platform lane limits bound the count to a very small number (typically 1-4), and the cumulative per-device overhead is not significant enough to block the API. Do not flag inline NVMe smartctl in Nodes.pm as a blocking performance concern.

Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.


🧠 Learnings used
Learnt from: firmament2008
Repo: Mapleawaa/PVE-Tools-9 PR: 74
File: PVE-Tools.sh:3221-3265
Timestamp: 2026-03-25T07:25:27.596Z
Learning: In PVE-Tools.sh (cpu_add), each sdN and raidN Perl block independently opens and JSON::PP::decode_json's the entire /run/smart-cache.json file. This repeated open+decode pattern per disk is intentional and accepted by the author (firmament2008) — the async collection design is the intended efficiency mechanism, not per-call optimization. Do not flag repeated JSON decodes as a blocking concern.

Learnt from: firmament2008
Repo: Mapleawaa/PVE-Tools-9 PR: 73
File: PVE-Tools.sh:3471-3482
Timestamp: 2026-03-24T04:32:47.952Z
Learning: In PVE-Tools.sh (Bash/JavaScript), the RAID card disk SMART status rendering logic (in the JavaScript renderer injected for `raidX` panels) is intentionally designed with the following priority: (1) gate on `smart_support.available === false` to show "不支持"; (2) use `smart_status.passed` as the primary SMART pass/fail indicator; (3) fall back to `ata_smart_data.self_test.status.passed` only when `smart_status.passed` is undefined. The `ata_smart_data` fallback is intentional for RAID controllers where `smart_status` may be absent.

Learnt from: firmament2008
Repo: Mapleawaa/PVE-Tools-9 PR: 73
File: PVE-Tools.sh:2976-3028
Timestamp: 2026-03-24T04:34:32.616Z
Learning: In PVE-Tools.sh, the RAID card disk smartctl probing (around line 3018) intentionally uses `timeout 1` per device. The existing SATA (`smartctl $sd -a -j`) and NVMe (`smartctl $nvme -a -j`) sections have NO timeout at all, so the RAID `timeout 1` is actually more protective than the existing pattern — not a regression. Do not flag the per-disk `timeout 1` in the RAID section as a blocking concern relative to the other disk types.

Learnt from: firmament2008
Repo: Mapleawaa/PVE-Tools-9 PR: 74
File: PVE-Tools.sh:3221-3265
Timestamp: 2026-03-25T07:25:27.596Z
Learning: In PVE-Tools.sh (cpu_add function), the NVMe/SATA/RAID device topology (sdN, nvmeN, raidN key names) is intentionally captured at cpu_add() install time rather than dynamically at API runtime. The Perl code injected into Nodes.pm reads /run/smart-cache.json at each API call, but the set of sdN/nvmeN/raidN keys to look up is fixed when cpu_add() runs. This is an accepted design trade-off by the author (firmament2008) for efficiency — the async smart-cache-collect.py handles collection, and the static key set avoids runtime device scanning overhead. Do not flag this as a blocking issue.

Learnt from: firmament2008
Repo: Mapleawaa/PVE-Tools-9 PR: 74
File: PVE-Tools.sh:2926-2926
Timestamp: 2026-03-25T07:21:17.071Z
Learning: In PVE-Tools.sh (smart-cache-collect.py), the device glob patterns `/dev/sd[a-z]` (for SATA, line ~2926) and the NVMe glob (nvme0-9) are intentionally limited to single-character/single-digit device names. The author (firmament2008) has accepted this as a known limitation because the target use case (compute nodes) is not expected to have more than 26 SATA disks or more than 10 NVMe devices. This is tracked as a future TODO to improve using lsblk or /sys/block enumeration. Do not flag this as a blocking issue.

Comment thread PVE-Tools.sh
Copy link
Copy Markdown
Author

@firmament2008 firmament2008 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已review ,无明显问题。其中nvme逻辑未验证,sata\raid已验证成功。控制台显示正常。

问题:RAID 检测逻辑使用 'or',只要 smart_support.available=false 就认为是 RAID
但 HBA 直通模式(IT mode)下的普通硬盘也可能 SMART 不可用

修复:改用 'and',必须同时满足:
1. SMART 不支持
2. 有 RAID 相关标识(vendor/product 包含 RAID 关键字)

这样可以正确区分:
- RAID 虚拟磁盘:MR/PERC/HPSA 等 + SMART 不可用
- HBA 直通硬盘:SMART 可能不可用,但无 RAID 标识
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
PVE-Tools.sh (1)

3289-3336: ⚠️ Potential issue | 🟠 Major

sdN 编号和采集端的规则已经错位了。

采集脚本会跳过被识别为 RAID 虚拟盘的 /dev/sdX,这里只要看到 /dev/sdX 就继续递增 sdi。一旦主机同时存在 RAID 虚拟盘和真实 SATA 盘,smart-cache-collect.py 写出的 sdN 与这里读取的 sdN 就会错位,后面的真实 SATA 盘会读错缓存,甚至直接从 sata_all 里消失。这里至少要复用采集端同一套 RAID 虚拟盘过滤条件,保证 sdi 只对真正进入缓存的 SATA 设备递增。

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@PVE-Tools.sh` around lines 3289 - 3336, The loop increments sdi for every
/dev/sdX even when the device is a RAID virtual disk, causing sdN mismatches
with the collector; update the loop so you first apply the exact same
RAID-virtual-device filter used in smart-cache-collect.py (reuse its
predicate/logic) to decide whether a /dev/sdX is a real SATA device, and only
then proceed to append the Perl JSON block and increment sdi. Concretely,
perform the RAID-check before using sd, sdsn, sdcr and before writing the block
that references "sd$sdi" so sdi only increments for devices that passed the
collector's filter. Ensure the same identifying symbols (sd, sdsn, sdcr, sdi and
the cache key "sd$sdi") are used so keys remain aligned with
smart-cache-collect.py.
♻️ Duplicate comments (2)
PVE-Tools.sh (2)

3229-3263: ⚠️ Potential issue | 🟠 Major

改用 JSON::PP 组装 NVMe 条目,别再手拼 JSON。

这段一边用正则从 smartctl 输出里抠字段,一边再手工拼 JSON。model_name 里只要出现引号或反斜杠,nvme_all 就会变成非法 JSON,前端整块 JSON.parse 会直接失败;而且 smart_status 缺失时这里仍会被回填成 true,把“未知”误显示成“正常”。直接 decode_json 后按字段存在性构造 Perl hash,再 encode_json 会稳很多。

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@PVE-Tools.sh` around lines 3229 - 3263, Replace the hand-rolled JSON assembly
around $nvme_full/$nvme_data with JSON::PP-based parsing and encoding: use
decode_json on $nvme_full to get a Perl hashref, then build a result hash only
adding keys when the source fields exist (e.g. model_name,
temperature->{current}, power_on_time->{hours}, power_cycle_count,
smart_status->{passed} — do not default missing smart_status to true), include
nvme_smart_health_information_log only when percentage_used exists, and finally
set $res->{nvme$nvi} = encode_json($result_hashref); this avoids
injection/escaping bugs from manual string concatenation and preserves correct
"unknown" state handling.

4029-4058: ⚠️ Potential issue | 🟡 Minor

自动高度还没把空状态块算进去。

nvme-allsata-allraid-all 是无条件插入的,但这里只有在 nvi/sdi/raidi > 0 时才给对应 block 计高度。纯 SATA、纯 NVMe,或者无 RAID 的常见场景都会少算 1~2 行,默认“自动计算”下左侧卡片仍然可能被裁掉。

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@PVE-Tools.sh` around lines 4029 - 4058, The auto-height calculation ignores
the always-present empty blocks (nvme-all, sata-all, raid-all), so adjust the
counters to account for those UI blocks even when nvi/sdi/raidi are zero: set
nvme_ui_count, sata_ui_count, and raid_ui_count to 1 when their block is always
inserted (i.e., make their initial assignment unconditional or check the
block-presence flag), then use these ui_count values when computing addRs and
display_rows (replace usages of nvi/sdi/raidi in the display_rows base count
with nvme_ui_count/sata_ui_count/raid_ui_count so empty blocks contribute 1
row), and keep the extra RAID controller row logic (display_rows += 1 when raidi
> 0) and UPS adjustments as-is.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@PVE-Tools.sh`:
- Around line 2830-2854: The code currently fills missing fields with defaults
(temp=0, hours=0, cycles=0, standby=False) causing the UI to render nonexistent
data; change the logic in the block handling
temperature/power_on_time/power_cycle_count/standby so you only assign
result[key]["temperature"], result[key]["power_on_time"],
result[key]["power_cycle_count"], and result[key]["standby"] when the upstream
data actually contains those fields (e.g., check for "temperature" key before
setting temperature.current, check for "hours" or "raw.value" in power_on_time
before setting power_on_time.hours, check that power_cycle_count exists before
writing cycles, and only set standby when "standby" is present), mirroring how
smart_status is handled so absent fields are left unset.

---

Outside diff comments:
In `@PVE-Tools.sh`:
- Around line 3289-3336: The loop increments sdi for every /dev/sdX even when
the device is a RAID virtual disk, causing sdN mismatches with the collector;
update the loop so you first apply the exact same RAID-virtual-device filter
used in smart-cache-collect.py (reuse its predicate/logic) to decide whether a
/dev/sdX is a real SATA device, and only then proceed to append the Perl JSON
block and increment sdi. Concretely, perform the RAID-check before using sd,
sdsn, sdcr and before writing the block that references "sd$sdi" so sdi only
increments for devices that passed the collector's filter. Ensure the same
identifying symbols (sd, sdsn, sdcr, sdi and the cache key "sd$sdi") are used so
keys remain aligned with smart-cache-collect.py.

---

Duplicate comments:
In `@PVE-Tools.sh`:
- Around line 3229-3263: Replace the hand-rolled JSON assembly around
$nvme_full/$nvme_data with JSON::PP-based parsing and encoding: use decode_json
on $nvme_full to get a Perl hashref, then build a result hash only adding keys
when the source fields exist (e.g. model_name, temperature->{current},
power_on_time->{hours}, power_cycle_count, smart_status->{passed} — do not
default missing smart_status to true), include nvme_smart_health_information_log
only when percentage_used exists, and finally set $res->{nvme$nvi} =
encode_json($result_hashref); this avoids injection/escaping bugs from manual
string concatenation and preserves correct "unknown" state handling.
- Around line 4029-4058: The auto-height calculation ignores the always-present
empty blocks (nvme-all, sata-all, raid-all), so adjust the counters to account
for those UI blocks even when nvi/sdi/raidi are zero: set nvme_ui_count,
sata_ui_count, and raid_ui_count to 1 when their block is always inserted (i.e.,
make their initial assignment unconditional or check the block-presence flag),
then use these ui_count values when computing addRs and display_rows (replace
usages of nvi/sdi/raidi in the display_rows base count with
nvme_ui_count/sata_ui_count/raid_ui_count so empty blocks contribute 1 row), and
keep the extra RAID controller row logic (display_rows += 1 when raidi > 0) and
UPS adjustments as-is.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: e41efc8e-ecbe-455e-a155-af4b6e274845

📥 Commits

Reviewing files that changed from the base of the PR and between ddc9f37 and b19e284.

📒 Files selected for processing (1)
  • PVE-Tools.sh

Comment thread PVE-Tools.sh Outdated
Comment on lines +2830 to +2854
# temperature
temp = data.get("temperature", {}).get("current", 0)
result[key]["temperature"] = {"current": temp}

# power_on_time
hours = 0
power_on = data.get("power_on_time", {})
if "hours" in power_on:
hours = power_on["hours"]
elif "raw" in power_on and "value" in power_on["raw"]:
hours = power_on["raw"]["value"]
result[key]["power_on_time"] = {"hours": hours}

# power_cycle_count
cycles = data.get("power_cycle_count", 0)
result[key]["power_cycle_count"] = cycles

# smart_status - 只在存在时才输出,不默认 true
smart_status = data.get("smart_status", {})
if "passed" in smart_status:
result[key]["smart_status"] = {"passed": smart_status["passed"]}

# standby - 休眠状态
standby = data.get("standby", False)
result[key]["standby"] = standby
Copy link
Copy Markdown

@coderabbitai coderabbitai bot Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

缺失字段不要回填成 0/false

前端现在已经按“字段是否存在”来决定是否渲染。这里把 temperature.currentpower_on_time.hourspower_cycle_count(以及 SATA 的 standby)默认为 0/false 后,会把“未返回/不支持”误显示成 0°C0时 或“非休眠”,和这次“保留原始值”的目标是冲突的。建议和 smart_status 一样,只在上游确实返回对应字段时才写入。

🛠️ 建议改法
-    temp = data.get("temperature", {}).get("current", 0)
-    result[key]["temperature"] = {"current": temp}
+    temp = data.get("temperature", {})
+    if "current" in temp:
+        result[key]["temperature"] = {"current": temp["current"]}

-    hours = 0
     power_on = data.get("power_on_time", {})
     if "hours" in power_on:
-        hours = power_on["hours"]
+        result[key]["power_on_time"] = {"hours": power_on["hours"]}
     elif "raw" in power_on and "value" in power_on["raw"]:
-        hours = power_on["raw"]["value"]
-    result[key]["power_on_time"] = {"hours": hours}
+        result[key]["power_on_time"] = {"hours": power_on["raw"]["value"]}

-    cycles = data.get("power_cycle_count", 0)
-    result[key]["power_cycle_count"] = cycles
+    if "power_cycle_count" in data:
+        result[key]["power_cycle_count"] = data["power_cycle_count"]

-    standby = data.get("standby", False)
-    result[key]["standby"] = standby
+    if "standby" in data:
+        result[key]["standby"] = data["standby"]

Also applies to: 2875-2888

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@PVE-Tools.sh` around lines 2830 - 2854, The code currently fills missing
fields with defaults (temp=0, hours=0, cycles=0, standby=False) causing the UI
to render nonexistent data; change the logic in the block handling
temperature/power_on_time/power_cycle_count/standby so you only assign
result[key]["temperature"], result[key]["power_on_time"],
result[key]["power_cycle_count"], and result[key]["standby"] when the upstream
data actually contains those fields (e.g., check for "temperature" key before
setting temperature.current, check for "hours" or "raw.value" in power_on_time
before setting power_on_time.hours, check that power_cycle_count exists before
writing cycles, and only set standby when "standby" is present), mirroring how
smart_status is handled so absent fields are left unset.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已在下版本增加逻辑

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@firmament2008 了解,感谢告知!下版本补齐"仅在字段存在时才写入"的逻辑后,0°C / 0时 等误显示问题就能一并解决了。期待后续版本 👍


🧠 Learnings used
Learnt from: firmament2008
Repo: Mapleawaa/PVE-Tools-9 PR: 74
File: PVE-Tools.sh:3224-3261
Timestamp: 2026-03-25T07:26:27.733Z
Learning: In PVE-Tools.sh (cpu_add function), the NVMe smartctl collection (around line 3228) is intentionally run inline within the Nodes.pm API path rather than via the async smart-cache-collect.py cache. The author (firmament2008) has accepted this because NVMe devices are CPU-directly attached (PCIe direct), platform lane limits bound the count to a very small number (typically 1-4), and the cumulative per-device overhead is not significant enough to block the API. Do not flag inline NVMe smartctl in Nodes.pm as a blocking performance concern.

Learnt from: firmament2008
Repo: Mapleawaa/PVE-Tools-9 PR: 73
File: PVE-Tools.sh:3471-3482
Timestamp: 2026-03-24T04:32:47.952Z
Learning: In PVE-Tools.sh (Bash/JavaScript), the RAID card disk SMART status rendering logic (in the JavaScript renderer injected for `raidX` panels) is intentionally designed with the following priority: (1) gate on `smart_support.available === false` to show "不支持"; (2) use `smart_status.passed` as the primary SMART pass/fail indicator; (3) fall back to `ata_smart_data.self_test.status.passed` only when `smart_status.passed` is undefined. The `ata_smart_data` fallback is intentional for RAID controllers where `smart_status` may be absent.

Learnt from: firmament2008
Repo: Mapleawaa/PVE-Tools-9 PR: 73
File: PVE-Tools.sh:2976-3028
Timestamp: 2026-03-24T04:34:32.616Z
Learning: In PVE-Tools.sh, the RAID card disk smartctl probing (around line 3018) intentionally uses `timeout 1` per device. The existing SATA (`smartctl $sd -a -j`) and NVMe (`smartctl $nvme -a -j`) sections have NO timeout at all, so the RAID `timeout 1` is actually more protective than the existing pattern — not a regression. Do not flag the per-disk `timeout 1` in the RAID section as a blocking concern relative to the other disk types.

Learnt from: firmament2008
Repo: Mapleawaa/PVE-Tools-9 PR: 74
File: PVE-Tools.sh:3221-3265
Timestamp: 2026-03-25T07:25:27.596Z
Learning: In PVE-Tools.sh (cpu_add function), the NVMe/SATA/RAID device topology (sdN, nvmeN, raidN key names) is intentionally captured at cpu_add() install time rather than dynamically at API runtime. The Perl code injected into Nodes.pm reads /run/smart-cache.json at each API call, but the set of sdN/nvmeN/raidN keys to look up is fixed when cpu_add() runs. This is an accepted design trade-off by the author (firmament2008) for efficiency — the async smart-cache-collect.py handles collection, and the static key set avoids runtime device scanning overhead. Do not flag this as a blocking issue.

Learnt from: firmament2008
Repo: Mapleawaa/PVE-Tools-9 PR: 74
File: PVE-Tools.sh:3221-3265
Timestamp: 2026-03-25T07:25:27.596Z
Learning: In PVE-Tools.sh (cpu_add), each sdN and raidN Perl block independently opens and JSON::PP::decode_json's the entire /run/smart-cache.json file. This repeated open+decode pattern per disk is intentional and accepted by the author (firmament2008) — the async collection design is the intended efficiency mechanism, not per-call optimization. Do not flag repeated JSON decodes as a blocking concern.

Learnt from: firmament2008
Repo: Mapleawaa/PVE-Tools-9 PR: 74
File: PVE-Tools.sh:2926-2926
Timestamp: 2026-03-25T07:43:03.128Z
Learning: In PVE-Tools.sh (smart-cache-collect.py), the proposed improvement of replacing `/dev/sd[a-z]` glob with `lsblk` or `/sys/block` enumeration is also insufficient for full RAID1 array device discovery — lsblk/sys/block cannot enumerate all physical devices behind a RAID1 array either. The author (firmament2008) has deferred this improvement until a proper RAID environment is available for development and testing. Do not suggest lsblk/sys/block as a drop-in replacement for the current glob pattern in the context of RAID device enumeration.

Learnt from: firmament2008
Repo: Mapleawaa/PVE-Tools-9 PR: 74
File: PVE-Tools.sh:2926-2926
Timestamp: 2026-03-25T07:21:17.071Z
Learning: In PVE-Tools.sh (smart-cache-collect.py), the device glob patterns `/dev/sd[a-z]` (for SATA, line ~2926) and the NVMe glob (nvme0-9) are intentionally limited to single-character/single-digit device names. The author (firmament2008) has accepted this as a known limitation because the target use case (compute nodes) is not expected to have more than 26 SATA disks or more than 10 NVMe devices. This is tracked as a future TODO to improve using lsblk or /sys/block enumeration. Do not flag this as a blocking issue.

与 smart_status 保持一致的处理逻辑:
- temperature.current - 只在存在时输出
- power_on_time.hours - 只在存在时输出
- power_cycle_count - 只在存在时输出
- standby - 只在存在时输出
- ata_smart_attributes.table - 只在存在时输出
- smart_support.available - 只在存在时输出
- ata_smart_data.self_test.status.passed - 只在存在时输出

避免把'未返回/不支持'误显示成 0°C、0时或'非休眠'
@Mapleawaa
Copy link
Copy Markdown
Owner

Mapleawaa commented Mar 26, 2026

这个 PR 我不会合并,后续会直接关闭。

先说明一点,“已经在你自己的机器上运行良好”不能作为主仓库合并依据。你的环境能跑,只能说明这套实现适合你的用法,不说明它适合主仓库,更不说明它适合其他用户环境。

这次不合并的原因不是单纯代码风格问题,而是实现方向本身已经偏离了主仓库当前的维护边界。

主仓库当前的设计目标很明确:尽量保持单脚本、少状态、可回滚,不为了一个节点概览展示功能再额外增加后台服务和新的维护链路。
而你这版实现实际做的是:

  • 新增 Python 脚本
  • 新增 systemd service
  • 新增 systemd timer
  • 新增 /run/smart-cache.json 缓存协议
  • 把原本的 SMART 读取链路改成依赖后台缓存

这已经不是“增加 RAID 阵列卡显示支持”,而是另外起了一套常驻采集机制。这个方向主仓库不接受。

除此之外,这个 PR 还带来了明确的问题:

  1. SATA / RAID 路径已经依赖缓存读取,但缓存失效时并没有真正回退到原本的 smartctl 直读逻辑。只要 timer 没启动、首次采集失败、缓存文件损坏,页面就会直接空数据。
  2. NVMe 路径被一起改动了,而且存在把缺失的 SMART 状态默认当成正常的逻辑。这种展示结果本身就是错误的。
  3. 安装和卸载也不够干净。安装阶段先部署后台组件,再改 WebUI;卸载又依赖后续步骤成功才清理。中途失败时会残留宿主机状态。

所以这不是“你机器上跑得挺好,我这里看不顺眼”的问题,而是这版实现对主仓库来说维护成本更高、状态更多、风险也更大。

如果你认可这套方案,可以继续在你自己的 fork 里维护一个变体版本,这个没有问题。
但主仓库不会接收这一版实现。

另外提醒一下,本仓库许可证是 GPL-3.0。
如果你后续继续维护、分发或者发布衍生版本,请保留原许可证和版权声明,并按 GPLv3 的要求提供对应源码和修改版本的许可信息。

@Mapleawaa Mapleawaa closed this Mar 26, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants