NCCL 2.29 – JSON Output and Realtime Monitoring Support in RAS #2009
gab9talavera
announced in
Announcements
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
JSON Output and Realtime Monitoring Support in RAS
JSON Output and Realtime Monitoring Support in RAS enhance NCCL’s Reliability, Availability, and Serviceability (RAS) subsystem with machine-parsable JSON output for metrics collection. This JSON mode is enabled by invoking the
ncclrasbinary with the-f jsonargument. Unlike the default text output, which is optimized for human consumption, the JSON output is considerably more verbose and effectively dumps all raw data collected by RAS, allowing developers to analyze and interpret NCCL metrics to meet their specific needs.RAS is normally a pull-based mechanism, generating information only in response to explicit requests. NCCL 2.29 adds a push-based alternative, with a monitoring mode for real-time status updates. This monitoring mode can be enabled by invoking the
ncclrasbinary with the-margument; the client will print a welcome message and subsequently block, waiting for important event notifications.build/bin/ncclras -m RAS Monitor Mode - watching for peer changes (Ctrl+C to exit)... ================================================================Such an event could be, for instance, a process being declared dead:
The monitoring mode supports JSON output as well; the above example would then look as follows:
{ "timestamp": "2025-12-19 13:07:07", "group": "LIFECYCLE", "event": "PEER_DEAD", "peer": { "host": "172.16.64.245", "pid": 1524345, "cuda_devs": [1], "nvml_devs": [1] }, "details": "" }—
Authored by Kamil Iskra (@kiskra-nvidia)
Beta Was this translation helpful? Give feedback.
All reactions