Skip to content

perf(nsys): reduce CPU-side overhead in profiling defaults#3311

Draft
dingqingy-nv wants to merge 3 commits intoNVIDIA-NeMo:mainfrom
dingqingy-nv:r26.04/nsys-reduce-overhead
Draft

perf(nsys): reduce CPU-side overhead in profiling defaults#3311
dingqingy-nv wants to merge 3 commits intoNVIDIA-NeMo:mainfrom
dingqingy-nv:r26.04/nsys-reduce-overhead

Conversation

@dingqingy-nv
Copy link
Copy Markdown
Contributor

Summary

  • Override nemo_run's default nsys_extra_args to disable CPU context-switch tracing (--cpuctxsw=none), backtrace collection (--backtrace=none), and remove CUDA graph node tracing (--cuda-graph-trace=node)
  • These flags eliminate unnecessary CPU-side activity during GPU profiling, reducing profiling overhead
  • Applied to both NsysPlugin implementations (recipes + perf scripts) and the interactive launch script

Changes vs nemo_run defaults

Flag nemo_run default After this PR
--cpuctxsw process-tree (nsys default) none
--backtrace auto (nsys default) none
--cuda-graph-trace node removed

Since -s none (disable CPU sampling) is already set by nemo_run, --backtrace=none is defensive — there are no CPU samples to attach backtraces to. --cpuctxsw=none is the main win, stopping OS thread scheduling trace collection that was active by default.

Test plan

  • Verify nsys profiling still produces valid .nsys-rep files
  • Confirm NVTX ranges and CUDA kernels are still captured
  • Check that nsys_extra_args user override still works (user args take precedence)

🤖 Generated with Claude Code

Override nemo_run's default nsys_extra_args to disable CPU context-switch
tracing (--cpuctxsw=none), backtrace collection (--backtrace=none), and
CUDA graph node tracing (remove --cuda-graph-trace=node). These flags
eliminate unnecessary CPU-side activity during GPU profiling.

Signed-off-by: Dingqing Yang <dingqingy@nvidia.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Dingqing Yang <dingqingy@nvidia.com>
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot bot commented Apr 13, 2026

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

Signed-off-by: Dingqing Yang <dingqingy@nvidia.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Dingqing Yang <dingqingy@nvidia.com>
@dingqingy-nv
Copy link
Copy Markdown
Contributor Author

Let's treat this as an test PR. If getting less profiling overhead, we can raise a PR directly to nemo run and close this

Instead of hardcoding the full nsys_extra_args list, filter out
--cuda-graph-trace and append --cpuctxsw=none and --backtrace=none
to the existing nemo_run defaults. This preserves forward compatibility
with future nemo_run default changes.

Signed-off-by: Dingqing Yang <dingqingy@nvidia.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Dingqing Yang <dingqingy@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant