Add tracking of GPU utilization #368

yarikoptic · 2026-01-15T20:48:57Z

the need came up while working on https://github.com/ReproNim/qr_decode_bm where the whole purpose is to see if GPU would help. Also there is an increasing number of pipelines

I am surrendering it to you @asmacdo to bring to/over finish line. I see the need to

test on typhon, test within podman and/or singularity container while making them access GPU (see the --nvidia PR for yolo). If all works minimally:
test on discovery gpu node
test on discovery gpu node while under slurm execution
research in opportunities or integration testing on github CI with GPU , or may be us offloading testing to typhon for that run. @CodyCBakerPhD loves those external runners IIRC
if there is nvidia-smi or alike under OSX to assess use of GPU, may be @candleindark and @CodyCBakerPhD could contribute

Add support for GPU utilization monitoring through nvidia-smi: - New --instruments option (cpu,mem,gpu or 'all'), defaults to 'cpu,mem' - New --gpu-sample-interval option for separate GPU sampling rate - New --gpu-timeout option for nvidia-smi call timeout (default 5.0s) - New environment variables: DUCT_INSTRUMENTS, DUCT_GPU_SAMPLE_INTERVAL, DUCT_GPU_TIMEOUT GPU monitoring features: - Tracks GPU utilization %, memory utilization %, and memory usage - Supports multi-GPU systems (aggregates across all GPUs) - Graceful degradation when nvidia-smi is unavailable - Peak and average metrics in execution summary and info.json Schema version bumped to 0.3.0. GPU fields are optional in output. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

=== Do not change lines below === { "chain": [], "cmd": "pre-commit install && pre-commit run --all-files || true && ./.update-readme-help.py", "exit": 0, "extra_inputs": [], "inputs": [], "outputs": [ "." ], "pwd": "." } ^^^ Do not change lines above ^^^

Copilot

Pull request overview

This PR adds GPU utilization tracking to con-duct via nvidia-smi, enabling monitoring of GPU usage (utilization and memory) alongside existing CPU and memory metrics. The implementation introduces an --instruments CLI argument to control which metrics are collected (cpu, mem, gpu, or all), along with GPU-specific sampling interval and timeout controls.

Changes:

Added Instruments enum to control which metrics (CPU, MEM, GPU) are monitored
Implemented GPU sampling via nvidia-smi with configurable intervals and timeouts
Extended schema to version 0.3.0 with optional GPU fields in execution summaries

Reviewed changes

Copilot reviewed 8 out of 9 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
src/con_duct/duct_main.py	Core GPU monitoring implementation including GpuStats/GpuSample/GpuAverages classes, nvidia-smi integration, and Instruments enum
src/con_duct/cli.py	Added CLI arguments for --instruments, --gpu-sample-interval, and --gpu-timeout with environment variable support
src/con_duct/ls.py	Added GPU field definitions to value transformation map and marked them as optional
test/utils.py	Updated test helper to include default GPU-related parameters
test/test_schema.py	Modified schema test to handle optional GPU fields correctly
test/test_instruments.py	New test file for Instruments enum parsing logic
test/duct_main/test_gpu.py	Comprehensive unit tests for GPU monitoring classes and functions
CONTRIBUTING.rst	Whitespace cleanup

Comments suppressed due to low confidence (1)

src/con_duct/ls.py:51

The 'gpu' field in NON_TRANSFORMED_FIELDS is undocumented and its purpose is unclear. It appears to be related to GPU device information rather than utilization metrics, but this should be clarified with a comment explaining what this field contains.

    "gpu",

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-01-15T20:52:14Z

src/con_duct/duct_main.py

+        output.averages = self.averages
+        output.averages.update(other)


The averages from self are being assigned to output.averages and then updated with other. This mutates self.averages since it's the same object reference. This can lead to incorrect averages being accumulated over time. Instead, create a deep copy of self.averages before assignment or initialize output.averages separately and update it with both samples.

test/test_schema.py

Copilot · 2026-01-15T20:52:15Z

src/con_duct/duct_main.py

+
+    # GPU sampling state
+    gpu_enabled = Instruments.GPU in instruments
+    gpu_interval = gpu_sample_interval if gpu_sample_interval else sample_interval


This condition treats 0.0 as falsy, which means gpu_sample_interval=0.0 will fall back to sample_interval. However, in the execute function (line 1475), gpu_sample_interval is explicitly passed as None when the value is <= 0, but here the check should be if gpu_sample_interval is not None to properly handle 0.0 as a valid value that should use the sample_interval.

Suggested change

gpu_interval = gpu_sample_interval if gpu_sample_interval else sample_interval

gpu_interval = (

gpu_sample_interval if gpu_sample_interval is not None else sample_interval

)

Copilot · 2026-01-15T20:52:15Z

test/test_instruments.py

@@ -0,0 +1,78 @@
+"""Tests for Instruments enum and parsing."""


Missing test coverage for the Instruments enum's actual usage in execution - specifically testing that when GPU is in instruments but nvidia-smi is not available, execution proceeds with a warning but without GPU metrics being collected.

src/con_duct/duct_main.py

asmacdo · 2026-01-15T20:54:14Z

@yarikoptic ok, ill pick this up. its a big one, is this needed quickly?

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

=== Do not change lines below === { "chain": [], "cmd": "pre-commit install && pre-commit run --all-files || true && ./.update-readme-help.py", "exit": 0, "extra_inputs": [], "inputs": [], "outputs": [ "." ], "pwd": "." } ^^^ Do not change lines above ^^^

yarikoptic · 2026-01-16T03:01:07Z

no immediate urge

yarikoptic and others added 2 commits January 8, 2026 21:19

Fix trailing whitespace in CONTRIBUTING.rst

a2de09b

🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

yarikoptic assigned asmacdo Jan 15, 2026

Copilot AI review requested due to automatic review settings January 15, 2026 20:48

Copilot started reviewing on behalf of yarikoptic January 15, 2026 20:49 View session

Copilot AI reviewed Jan 15, 2026

View reviewed changes

yarikoptic and others added 2 commits January 15, 2026 18:39

Apply two minor suggestions from code review

b9be46d

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add tracking of GPU utilization #368

Add tracking of GPU utilization #368

Uh oh!

yarikoptic commented Jan 15, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Jan 15, 2026

Uh oh!

Uh oh!

Copilot AI Jan 15, 2026

Uh oh!

Copilot AI Jan 15, 2026

Uh oh!

Uh oh!

asmacdo commented Jan 15, 2026

Uh oh!

yarikoptic commented Jan 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

		output.averages = self.averages
		output.averages.update(other)

-    gpu_interval = gpu_sample_interval if gpu_sample_interval else sample_interval
+    gpu_interval = (
+        gpu_sample_interval if gpu_sample_interval is not None else sample_interval
+    )

		@@ -0,0 +1,78 @@
		"""Tests for Instruments enum and parsing."""

Add tracking of GPU utilization #368

Are you sure you want to change the base?

Add tracking of GPU utilization #368

Uh oh!

Conversation

yarikoptic commented Jan 15, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Jan 15, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Copilot AI Jan 15, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 15, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

asmacdo commented Jan 15, 2026

Uh oh!

yarikoptic commented Jan 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants