Skip to content

Conversation

@yarikoptic
Copy link
Member

the need came up while working on https://github.com/ReproNim/qr_decode_bm where the whole purpose is to see if GPU would help. Also there is an increasing number of pipelines

I am surrendering it to you @asmacdo to bring to/over finish line. I see the need to

  • test on typhon, test within podman and/or singularity container while making them access GPU (see the --nvidia PR for yolo). If all works minimally:
  • test on discovery gpu node
  • test on discovery gpu node while under slurm execution
  • research in opportunities or integration testing on github CI with GPU , or may be us offloading testing to typhon for that run. @CodyCBakerPhD loves those external runners IIRC
  • if there is nvidia-smi or alike under OSX to assess use of GPU, may be @candleindark and @CodyCBakerPhD could contribute

yarikoptic and others added 2 commits January 8, 2026 21:19
Add support for GPU utilization monitoring through nvidia-smi:

- New --instruments option (cpu,mem,gpu or 'all'), defaults to 'cpu,mem'
- New --gpu-sample-interval option for separate GPU sampling rate
- New --gpu-timeout option for nvidia-smi call timeout (default 5.0s)
- New environment variables: DUCT_INSTRUMENTS, DUCT_GPU_SAMPLE_INTERVAL,
  DUCT_GPU_TIMEOUT

GPU monitoring features:
- Tracks GPU utilization %, memory utilization %, and memory usage
- Supports multi-GPU systems (aggregates across all GPUs)
- Graceful degradation when nvidia-smi is unavailable
- Peak and average metrics in execution summary and info.json

Schema version bumped to 0.3.0. GPU fields are optional in output.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings January 15, 2026 20:48
=== Do not change lines below ===
{
 "chain": [],
 "cmd": "pre-commit install && pre-commit run --all-files || true && ./.update-readme-help.py",
 "exit": 0,
 "extra_inputs": [],
 "inputs": [],
 "outputs": [
  "."
 ],
 "pwd": "."
}
^^^ Do not change lines above ^^^
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds GPU utilization tracking to con-duct via nvidia-smi, enabling monitoring of GPU usage (utilization and memory) alongside existing CPU and memory metrics. The implementation introduces an --instruments CLI argument to control which metrics are collected (cpu, mem, gpu, or all), along with GPU-specific sampling interval and timeout controls.

Changes:

  • Added Instruments enum to control which metrics (CPU, MEM, GPU) are monitored
  • Implemented GPU sampling via nvidia-smi with configurable intervals and timeouts
  • Extended schema to version 0.3.0 with optional GPU fields in execution summaries

Reviewed changes

Copilot reviewed 8 out of 9 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
src/con_duct/duct_main.py Core GPU monitoring implementation including GpuStats/GpuSample/GpuAverages classes, nvidia-smi integration, and Instruments enum
src/con_duct/cli.py Added CLI arguments for --instruments, --gpu-sample-interval, and --gpu-timeout with environment variable support
src/con_duct/ls.py Added GPU field definitions to value transformation map and marked them as optional
test/utils.py Updated test helper to include default GPU-related parameters
test/test_schema.py Modified schema test to handle optional GPU fields correctly
test/test_instruments.py New test file for Instruments enum parsing logic
test/duct_main/test_gpu.py Comprehensive unit tests for GPU monitoring classes and functions
CONTRIBUTING.rst Whitespace cleanup
Comments suppressed due to low confidence (1)

src/con_duct/ls.py:51

  • The 'gpu' field in NON_TRANSFORMED_FIELDS is undocumented and its purpose is unclear. It appears to be related to GPU device information rather than utilization metrics, but this should be clarified with a comment explaining what this field contains.
    "gpu",

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +468 to +469
output.averages = self.averages
output.averages.update(other)
Copy link

Copilot AI Jan 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The averages from self are being assigned to output.averages and then updated with other. This mutates self.averages since it's the same object reference. This can lead to incorrect averages being accumulated over time. Instead, create a deep copy of self.averages before assignment or initialize output.averages separately and update it with both samples.

Copilot uses AI. Check for mistakes.

# GPU sampling state
gpu_enabled = Instruments.GPU in instruments
gpu_interval = gpu_sample_interval if gpu_sample_interval else sample_interval
Copy link

Copilot AI Jan 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This condition treats 0.0 as falsy, which means gpu_sample_interval=0.0 will fall back to sample_interval. However, in the execute function (line 1475), gpu_sample_interval is explicitly passed as None when the value is <= 0, but here the check should be if gpu_sample_interval is not None to properly handle 0.0 as a valid value that should use the sample_interval.

Suggested change
gpu_interval = gpu_sample_interval if gpu_sample_interval else sample_interval
gpu_interval = (
gpu_sample_interval if gpu_sample_interval is not None else sample_interval
)

Copilot uses AI. Check for mistakes.
@@ -0,0 +1,78 @@
"""Tests for Instruments enum and parsing."""
Copy link

Copilot AI Jan 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing test coverage for the Instruments enum's actual usage in execution - specifically testing that when GPU is in instruments but nvidia-smi is not available, execution proceeds with a warning but without GPU metrics being collected.

Copilot uses AI. Check for mistakes.
@asmacdo
Copy link
Member

asmacdo commented Jan 15, 2026

@yarikoptic ok, ill pick this up. its a big one, is this needed quickly?

yarikoptic and others added 2 commits January 15, 2026 18:39
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
=== Do not change lines below ===
{
 "chain": [],
 "cmd": "pre-commit install && pre-commit run --all-files || true && ./.update-readme-help.py",
 "exit": 0,
 "extra_inputs": [],
 "inputs": [],
 "outputs": [
  "."
 ],
 "pwd": "."
}
^^^ Do not change lines above ^^^
@yarikoptic
Copy link
Member Author

no immediate urge

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants