-
Notifications
You must be signed in to change notification settings - Fork 5
Add tracking of GPU utilization #368
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Add support for GPU utilization monitoring through nvidia-smi: - New --instruments option (cpu,mem,gpu or 'all'), defaults to 'cpu,mem' - New --gpu-sample-interval option for separate GPU sampling rate - New --gpu-timeout option for nvidia-smi call timeout (default 5.0s) - New environment variables: DUCT_INSTRUMENTS, DUCT_GPU_SAMPLE_INTERVAL, DUCT_GPU_TIMEOUT GPU monitoring features: - Tracks GPU utilization %, memory utilization %, and memory usage - Supports multi-GPU systems (aggregates across all GPUs) - Graceful degradation when nvidia-smi is unavailable - Peak and average metrics in execution summary and info.json Schema version bumped to 0.3.0. GPU fields are optional in output. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
=== Do not change lines below ===
{
"chain": [],
"cmd": "pre-commit install && pre-commit run --all-files || true && ./.update-readme-help.py",
"exit": 0,
"extra_inputs": [],
"inputs": [],
"outputs": [
"."
],
"pwd": "."
}
^^^ Do not change lines above ^^^
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR adds GPU utilization tracking to con-duct via nvidia-smi, enabling monitoring of GPU usage (utilization and memory) alongside existing CPU and memory metrics. The implementation introduces an --instruments CLI argument to control which metrics are collected (cpu, mem, gpu, or all), along with GPU-specific sampling interval and timeout controls.
Changes:
- Added
Instrumentsenum to control which metrics (CPU, MEM, GPU) are monitored - Implemented GPU sampling via nvidia-smi with configurable intervals and timeouts
- Extended schema to version 0.3.0 with optional GPU fields in execution summaries
Reviewed changes
Copilot reviewed 8 out of 9 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| src/con_duct/duct_main.py | Core GPU monitoring implementation including GpuStats/GpuSample/GpuAverages classes, nvidia-smi integration, and Instruments enum |
| src/con_duct/cli.py | Added CLI arguments for --instruments, --gpu-sample-interval, and --gpu-timeout with environment variable support |
| src/con_duct/ls.py | Added GPU field definitions to value transformation map and marked them as optional |
| test/utils.py | Updated test helper to include default GPU-related parameters |
| test/test_schema.py | Modified schema test to handle optional GPU fields correctly |
| test/test_instruments.py | New test file for Instruments enum parsing logic |
| test/duct_main/test_gpu.py | Comprehensive unit tests for GPU monitoring classes and functions |
| CONTRIBUTING.rst | Whitespace cleanup |
Comments suppressed due to low confidence (1)
src/con_duct/ls.py:51
- The 'gpu' field in NON_TRANSFORMED_FIELDS is undocumented and its purpose is unclear. It appears to be related to GPU device information rather than utilization metrics, but this should be clarified with a comment explaining what this field contains.
"gpu",
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| output.averages = self.averages | ||
| output.averages.update(other) |
Copilot
AI
Jan 15, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The averages from self are being assigned to output.averages and then updated with other. This mutates self.averages since it's the same object reference. This can lead to incorrect averages being accumulated over time. Instead, create a deep copy of self.averages before assignment or initialize output.averages separately and update it with both samples.
|
|
||
| # GPU sampling state | ||
| gpu_enabled = Instruments.GPU in instruments | ||
| gpu_interval = gpu_sample_interval if gpu_sample_interval else sample_interval |
Copilot
AI
Jan 15, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This condition treats 0.0 as falsy, which means gpu_sample_interval=0.0 will fall back to sample_interval. However, in the execute function (line 1475), gpu_sample_interval is explicitly passed as None when the value is <= 0, but here the check should be if gpu_sample_interval is not None to properly handle 0.0 as a valid value that should use the sample_interval.
| gpu_interval = gpu_sample_interval if gpu_sample_interval else sample_interval | |
| gpu_interval = ( | |
| gpu_sample_interval if gpu_sample_interval is not None else sample_interval | |
| ) |
| @@ -0,0 +1,78 @@ | |||
| """Tests for Instruments enum and parsing.""" | |||
Copilot
AI
Jan 15, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing test coverage for the Instruments enum's actual usage in execution - specifically testing that when GPU is in instruments but nvidia-smi is not available, execution proceeds with a warning but without GPU metrics being collected.
|
@yarikoptic ok, ill pick this up. its a big one, is this needed quickly? |
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
=== Do not change lines below ===
{
"chain": [],
"cmd": "pre-commit install && pre-commit run --all-files || true && ./.update-readme-help.py",
"exit": 0,
"extra_inputs": [],
"inputs": [],
"outputs": [
"."
],
"pwd": "."
}
^^^ Do not change lines above ^^^
|
no immediate urge |
the need came up while working on https://github.com/ReproNim/qr_decode_bm where the whole purpose is to see if GPU would help. Also there is an increasing number of pipelines
I am surrendering it to you @asmacdo to bring to/over finish line. I see the need to