Skip to content

amd_smi: add gpu_metrics events, refine descriptions, and suppress sentinel-valued fields#568

Open
djwoun wants to merge 1 commit intoicl-utk-edu:masterfrom
djwoun:amd-smi-gpu-metrics-events
Open

amd_smi: add gpu_metrics events, refine descriptions, and suppress sentinel-valued fields#568
djwoun wants to merge 1 commit intoicl-utk-edu:masterfrom
djwoun:amd-smi-gpu-metrics-events

Conversation

@djwoun
Copy link
Copy Markdown
Contributor

@djwoun djwoun commented Feb 25, 2026

Pull Request Description

Expands metric coverage, updates descriptions for select existing GPU metrics events, suppresses registration of struct-backed events when AMD SMI returns sentinel values.

In particular, it:

  • adds and expands GPU metrics events
  • updates descriptions for select existing GPU metrics events where better AMD SMI documentation is available
  • filters out sentinel-backed fields at registration time for struct-based event groups so unsupported metrics are not exposed as garbage values such as -1

Author Checklist

  • Description
    Why this PR exists. Reference all relevant information, including background, issues, test failures, etc
  • Commits
    Commits are self contained and only do one thing
    Commits have a header of the form: module: short description
    Commits have a body (whenever relevant) containing a detailed description of the addressed problem and its solution
  • Tests
    The PR needs to pass all the tests

@Treece-Burgess
Copy link
Copy Markdown
Contributor

I am reviewing this PR.

Comment thread src/components/amd_smi/amds.c Outdated
Comment thread src/components/amd_smi/amds.c Outdated
Comment thread src/components/amd_smi/amds.c Outdated
if (add_event(&idx, name_buf, descr_buf, d, 10, 0, PAPI_MODE_READ,
access_amdsmi_gpu_metrics) != PAPI_OK)
return PAPI_ENOMEM;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new events all check out on Odyssey, but on Illyad with ROCm 7.1.1, the events:

amd_smi:::accumulation_counter
amd_smi:::prochot_residency_acc
amd_smi:::ppt_residency_acc 
amd_smi:::socket_thm_residency_acc
amd_smi:::vr_thm_residency_acc
amd_smi:::hbm_thm_residency_acc

will show a counter value of -1 when ran with papi_command_line. Do you know why this is occurring?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a small problem where amdsmi_get_gpu_metrics_info_p will return true if it returns just one true metric out of the 20 metrics it queries for.

I'm thinking maybe I should add additional checks to see if it returns sentinel values?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that would be a good idea! What were you thinking the check could be?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AMD SMI calls can return AMDSMI_STATUS_SUCCESS even when only part of the output struct is actually valid, and the unsupported fields are left at sentinel values. For the residency and accumulation counters that sentinel is UINT64_MAX, which PAPI ends up showing as -1. But the same pattern shows up in other struct-returning calls too, with UINT16_MAX, UINT32_MAX, UINT64_MAX, and in a few cases things like UINT8_MAX.

So the fix in this PR is the generic one: zero-init the struct, probe it in amds.c, and only register an event if the specific backing field is not its width-appropriate sentinel. That is why the PR applies the same check to other struct-based events.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I saw the use of conditionals checking UINT16_MAX, UINT32_MAX, UINT64_MAX, etc. A follow up question is: How do you know those are the correct? For example, you have the following conditional check:

if (pinfo.current_socket_power != UINT32_MAX && pinfo.current_socket_power != UINT16_MAX)

Looking at documentation current_socket_power only has UINT32_t. However, you also check UINT16_MAX.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added the extra UINT16_MAX check because I was seeing some uint32_t AMD SMI fields come back as 65535. After printing the values from amdsmi_get_power_info_p(), that does appear to be what is happening for average_socket_power.

Comment thread src/components/amd_smi/amds_accessors.c Outdated
@djwoun djwoun changed the title amd_smi: add gpu_metrics accumulation and and residency counters amd_smi: add gpu_metrics events, refine descriptions, and suppress sentinel-valued fields Mar 19, 2026
@djwoun djwoun force-pushed the amd-smi-gpu-metrics-events branch from a28e459 to 6bc178d Compare March 19, 2026 21:59
@djwoun djwoun force-pushed the amd-smi-gpu-metrics-events branch from 83f2477 to f78936b Compare March 19, 2026 23:07
memset(&dummy_usage, 0, sizeof(dummy_usage));
if (amdsmi_get_gpu_activity_p && amdsmi_get_gpu_activity_p(device_handles[d], &dummy_usage) == AMDSMI_STATUS_SUCCESS) {
if (dummy_usage.gfx_activity != UINT32_MAX && dummy_usage.gfx_activity != UINT16_MAX) {
CHECK_EVENT_IDX(idx);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was the CHECK_EVENT_IDX here and the two below originally forgotten when amd_smi was merged into master? From the diff it appears that way.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, looks like it was missing before.

if (amdsmi_get_clock_info_p(device_handles[d], clk_types[t], &info) != AMDSMI_STATUS_SUCCESS)
continue;
for (int f = 0; f < 5; ++f) {
if (f == 4 && info.clk_deep_sleep == UINT8_MAX)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the reasoning for continuing here? Could you add a comment above the for loop header for documentation.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That continue is only there to skip adding the deep_sleep event when clk_deep_sleep is UINT8_MAX.

if (add_event(&idx, name_buf, descr_buf, d, 10, 0, PAPI_MODE_READ,
access_amdsmi_gpu_metrics) != PAPI_OK)
return PAPI_ENOMEM;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I saw the use of conditionals checking UINT16_MAX, UINT32_MAX, UINT64_MAX, etc. A follow up question is: How do you know those are the correct? For example, you have the following conditional check:

if (pinfo.current_socket_power != UINT32_MAX && pinfo.current_socket_power != UINT16_MAX)

Looking at documentation current_socket_power only has UINT32_t. However, you also check UINT16_MAX.

Comment thread src/components/amd_smi/amds.c Outdated
@@ -1279,128 +1153,121 @@ static int init_event_table(void) {
// PCIe information events
if (amdsmi_get_pcie_info_p) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From my counting 36 new events are added. Of the 36, 7 events do not appear on either the MI210 or MI300A they are:

pcie_nak_sent_count_acc
pcie_nak_rcvd_count_acc
average_vclk1_frequency
average_dclk1_frequency
vcn_activity_vcn
jpeg_activity_jpeg
xcp_gfx_below_host_limit_acc_xcp

Do you know the architecture these events appear on?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure but I would guess some older generations of AMD gpus because this function was created with ROCm SMI.

@@ -1279,128 +1153,121 @@ static int init_event_table(void) {
// PCIe information events
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When running ./papi_component_avail on the master branch, amd_smi shows 342 native events on Odyssey at Oregon with ROCm 7.2.0. Doing the same for this PR, I see 390 native events. An increase is expected, but from my count 71 new events show in the output of papi_native_avail. Meaning we should see 413 output instead of 390.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR adds new events, but it also adds registration-time sentinel filtering to some existing struct-based events. So the expected total is not just 342 + 71; it is 342 + added - filtered. On Odyssey that appears to be why the total is 390 instead of 413.

break;
}
case 41: {
uint32_t xcp_index = event->subvariant >> 16;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Q: I am not familiar enough with the workflow, so what will event->subvariant hold? I know from one of the header files it is a uint32_t, but how is both right shifting and ANDing going to give us the correct indices?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

subvariant is being used here to carry the indices needed for the XCP array lookups. These events need two indices, not one: the XCP index and the inner element index. Since native_event_t only has one subvariant field, both are packed into that uint32_t during registration and unpacked in the accessor.

Comment thread src/components/amd_smi/amds.c Outdated
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants