amd_smi: add gpu_metrics events, refine descriptions, and suppress sentinel-valued fields#568
amd_smi: add gpu_metrics events, refine descriptions, and suppress sentinel-valued fields#568djwoun wants to merge 1 commit intoicl-utk-edu:masterfrom
Conversation
|
I am reviewing this PR. |
| if (add_event(&idx, name_buf, descr_buf, d, 10, 0, PAPI_MODE_READ, | ||
| access_amdsmi_gpu_metrics) != PAPI_OK) | ||
| return PAPI_ENOMEM; | ||
|
|
There was a problem hiding this comment.
The new events all check out on Odyssey, but on Illyad with ROCm 7.1.1, the events:
amd_smi:::accumulation_counter
amd_smi:::prochot_residency_acc
amd_smi:::ppt_residency_acc
amd_smi:::socket_thm_residency_acc
amd_smi:::vr_thm_residency_acc
amd_smi:::hbm_thm_residency_acc
will show a counter value of -1 when ran with papi_command_line. Do you know why this is occurring?
There was a problem hiding this comment.
There is a small problem where amdsmi_get_gpu_metrics_info_p will return true if it returns just one true metric out of the 20 metrics it queries for.
I'm thinking maybe I should add additional checks to see if it returns sentinel values?
There was a problem hiding this comment.
I think that would be a good idea! What were you thinking the check could be?
There was a problem hiding this comment.
AMD SMI calls can return AMDSMI_STATUS_SUCCESS even when only part of the output struct is actually valid, and the unsupported fields are left at sentinel values. For the residency and accumulation counters that sentinel is UINT64_MAX, which PAPI ends up showing as -1. But the same pattern shows up in other struct-returning calls too, with UINT16_MAX, UINT32_MAX, UINT64_MAX, and in a few cases things like UINT8_MAX.
So the fix in this PR is the generic one: zero-init the struct, probe it in amds.c, and only register an event if the specific backing field is not its width-appropriate sentinel. That is why the PR applies the same check to other struct-based events.
There was a problem hiding this comment.
I saw the use of conditionals checking UINT16_MAX, UINT32_MAX, UINT64_MAX, etc. A follow up question is: How do you know those are the correct? For example, you have the following conditional check:
if (pinfo.current_socket_power != UINT32_MAX && pinfo.current_socket_power != UINT16_MAX)
Looking at documentation current_socket_power only has UINT32_t. However, you also check UINT16_MAX.
There was a problem hiding this comment.
I added the extra UINT16_MAX check because I was seeing some uint32_t AMD SMI fields come back as 65535. After printing the values from amdsmi_get_power_info_p(), that does appear to be what is happening for average_socket_power.
a28e459 to
6bc178d
Compare
…ntinel-valued fields
83f2477 to
f78936b
Compare
| memset(&dummy_usage, 0, sizeof(dummy_usage)); | ||
| if (amdsmi_get_gpu_activity_p && amdsmi_get_gpu_activity_p(device_handles[d], &dummy_usage) == AMDSMI_STATUS_SUCCESS) { | ||
| if (dummy_usage.gfx_activity != UINT32_MAX && dummy_usage.gfx_activity != UINT16_MAX) { | ||
| CHECK_EVENT_IDX(idx); |
There was a problem hiding this comment.
Was the CHECK_EVENT_IDX here and the two below originally forgotten when amd_smi was merged into master? From the diff it appears that way.
There was a problem hiding this comment.
Yeah, looks like it was missing before.
| if (amdsmi_get_clock_info_p(device_handles[d], clk_types[t], &info) != AMDSMI_STATUS_SUCCESS) | ||
| continue; | ||
| for (int f = 0; f < 5; ++f) { | ||
| if (f == 4 && info.clk_deep_sleep == UINT8_MAX) |
There was a problem hiding this comment.
What is the reasoning for continuing here? Could you add a comment above the for loop header for documentation.
There was a problem hiding this comment.
That continue is only there to skip adding the deep_sleep event when clk_deep_sleep is UINT8_MAX.
| if (add_event(&idx, name_buf, descr_buf, d, 10, 0, PAPI_MODE_READ, | ||
| access_amdsmi_gpu_metrics) != PAPI_OK) | ||
| return PAPI_ENOMEM; | ||
|
|
There was a problem hiding this comment.
I saw the use of conditionals checking UINT16_MAX, UINT32_MAX, UINT64_MAX, etc. A follow up question is: How do you know those are the correct? For example, you have the following conditional check:
if (pinfo.current_socket_power != UINT32_MAX && pinfo.current_socket_power != UINT16_MAX)
Looking at documentation current_socket_power only has UINT32_t. However, you also check UINT16_MAX.
| @@ -1279,128 +1153,121 @@ static int init_event_table(void) { | |||
| // PCIe information events | |||
| if (amdsmi_get_pcie_info_p) { | |||
There was a problem hiding this comment.
From my counting 36 new events are added. Of the 36, 7 events do not appear on either the MI210 or MI300A they are:
pcie_nak_sent_count_acc
pcie_nak_rcvd_count_acc
average_vclk1_frequency
average_dclk1_frequency
vcn_activity_vcn
jpeg_activity_jpeg
xcp_gfx_below_host_limit_acc_xcp
Do you know the architecture these events appear on?
There was a problem hiding this comment.
I am not sure but I would guess some older generations of AMD gpus because this function was created with ROCm SMI.
| @@ -1279,128 +1153,121 @@ static int init_event_table(void) { | |||
| // PCIe information events | |||
There was a problem hiding this comment.
When running ./papi_component_avail on the master branch, amd_smi shows 342 native events on Odyssey at Oregon with ROCm 7.2.0. Doing the same for this PR, I see 390 native events. An increase is expected, but from my count 71 new events show in the output of papi_native_avail. Meaning we should see 413 output instead of 390.
There was a problem hiding this comment.
This PR adds new events, but it also adds registration-time sentinel filtering to some existing struct-based events. So the expected total is not just 342 + 71; it is 342 + added - filtered. On Odyssey that appears to be why the total is 390 instead of 413.
| break; | ||
| } | ||
| case 41: { | ||
| uint32_t xcp_index = event->subvariant >> 16; |
There was a problem hiding this comment.
Q: I am not familiar enough with the workflow, so what will event->subvariant hold? I know from one of the header files it is a uint32_t, but how is both right shifting and ANDing going to give us the correct indices?
There was a problem hiding this comment.
subvariant is being used here to carry the indices needed for the XCP array lookups. These events need two indices, not one: the XCP index and the inner element index. Since native_event_t only has one subvariant field, both are packed into that uint32_t during registration and unpacked in the accessor.
Pull Request Description
Expands metric coverage, updates descriptions for select existing GPU metrics events, suppresses registration of struct-backed events when AMD SMI returns sentinel values.
In particular, it:
-1Author Checklist
Why this PR exists. Reference all relevant information, including background, issues, test failures, etc
Commits are self contained and only do one thing
Commits have a header of the form:
module: short descriptionCommits have a body (whenever relevant) containing a detailed description of the addressed problem and its solution
The PR needs to pass all the tests