Switch over to hip-tests by default + HIP7 Fixes#1161
Merged
Conversation
c3ff91c to
c903bc2
Compare
pvelesko
added a commit
that referenced
this pull request
Mar 17, 2026
Update known_failures.yaml to achieve 100% pass rate across all platforms after switching to hip-tests by default. The new hip-tests submodule introduces ~480 additional tests compared to the old HIP/tests suite. Key changes: - Fix syncthreads regex: Unit_syncthreads_* → .*syncthreads_* to match the triple-underscore test names (Unit___syncthreads_*) - Fix Unit_hipMultiThreadStreams2 regex to use .* suffix for new naming - Add cupcake-specific exclusions for Intel GPU failures: - .*_MemoryTest1/2 wildcards (hipArray/hipMalloc3D not supported) - DeviceTest, StreamTest, EventTest, etc. timeouts/failures - Add salami-specific timeout exclusions (ARM Mali iGPU) - Add pastrami-specific PoCL exclusions for macOS: - .*_MemoryTest1/2, .*_UnitDeviceTests, .*_EventTest wildcards - Replace 100+ individual entries with category-level wildcards - Update hip-tests submodule with macOS compatibility fixes ALL.txt exclusion regex size: 18154 bytes (well under 57000 byte limit) Test results (100% pass rate on all platforms): - cupcake dgpu level0: 639 tests passed - cupcake dgpu opencl: 618 tests passed - cupcake igpu opencl: 618 tests passed - cupcake cpu opencl: 663 tests passed - salami igpu opencl: 1517 tests passed - pastrami pocl opencl: 347 tests passed
ecaa5dd to
0ab4702
Compare
74d561b to
b7391a9
Compare
Collaborator
Author
|
/run-aurora-ci |
88afa25 to
4074cd5
Compare
- Switch default test suite from hip-testsuite to hip-tests - Fix device-side __assert_rtn for macOS kernel assert support - Fix cooperative_groups assert with __builtin_trap on macOS - Return correct error codes for invalid kernel block dimensions - Implement missing hipDeviceGetAttribute cases (40+ attributes) - Update known_failures.yaml for all platforms (cupcake, meatloaf, pastrami, salami, pocl-cpu CI) - Move PoCL category wildcards to ANY.OPENCL_POCL for CI coverage
The build-and-test-libraries job sets IGC_EnableDPEmulation/ OverrideDefaultFP64Settings only in the 'Run library tests' step, but rocRAND's double-precision kernels (poisson/xorwow generators) are compiled during the 'Build chipStar + all libraries' step. On the Intel Arc runner (no native fp64) those compiles fail with 'Double type is not supported on this platform'. Set the emulation flags for the build too.
…eonsi The new hip-tests suite introduces test names not covered by the chipstar-rusticl known-failures section, surfacing genuine rusticl/radeonsi (Mesa/ACO) driver gaps: - Unit_kernel_chkConstantViaKernel_KernelTest and Unit_kernel_chkGlobalArrAndGlobalVaribleViaKernelFn_KernelTest abort because Mesa cannot consume program-scope CrossWorkgroup globals (__constant__/global vars) -- the same gap already documented for this runner. - Unit___threadfence_block_Positive_Basic_Shared_ThreadfenceTest hangs the GPU (amdgpu CS cancelled / hard recovery) on radeonsi. All three pass on Level Zero and PoCL; only the experimental rusticl runner is affected.
sync_and_util.hh declares device-side memset()/memcpy() overloads at global
scope but only included <cstddef>/<cstdint>. Host code that reaches these
declarations without <cstring> in scope (e.g. rocPRIM/hipCUB device headers
compiled for the host via the cuspv/CUDA-compat include path) then sees only
the __device__ memset overload -- and no host strcmp at all -- producing:
error: use of undeclared identifier 'strcmp'
error: no matching function for call to 'memset'
(candidate: __device__ void *memset(...) -- call from __host__)
This was a latent fragility: it only compiled when some other header pulled in
<cstring> transitively first. Make the dependency explicit so the host string
functions are always visible alongside the device overloads; HIP host/device
overloading keeps the two distinct (verified: device kernels using memset and
host code using memset/strcmp both compile with both declarations in scope).
Fixes the build-and-test-libraries (rocPRIM/hipCUB/rocThrust/...) failures.
<cstring> adds std::memset/std::memcpy as __host__ functions. Device kernels that call std::memset (e.g. rocPRIM's test_block_radix_sort) then get "reference to __host__ function in __global__ function". Add __device__ overloads of both in namespace std so device code resolves to the device version.
…n rusticl Both Unit___threadfence_Positive_Basic_Shared and Unit___threadfence_block_Positive_Basic_Shared trigger amdgpu CS cancellation / context loss (hard recovery) on the W6400/rusticl runner on the shared-memory path. Extend the existing regex key to cover both variants: Unit___threadfence(_block)?_Positive_Basic_Shared_ThreadfenceTest Root cause is a Mesa/radeonsi driver limitation; not a chipStar regression.
…ames The bumped hip-tests submodule dropped the _ThreadfenceTest suffix from CTest names. Make the suffix optional so the pattern matches both Unit___threadfence_Positive_Basic_Shared (new) and Unit___threadfence_Positive_Basic_Shared_ThreadfenceTest (old).
The bumped hip-tests submodule no longer appends the test executable name as a suffix to CTest test names. Update 238 existing entries to drop suffixes like _DeviceTest, _ErrorHandlingTest, _StreamTest, etc. so they match the new (suffix-free) test names.
Commit 1181a15 stripped suffixes like _PrintfTest, _ComplexTest, _LaunchBoundsTest, _AssertionTest, _synchronizationTests from known_failures entries under the assumption that the hip-tests submodule dropped them from CTest names. That assumption is wrong: the tests still register with the suffix in CTest, so the exclusion patterns no longer matched and 14 tests ran and failed on macOS/pastrami. Restore the suffixed names for the 14 affected entries across ALL sections (ANY/ALL, ANY/OPENCL_POCL, etc.) so the -E exclusion patterns correctly match the actual CTest test names again.
All hip-tests executables still append their name as a CTest suffix via the CATCH2_DISCOVER_TESTS_COMPILE_TIME mechanism. The bulk strip in 1181a15 incorrectly removed 238 suffix patterns, causing masked tests to run and fail. Restore known_failures.yaml to the state at cffa961 (only the threadfence block-vs-device regex broadening).
…-decls The new-hip-tests / HIP7 work needs device-side memset/memcpy plus host string functions to coexist with libstdc++ <cstring> in single-source HIP. Two problems had to be solved together: 1. Host build break. <cstring> does 'using ::memset/::memcpy', so *defining* std::memset/std::memcpy (as the earlier approach did) collides with it: 'target of using declaration conflicts with declaration already in scope'. That broke host compilation on every lane (pocl, macOS, salami, rusticl), and on macOS surfaced as 'call to memcpy is ambiguous' in hip_fp16_gcc.h. Fix: instead of redefining them, declare the __device__ memset/memcpy overloads at global scope and surface them into std:: with using-decls. A duplicate 'using ::memset/::memcpy' is harmless and, placed after the __device__ overloads, makes std::memcpy/std::memset resolve to the device (__chip_*) versions in device code. 2. Library device pass. Library headers (e.g. rocPRIM) reference strcmp/ memset/memcpy in __host__ functions (device-property setup), whose bodies are still parsed during the device pass. Guarding <cstring> out of device compilation (a previous attempt to keep the SPIR-V module small) left those identifiers undeclared and broke the device pass -- hipCUB/rocPRIM/rocRAND failed to build on the x86 Intel GPU libraries lane. Fix: include <cstring> unconditionally. It contributes only declarations, not definitions, so it does not enlarge the module; the using-decls above keep device mem* calls pointing at __chip_*. The DG2 IGC kernel-drop that motivated the device-pass <cstring> guard is tracked/handled separately (PR #1297) and is not reintroduced by this change.
Enabling IGC fp64 emulation in the library build step (so rocRAND's double kernels compile) has a side effect: chip-kernel-verify's ocloc compile now succeeds where on other branches it errors with 'Double type is not supported' and verification is silently skipped. With it running for real, it surfaces IGC's #403 RetryManager kernel drops in rocPRIM/rocRAND -- a known upstream IGC bug, not something this PR introduces or fixes. Set HIPCC_VERIFY=warn so the verifier reports the dropped kernels as warnings (keeping visibility) without failing the library build.
…s suite Validated the new hip-tests suite on Aurora (Intel PVC, llvm-22-native) on both Level Zero and OpenCL. After excluding these, both backends pass 100% (level0 582/582, opencl 530/530): - Stress_* (multi-device memcpy, MallocManaged/HMM oversubscription, prefetch, max-allocation, stream-create, module load/unload): resource/timing-sensitive on a single-tile PVC run; time out, OOM, or exercise unsupported capabilities. - Perf_* (memory benchmarks, dispatch speed, HostNumaAlloc): timing benchmarks not gated in CI; HostNumaAlloc segfaults (NUMA host alloc unsupported). - TestPositiveHasNoIGBAs: IGBA detection is platform-sensitive on Aurora (both backends); already a known failure on macOS PoCL. All under the existing 'x4\\d\\d\\dc\\ds\\db0n0' (aurora) host key.
Collaborator
Author
|
/run-aurora-ci |
Contributor
|
Thanks! It looks like far less tests than before, it was something over 1000 previously. Were a lot of tests changed or dropped? |
Catch2 test discovery (catch_discover_tests) runs each test binary with --list-tests, which triggers the compiler-inserted __hipRegisterFatBinary static initializer -> CHIPInitialize(). On a machine without a GPU this failed, killing discovery before any test could be listed -- so building the hip-tests suite on a GPU-less node registered zero Catch2 tests. __hipRegisterFatBinary already wraps CHIPInitialize() in try/catch to tolerate this (see 'Allow test discovery without devices'), but two things defeated it: - CHIPInitializeCallOnce() reported the explicit-backend 'no device' failure via CHIPERR_LOG_AND_ABORT (std::abort), which escapes the catch. Switch it to CHIPERR_LOG_AND_THROW, matching the default-backend path; the exception is now caught and real HIP API calls still surface hipErrorInitializationError. - std::call_once does not latch when its callable throws, so init was retried for every registered fat binary (one per kernel TU), hanging startup. Guard the call so it is attempted at most once. With both, 'UnitDeviceTests --list-tests' etc. enumerate their cases on a node with no GPU, and the full ~1900-test suite registers as expected.
The '.*_MemoryTest1' / '.*_MemoryTest2' wildcards were hiding ~410 memory tests that actually pass on chipStar now that Device::getAttr handles the new HIP7 attributes, and the make_vector wildcard hid all 48 make_vector tests that now compile and pass. Replace the two memory wildcards with the 79 specific cases that still fail (verified on Aurora Level Zero + OpenCL) and drop the make_vector wildcard entirely. Verified on Aurora: Level Zero 100% (1001 tests), OpenCL 100% (962 tests), with the ~458 recovered tests now actually executing.
Pulls in hip-tests b86c0e6, which clamps the helper-kernel launch block
size in memcpy1d/memcpy2d test commons to the device's maxThreadsPerBlock.
Fixes Unit_hipMemcpy{,2D,DtoH,DtoHAsync}_Positive_Basic failing with
hipErrorInvalidValue / CL_INVALID_WORK_GROUP_SIZE on devices that report a
per-block limit below 1024 (Intel iGPUs, some rusticl/radeonsi configs).
These were previously hidden by the over-broad _MemoryTest exclusion.
This negative test aborts on rusticl/radeonsi with "Destination is nullptr" because rusticl cannot consume program-scope CrossWorkgroup globals (the symbol address resolves to null) — the same root cause already recorded for Unit_hipMemcpyFromToSymbol_Negative. Surfaced by narrowing the over-broad _MemoryTest exclusion. rusticl-specific (passes elsewhere).
rusticl/radeonsi advertises maxThreadsPerBlock=1024 for the W6400, but the buffer-fill helper kernels (VectorSet/Iota) have a lower achievable work-group size, so the fill launch is rejected with CL_INVALID_WORK_GROUP_SIZE even after the test clamps to the reported device max (hip-tests b86c0e6). This is a rusticl/radeonsi advertise-vs-actual mismatch; the same tests pass on devices that report their true per-block limit (Intel iGPU 512, Intel PVC 1024). These were previously hidden by the over-broad _MemoryTest exclusion.
Collaborator
Author
|
/run-aurora-ci |
Collaborator
Author
1.8k tests total. I ran on Aurora and hip-tests had the old way of detecting tests which required compilation on a GPU node. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.