Skip to content

Switch over to hip-tests by default + HIP7 Fixes#1161

Merged
pvelesko merged 18 commits into
mainfrom
new-hip-tests
Jun 13, 2026
Merged

Switch over to hip-tests by default + HIP7 Fixes#1161
pvelesko merged 18 commits into
mainfrom
new-hip-tests

Conversation

@pvelesko

Copy link
Copy Markdown
Collaborator

No description provided.

@pvelesko pvelesko force-pushed the new-hip-tests branch 3 times, most recently from c3ff91c to c903bc2 Compare March 1, 2026 07:13
pvelesko added a commit that referenced this pull request Mar 17, 2026
Update known_failures.yaml to achieve 100% pass rate across all platforms
after switching to hip-tests by default. The new hip-tests submodule
introduces ~480 additional tests compared to the old HIP/tests suite.

Key changes:
- Fix syncthreads regex: Unit_syncthreads_* → .*syncthreads_* to match
  the triple-underscore test names (Unit___syncthreads_*)
- Fix Unit_hipMultiThreadStreams2 regex to use .* suffix for new naming
- Add cupcake-specific exclusions for Intel GPU failures:
  - .*_MemoryTest1/2 wildcards (hipArray/hipMalloc3D not supported)
  - DeviceTest, StreamTest, EventTest, etc. timeouts/failures
- Add salami-specific timeout exclusions (ARM Mali iGPU)
- Add pastrami-specific PoCL exclusions for macOS:
  - .*_MemoryTest1/2, .*_UnitDeviceTests, .*_EventTest wildcards
  - Replace 100+ individual entries with category-level wildcards
- Update hip-tests submodule with macOS compatibility fixes

ALL.txt exclusion regex size: 18154 bytes (well under 57000 byte limit)

Test results (100% pass rate on all platforms):
- cupcake dgpu level0: 639 tests passed
- cupcake dgpu opencl: 618 tests passed
- cupcake igpu opencl: 618 tests passed
- cupcake cpu opencl:  663 tests passed
- salami igpu opencl:  1517 tests passed
- pastrami pocl opencl: 347 tests passed
@pvelesko pvelesko force-pushed the new-hip-tests branch 4 times, most recently from ecaa5dd to 0ab4702 Compare March 20, 2026 16:58
@pvelesko pvelesko force-pushed the new-hip-tests branch 5 times, most recently from 74d561b to b7391a9 Compare June 7, 2026 15:45
@pvelesko

pvelesko commented Jun 7, 2026

Copy link
Copy Markdown
Collaborator Author

/run-aurora-ci

@pvelesko pvelesko force-pushed the new-hip-tests branch 2 times, most recently from 88afa25 to 4074cd5 Compare June 8, 2026 06:59
pvelesko added 10 commits June 12, 2026 09:27
- Switch default test suite from hip-testsuite to hip-tests
- Fix device-side __assert_rtn for macOS kernel assert support
- Fix cooperative_groups assert with __builtin_trap on macOS
- Return correct error codes for invalid kernel block dimensions
- Implement missing hipDeviceGetAttribute cases (40+ attributes)
- Update known_failures.yaml for all platforms (cupcake, meatloaf,
  pastrami, salami, pocl-cpu CI)
- Move PoCL category wildcards to ANY.OPENCL_POCL for CI coverage
The build-and-test-libraries job sets IGC_EnableDPEmulation/
OverrideDefaultFP64Settings only in the 'Run library tests' step, but
rocRAND's double-precision kernels (poisson/xorwow generators) are
compiled during the 'Build chipStar + all libraries' step. On the Intel
Arc runner (no native fp64) those compiles fail with 'Double type is not
supported on this platform'. Set the emulation flags for the build too.
…eonsi

The new hip-tests suite introduces test names not covered by the
chipstar-rusticl known-failures section, surfacing genuine rusticl/radeonsi
(Mesa/ACO) driver gaps:
- Unit_kernel_chkConstantViaKernel_KernelTest and
  Unit_kernel_chkGlobalArrAndGlobalVaribleViaKernelFn_KernelTest abort because
  Mesa cannot consume program-scope CrossWorkgroup globals (__constant__/global
  vars) -- the same gap already documented for this runner.
- Unit___threadfence_block_Positive_Basic_Shared_ThreadfenceTest hangs the GPU
  (amdgpu CS cancelled / hard recovery) on radeonsi.
All three pass on Level Zero and PoCL; only the experimental rusticl runner is
affected.
sync_and_util.hh declares device-side memset()/memcpy() overloads at global
scope but only included <cstddef>/<cstdint>. Host code that reaches these
declarations without <cstring> in scope (e.g. rocPRIM/hipCUB device headers
compiled for the host via the cuspv/CUDA-compat include path) then sees only
the __device__ memset overload -- and no host strcmp at all -- producing:

  error: use of undeclared identifier 'strcmp'
  error: no matching function for call to 'memset'
         (candidate: __device__ void *memset(...) -- call from __host__)

This was a latent fragility: it only compiled when some other header pulled in
<cstring> transitively first. Make the dependency explicit so the host string
functions are always visible alongside the device overloads; HIP host/device
overloading keeps the two distinct (verified: device kernels using memset and
host code using memset/strcmp both compile with both declarations in scope).

Fixes the build-and-test-libraries (rocPRIM/hipCUB/rocThrust/...) failures.
<cstring> adds std::memset/std::memcpy as __host__ functions.
Device kernels that call std::memset (e.g. rocPRIM's
test_block_radix_sort) then get "reference to __host__ function
in __global__ function". Add __device__ overloads of both in
namespace std so device code resolves to the device version.
…n rusticl

Both Unit___threadfence_Positive_Basic_Shared and Unit___threadfence_block_Positive_Basic_Shared
trigger amdgpu CS cancellation / context loss (hard recovery) on the W6400/rusticl runner
on the shared-memory path. Extend the existing regex key to cover both variants:
  Unit___threadfence(_block)?_Positive_Basic_Shared_ThreadfenceTest
Root cause is a Mesa/radeonsi driver limitation; not a chipStar regression.
…ames

The bumped hip-tests submodule dropped the _ThreadfenceTest suffix
from CTest names. Make the suffix optional so the pattern matches
both Unit___threadfence_Positive_Basic_Shared (new) and
Unit___threadfence_Positive_Basic_Shared_ThreadfenceTest (old).
The bumped hip-tests submodule no longer appends the test executable
name as a suffix to CTest test names. Update 238 existing entries to
drop suffixes like _DeviceTest, _ErrorHandlingTest, _StreamTest, etc.
so they match the new (suffix-free) test names.
Commit 1181a15 stripped suffixes like _PrintfTest, _ComplexTest,
_LaunchBoundsTest, _AssertionTest, _synchronizationTests from known_failures
entries under the assumption that the hip-tests submodule dropped them from
CTest names. That assumption is wrong: the tests still register with the
suffix in CTest, so the exclusion patterns no longer matched and 14 tests
ran and failed on macOS/pastrami.

Restore the suffixed names for the 14 affected entries across ALL sections
(ANY/ALL, ANY/OPENCL_POCL, etc.) so the -E exclusion patterns correctly
match the actual CTest test names again.
All hip-tests executables still append their name as a CTest suffix via
the CATCH2_DISCOVER_TESTS_COMPILE_TIME mechanism. The bulk strip in
1181a15 incorrectly removed 238 suffix patterns, causing masked
tests to run and fail. Restore known_failures.yaml to the state at
cffa961 (only the threadfence block-vs-device regex broadening).
…-decls

The new-hip-tests / HIP7 work needs device-side memset/memcpy plus host
string functions to coexist with libstdc++ <cstring> in single-source HIP.
Two problems had to be solved together:

1. Host build break. <cstring> does 'using ::memset/::memcpy', so *defining*
   std::memset/std::memcpy (as the earlier approach did) collides with it:
   'target of using declaration conflicts with declaration already in scope'.
   That broke host compilation on every lane (pocl, macOS, salami, rusticl),
   and on macOS surfaced as 'call to memcpy is ambiguous' in hip_fp16_gcc.h.
   Fix: instead of redefining them, declare the __device__ memset/memcpy
   overloads at global scope and surface them into std:: with using-decls.
   A duplicate 'using ::memset/::memcpy' is harmless and, placed after the
   __device__ overloads, makes std::memcpy/std::memset resolve to the device
   (__chip_*) versions in device code.

2. Library device pass. Library headers (e.g. rocPRIM) reference strcmp/
   memset/memcpy in __host__ functions (device-property setup), whose bodies
   are still parsed during the device pass. Guarding <cstring> out of device
   compilation (a previous attempt to keep the SPIR-V module small) left those
   identifiers undeclared and broke the device pass -- hipCUB/rocPRIM/rocRAND
   failed to build on the x86 Intel GPU libraries lane. Fix: include <cstring>
   unconditionally. It contributes only declarations, not definitions, so it
   does not enlarge the module; the using-decls above keep device mem* calls
   pointing at __chip_*.

The DG2 IGC kernel-drop that motivated the device-pass <cstring> guard is
tracked/handled separately (PR #1297) and is not reintroduced by this change.
pvelesko added 2 commits June 12, 2026 13:30
Enabling IGC fp64 emulation in the library build step (so rocRAND's double
kernels compile) has a side effect: chip-kernel-verify's ocloc compile now
succeeds where on other branches it errors with 'Double type is not supported'
and verification is silently skipped. With it running for real, it surfaces
IGC's #403 RetryManager kernel drops in rocPRIM/rocRAND -- a known upstream IGC
bug, not something this PR introduces or fixes.

Set HIPCC_VERIFY=warn so the verifier reports the dropped kernels as warnings
(keeping visibility) without failing the library build.
…s suite

Validated the new hip-tests suite on Aurora (Intel PVC, llvm-22-native) on
both Level Zero and OpenCL. After excluding these, both backends pass 100%
(level0 582/582, opencl 530/530):

- Stress_* (multi-device memcpy, MallocManaged/HMM oversubscription, prefetch,
  max-allocation, stream-create, module load/unload): resource/timing-sensitive
  on a single-tile PVC run; time out, OOM, or exercise unsupported capabilities.
- Perf_* (memory benchmarks, dispatch speed, HostNumaAlloc): timing benchmarks
  not gated in CI; HostNumaAlloc segfaults (NUMA host alloc unsupported).
- TestPositiveHasNoIGBAs: IGBA detection is platform-sensitive on Aurora (both
  backends); already a known failure on macOS PoCL.

All under the existing 'x4\\d\\d\\dc\\ds\\db0n0' (aurora) host key.
@pvelesko

pvelesko commented Jun 12, 2026

Copy link
Copy Markdown
Collaborator Author

/run-aurora-ci

@colleeneb

Copy link
Copy Markdown
Contributor

Thanks! It looks like far less tests than before, it was something over 1000 previously. Were a lot of tests changed or dropped?

pvelesko added 5 commits June 13, 2026 10:05
Catch2 test discovery (catch_discover_tests) runs each test binary with
--list-tests, which triggers the compiler-inserted __hipRegisterFatBinary
static initializer -> CHIPInitialize(). On a machine without a GPU this
failed, killing discovery before any test could be listed -- so building the
hip-tests suite on a GPU-less node registered zero Catch2 tests.

__hipRegisterFatBinary already wraps CHIPInitialize() in try/catch to tolerate
this (see 'Allow test discovery without devices'), but two things defeated it:

- CHIPInitializeCallOnce() reported the explicit-backend 'no device' failure
  via CHIPERR_LOG_AND_ABORT (std::abort), which escapes the catch. Switch it to
  CHIPERR_LOG_AND_THROW, matching the default-backend path; the exception is now
  caught and real HIP API calls still surface hipErrorInitializationError.
- std::call_once does not latch when its callable throws, so init was retried
  for every registered fat binary (one per kernel TU), hanging startup. Guard
  the call so it is attempted at most once.

With both, 'UnitDeviceTests --list-tests' etc. enumerate their cases on a node
with no GPU, and the full ~1900-test suite registers as expected.
The '.*_MemoryTest1' / '.*_MemoryTest2' wildcards were hiding ~410 memory
tests that actually pass on chipStar now that Device::getAttr handles the
new HIP7 attributes, and the make_vector wildcard hid all 48 make_vector
tests that now compile and pass. Replace the two memory wildcards with the
79 specific cases that still fail (verified on Aurora Level Zero + OpenCL)
and drop the make_vector wildcard entirely.

Verified on Aurora: Level Zero 100% (1001 tests), OpenCL 100% (962 tests),
with the ~458 recovered tests now actually executing.
Pulls in hip-tests b86c0e6, which clamps the helper-kernel launch block
size in memcpy1d/memcpy2d test commons to the device's maxThreadsPerBlock.
Fixes Unit_hipMemcpy{,2D,DtoH,DtoHAsync}_Positive_Basic failing with
hipErrorInvalidValue / CL_INVALID_WORK_GROUP_SIZE on devices that report a
per-block limit below 1024 (Intel iGPUs, some rusticl/radeonsi configs).
These were previously hidden by the over-broad _MemoryTest exclusion.
This negative test aborts on rusticl/radeonsi with "Destination is nullptr"
because rusticl cannot consume program-scope CrossWorkgroup globals (the
symbol address resolves to null) — the same root cause already recorded for
Unit_hipMemcpyFromToSymbol_Negative. Surfaced by narrowing the over-broad
_MemoryTest exclusion. rusticl-specific (passes elsewhere).
rusticl/radeonsi advertises maxThreadsPerBlock=1024 for the W6400, but the
buffer-fill helper kernels (VectorSet/Iota) have a lower achievable work-group
size, so the fill launch is rejected with CL_INVALID_WORK_GROUP_SIZE even after
the test clamps to the reported device max (hip-tests b86c0e6). This is a
rusticl/radeonsi advertise-vs-actual mismatch; the same tests pass on devices
that report their true per-block limit (Intel iGPU 512, Intel PVC 1024). These
were previously hidden by the over-broad _MemoryTest exclusion.
@pvelesko

Copy link
Copy Markdown
Collaborator Author

/run-aurora-ci

@pvelesko

Copy link
Copy Markdown
Collaborator Author

Thanks! It looks like far less tests than before, it was something over 1000 previously. Were a lot of tests changed or dropped?

1.8k tests total. I ran on Aurora and hip-tests had the old way of detecting tests which required compilation on a GPU node.

@pvelesko pvelesko merged commit 82fe3dd into main Jun 13, 2026
29 checks passed
@pvelesko pvelesko deleted the new-hip-tests branch June 13, 2026 15:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants