Skip to content

Conversation

@yhmtsai
Copy link
Contributor

@yhmtsai yhmtsai commented Mar 3, 2025

This PR creates rocSPARSE backend and spmv on AMD GPU.
Additionally, it makes the test also available on device.

@yhmtsai yhmtsai requested a review from BenBrock March 3, 2025 10:15
@yhmtsai yhmtsai self-assigned this Mar 3, 2025
@yhmtsai yhmtsai changed the title Create rocsparse and corresponding spmv Create rocSPARSE and corresponding spmv Mar 3, 2025
@yhmtsai yhmtsai force-pushed the dev/yhmtsai/rocsparse_spmv branch 2 times, most recently from 85a5f2b to fbb9106 Compare March 6, 2025 14:19
@yhmtsai
Copy link
Contributor Author

yhmtsai commented Mar 6, 2025

@BenBrock It seems that we will merge this when we have a CI to check it too.
I also update the test such that it can take care of device data.
Some components will help to unify the function like array to handle alloc/free automatically and copy function between host and device.
I try to avoid them be in the csr_view and library itself in general.
I do not think they are in a perfect shape, but I would say that we go with this first if there's no fatal issue.

@BenBrock
Copy link
Collaborator

BenBrock commented Mar 6, 2025

Got it, so I should work on merging this rocSPARSE PR first (before looking at the CUDA/HIP ones)?

@yhmtsai
Copy link
Contributor Author

yhmtsai commented Mar 6, 2025

Yes. I have also closed two old prs.
After this, I can continue to add test with the corresponding not use device_example, which is not a proper way to show it can work.

Copy link
Collaborator

@BenBrock BenBrock left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR, @yhmtsai, this is a good start. There are a couple of changes I think we should make:

  • I'd like to start by sticking to rocThrust for managing memory as much as possible. This would mean using thrust::device_vector instead of a custom array class and thrust::device_allocator instead of our own custom memory allocator.
  • It seems we're currently mixing HIP and ROCm here—@mshanthagit and @YvanMokwinski, can you comment whether it's appropriate to use HIP here, or are there ROCm memory management routines we should use instead?

I have luckily been able to get access to a machine with AMD GPUs, so I will work on integrating rocThrust into the CMake build today if I have time.

* allocator base class. When user provides the allocator implementation should
* inherit from this class.
*/
class allocator {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should try to replace this with thrust::device_allocator from rocThrust.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thrust will only work for cuSPARSE and rocSPARSE.
@spencerpatty correct me if I'm wrong
I think oneMKL does not have the thrust library.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also, we should not limit how user provide the allocator.
If we limit to device_allocator, there's no benefit from providing the abaility of user's allocator because user are forced to use thrust device allocator

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, that is correct. oneMKL does not have the thrust library. closest is oneDPL https://github.com/uxlfoundation/oneDPL

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's have a basic allocator and avoid having dependency on the library side. Let's keep it simple.


// It is a class to handle the data by allocator which has auto clearup process
template <typename ValueType>
class array {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should try to replace a custom array class with thrust::device_vector from rocThrust.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is an internal helper not for user IMO, so I will not hold the same comment as allocator.
Thus, if we are okay to provide different device_vector for different backend, it should be doable although it introduces the another dependency.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have 'mature' customized data structures too, if needed, for the client side. The use of rocThrust in the implementation is debatable, it all depends on the final purpose of the code.


namespace spblas::detail {

class rocm_allocator : public spblas::allocator {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can use rocThrust's allocator here.

Copy link

@YvanMokwinski YvanMokwinski Mar 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not. I'm not familiar with the rocThrust allocator. The definition of the spblas::allocator is low-level. I'm not sure why alloc and free should be const methods; I think they shouldn't be. Also, think about streams. Will streams be available in this allocator?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it is doable, but it will come with the exeucution policy such that the routine will be run on the same stream with the allocator. (we can also allow the routine to get the stream though. However, maybe the execution policy is the right place to take care of.)

class stream_allocator : public spblas:: allocator {
public:
  stream_allocator(hipStream_t stream): stream_(stream);
  void* alloc(size_t) const override {
    void* ptr;
    hipMallocAsyc(&ptr, size);
    return ptr
  }
  ...
};

I put them into const because I only thought the simple malloc and free without changing internal data.
With the pool memory allocator possibility, I will change them into non-const such that the allocator can change their internal data structure without muture keyword

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's have the right things now rather than later.
Having some prototype code is suitable for iterating on the design, but the design is incomplete. I hope we are not developing too much before fixing things that need to be fixed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added the stream allocator prototype in the comments. yeah, prototype will need to change when the design is changed, but it is the prototype benefit for us. I would still say we do not hold this due to the execution policy. After merging this, we can raise the concern at the right place. Otherwise, I feel the reference implementation might be harder when merging the other stuff to let the device library into framework.

@YvanMokwinski
Copy link

Thanks for the PR, @yhmtsai, this is a good start. There are a couple of changes I think we should make:

  • I'd like to start by sticking to rocThrust for managing memory as much as possible. This would mean using thrust::device_vector instead of a custom array class and thrust::device_allocator instead of our own custom memory allocator.
  • It seems we're currently mixing HIP and ROCm here—@mshanthagit and @YvanMokwinski, can you comment whether it's appropriate to use HIP here, or are there ROCm memory management routines we should use instead?

I have luckily been able to get access to a machine with AMD GPUs, so I will work on integrating rocThrust into the CMake build today if I have time.

I suggest minimizing the dependencies, but I understand using rocThrust would make life easier. So, sure, that's a good idea. Let's not reinvent the wheel.

Yes, hipMemcpy is appropriate to implement your function copy_to_device. Just to let you know, it is a blocking function, so it'll trigger synchronization. Also, if you don't want to bother, you can replace hipMemcpyHostToDevice with hipMemcpyDefault.


// It is a class to handle the data by allocator which has auto clearup process
template <typename ValueType>
class array {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have 'mature' customized data structures too, if needed, for the client side. The use of rocThrust in the implementation is debatable, it all depends on the final purpose of the code.

@YvanMokwinski
Copy link

We need to solve the issue of passing the stream to the allocator now, rather than later, and use hipMallocAsync, hipMemcpyAsync etc.

@yhmtsai yhmtsai force-pushed the dev/yhmtsai/rocsparse_spmv branch from 07943b5 to 2de1ee5 Compare March 10, 2025 21:24
@yhmtsai
Copy link
Contributor Author

yhmtsai commented Mar 10, 2025

@BenBrock @YvanMokwinski I have tried a bit rocTHRUST. It seems to require hipcc to compile.
using g++ as compiler

target_link_libraries(spblas-tests spblas roc::rocthrust)
# we need to use hip as backend when using non-hipcc 
target_compile_definitions(spblas-tests PRIVATE THRUST_DEVICE_SYSTEM=THRUST_DEVICE_SYSTEM_HIP)

rocThrust will try to add --offload-arch=<gfx...> , which is unavailable in g++.

using hipcc

  enable_language(HIP)
  add_executable(spblas-tests spmv_test.cpp)
  set_source_files_properties(spmv_test.cpp PROPERTIES LANGUAGE HIP)
  set(CMAKE_HIP_FLAGS "${CMAKE_HIP_FLAGS} -DSPBLAS_ENABLE_ROCSPARSE")

thrust::device_ptr<T> a; a.data() returns device_ptr<T> not T*, so we will also need to adapt this difference between different backend.

if cuda thrust also require nvcc, we might face nvcc does not support the c++ standard we used here.
also for the oneMKL, they do not have the

@YvanMokwinski
Copy link

@BenBrock @YvanMokwinski I have tried a bit rocTHRUST. It seems to require hipcc to compile. using g++ as compiler

target_link_libraries(spblas-tests spblas roc::rocthrust)
# we need to use hip as backend when using non-hipcc 
target_compile_definitions(spblas-tests PRIVATE THRUST_DEVICE_SYSTEM=THRUST_DEVICE_SYSTEM_HIP)

rocThrust will try to add --offload-arch=<gfx...> , which is unavailable in g++.

using hipcc

  enable_language(HIP)
  add_executable(spblas-tests spmv_test.cpp)
  set_source_files_properties(spmv_test.cpp PROPERTIES LANGUAGE HIP)
  set(CMAKE_HIP_FLAGS "${CMAKE_HIP_FLAGS} -DSPBLAS_ENABLE_ROCSPARSE")

thrust::device_ptr<T> a; a.data() returns device_ptr<T> not T*, so we will also need to adapt this difference between different backend.

if cuda thrust also require nvcc, we might face nvcc does not support the c++ standard we used here. also for the oneMKL, they do not have the

https://github.com/ROCm/rocThrust#:~:text=Including%20the%20HipCC%20compiler%2C%20which%20must%20be%20set%20as%20your%20C%2B%2B%20compiler%20for%20ROCm

@BenBrock
Copy link
Collaborator

@YvanMokwinski @mshanthagit Is .hip the expected extension for a file that will be compiled with hipcc, or do you normally use .cpp (or are both acceptable)?

I'm looking at the CMake and have things working locally with rocThrust. However, we need to either explicitly mark .cpp files as being the HIP language or use .hip.

@yhmtsai
Copy link
Contributor Author

yhmtsai commented Mar 11, 2025

I think the suffix is .hip.cpp, but I think you still need to mark the language HIP as I showed above. Do we really want to use rocThrust and use hipcc just for device_vector? For me, it might be too much for a simple thing here. We still need to adapt the different data type from data() in some way, so we do not get something from device_vector especially just for handling memory.

@YvanMokwinski
Copy link

YvanMokwinski commented Mar 11, 2025

@YvanMokwinski @mshanthagit Is .hip the expected extension for a file that will be compiled with hipcc, or do you normally use .cpp (or are both acceptable)?

I'm looking at the CMake and have things working locally with rocThrust. However, we need to either explicitly mark .cpp files as being the HIP language or use .hip.

You can use both. We use .cpp
.hip.cpp could be used as a cmake trick to indicate to use hipcc

@YvanMokwinski
Copy link

I think the suffix is .hip.cpp, but I think you still need to mark the language HIP as I showed above. Do we really want to use rocThrust and use hipcc just for device_vector? For me, it might be too much for a simple thing here. We still need to adapt the different data type from data() in some way, so we do not get something from device_vector especially just for handling memory.

@BenBrock what is the purpose of using thrust on the library side?

@BenBrock
Copy link
Collaborator

BenBrock commented Mar 11, 2025

@yhmtsai @YvanMokwinski rocThrust is purely for implementing tests and examples. IMO the rocSPARSE backend itself should not require rocThrust or compilation with hipcc.

I think the examples will be much cleaner if we use rocThrust, for two reasons:

  1. We can use rocThrust's device allocator instead of implementing our own.
  2. The rocSPARSE and cuSPARSE backends' tests and examples will be virtually identical if they both use Thrust. That will create less work for us.

When AMD users compile without hipcc (and thus without rocThrust) there is the issue of where they're going to get a C++ memory allocator, since AFAIK AMD does not provide one outside of rocThrust. Users could of course write their own, but we might eventually want to provide one ourselves if there's no other option outside of rocThrust, which locks you into using hipcc. (This is an AMD-only issue; the host portions of NVIDIA's Thrust can be compiled with a host compiler—no need for nvcc to use the basic memory allocator, data structures, data movement.)

@BenBrock
Copy link
Collaborator

BenBrock commented Mar 11, 2025

I should also acknowledge what @yhmtsai mentioned, which was that Intel doesn't have a public implementation of Thrust. Intel does provide memory allocators, though, which solve most of the issue here. Intel also does have open-source libraries that implement the same API in terms of device vectors, and I believe SYCLomatic also provides a Thrust-like vector in automatically converted code.

@spencerpatty and I will have to figure out what to do there for the Intel GPU backend (we could either provide our own in the vendor backend or use one of the pre-existing ones, even though they're somewhat non-official). Either way, my wish is for the tests/examples for all three GPU backends to look very similar, and I think this should be very possible using Thrust/Thrust-like features.

@yhmtsai
Copy link
Contributor Author

yhmtsai commented Mar 11, 2025

I have just played a bit for different kinds of implementation which does not use the custom array class, like std::vector with custom allocator, or using unique_ptr to hold the array pointer directly, which are available in commits if anyone interested in.
I hope we have the default test/example available on cpu and gpu like considering cpu as another kind of accelerator. That's why I feel okay to create another copy of matrix even if it is a cpu backend.
Something vendor specific we can test additionally.

@BenBrock
Copy link
Collaborator

BenBrock commented Mar 11, 2025

Here's what the tests look like with rocThrust in a slightly modified version of this PR.

It has the nice property that the test/examples for cuSPARSE and rocSPARSE will be identical. For SpGEMM (not implemented yet), we would pass in Thrust's device_allocator<T> to the operation state (although users could of course implement their own allocators, and we could add our own later).

It also has the nice property that there's a clean separation of concerns: the allocation of the data arrays and moving it around is all the user's responsibility, and they can use standard vendor tools (like Thrust) to accomplish this. (They could also not use Thrust and do everything by hand if they so wished.) SparseBLAS only has to worry about the actual computation; we're not implementing our own allocators or developing our own model of how to allocate, deallocate, and move around data.

The disadvantage is you need to compile the tests/examples with hipcc, because rocThrust requires hipcc. If we really don't want to use hipcc, I'd recommend just re-implementing what we need from Thrust using host-only HIP features (I'd be willing to do that myself, but I'd prefer to not re-implement the wheel).

@BenBrock
Copy link
Collaborator

@yhmtsai Using std::vector is an issue because it won't accept allocators from Thrust or other high-level libraries that return smart device pointers. Standard library data structures require allocators that return pointer types whose dereference operators returns raw references (T&). This means that lots of simple things (e.g. resizing a vector) have undefined behavior if you use an allocator that's actually returning device memory. This is why Thrust, etc. implement their own vectors. (This is not true if you restrict yourself to unified memory, but unified memory has limitations and is not always suitable.)

unique_ptr could work, but it doesn't solve the problem of allocating memory. I think we should just use the C++ infrastructure (like Thrust) that already exists.

@YvanMokwinski
Copy link

@yhmtsai @YvanMokwinski rocThrust is purely for implementing tests and examples. IMO the rocSPARSE backend itself should not require rocThrust or compilation with hipcc.

I think the examples will be much cleaner if we use rocThrust, for two reasons:

  1. We can use rocThrust's device allocator instead of implementing our own.
  2. The rocSPARSE and cuSPARSE backends' tests and examples will be virtually identical if they both use Thrust. That will create less work for us.

When AMD users compile without hipcc (and thus without rocThrust) there is the issue of where they're going to get a C++ memory allocator, since AFAIK AMD does not provide one outside of rocThrust. Users could of course write their own, but we might eventually want to provide one ourselves if there's no other option outside of rocThrust, which locks you into using hipcc. (This is an AMD-only issue; the host portions of NVIDIA's Thrust can be compiled with a host compiler—no need for nvcc to use the basic memory allocator, data structures, data movement.)

@BenBrock
Yes!
We are on the same page. When I mentioned the library side, it was as opposed to the client side, which contains tests and examples. I fully support the use of Thrust on the tests and examples.

@yhmtsai
Copy link
Contributor Author

yhmtsai commented Mar 12, 2025

Yes, I know it is not a good practice for using std::vector with device_vector like constructor with size already lead an issue. I also manage the allocation for unique_ptr. test always uses allocate_device_ptr<T>(num) or clone_to_device(std::vector) if necessary.

last question: how do we say the compiler compatibility?
When we compile the test and example with hipcc, we do not have a way to use g++ to test it. It should be likely working, but it's hard for me to say ROCm backend allows g++ without testing.

@BenBrock
Copy link
Collaborator

@YvanMokwinski and @mshanthagit, please take a look at the current state of this PR and let me know what you think. I've just updated it:

  1. rocSPARSE is added to the CI. This uses a self-hosted runner at ICL.

  2. The examples and tests now use Thrust. These can be shared between all GPU implementations (rocSPARSE, cuSPARSE, oneMKL). I also added a separate set of rocSPARSE examples that use only HIP and are compiled with GCC (or whatever host compiler you're using). That should be enough to ensure we don't add anything hipcc-specific in the rocSPARSE backend implementation.

  3. I also updated the README, updated the hip_allocator used in the backend, and made some other quality-of-life improvements.

Let me know what you think about the current state. I would like to merge this sooner rather than later so we can start working on cuSPARSE and oneMKL backends to make sure everything works together. I think the primary remaining comment @YvanMokwinski had was about the implementation of multiply using an spmv_state_t object under the covers. I don't have a strong opinion there, so I'll let you and @yhmtsai discuss.

@yhmtsai
Copy link
Contributor Author

yhmtsai commented Mar 19, 2025

Because the tests are separated for cpu and gpu, we do not need to decide it now.
It is the default interface support across backend.
Should users expect either spmv_state_t s() or spmv_state_t s(allocator) always available? spmv_state_t(allocator) is required because we need to leave the control to user if they need. Should spmv_state_t s() be also available? It is easy to users who do not care about memory control and does not hurt the allocator interface, so I do not mind having it. I misremembered it.
Should users expect either multiply(state, ...); or multiply(...) always available. I think should be the one with the state because other function will also need. I do not mind providing the simple interface to users which does not care to much the overhead and memory control. It definitely give more inconsistence between one-stage and multiple-stage call.

yhmtsai and others added 3 commits March 19, 2025 17:12
* Use `rocThrust` for examples/tests.

* Add rocSPARSE to CI.

* Add build instructions for rocSPARSE to the repo.

* Re-write `allocator` -> `hip_allocator`

* Separate examples into `device` examples and `rocsparse` examples.
@yhmtsai yhmtsai force-pushed the dev/yhmtsai/rocsparse_spmv branch from e4cfc09 to f53cce5 Compare March 19, 2025 16:12
Copy link

@YvanMokwinski YvanMokwinski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left a bunch of comments you can feel free to ignore. I think we need to make the code more readable.

I am going to be a real contributor to this repo.

namespace spblas {

class spmv_state_t {
public:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

using "this->" will significantly improve readability.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you mean also for the data?
I am used to use this-> only for function.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Coding style. I use this-> whenever possible.

__detail::has_contiguous_range_base<B> &&
__ranges::contiguous_range<C>
void multiply(A&& a, B&& b, C&& c) {
auto a_base = __detail::get_ultimate_base(a);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not nice to read. It looks like a massive block of code.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I definitely think we could create better names for these concepts. For example swap csr_base for has_csr_base and dense_vector_base for has_contiguous_range_base.

However, I'm not sure we can do much better in terms of overall structure without repeating a lot of code. For example, if we were to re-write these without using these concepts, we'd need four implementations, one for each case of (1) CSR and dense vectors, (2) scaled CSR and dense vectors, (3) CSR and scaled input vector, (4) scaled CSR and scaled dense vector.

// a: csr_view<...>
// b: contiguous_range
// c: contiguous_range
template <typename T, typename I, typename O, contiguous_range B, contiguous_range C>
void multiply(csr_view<T, I, O> a, B&& b, C&& c);
// a: scaled_view<csr_view<...>>
// b: contiguous_range
// c: contiguous_range
template <typename T, typename I, typename O, contiguous_range B, contiguous_range C>
void multiply(scaled_view<csr_view<T, I, O>> a, B&& b, C&& c);
// a: csr_view<...>
// b: scaled_view<B>, B is contiguous_range
// c: contiguous_range
template <typename T, typename I, typename O, contiguous_range B, contiguous_range C>
void multiply(csr_view<T, I, O> a, scaled_view<B> b, C&& c);
// a: scaled_view<csr_view<...>>
// b: scaled_view<B>, B is contiguous_range
// c: contiguous_range
template <typename T, typename I, typename O, contiguous_range B, contiguous_range C>
void multiply(scaled_view<csr_view<T, I, O>> a, scaled_view<B> b, C&& c);

The current design allows us to write one implementation that accepts all of these inputs, then extract the scaling factor with inspectors. When we add transpose, skew-symmetric, Hermitian, etc., we can use this same mechanism without adding an additional implementation.

So while I'm all in favor of trying to make this as pretty as possible, I don't think changing the architecture significantly is going to get very far.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wasn't talking about the concepts but more about the whole function. Reading the code is not smooth.

@@ -0,0 +1,77 @@
#pragma once

#include <complex>

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are these headers needed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I also bind the std::complex<float> and std::complex<double>


namespace spblas {

using index_t = std::int64_t;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these aliased redefined in every backend?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think so. I would like to bring this up, too. I feel it is quite easy to mess things up.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

they are meant to be the backend defined defaults, but other types are also available to be used

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The point of spblas::index_t and spblas::offset_t is that they provide some integral type that is valid to use with a particular backend. The user can write a code (like the examples/tests) that operates on these types and be guaranteed it will work with any backend. Otherwise, you're stuck wondering what index and offset types are valid to use.

@yhmtsai
Copy link
Contributor Author

yhmtsai commented Mar 19, 2025

@YvanMokwinski I mainly update the offset_t, and use the function with the namespace directly.
About the allocator and get_ultimate_base, we can discuss them more in the meeting or create a disccusion under github.
I can adapt them quickly if the new decision or stuff we make.

@YvanMokwinski
Copy link

@YvanMokwinski I mainly update the offset_t, and use the function with the namespace directly. About the allocator and get_ultimate_base, we can discuss them more in the meeting or create a disccusion under github. I can adapt them quickly if the new decision or stuff we make.

sounds good.
Let's merge.

@yhmtsai yhmtsai force-pushed the dev/yhmtsai/rocsparse_spmv branch 3 times, most recently from c9049ee to 08b3692 Compare March 19, 2025 22:37
@yhmtsai yhmtsai force-pushed the dev/yhmtsai/rocsparse_spmv branch from 08b3692 to 25a94d7 Compare March 19, 2025 22:44
Copy link
Collaborator

@BenBrock BenBrock left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for all your hard work, @yhmtsai, and sorry this took so long to start. Great start on the GPU backends!

@BenBrock BenBrock merged commit 168da20 into main Mar 20, 2025
10 checks passed
@BenBrock BenBrock deleted the dev/yhmtsai/rocsparse_spmv branch March 20, 2025 02:26
@YvanMokwinski
Copy link

Thanks for all your hard work, @yhmtsai, and sorry this took so long to start. Great start on the GPU backends!

Agree!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants