-
Notifications
You must be signed in to change notification settings - Fork 8
Create rocSPARSE and corresponding spmv #41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
85a5f2b to
fbb9106
Compare
|
@BenBrock It seems that we will merge this when we have a CI to check it too. |
|
Got it, so I should work on merging this rocSPARSE PR first (before looking at the CUDA/HIP ones)? |
|
Yes. I have also closed two old prs. |
BenBrock
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR, @yhmtsai, this is a good start. There are a couple of changes I think we should make:
- I'd like to start by sticking to
rocThrustfor managing memory as much as possible. This would mean usingthrust::device_vectorinstead of a customarrayclass andthrust::device_allocatorinstead of our own custom memory allocator. - It seems we're currently mixing HIP and ROCm here—@mshanthagit and @YvanMokwinski, can you comment whether it's appropriate to use HIP here, or are there ROCm memory management routines we should use instead?
I have luckily been able to get access to a machine with AMD GPUs, so I will work on integrating rocThrust into the CMake build today if I have time.
include/spblas/allocator.hpp
Outdated
| * allocator base class. When user provides the allocator implementation should | ||
| * inherit from this class. | ||
| */ | ||
| class allocator { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should try to replace this with thrust::device_allocator from rocThrust.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thrust will only work for cuSPARSE and rocSPARSE.
@spencerpatty correct me if I'm wrong
I think oneMKL does not have the thrust library.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also, we should not limit how user provide the allocator.
If we limit to device_allocator, there's no benefit from providing the abaility of user's allocator because user are forced to use thrust device allocator
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, that is correct. oneMKL does not have the thrust library. closest is oneDPL https://github.com/uxlfoundation/oneDPL
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's have a basic allocator and avoid having dependency on the library side. Let's keep it simple.
include/spblas/array.hpp
Outdated
|
|
||
| // It is a class to handle the data by allocator which has auto clearup process | ||
| template <typename ValueType> | ||
| class array { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should try to replace a custom array class with thrust::device_vector from rocThrust.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is an internal helper not for user IMO, so I will not hold the same comment as allocator.
Thus, if we are okay to provide different device_vector for different backend, it should be doable although it introduces the another dependency.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have 'mature' customized data structures too, if needed, for the client side. The use of rocThrust in the implementation is debatable, it all depends on the final purpose of the code.
|
|
||
| namespace spblas::detail { | ||
|
|
||
| class rocm_allocator : public spblas::allocator { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can use rocThrust's allocator here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not. I'm not familiar with the rocThrust allocator. The definition of the spblas::allocator is low-level. I'm not sure why alloc and free should be const methods; I think they shouldn't be. Also, think about streams. Will streams be available in this allocator?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it is doable, but it will come with the exeucution policy such that the routine will be run on the same stream with the allocator. (we can also allow the routine to get the stream though. However, maybe the execution policy is the right place to take care of.)
class stream_allocator : public spblas:: allocator {
public:
stream_allocator(hipStream_t stream): stream_(stream);
void* alloc(size_t) const override {
void* ptr;
hipMallocAsyc(&ptr, size);
return ptr
}
...
};
I put them into const because I only thought the simple malloc and free without changing internal data.
With the pool memory allocator possibility, I will change them into non-const such that the allocator can change their internal data structure without muture keyword
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's have the right things now rather than later.
Having some prototype code is suitable for iterating on the design, but the design is incomplete. I hope we are not developing too much before fixing things that need to be fixed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have added the stream allocator prototype in the comments. yeah, prototype will need to change when the design is changed, but it is the prototype benefit for us. I would still say we do not hold this due to the execution policy. After merging this, we can raise the concern at the right place. Otherwise, I feel the reference implementation might be harder when merging the other stuff to let the device library into framework.
I suggest minimizing the dependencies, but I understand using rocThrust would make life easier. So, sure, that's a good idea. Let's not reinvent the wheel. Yes, hipMemcpy is appropriate to implement your function copy_to_device. Just to let you know, it is a blocking function, so it'll trigger synchronization. Also, if you don't want to bother, you can replace hipMemcpyHostToDevice with hipMemcpyDefault. |
include/spblas/array.hpp
Outdated
|
|
||
| // It is a class to handle the data by allocator which has auto clearup process | ||
| template <typename ValueType> | ||
| class array { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have 'mature' customized data structures too, if needed, for the client side. The use of rocThrust in the implementation is debatable, it all depends on the final purpose of the code.
|
We need to solve the issue of passing the stream to the allocator now, rather than later, and use hipMallocAsync, hipMemcpyAsync etc. |
…r original case, too
…g for library call
07943b5 to
2de1ee5
Compare
|
@BenBrock @YvanMokwinski I have tried a bit rocTHRUST. It seems to require hipcc to compile. rocThrust will try to add using hipcc
if cuda thrust also require nvcc, we might face nvcc does not support the c++ standard we used here. |
|
|
@YvanMokwinski @mshanthagit Is I'm looking at the CMake and have things working locally with rocThrust. However, we need to either explicitly mark |
|
I think the suffix is |
You can use both. We use .cpp |
@BenBrock what is the purpose of using thrust on the library side? |
|
@yhmtsai @YvanMokwinski I think the examples will be much cleaner if we use
When AMD users compile without |
|
I should also acknowledge what @yhmtsai mentioned, which was that Intel doesn't have a public implementation of Thrust. Intel does provide memory allocators, though, which solve most of the issue here. Intel also does have open-source libraries that implement the same API in terms of device vectors, and I believe SYCLomatic also provides a Thrust-like vector in automatically converted code. @spencerpatty and I will have to figure out what to do there for the Intel GPU backend (we could either provide our own in the vendor backend or use one of the pre-existing ones, even though they're somewhat non-official). Either way, my wish is for the tests/examples for all three GPU backends to look very similar, and I think this should be very possible using Thrust/Thrust-like features. |
|
I have just played a bit for different kinds of implementation which does not use the custom array class, like std::vector with custom allocator, or using unique_ptr to hold the array pointer directly, which are available in commits if anyone interested in. |
|
Here's what the tests look like with It has the nice property that the test/examples for cuSPARSE and rocSPARSE will be identical. For SpGEMM (not implemented yet), we would pass in Thrust's It also has the nice property that there's a clean separation of concerns: the allocation of the data arrays and moving it around is all the user's responsibility, and they can use standard vendor tools (like Thrust) to accomplish this. (They could also not use Thrust and do everything by hand if they so wished.) SparseBLAS only has to worry about the actual computation; we're not implementing our own allocators or developing our own model of how to allocate, deallocate, and move around data. The disadvantage is you need to compile the tests/examples with |
|
@yhmtsai Using
|
@BenBrock |
|
Yes, I know it is not a good practice for using std::vector with device_vector like constructor with size already lead an issue. I also manage the allocation for unique_ptr. test always uses last question: how do we say the compiler compatibility? |
|
@YvanMokwinski and @mshanthagit, please take a look at the current state of this PR and let me know what you think. I've just updated it:
Let me know what you think about the current state. I would like to merge this sooner rather than later so we can start working on cuSPARSE and oneMKL backends to make sure everything works together. I think the primary remaining comment @YvanMokwinski had was about the implementation of |
|
Because the tests are separated for cpu and gpu, we do not need to decide it now. |
* Use `rocThrust` for examples/tests. * Add rocSPARSE to CI. * Add build instructions for rocSPARSE to the repo. * Re-write `allocator` -> `hip_allocator` * Separate examples into `device` examples and `rocsparse` examples.
e4cfc09 to
f53cce5
Compare
YvanMokwinski
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left a bunch of comments you can feel free to ignore. I think we need to make the code more readable.
I am going to be a real contributor to this repo.
| namespace spblas { | ||
|
|
||
| class spmv_state_t { | ||
| public: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
using "this->" will significantly improve readability.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you mean also for the data?
I am used to use this-> only for function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Coding style. I use this-> whenever possible.
| __detail::has_contiguous_range_base<B> && | ||
| __ranges::contiguous_range<C> | ||
| void multiply(A&& a, B&& b, C&& c) { | ||
| auto a_base = __detail::get_ultimate_base(a); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is not nice to read. It looks like a massive block of code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I definitely think we could create better names for these concepts. For example swap csr_base for has_csr_base and dense_vector_base for has_contiguous_range_base.
However, I'm not sure we can do much better in terms of overall structure without repeating a lot of code. For example, if we were to re-write these without using these concepts, we'd need four implementations, one for each case of (1) CSR and dense vectors, (2) scaled CSR and dense vectors, (3) CSR and scaled input vector, (4) scaled CSR and scaled dense vector.
// a: csr_view<...>
// b: contiguous_range
// c: contiguous_range
template <typename T, typename I, typename O, contiguous_range B, contiguous_range C>
void multiply(csr_view<T, I, O> a, B&& b, C&& c);// a: scaled_view<csr_view<...>>
// b: contiguous_range
// c: contiguous_range
template <typename T, typename I, typename O, contiguous_range B, contiguous_range C>
void multiply(scaled_view<csr_view<T, I, O>> a, B&& b, C&& c);// a: csr_view<...>
// b: scaled_view<B>, B is contiguous_range
// c: contiguous_range
template <typename T, typename I, typename O, contiguous_range B, contiguous_range C>
void multiply(csr_view<T, I, O> a, scaled_view<B> b, C&& c);// a: scaled_view<csr_view<...>>
// b: scaled_view<B>, B is contiguous_range
// c: contiguous_range
template <typename T, typename I, typename O, contiguous_range B, contiguous_range C>
void multiply(scaled_view<csr_view<T, I, O>> a, scaled_view<B> b, C&& c);The current design allows us to write one implementation that accepts all of these inputs, then extract the scaling factor with inspectors. When we add transpose, skew-symmetric, Hermitian, etc., we can use this same mechanism without adding an additional implementation.
So while I'm all in favor of trying to make this as pretty as possible, I don't think changing the architecture significantly is going to get very far.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wasn't talking about the concepts but more about the whole function. Reading the code is not smooth.
| @@ -0,0 +1,77 @@ | |||
| #pragma once | |||
|
|
|||
| #include <complex> | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are these headers needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I also bind the std::complex<float> and std::complex<double>
|
|
||
| namespace spblas { | ||
|
|
||
| using index_t = std::int64_t; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are these aliased redefined in every backend?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think so. I would like to bring this up, too. I feel it is quite easy to mess things up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
they are meant to be the backend defined defaults, but other types are also available to be used
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The point of spblas::index_t and spblas::offset_t is that they provide some integral type that is valid to use with a particular backend. The user can write a code (like the examples/tests) that operates on these types and be guaranteed it will work with any backend. Otherwise, you're stuck wondering what index and offset types are valid to use.
|
@YvanMokwinski I mainly update the offset_t, and use the function with the namespace directly. |
sounds good. |
c9049ee to
08b3692
Compare
08b3692 to
25a94d7
Compare
BenBrock
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for all your hard work, @yhmtsai, and sorry this took so long to start. Great start on the GPU backends!
Agree! |
This PR creates rocSPARSE backend and spmv on AMD GPU.
Additionally, it makes the test also available on device.