[WIP][Feature] 2-SM support for TMA, TMEM and TCGEN5MMA on Blackwell#1882
[WIP][Feature] 2-SM support for TMA, TMEM and TCGEN5MMA on Blackwell#1882Rachmanino wants to merge 16 commits intotile-ai:mainfrom
Conversation
- Introduced new built-in operations for cluster synchronization: `cluster_arrive_relaxed`, `cluster_arrive`, `cluster_wait`, `cluster_sync`, and `block_rank_in_cluster`. - Updated the CUDA code generator to handle these new operations. - Added corresponding Python bindings and documentation for the new cluster functions. - Included necessary header files for cluster operations in the CUDA code generation process.
- Introduced `ptx_arrive_cluster_barrier` built-in for arriving at cluster barriers. - Added `alloc_cluster_barrier` function for allocating cluster barrier buffers. - Updated CUDA code generation to handle the new cluster barrier operations. - Enhanced existing functions to support shared and cluster barrier scopes. - Added tests for cluster barrier functionality in the TileLang testing suite.
|
Important Review skippedDraft detected. Please check the settings in the CodeRabbit UI or the You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
👋 Hi! Thank you for contributing to the TileLang project. Please remember to run We appreciate you taking this step! Our team will review your contribution, and we look forward to your awesome work! 🚀 |
No description provided.