Is your feature request related to a problem? Please describe.
Sparse attention mechanisms like Native Sparse Attention (NSA) (from DeepSeek's "Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention") address the quadratic attention problem by reducing attention complexity while preserving model quality, but Megatron-LM currently lacks support for these methods.
Describe the solution you'd like
I will extend the existing DSA implementation, which is under megatron/core/transformer/experimental_attention_variant/dsa.py. I'll add a design where the token compression, top-k token selection and sliding window features can be configured per layer.
Describe alternatives you've considered
There are external libraries doing the same, however we want to have it in-house.
Additional context
- The NSA paper: arxiv.org/abs/2502.11089
- DeepSeek V3.2 reference implementation: github.com/deepseek-ai/DeepSeek-V3.2-Exp
- The existing experimental DSA implementation in this repo (
megatron/core/transformer/experimental_attention_variant/dsa.py) already implements key components: top-k indexing, KL-divergence indexer loss, and Hadamard rotation. This provides a strong foundation to build upon.
Is your feature request related to a problem? Please describe.
Sparse attention mechanisms like Native Sparse Attention (NSA) (from DeepSeek's "Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention") address the quadratic attention problem by reducing attention complexity while preserving model quality, but Megatron-LM currently lacks support for these methods.
Describe the solution you'd like
I will extend the existing DSA implementation, which is under
megatron/core/transformer/experimental_attention_variant/dsa.py. I'll add a design where the token compression, top-k token selection and sliding window features can be configured per layer.Describe alternatives you've considered
There are external libraries doing the same, however we want to have it in-house.
Additional context
megatron/core/transformer/experimental_attention_variant/dsa.py) already implements key components: top-k indexing, KL-divergence indexer loss, and Hadamard rotation. This provides a strong foundation to build upon.