Feature Request: Adding NSA and other sparse attention mechanisms

**Is your feature request related to a problem? Please describe.**
Sparse attention mechanisms like __Native Sparse Attention (NSA)__ (from DeepSeek's ["Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention"](https://arxiv.org/abs/2502.11089)) address the quadratic attention problem by reducing attention complexity while preserving model quality, but Megatron-LM currently lacks support for these methods.

**Describe the solution you'd like**
I will extend the existing DSA implementation, which is under `megatron/core/transformer/experimental_attention_variant/dsa.py`. I'll add a design where the token compression, top-k token selection and sliding window features can be configured per layer.

**Describe alternatives you've considered**
There are external libraries doing the same, however we want to have it in-house.

**Additional context**
- The NSA paper: [arxiv.org/abs/2502.11089](https://arxiv.org/abs/2502.11089)
- DeepSeek V3.2 reference implementation: [github.com/deepseek-ai/DeepSeek-V3.2-Exp](https://github.com/deepseek-ai/DeepSeek-V3.2-Exp)
- The existing experimental DSA implementation in this repo (`megatron/core/transformer/experimental_attention_variant/dsa.py`) already implements key components: top-k indexing, KL-divergence indexer loss, and Hadamard rotation. This provides a strong foundation to build upon.



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature Request: Adding NSA and other sparse attention mechanisms #4252

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Feature Request: Adding NSA and other sparse attention mechanisms #4252

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions