Skip to content

Feature Request: Adding NSA and other sparse attention mechanisms #4252

@csking101

Description

@csking101

Is your feature request related to a problem? Please describe.
Sparse attention mechanisms like Native Sparse Attention (NSA) (from DeepSeek's "Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention") address the quadratic attention problem by reducing attention complexity while preserving model quality, but Megatron-LM currently lacks support for these methods.

Describe the solution you'd like
I will extend the existing DSA implementation, which is under megatron/core/transformer/experimental_attention_variant/dsa.py. I'll add a design where the token compression, top-k token selection and sliding window features can be configured per layer.

Describe alternatives you've considered
There are external libraries doing the same, however we want to have it in-house.

Additional context

  • The NSA paper: arxiv.org/abs/2502.11089
  • DeepSeek V3.2 reference implementation: github.com/deepseek-ai/DeepSeek-V3.2-Exp
  • The existing experimental DSA implementation in this repo (megatron/core/transformer/experimental_attention_variant/dsa.py) already implements key components: top-k indexing, KL-divergence indexer loss, and Hadamard rotation. This provides a strong foundation to build upon.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions