TileOPs (TOP) is a high-performance operator library for large language models (LLMs) built on TileLang. It offers efficient, modular, and composable implementations for AI workloads, especially for LLMs.
⚠️ Status: TileOPs is under active and rapid development. APIs and features may change.
What TileOPs is for:
- Out-of-the-box Operator Library: A growing collection of production-ready operators commonly used in LLM workloads, designed with clear abstractions and modular building blocks. These operators can be used directly or easily extended for custom research and system integration.
- Efficient Attention Kernels for LLMs: Highly optimized attention implementations, including MHA/GQA (implemented FA2 on Ampere-like GPUs and FA3 on Hopper), DeepSeek-MLA, and DeepSeek-DSA.
- Reference Implementation for TileLang: TileOPs acts as a canonical reference implementation for writing performant and maintainable kernels in TileLang. It demonstrates best practices in tiling strategies, memory hierarchy utilization, and warp-/block-level coordination, making it a practical learning resource for compiler and kernel developers.
The core features of TileOPs include:
- Auto-Tuning: Built-in auto-tuning support to explore tile sizes, pipelines, and scheduling parameters, enabling kernels to adapt efficiently to different GPU architectures and workload characteristics with minimal manual effort.
- CUDA-Graph and torch.compile Compatibility: TileOPs APIs are fully compatible with CUDA-Graph capture and PyTorch
torch.compile, allowing seamless integration into modern training and inference pipelines with reduced launch overhead and improved end-to-end performance.
- Lightweight Dependencies: TileOPs depends only on TileLang and PyTorch, keeping the software stack minimal and easy to integrate.
TODO
TODO
- Python >= 3.8
- Torch >= 2.1
- TileLang >= 0.1.7
pip install tileopsgit clone https://github.com/tile-ai/TileOPs
cd TileOPs
pip install -e '.[dev]' -v # remove -e option if you don't want to install in editable mode, -v for verbose outputTODO
TileOPs is structured around four hierarchical key concepts, each representing a distinct level of abstraction. Higher-level components are composed from, or delegate execution to, the next lower level.
- Layer: A high-level, user-facing abstraction analogous to
torch.nn.Module. It manages stateful parameters. - Function: A stateless, functional abstraction analogous to
torch.nn.functional. Functions are fully compatible with CUDA-Graph capture,torch.compile, andtorch.autograd. - Op: determines the implementation for a given shape and hardware, dispatching to the correct Kernel and providing unit test and benchmark. Ops are fully compatible with CUDA-Graph capture and
torch.compile. - Kernel: Tilelang-based kernels with hardware-specific optimizations.
