vllm-project
diff --git a/‎.github/mergify.yml‎
Lines changed: 5 additions & 0 deletions b/‎.github/mergify.yml‎
Lines changed: 5 additions & 0 deletions
diff --git a/‎.gitignore‎
Lines changed: 7 additions & 0 deletions b/‎.gitignore‎
Lines changed: 7 additions & 0 deletions
diff --git a/‎.readthedocs.yaml‎
Lines changed: 10 additions & 16 deletions b/‎.readthedocs.yaml‎
Lines changed: 10 additions & 16 deletions
diff --git a/‎README.md‎
Lines changed: 10 additions & 4 deletions b/‎README.md‎
Lines changed: 10 additions & 4 deletions
diff --git a/‎docs/.nav.yml‎
Lines changed: 7 additions & 3 deletions b/‎docs/.nav.yml‎
Lines changed: 7 additions & 3 deletions
diff --git a/‎docs/DEVELOPMENT.md‎
Lines changed: 43 additions & 0 deletions b/‎docs/DEVELOPMENT.md‎
Lines changed: 43 additions & 0 deletions
diff --git a/‎docs/Makefile‎
Lines changed: 16 additions & 10 deletions b/‎docs/Makefile‎
Lines changed: 16 additions & 10 deletions
diff --git a/‎docs/README.md‎
Lines changed: 0 additions & 25 deletions b/‎docs/README.md‎
Lines changed: 0 additions & 25 deletions
diff --git a/‎docs/guides/compression_schemes.md‎
Lines changed: 2 additions & 26 deletions b/‎docs/guides/compression_schemes.md‎
Lines changed: 2 additions & 26 deletions
@@ -47,6 +47,11 @@ pull_request_rules:
           - files~=^[^/]+\.md$
           - files~=^docs/
           - files~=^examples/
+      - -files~=^src/
+      - -files~=^tests/
+      - -files~=^\.github/
+      - -files~=^Makefile$
+      - -files~=^pyproject\.toml$
     actions:
       label:
         add:
 
@@ -128,6 +128,13 @@ venv.bak/
 /site
 docs/.cache/*
 
+# zensical docs build (generated by pre-build scripts)
+docs/api/llmcompressor/
+docs/examples/
+docs/experimental/
+docs/developer/code-of-conduct.md
+docs/developer/contributing.md
+
 # mypy
 .mypy_cache/
 ### Example user template template
 
@@ -1,22 +1,16 @@
-# Read the Docs configuration file
-# See https://docs.readthedocs.io/en/stable/config-file/v2.html for details
-
-# Required
 version: 2
 
-# Set the OS, Python version, and other tools you might need
 build:
   os: ubuntu-24.04
   tools:
     python: "3.12"
-
-# Build documentation with Mkdocs
-mkdocs:
-   configuration: mkdocs.yml
-
-python:
-  install:
-    - method: pip
-      path: .
-      extra_requirements:
-        - dev
+  jobs:
+    install:
+      - pip install -e ".[dev]"
+    build:
+      html:
+        - python docs/scripts/zensical_gen_files.py
+        - zensical build
+    post_build:
+      - mkdir -p $READTHEDOCS_OUTPUT/html/
+      - cp --recursive site/* $READTHEDOCS_OUTPUT/html/
@@ -37,24 +37,24 @@ Big updates have landed in LLM Compressor! To get a more in-depth look, check ou
 
 Some of the exciting new features include:
 
+* **Qwen3.5 Support**: Qwen 3.5 can now be quantized using LLM Compressor. You will need to update your local transformers version using `uv pip install --upgrade transformers` and install LLM Compressor from source if using `<0.11`. Once updated, you should be able to run examples for the [MoE](examples/quantization_w4a4_fp4/qwen3_5_example.py) and [non-MoE](examples/quantization_w4a4_fp4/qwen3_5_example.py) variants of Qwen 3.5 end-to-end. For models quantized and published by the RedHat team, consider using the [NVFP4](https://huggingface.co/RedHatAI/Qwen3.5-122B-A10B-NVFP4) and FP8 checkpoints for [Qwen3.5-122B](https://huggingface.co/RedHatAI/Qwen3.5-122B-A10B-FP8-dynamic) and [Qwen3.5-397B](https://huggingface.co/RedHatAI/Qwen3.5-397B-A17B-FP8-dynamic).
 * **Updated offloading and model loading support**: Loading transformers models that are offloaded to disk and/or offloaded across distributed process ranks is now supported. Disk offloading allows users to load and compress very large models which normally would not fit in CPU memory. Offloading functionality is no longer supported through accelerate but through model loading utilities added to compressed-tensors. For a full summary of updated loading and offloading functionality, for both single-process and distributed flows, see the [Big Models and Distributed Support guide](docs/guides/big_models_and_distributed/model_loading.md).
 * **Distributed GPTQ Support**: GPTQ now supports Distributed Data Parallel (DDP) functionality to significantly improve calibration runtime. An example using DDP with GPTQ can be found [here](examples/quantization_w4a16/llama3_ddp_example.py).
 * **Updated FP4 Microscale Support**: GPTQ now supports FP4 quantization schemes, including both [MXFP4](examples/quantization_w4a16_fp4/mxfp4/llama3_example.py) and [NVFP4](examples/quantization_w4a4_fp4/llama3_gptq_example.py). MXFP4 support has also been improved with updated weight scale generation. Models with weight-only quantization in the MXFP4 format can now run in vLLM as of vLLM v0.14.0. MXFP4 models with activation quantization are not yet supported in vLLM for compressed-tensors models
 * **New Model-Free PTQ Pathway**: A new model-free PTQ pathway has been added to LLM Compressor, called [`model_free_ptq`](src/llmcompressor/entrypoints/model_free/__init__.py#L36). This pathway allows you to quantize your model without the requirement of Hugging Face model definition and is especially useful in cases where `oneshot` may fail. This pathway is currently supported for data-free pathways only i.e FP8 quantization and was leveraged to quantize the [Mistral Large 3 model](https://huggingface.co/mistralai/Mistral-Large-3-675B-Instruct-2512). Additional [examples](examples/model_free_ptq) have been added illustrating how LLM Compressor can be used for Kimi K2
+* **MXFP8 Microscale Support (Experimental)**: LLM Compressor now supports MXFP8 quantization via PTQ. Both W8A8 ([MXFP8](experimental/mxfp8/qwen3_example_w8a8_mxfp8.py)) and W8A16 weight-only ([MXFP8A16](experimental/mxfp8/qwen3_example_w8a16_mxfp8.py)) modes are available.
 * **Extended KV Cache and Attention Quantization Support**: LLM Compressor now supports attention quantization. KV Cache quantization, which previously only supported per-tensor scales, has been extended to support any quantization scheme including a new `per-head` quantization scheme. Support for these checkpoints is on-going in vLLM and scripts to get started have been added to the [experimental folder](experimental/attention)
 
 
 ### Supported Formats
-* Activation Quantization: W8A8 (int8 and fp8)
-* Mixed Precision: W4A16, W8A16, NVFP4 (W4A4 and W4A16 support)
-* 2:4 Semi-structured and Unstructured Sparsity
+* Activation Quantization: W8A8 (int8 and fp8), MXFP8 (experimental)
+* Mixed Precision: W4A16, W8A16, MXFP8A16 (experimental), NVFP4 (W4A4 and W4A16 support)
 
 ### Supported Algorithms
 * Simple PTQ
 * GPTQ
 * AWQ
 * SmoothQuant
-* SparseGPT
 * AutoRound
 
 ### When to Use Which Optimization
@@ -75,6 +75,8 @@ pip install llmcompressor
 Applying quantization with `llmcompressor`:
 * [Activation quantization to `int8`](examples/quantization_w8a8_int8/README.md)
 * [Activation quantization to `fp8`](examples/quantization_w8a8_fp8/README.md)
+* [Activation quantization to MXFP8 (experimental)](experimental/mxfp8/qwen3_example_w8a8_mxfp8.py)
+* [Weight-only quantization to MXFP8A16 (experimental)](experimental/mxfp8/qwen3_example_w8a16_mxfp8.py)
 * [Activation quantization to `fp4`](examples/quantization_w4a4_fp4/llama3_example.py)
 * [Activation quantization to `fp4` using AutoRound](examples/autoround/quantization_w4a4_fp4/README.md)
 * [Activation quantization to `fp8` and weight quantization to `int4`](examples/quantization_w4a8_fp8/)
@@ -183,3 +185,7 @@ If you find LLM Compressor useful in your research or projects, please consider
     url={https://github.com/vllm-project/llm-compressor},
 }
 ```
+
+
+!!! warning
+    Sparse compression (24 sparsity) is no longer supported by LLM Compressor due lack of hardware support and usage
@@ -1,7 +1,7 @@
 nav:
   - Home: index.md
   - Why use LLM Compressor?: steps/why-llmcompressor.md
-  - Compresssing your model, step-by-step:
+  - Compressing your model, step-by-step:
     - Choosing your model: steps/choosing-model.md
     - Choosing the right compression scheme: steps/choosing-scheme.md
     - Choosing the right compression algorithm: steps/choosing-algo.md
@@ -19,19 +19,23 @@ nav:
     - Qwen3:
       - key-models/qwen3/index.md
       - FP8 Example: key-models/qwen3/fp8-example.md
+    - Qwen3.5:
+      - key-models/qwen3.5/index.md
+      - NVFP4A16 VL Example: key-models/qwen3.5/nvfp4-vl-example.md
+      - NVFP4 MoE Example: key-models/qwen3.5/nvfp4-moe-example.md
     - Kimi-K2:
       - key-models/kimi-k2/index.md
       - FP8 Example: key-models/kimi-k2/fp8-example.md
     - Mistral Large 3:
       - key-models/mistral-large-3/index.md
       - FP8 Example: key-models/mistral-large-3/fp8-example.md
-  - Guides:
+  - User Guides:
     - Big Models and Distributed Support:
       - Model Loading: guides/big_models_and_distributed/model_loading.md
       - Sequential Onloading: guides/big_models_and_distributed/sequential_onloading.md
       - Distributed Oneshot: guides/big_models_and_distributed/distributed_oneshot.md
     - Compression Schemes: guides/compression_schemes.md
-    - Saving a Model: guides/saving_a_model.md
+    - Saving a Compressed Model: guides/saving_a_model.md
     - Observers: guides/observers.md
     - Memory Requirements: guides/memory.md
     - Runtime Performance: guides/runtime.md
 
@@ -0,0 +1,43 @@
+# Getting started with LLM Compressor docs
+
+```bash
+cd docs
+```
+
+- Install the dependencies:
+
+```bash
+make install
+```
+
+- Clean the previous build (optional but recommended):
+
+```bash
+make clean
+```
+
+- Generate docs content (files, API references, and navigation):
+
+```bash
+make gen
+```
+
+- Serve the docs locally (runs `gen` automatically):
+
+```bash
+make serve
+```
+
+This will start a local server. You can now open your browser and view the documentation.
+
+- Build the static site (runs `gen` automatically):
+
+```bash
+make build
+```
+
+- List all available targets:
+
+```bash
+make help
+```
@@ -1,26 +1,32 @@
-# Minimal mkdocs makefile
+# Minimal zensical makefile
 
-PYTHON      := python3
-MKDOCS_CMD  := mkdocs
-MKDOCS_CONF := ../mkdocs.yml
+ZENSICAL_CMD := zensical
+ZENSICAL_CONF := ../zensical.toml
 
-.PHONY: help install serve build clean
+.PHONY: help install gen serve build clean
 
 help:
 	@echo "Available targets:"
 	@echo "  install  Install dependencies globally"
+	@echo "  gen      Generate docs content (files + API + nav)"
 	@echo "  serve    Serve docs locally"
 	@echo "  build    Build static site"
 	@echo "  clean    Remove build artifacts"
 
 install:
 	pip install -e "../[dev]"
 
-serve:
-	$(MKDOCS_CMD) serve --livereload -f $(MKDOCS_CONF)
+gen:
+	cd .. && python docs/scripts/zensical_gen_files.py
 
-build:
-	$(MKDOCS_CMD) build -f $(MKDOCS_CONF)
+serve: gen
+	cd .. && $(ZENSICAL_CMD) serve
+
+build: gen
+	cd .. && $(ZENSICAL_CMD) build
 
 clean:
-	rm -rf site/ .cache/ 
+	rm -rf site/ .cache/ api/llmcompressor/
+	rm -rf examples/ experimental/
+	rm -f developer/code-of-conduct.md developer/contributing.md
+	cd .. && python3 docs/scripts/zensical_gen_files.py --clean
@@ -8,8 +8,6 @@ A full list of supported schemes can be found [here](https://github.com/vllm-pro
 - [W8A8-INT8](#int8_w8a8)
 - [W4A16 and W8A16](#w4a16-and-w8a16)
 - [NVFP4](#nvfp4)
-- [2:4 Semi-structured Sparsity](#semi-structured)
-- [Unstructured Sparsity](#unstructured)
 
 ## PTQ Compression Schemes
 
@@ -63,27 +61,5 @@ A full list of supported schemes can be found [here](https://github.com/vllm-pro
 | Calibration   | Requires a calibration dataset to calibrate activation global scales                                                            |
 | Use case      | Supported on all NVIDIA Blackwell GPUs or later  
 
-## Sparsification Compression Schemes
-
-Sparsification reduces model complexity by pruning selected weight values to zero while retaining essential weights in a subset of parameters. Supported formats include:
-
-
-### Semi-Structured
-| Feature       | Description                                                                                  |
-|---------------|----------------------------------------------------------------------------------------------|
-| 2:4 Semi-structured Sparsity | Uses semi-structured sparsity (SparseGPT), where 2 of every 4 contiguous weights are set to zero. |
-| Weights       | 2:4 sparsity                                                                                |
-| Activations   | N/A                                                                                          |
-| Calibration   | Requires a calibration dataset                                                              |
-| Use case      | Fine-grained sparsity for compression and speedups           |
-
-
-
-### Unstructured
-| Feature       | Description                                                                                  |
-|---------------|----------------------------------------------------------------------------------------------|
-| Unstructured Sparsity | Zeros out individual weights without a regular pattern, removing weights wherever they contribute least. Produces a fine-grained sparse matrix. |
-| Weights       | Sparsified individually (no structure)                                                     |
-| Activations   | N/A                                                                  |
-| Calibration   | Does not require a calibration dataset                                                    |
-| Use case      | Fine-grained sparsity for compression and speedups                                         |
+!!! warning
+    Sparse compression (including 2of4 sparsity) is no longer supported by LLM Compressor due lack of hardware support and user interest. Please see https://github.com/vllm-project/vllm/pull/36799 for more information.