ai-dynamo
diff --git a/‎fern/fern/pages/api/nixl_connect/README.mdx‎
Lines changed: 1 addition & 1 deletion b/‎fern/fern/pages/api/nixl_connect/README.mdx‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎fern/fern/pages/backends/sglang/README.mdx‎
Lines changed: 8 additions & 8 deletions b/‎fern/fern/pages/backends/sglang/README.mdx‎
Lines changed: 8 additions & 8 deletions
diff --git a/‎fern/fern/pages/backends/sglang/prometheus.mdx‎
Lines changed: 5 additions & 5 deletions b/‎fern/fern/pages/backends/sglang/prometheus.mdx‎
Lines changed: 5 additions & 5 deletions
diff --git a/‎fern/fern/pages/backends/trtllm/README.mdx‎
Lines changed: 14 additions & 14 deletions b/‎fern/fern/pages/backends/trtllm/README.mdx‎
Lines changed: 14 additions & 14 deletions
diff --git a/‎fern/fern/pages/backends/trtllm/llama4_plus_eagle.mdx‎
Lines changed: 3 additions & 3 deletions b/‎fern/fern/pages/backends/trtllm/llama4_plus_eagle.mdx‎
Lines changed: 3 additions & 3 deletions
diff --git a/‎fern/fern/pages/backends/trtllm/prometheus.mdx‎
Lines changed: 5 additions & 5 deletions b/‎fern/fern/pages/backends/trtllm/prometheus.mdx‎
Lines changed: 5 additions & 5 deletions
diff --git a/‎fern/fern/pages/backends/vllm/README.mdx‎
Lines changed: 12 additions & 12 deletions b/‎fern/fern/pages/backends/vllm/README.mdx‎
Lines changed: 12 additions & 12 deletions
diff --git a/‎fern/fern/pages/backends/vllm/deepseek-r1.mdx‎
Lines changed: 1 addition & 1 deletion b/‎fern/fern/pages/backends/vllm/deepseek-r1.mdx‎
Lines changed: 1 addition & 1 deletion
@@ -94,7 +94,7 @@ When RDMA isn't available, the NIXL data transfer will still complete using non-
 
 ### Multimodal Example
 
-In the case of the [Dynamo Multimodal Disaggregated Example](../../multimodal/vllm):
+In the case of the [Dynamo Multimodal Disaggregated Example](/additional-resources/multimodal-details/vllm):
 
  1. The HTTP frontend accepts a text prompt and a URL to an image.
 
 
@@ -36,12 +36,12 @@ git checkout $(git describe --tags $(git rev-list --tags --max-count=1))
 
 | Feature | SGLang | Notes |
 |---------|--------|-------|
-| [**Disaggregated Serving**](../../design-docs/disagg-serving) | ✅ |  |
-| [**Conditional Disaggregation**](../../design-docs/disagg-serving#conditional-disaggregation) | 🚧 | WIP [PR](https://github.com/sgl-project/sglang/pull/7730) |
-| [**KV-Aware Routing**](../../router/kv_cache_routing) | ✅ |  |
-| [**SLA-Based Planner**](../../planner/sla_planner) | ✅ |  |
-| [**Multimodal Support**](../../multimodal/sglang) | ✅ |  |
-| [**KVBM**](../../kvbm/kvbm_architecture) | ❌ | Planned |
+| [**Disaggregated Serving**](/design-docs/disaggregated-serving) | ✅ |  |
+| [**Conditional Disaggregation**](/design-docs/disaggregated-serving#conditional-disaggregation) | 🚧 | WIP [PR](https://github.com/sgl-project/sglang/pull/7730) |
+| [**KV-Aware Routing**](/additional-resources/router-details/kv-cache-routing) | ✅ |  |
+| [**SLA-Based Planner**](/components/planner/sla-based-planner) | ✅ |  |
+| [**Multimodal Support**](/additional-resources/multimodal-details/sglang) | ✅ |  |
+| [**KVBM**](/components/kvbm/architecture) | ❌ | Planned |
 
 
 ## Dynamo SGLang Integration
@@ -57,7 +57,7 @@ Dynamo SGLang uses SGLang's native argument parser, so **most SGLang engine argu
 | Argument | Description | Default | SGLang Equivalent |
 |----------|-------------|---------|-------------------|
 | `--endpoint` | Dynamo endpoint in `dyn://namespace.component.endpoint` format | Auto-generated based on mode | N/A |
-| `--migration-limit` | Max times a request can migrate between workers for fault tolerance. See [Request Migration Architecture](../../fault-tolerance/request_migration). | `0` (disabled) | N/A |
+| `--migration-limit` | Max times a request can migrate between workers for fault tolerance. See [Request Migration Architecture](/additional-resources/fault-tolerance/request-migration). | `0` (disabled) | N/A |
 | `--dyn-tool-call-parser` | Tool call parser for structured outputs (takes precedence over `--tool-call-parser`) | `None` | `--tool-call-parser` |
 | `--dyn-reasoning-parser` | Reasoning parser for CoT models (takes precedence over `--reasoning-parser`) | `None` | `--reasoning-parser` |
 | `--use-sglang-tokenizer` | Use SGLang's tokenizer instead of Dynamo's | `False` | N/A |
@@ -87,7 +87,7 @@ When a user cancels a request (e.g., by disconnecting from the frontend), the re
 ⚠️ SGLang backend currently does not support cancellation during remote prefill phase in disaggregated mode.
 </Callout>
 
-For more details, see the [Request Cancellation Architecture](../../fault-tolerance/request_cancellation) documentation.
+For more details, see the [Request Cancellation Architecture](/additional-resources/fault-tolerance/request-cancellation) documentation.
 
 ## Installation
 
 
@@ -13,9 +13,9 @@ When running SGLang through Dynamo, SGLang engine metrics are automatically pass
 
 **For the complete and authoritative list of all SGLang metrics**, always refer to the [official SGLang Production Metrics documentation](https://docs.sglang.ai/references/production_metrics.html).
 
-**For Dynamo runtime metrics**, see the [Dynamo Metrics Guide](../../observability/metrics).
+**For Dynamo runtime metrics**, see the [Dynamo Metrics Guide](/user-guides/observability-local/metrics).
 
-**For visualization setup instructions**, see the [Prometheus and Grafana Setup Guide](../../observability/prometheus-grafana).
+**For visualization setup instructions**, see the [Prometheus and Grafana Setup Guide](/user-guides/observability-local/prometheus-grafana).
 
 ## Environment Variables
 
@@ -29,7 +29,7 @@ This is a single machine example.
 
 ### Start Observability Stack
 
-For visualizing metrics with Prometheus and Grafana, start the observability stack. See [Observability Getting Started](../../observability/README#getting-started-quickly) for instructions.
+For visualizing metrics with Prometheus and Grafana, start the observability stack. See [Observability Getting Started](/user-guides/observability-local/overview#getting-started-quickly) for instructions.
 
 ### Launch Dynamo Components
 
@@ -117,8 +117,8 @@ For the complete and authoritative list of all SGLang metrics, see the [official
 - [SGLang GitHub - Metrics Collector](https://github.com/sgl-project/sglang/blob/main/python/sglang/srt/metrics/collector.py)
 
 ### Dynamo Metrics
-- [Dynamo Metrics Guide](../../observability/metrics) - Complete documentation on Dynamo runtime metrics
-- [Prometheus and Grafana Setup](../../observability/prometheus-grafana) - Visualization setup instructions
+- [Dynamo Metrics Guide](/user-guides/observability-local/metrics) - Complete documentation on Dynamo runtime metrics
+- [Prometheus and Grafana Setup](/user-guides/observability-local/prometheus-grafana) - Visualization setup instructions
 - Dynamo runtime metrics (prefixed with `dynamo_*`) are available at the same `/metrics` endpoint alongside SGLang metrics
   - Implementation: `lib/runtime/src/metrics.rs` (Rust runtime metrics)
   - Metric names: `lib/runtime/src/metrics/prometheus_names.rs` (metric name constants)
 
@@ -41,12 +41,12 @@ git checkout $(git describe --tags $(git rev-list --tags --max-count=1))
 
 | Feature | TensorRT-LLM | Notes |
 |---------|--------------|-------|
-| [**Disaggregated Serving**](../../design-docs/disagg-serving) | ✅ |  |
-| [**Conditional Disaggregation**](../../design-docs/disagg-serving#conditional-disaggregation) | 🚧 | Not supported yet |
-| [**KV-Aware Routing**](../../router/kv_cache_routing) | ✅ |  |
-| [**SLA-Based Planner**](../../planner/sla_planner) | ✅ |  |
-| [**Load Based Planner**](../../planner/load_planner) | 🚧 | Planned |
-| [**KVBM**](../../kvbm/kvbm_architecture) | ✅ | |
+| [**Disaggregated Serving**](/design-docs/disaggregated-serving) | ✅ |  |
+| [**Conditional Disaggregation**](/design-docs/disaggregated-serving#conditional-disaggregation) | 🚧 | Not supported yet |
+| [**KV-Aware Routing**](/additional-resources/router-details/kv-cache-routing) | ✅ |  |
+| [**SLA-Based Planner**](/components/planner/sla-based-planner) | ✅ |  |
+| [**Load Based Planner**](/additional-resources/load-planner) | 🚧 | Planned |
+| [**KVBM**](/components/kvbm/architecture) | ✅ | |
 
 ### Large Scale P/D and WideEP Features
 
@@ -98,7 +98,7 @@ apt-get update && apt-get -y install git git-lfs
 Below we provide some simple shell scripts that run the components for each configuration. Each shell script is simply running the `python3 -m dynamo.frontend <args>` to start up the ingress and using `python3 -m dynamo.trtllm <args>` to start up the workers. You can easily take each command and run them in separate terminals.
 </Callout>
 
-For detailed information about the architecture and how KV-aware routing works, see the [KV Cache Routing documentation](../../router/kv_cache_routing).
+For detailed information about the architecture and how KV-aware routing works, see the [KV Cache Routing documentation](/additional-resources/router-details/kv-cache-routing).
 
 ### Aggregated
 ```bash
@@ -151,7 +151,7 @@ Below we provide a selected list of advanced examples. Please open up an issue i
 
 ### Multinode Deployment
 
-For comprehensive instructions on multinode serving, see the [multinode-examples.md](./multinode/multinode-examples) guide. It provides step-by-step deployment examples and configuration tips for running Dynamo with TensorRT-LLM across multiple nodes. While the walkthrough uses DeepSeek-R1 as the model, you can easily adapt the process for any supported model by updating the relevant configuration files. You can see [Llama4+eagle](./llama4_plus_eagle) guide to learn how to use these scripts when a single worker fits on the single node.
+For comprehensive instructions on multinode serving, see the [multinode-examples.md](/additional-resources/backend-details/tensorrt-llm/multinode-examples) guide. It provides step-by-step deployment examples and configuration tips for running Dynamo with TensorRT-LLM across multiple nodes. While the walkthrough uses DeepSeek-R1 as the model, you can easily adapt the process for any supported model by updating the relevant configuration files. You can see [Llama4+eagle](/additional-resources/backend-details/tensorrt-llm/llama-4-eagle) guide to learn how to use these scripts when a single worker fits on the single node.
 
 ### Speculative Decoding
 - **[Llama 4 Maverick Instruct + Eagle Speculative Decoding](./llama4_plus_eagle)**
@@ -162,7 +162,7 @@ For complete Kubernetes deployment instructions, configurations, and troubleshoo
 
 ### Client
 
-See [client](../../backends/sglang/README#testing-the-deployment) section to learn how to send request to the deployment.
+See [client](/components/backends/sglang#testing-the-deployment) section to learn how to send request to the deployment.
 
 NOTE: To send a request to a multi-node deployment, target the node which is running `python3 -m dynamo.frontend <args>`.
 
@@ -178,7 +178,7 @@ Dynamo with TensorRT-LLM supports two methods for transferring KV cache in disag
 
 ## Request Migration
 
-You can enable [request migration](../../fault-tolerance/request_migration) to handle worker failures gracefully. Use the `--migration-limit` flag to specify how many times a request can be migrated to another worker:
+You can enable [request migration](/additional-resources/fault-tolerance/request-migration) to handle worker failures gracefully. Use the `--migration-limit` flag to specify how many times a request can be migrated to another worker:
 
 ```bash
 # For decode and aggregated workers
@@ -189,7 +189,7 @@ python3 -m dynamo.trtllm ... --migration-limit=3
 **Prefill workers do not support request migration** and must use `--migration-limit=0` (the default). Prefill workers only process prompts and return KV cache state - they don't maintain long-running generation requests that would benefit from migration.
 </Callout>
 
-See the [Request Migration Architecture](../../fault-tolerance/request_migration) documentation for details on how this works.
+See the [Request Migration Architecture](/additional-resources/fault-tolerance/request-migration) documentation for details on how this works.
 
 ## Request Cancellation
 
@@ -202,11 +202,11 @@ When a user cancels a request (e.g., by disconnecting from the frontend), the re
 | **Aggregated** | ✅ | ✅ |
 | **Disaggregated** | ✅ | ✅ |
 
-For more details, see the [Request Cancellation Architecture](../../fault-tolerance/request_cancellation) documentation.
+For more details, see the [Request Cancellation Architecture](/additional-resources/fault-tolerance/request-cancellation) documentation.
 
 ## Client
 
-See [client](../../backends/sglang/README#testing-the-deployment) section to learn how to send request to the deployment.
+See [client](/components/backends/sglang#testing-the-deployment) section to learn how to send request to the deployment.
 
 NOTE: To send a request to a multi-node deployment, target the node which is running `python3 -m dynamo.frontend <args>`.
 
@@ -217,7 +217,7 @@ To benchmark your deployment with AIPerf, see this utility script, configuring t
 
 ## Multimodal support
 
-Dynamo with the TensorRT-LLM backend supports multimodal models, enabling you to process both text and images (or pre-computed embeddings) in a single request. For detailed setup instructions, example requests, and best practices, see the [TensorRT-LLM Multimodal Guide](../../multimodal/trtllm).
+Dynamo with the TensorRT-LLM backend supports multimodal models, enabling you to process both text and images (or pre-computed embeddings) in a single request. For detailed setup instructions, example requests, and best practices, see the [TensorRT-LLM Multimodal Guide](/additional-resources/multimodal-details/tensorrt-llm).
 
 ## Logits Processing
 
 
@@ -7,7 +7,7 @@ title: "Llama 4 Maverick Instruct with Eagle Speculative Decoding on SLURM"
   SPDX-License-Identifier: Apache-2.0
 */}
 
-This guide demonstrates how to deploy Llama 4 Maverick Instruct with Eagle Speculative Decoding on GB200x4 nodes. We will be following the [multi-node deployment instructions](./multinode/multinode-examples) to set up the environment for the following scenarios:
+This guide demonstrates how to deploy Llama 4 Maverick Instruct with Eagle Speculative Decoding on GB200x4 nodes. We will be following the [multi-node deployment instructions](/additional-resources/backend-details/tensorrt-llm/multinode-examples) to set up the environment for the following scenarios:
 
 - **Aggregated Serving:**
   Deploy the entire Llama 4 model on a single GB200x4 node for end-to-end serving.
@@ -36,7 +36,7 @@ export MODEL_PATH="nvidia/Llama-4-Maverick-17B-128E-Instruct-FP8"
 export SERVED_MODEL_NAME="nvidia/Llama-4-Maverick-17B-128E-Instruct-FP8"
 ```
 
-See [this](./multinode/multinode-examples#setup) section from multinode guide to learn more about the above options.
+See [this](/additional-resources/backend-details/tensorrt-llm/multinode-examples#setup) section from multinode guide to learn more about the above options.
 
 
 ## Aggregated Serving
@@ -58,7 +58,7 @@ export DECODE_ENGINE_CONFIG="/mnt/examples/backends/trtllm/engine_configs/llama4
 
 ## Example Request
 
-See [here](./multinode/multinode-examples#example-request) to learn how to send a request to the deployment.
+See [here](/additional-resources/backend-details/tensorrt-llm/multinode-examples#example-request) to learn how to send a request to the deployment.
 
 ```
 curl localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{
 
@@ -15,9 +15,9 @@ Additional performance metrics are available via non-Prometheus APIs (see [Non-P
 
 As of the date of this documentation, the included TensorRT-LLM version 1.1.0rc5 exposes **5 basic Prometheus metrics**. Note that the `trtllm_` prefix is added by Dynamo.
 
-**For Dynamo runtime metrics**, see the [Dynamo Metrics Guide](../../observability/metrics).
+**For Dynamo runtime metrics**, see the [Dynamo Metrics Guide](/user-guides/observability-local/metrics).
 
-**For visualization setup instructions**, see the [Prometheus and Grafana Setup Guide](../../observability/prometheus-grafana).
+**For visualization setup instructions**, see the [Prometheus and Grafana Setup Guide](/user-guides/observability-local/prometheus-grafana).
 
 ## Environment Variables
 
@@ -31,7 +31,7 @@ This is a single machine example.
 
 ### Start Observability Stack
 
-For visualizing metrics with Prometheus and Grafana, start the observability stack. See [Observability Getting Started](../../observability/README#getting-started-quickly) for instructions.
+For visualizing metrics with Prometheus and Grafana, start the observability stack. See [Observability Getting Started](/user-guides/observability-local/overview#getting-started-quickly) for instructions.
 
 ### Launch Dynamo Components
 
@@ -187,8 +187,8 @@ TensorRT-LLM provides extensive performance data beyond the basic Prometheus met
 - [TensorRT-LLM Metrics Collector](https://github.com/NVIDIA/TensorRT-LLM/blob/main/tensorrt_llm/metrics/collector.py) - Source code reference
 
 ### Dynamo Metrics
-- [Dynamo Metrics Guide](../../observability/metrics) - Complete documentation on Dynamo runtime metrics
-- [Prometheus and Grafana Setup](../../observability/prometheus-grafana) - Visualization setup instructions
+- [Dynamo Metrics Guide](/user-guides/observability-local/metrics) - Complete documentation on Dynamo runtime metrics
+- [Prometheus and Grafana Setup](/user-guides/observability-local/prometheus-grafana) - Visualization setup instructions
 - Dynamo runtime metrics (prefixed with `dynamo_*`) are available at the same `/metrics` endpoint alongside TensorRT-LLM metrics
   - Implementation: `lib/runtime/src/metrics.rs` (Rust runtime metrics)
   - Metric names: `lib/runtime/src/metrics/prometheus_names.rs` (metric name constants)
 
@@ -37,14 +37,14 @@ git checkout $(git describe --tags $(git rev-list --tags --max-count=1))
 
 | Feature | vLLM | Notes |
 |---------|------|-------|
-| [**Disaggregated Serving**](../../design-docs/disagg-serving) | ✅ |  |
-| [**Conditional Disaggregation**](../../design-docs/disagg-serving#conditional-disaggregation) | 🚧 | WIP |
-| [**KV-Aware Routing**](../../router/kv_cache_routing) | ✅ |  |
-| [**SLA-Based Planner**](../../planner/sla_planner) | ✅ |  |
-| [**Load Based Planner**](../../planner/load_planner) | 🚧 | WIP |
-| [**KVBM**](../../kvbm/kvbm_architecture) | ✅ |  |
-| [**LMCache**](./LMCache_Integration) | ✅ |  |
-| [**Prompt Embeddings**](./prompt-embeddings) | ✅ | Requires `--enable-prompt-embeds` flag |
+| [**Disaggregated Serving**](/design-docs/disaggregated-serving) | ✅ |  |
+| [**Conditional Disaggregation**](/design-docs/disaggregated-serving#conditional-disaggregation) | 🚧 | WIP |
+| [**KV-Aware Routing**](/additional-resources/router-details/kv-cache-routing) | ✅ |  |
+| [**SLA-Based Planner**](/components/planner/sla-based-planner) | ✅ |  |
+| [**Load Based Planner**](/additional-resources/load-planner) | 🚧 | WIP |
+| [**KVBM**](/components/kvbm/architecture) | ✅ |  |
+| [**LMCache**](/components/kvbm/lm-cache-integration) | ✅ |  |
+| [**Prompt Embeddings**](/additional-resources/backend-details/vllm/prompt-embeddings) | ✅ | Requires `--enable-prompt-embeds` flag |
 
 ### Large Scale P/D and WideEP Features
 
@@ -176,17 +176,17 @@ When using KV-aware routing, ensure deterministic hashing across processes to av
 ```bash
 vllm serve ... --enable-prefix-caching --prefix-caching-algo sha256
 ```
-See the high-level notes in [KV Cache Routing](../../router/kv_cache_routing) on deterministic event IDs.
+See the high-level notes in [KV Cache Routing](/additional-resources/router-details/kv-cache-routing) on deterministic event IDs.
 
 ## Request Migration
 
-You can enable [request migration](../../fault-tolerance/request_migration) to handle worker failures gracefully. Use the `--migration-limit` flag to specify how many times a request can be migrated to another worker:
+You can enable [request migration](/additional-resources/fault-tolerance/request-migration) to handle worker failures gracefully. Use the `--migration-limit` flag to specify how many times a request can be migrated to another worker:
 
 ```bash
 python3 -m dynamo.vllm ... --migration-limit=3
 ```
 
-This allows a request to be migrated up to 3 times before failing. See the [Request Migration Architecture](../../fault-tolerance/request_migration) documentation for details on how this works.
+This allows a request to be migrated up to 3 times before failing. See the [Request Migration Architecture](/additional-resources/fault-tolerance/request-migration) documentation for details on how this works.
 
 ## Request Cancellation
 
@@ -199,4 +199,4 @@ When a user cancels a request (e.g., by disconnecting from the frontend), the re
 | **Aggregated** | ✅ | ✅ |
 | **Disaggregated** | ✅ | ✅ |
 
-For more details, see the [Request Cancellation Architecture](../../fault-tolerance/request_cancellation) documentation.
+For more details, see the [Request Cancellation Architecture](/additional-resources/fault-tolerance/request-cancellation) documentation.
@@ -11,7 +11,7 @@ Dynamo supports running Deepseek R1 with data parallel attention and wide expert
 
 ## Instructions
 
-The following script can be adapted to run Deepseek R1 with a variety of different configuration. The current configuration uses 2 nodes, 16 GPUs, and a dp of 16. Follow the [ReadMe](README) Getting Started section on each node, and then run these two commands.
+The following script can be adapted to run Deepseek R1 with a variety of different configuration. The current configuration uses 2 nodes, 16 GPUs, and a dp of 16. Follow the [vLLM Backend](/components/backends/vllm) Getting Started section on each node, and then run these two commands.
 
 node 0
 ```bash