You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
|`--endpoint`| Dynamo endpoint in `dyn://namespace.component.endpoint` format | Auto-generated based on mode | N/A |
60
-
|`--migration-limit`| Max times a request can migrate between workers for fault tolerance. See [Request Migration Architecture](../../fault-tolerance/request_migration). |`0` (disabled) | N/A |
60
+
|`--migration-limit`| Max times a request can migrate between workers for fault tolerance. See [Request Migration Architecture](/additional-resources/fault-tolerance/request-migration). |`0` (disabled) | N/A |
61
61
|`--dyn-tool-call-parser`| Tool call parser for structured outputs (takes precedence over `--tool-call-parser`) |`None`|`--tool-call-parser`|
62
62
|`--dyn-reasoning-parser`| Reasoning parser for CoT models (takes precedence over `--reasoning-parser`) |`None`|`--reasoning-parser`|
63
63
|`--use-sglang-tokenizer`| Use SGLang's tokenizer instead of Dynamo's |`False`| N/A |
@@ -87,7 +87,7 @@ When a user cancels a request (e.g., by disconnecting from the frontend), the re
87
87
⚠️ SGLang backend currently does not support cancellation during remote prefill phase in disaggregated mode.
88
88
</Callout>
89
89
90
-
For more details, see the [Request Cancellation Architecture](../../fault-tolerance/request_cancellation) documentation.
90
+
For more details, see the [Request Cancellation Architecture](/additional-resources/fault-tolerance/request-cancellation) documentation.
Copy file name to clipboardExpand all lines: fern/fern/pages/backends/sglang/prometheus.mdx
+5-5Lines changed: 5 additions & 5 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -13,9 +13,9 @@ When running SGLang through Dynamo, SGLang engine metrics are automatically pass
13
13
14
14
**For the complete and authoritative list of all SGLang metrics**, always refer to the [official SGLang Production Metrics documentation](https://docs.sglang.ai/references/production_metrics.html).
15
15
16
-
**For Dynamo runtime metrics**, see the [Dynamo Metrics Guide](../../observability/metrics).
16
+
**For Dynamo runtime metrics**, see the [Dynamo Metrics Guide](/user-guides/observability-local/metrics).
17
17
18
-
**For visualization setup instructions**, see the [Prometheus and Grafana Setup Guide](../../observability/prometheus-grafana).
18
+
**For visualization setup instructions**, see the [Prometheus and Grafana Setup Guide](/user-guides/observability-local/prometheus-grafana).
19
19
20
20
## Environment Variables
21
21
@@ -29,7 +29,7 @@ This is a single machine example.
29
29
30
30
### Start Observability Stack
31
31
32
-
For visualizing metrics with Prometheus and Grafana, start the observability stack. See [Observability Getting Started](../../observability/README#getting-started-quickly) for instructions.
32
+
For visualizing metrics with Prometheus and Grafana, start the observability stack. See [Observability Getting Started](/user-guides/observability-local/overview#getting-started-quickly) for instructions.
33
33
34
34
### Launch Dynamo Components
35
35
@@ -117,8 +117,8 @@ For the complete and authoritative list of all SGLang metrics, see the [official
Below we provide some simple shell scripts that run the components for each configuration. Each shell script is simply running the `python3 -m dynamo.frontend <args>` to start up the ingress and using `python3 -m dynamo.trtllm <args>` to start up the workers. You can easily take each command and run them in separate terminals.
99
99
</Callout>
100
100
101
-
For detailed information about the architecture and how KV-aware routing works, see the [KV Cache Routing documentation](../../router/kv_cache_routing).
101
+
For detailed information about the architecture and how KV-aware routing works, see the [KV Cache Routing documentation](/additional-resources/router-details/kv-cache-routing).
102
102
103
103
### Aggregated
104
104
```bash
@@ -151,7 +151,7 @@ Below we provide a selected list of advanced examples. Please open up an issue i
151
151
152
152
### Multinode Deployment
153
153
154
-
For comprehensive instructions on multinode serving, see the [multinode-examples.md](./multinode/multinode-examples) guide. It provides step-by-step deployment examples and configuration tips for running Dynamo with TensorRT-LLM across multiple nodes. While the walkthrough uses DeepSeek-R1 as the model, you can easily adapt the process for any supported model by updating the relevant configuration files. You can see [Llama4+eagle](./llama4_plus_eagle) guide to learn how to use these scripts when a single worker fits on the single node.
154
+
For comprehensive instructions on multinode serving, see the [multinode-examples.md](/additional-resources/backend-details/tensorrt-llm/multinode-examples) guide. It provides step-by-step deployment examples and configuration tips for running Dynamo with TensorRT-LLM across multiple nodes. While the walkthrough uses DeepSeek-R1 as the model, you can easily adapt the process for any supported model by updating the relevant configuration files. You can see [Llama4+eagle](/additional-resources/backend-details/tensorrt-llm/llama-4-eagle) guide to learn how to use these scripts when a single worker fits on the single node.
@@ -162,7 +162,7 @@ For complete Kubernetes deployment instructions, configurations, and troubleshoo
162
162
163
163
### Client
164
164
165
-
See [client](../../backends/sglang/README#testing-the-deployment) section to learn how to send request to the deployment.
165
+
See [client](/components/backends/sglang#testing-the-deployment) section to learn how to send request to the deployment.
166
166
167
167
NOTE: To send a request to a multi-node deployment, target the node which is running `python3 -m dynamo.frontend <args>`.
168
168
@@ -178,7 +178,7 @@ Dynamo with TensorRT-LLM supports two methods for transferring KV cache in disag
178
178
179
179
## Request Migration
180
180
181
-
You can enable [request migration](../../fault-tolerance/request_migration) to handle worker failures gracefully. Use the `--migration-limit` flag to specify how many times a request can be migrated to another worker:
181
+
You can enable [request migration](/additional-resources/fault-tolerance/request-migration) to handle worker failures gracefully. Use the `--migration-limit` flag to specify how many times a request can be migrated to another worker:
**Prefill workers do not support request migration** and must use `--migration-limit=0` (the default). Prefill workers only process prompts and return KV cache state - they don't maintain long-running generation requests that would benefit from migration.
190
190
</Callout>
191
191
192
-
See the [Request Migration Architecture](../../fault-tolerance/request_migration) documentation for details on how this works.
192
+
See the [Request Migration Architecture](/additional-resources/fault-tolerance/request-migration) documentation for details on how this works.
193
193
194
194
## Request Cancellation
195
195
@@ -202,11 +202,11 @@ When a user cancels a request (e.g., by disconnecting from the frontend), the re
202
202
|**Aggregated**| ✅ | ✅ |
203
203
|**Disaggregated**| ✅ | ✅ |
204
204
205
-
For more details, see the [Request Cancellation Architecture](../../fault-tolerance/request_cancellation) documentation.
205
+
For more details, see the [Request Cancellation Architecture](/additional-resources/fault-tolerance/request-cancellation) documentation.
206
206
207
207
## Client
208
208
209
-
See [client](../../backends/sglang/README#testing-the-deployment) section to learn how to send request to the deployment.
209
+
See [client](/components/backends/sglang#testing-the-deployment) section to learn how to send request to the deployment.
210
210
211
211
NOTE: To send a request to a multi-node deployment, target the node which is running `python3 -m dynamo.frontend <args>`.
212
212
@@ -217,7 +217,7 @@ To benchmark your deployment with AIPerf, see this utility script, configuring t
217
217
218
218
## Multimodal support
219
219
220
-
Dynamo with the TensorRT-LLM backend supports multimodal models, enabling you to process both text and images (or pre-computed embeddings) in a single request. For detailed setup instructions, example requests, and best practices, see the [TensorRT-LLM Multimodal Guide](../../multimodal/trtllm).
220
+
Dynamo with the TensorRT-LLM backend supports multimodal models, enabling you to process both text and images (or pre-computed embeddings) in a single request. For detailed setup instructions, example requests, and best practices, see the [TensorRT-LLM Multimodal Guide](/additional-resources/multimodal-details/tensorrt-llm).
Copy file name to clipboardExpand all lines: fern/fern/pages/backends/trtllm/llama4_plus_eagle.mdx
+3-3Lines changed: 3 additions & 3 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -7,7 +7,7 @@ title: "Llama 4 Maverick Instruct with Eagle Speculative Decoding on SLURM"
7
7
SPDX-License-Identifier: Apache-2.0
8
8
*/}
9
9
10
-
This guide demonstrates how to deploy Llama 4 Maverick Instruct with Eagle Speculative Decoding on GB200x4 nodes. We will be following the [multi-node deployment instructions](./multinode/multinode-examples) to set up the environment for the following scenarios:
10
+
This guide demonstrates how to deploy Llama 4 Maverick Instruct with Eagle Speculative Decoding on GB200x4 nodes. We will be following the [multi-node deployment instructions](/additional-resources/backend-details/tensorrt-llm/multinode-examples) to set up the environment for the following scenarios:
11
11
12
12
-**Aggregated Serving:**
13
13
Deploy the entire Llama 4 model on a single GB200x4 node for end-to-end serving.
See [this](./multinode/multinode-examples#setup) section from multinode guide to learn more about the above options.
39
+
See [this](/additional-resources/backend-details/tensorrt-llm/multinode-examples#setup) section from multinode guide to learn more about the above options.
Copy file name to clipboardExpand all lines: fern/fern/pages/backends/trtllm/prometheus.mdx
+5-5Lines changed: 5 additions & 5 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -15,9 +15,9 @@ Additional performance metrics are available via non-Prometheus APIs (see [Non-P
15
15
16
16
As of the date of this documentation, the included TensorRT-LLM version 1.1.0rc5 exposes **5 basic Prometheus metrics**. Note that the `trtllm_` prefix is added by Dynamo.
17
17
18
-
**For Dynamo runtime metrics**, see the [Dynamo Metrics Guide](../../observability/metrics).
18
+
**For Dynamo runtime metrics**, see the [Dynamo Metrics Guide](/user-guides/observability-local/metrics).
19
19
20
-
**For visualization setup instructions**, see the [Prometheus and Grafana Setup Guide](../../observability/prometheus-grafana).
20
+
**For visualization setup instructions**, see the [Prometheus and Grafana Setup Guide](/user-guides/observability-local/prometheus-grafana).
21
21
22
22
## Environment Variables
23
23
@@ -31,7 +31,7 @@ This is a single machine example.
31
31
32
32
### Start Observability Stack
33
33
34
-
For visualizing metrics with Prometheus and Grafana, start the observability stack. See [Observability Getting Started](../../observability/README#getting-started-quickly) for instructions.
34
+
For visualizing metrics with Prometheus and Grafana, start the observability stack. See [Observability Getting Started](/user-guides/observability-local/overview#getting-started-quickly) for instructions.
35
35
36
36
### Launch Dynamo Components
37
37
@@ -187,8 +187,8 @@ TensorRT-LLM provides extensive performance data beyond the basic Prometheus met
See the high-level notes in [KV Cache Routing](../../router/kv_cache_routing) on deterministic event IDs.
179
+
See the high-level notes in [KV Cache Routing](/additional-resources/router-details/kv-cache-routing) on deterministic event IDs.
180
180
181
181
## Request Migration
182
182
183
-
You can enable [request migration](../../fault-tolerance/request_migration) to handle worker failures gracefully. Use the `--migration-limit` flag to specify how many times a request can be migrated to another worker:
183
+
You can enable [request migration](/additional-resources/fault-tolerance/request-migration) to handle worker failures gracefully. Use the `--migration-limit` flag to specify how many times a request can be migrated to another worker:
184
184
185
185
```bash
186
186
python3 -m dynamo.vllm ... --migration-limit=3
187
187
```
188
188
189
-
This allows a request to be migrated up to 3 times before failing. See the [Request Migration Architecture](../../fault-tolerance/request_migration) documentation for details on how this works.
189
+
This allows a request to be migrated up to 3 times before failing. See the [Request Migration Architecture](/additional-resources/fault-tolerance/request-migration) documentation for details on how this works.
190
190
191
191
## Request Cancellation
192
192
@@ -199,4 +199,4 @@ When a user cancels a request (e.g., by disconnecting from the frontend), the re
199
199
|**Aggregated**| ✅ | ✅ |
200
200
|**Disaggregated**| ✅ | ✅ |
201
201
202
-
For more details, see the [Request Cancellation Architecture](../../fault-tolerance/request_cancellation) documentation.
202
+
For more details, see the [Request Cancellation Architecture](/additional-resources/fault-tolerance/request-cancellation) documentation.
Copy file name to clipboardExpand all lines: fern/fern/pages/backends/vllm/deepseek-r1.mdx
+1-1Lines changed: 1 addition & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -11,7 +11,7 @@ Dynamo supports running Deepseek R1 with data parallel attention and wide expert
11
11
12
12
## Instructions
13
13
14
-
The following script can be adapted to run Deepseek R1 with a variety of different configuration. The current configuration uses 2 nodes, 16 GPUs, and a dp of 16. Follow the [ReadMe](README) Getting Started section on each node, and then run these two commands.
14
+
The following script can be adapted to run Deepseek R1 with a variety of different configuration. The current configuration uses 2 nodes, 16 GPUs, and a dp of 16. Follow the [vLLM Backend](/components/backends/vllm) Getting Started section on each node, and then run these two commands.
0 commit comments