You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
|`--endpoint`| Dynamo endpoint in `dyn://namespace.component.endpoint` format | Auto-generated based on mode | N/A |
60
-
|`--migration-limit`| Max times a request can migrate between workers for fault tolerance. See [Request Migration Architecture](../../fault_tolerance/request_migration). |`0` (disabled) | N/A |
60
+
|`--migration-limit`| Max times a request can migrate between workers for fault tolerance. See [Request Migration Architecture](../../fault-tolerance/request_migration). |`0` (disabled) | N/A |
61
61
|`--dyn-tool-call-parser`| Tool call parser for structured outputs (takes precedence over `--tool-call-parser`) |`None`|`--tool-call-parser`|
62
62
|`--dyn-reasoning-parser`| Reasoning parser for CoT models (takes precedence over `--reasoning-parser`) |`None`|`--reasoning-parser`|
63
63
|`--use-sglang-tokenizer`| Use SGLang's tokenizer instead of Dynamo's |`False`| N/A |
@@ -87,7 +87,7 @@ When a user cancels a request (e.g., by disconnecting from the frontend), the re
87
87
⚠️ SGLang backend currently does not support cancellation during remote prefill phase in disaggregated mode.
88
88
</Callout>
89
89
90
-
For more details, see the [Request Cancellation Architecture](../../fault_tolerance/request_cancellation) documentation.
90
+
For more details, see the [Request Cancellation Architecture](../../fault-tolerance/request_cancellation) documentation.
|[**Load Based Planner**](../../planner/load_planner)| 🚧 | Planned |
@@ -178,7 +178,7 @@ Dynamo with TensorRT-LLM supports two methods for transferring KV cache in disag
178
178
179
179
## Request Migration
180
180
181
-
You can enable [request migration](../../fault_tolerance/request_migration) to handle worker failures gracefully. Use the `--migration-limit` flag to specify how many times a request can be migrated to another worker:
181
+
You can enable [request migration](../../fault-tolerance/request_migration) to handle worker failures gracefully. Use the `--migration-limit` flag to specify how many times a request can be migrated to another worker:
**Prefill workers do not support request migration** and must use `--migration-limit=0` (the default). Prefill workers only process prompts and return KV cache state - they don't maintain long-running generation requests that would benefit from migration.
190
190
</Callout>
191
191
192
-
See the [Request Migration Architecture](../../fault_tolerance/request_migration) documentation for details on how this works.
192
+
See the [Request Migration Architecture](../../fault-tolerance/request_migration) documentation for details on how this works.
193
193
194
194
## Request Cancellation
195
195
@@ -202,7 +202,7 @@ When a user cancels a request (e.g., by disconnecting from the frontend), the re
202
202
|**Aggregated**| ✅ | ✅ |
203
203
|**Disaggregated**| ✅ | ✅ |
204
204
205
-
For more details, see the [Request Cancellation Architecture](../../fault_tolerance/request_cancellation) documentation.
205
+
For more details, see the [Request Cancellation Architecture](../../fault-tolerance/request_cancellation) documentation.
|[**Load Based Planner**](../../planner/load_planner)| 🚧 | WIP |
@@ -180,13 +180,13 @@ See the high-level notes in [KV Cache Routing](../../router/kv_cache_routing) on
180
180
181
181
## Request Migration
182
182
183
-
You can enable [request migration](../../fault_tolerance/request_migration) to handle worker failures gracefully. Use the `--migration-limit` flag to specify how many times a request can be migrated to another worker:
183
+
You can enable [request migration](../../fault-tolerance/request_migration) to handle worker failures gracefully. Use the `--migration-limit` flag to specify how many times a request can be migrated to another worker:
184
184
185
185
```bash
186
186
python3 -m dynamo.vllm ... --migration-limit=3
187
187
```
188
188
189
-
This allows a request to be migrated up to 3 times before failing. See the [Request Migration Architecture](../../fault_tolerance/request_migration) documentation for details on how this works.
189
+
This allows a request to be migrated up to 3 times before failing. See the [Request Migration Architecture](../../fault-tolerance/request_migration) documentation for details on how this works.
190
190
191
191
## Request Cancellation
192
192
@@ -199,4 +199,4 @@ When a user cancels a request (e.g., by disconnecting from the frontend), the re
199
199
|**Aggregated**| ✅ | ✅ |
200
200
|**Disaggregated**| ✅ | ✅ |
201
201
202
-
For more details, see the [Request Cancellation Architecture](../../fault_tolerance/request_cancellation) documentation.
202
+
For more details, see the [Request Cancellation Architecture](../../fault-tolerance/request_cancellation) documentation.
Copy file name to clipboardExpand all lines: fern/fern/pages/design-docs/architecture.mdx
+1-1Lines changed: 1 addition & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -41,7 +41,7 @@ To address the growing demands of distributed inference serving, NVIDIA introduc
41
41
42
42
The following diagram outlines Dynamo's high-level architecture. To enable large-scale distributed and disaggregated inference serving, Dynamo includes five key features:
Copy file name to clipboardExpand all lines: fern/fern/pages/development/backend-guide.mdx
+4-4Lines changed: 4 additions & 4 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -74,7 +74,7 @@ The `model_type` can be:
74
74
-`model_name`: The name to call the model. Your incoming HTTP requests model name must match this. Defaults to the hugging face repo name or the folder name.
75
75
-`context_length`: Max model length in tokens. Defaults to the model's set max. Only set this if you need to reduce KV cache allocation to fit into VRAM.
76
76
-`kv_cache_block_size`: Size of a KV block for the engine, in tokens. Defaults to 16.
77
-
-`migration_limit`: Maximum number of times a request may be [migrated to another Instance](../fault_tolerance/request_migration). Defaults to 0.
77
+
-`migration_limit`: Maximum number of times a request may be [migrated to another Instance](../fault-tolerance/request_migration). Defaults to 0.
78
78
-`user_data`: Optional dictionary containing custom metadata for worker behavior (e.g., LoRA configuration). Defaults to None.
79
79
80
80
See `examples/backends` for full code examples.
@@ -116,7 +116,7 @@ In the P/D disaggregated setup you would have `deepseek-distill-llama8b.prefill.
116
116
117
117
A Python worker may need to be shut down promptly, for example when the node running the worker is to be reclaimed and there isn't enough time to complete all ongoing requests before the shutdown deadline.
118
118
119
-
In such cases, you can signal incomplete responses by raising a `GeneratorExit` exception in your generate loop. This will immediately close the response stream, signaling to the frontend that the stream is incomplete. With request migration enabled (see the [`migration_limit`](../fault_tolerance/request_migration) parameter), the frontend will automatically migrate the partially completed request to another worker instance, if available, to be completed.
119
+
In such cases, you can signal incomplete responses by raising a `GeneratorExit` exception in your generate loop. This will immediately close the response stream, signaling to the frontend that the stream is incomplete. With request migration enabled (see the [`migration_limit`](../fault-tolerance/request_migration) parameter), the frontend will automatically migrate the partially completed request to another worker instance, if available, to be completed.
120
120
121
121
<Calloutintent="warning">
122
122
We will update the `GeneratorExit` exception to a new Dynamo exception. Please expect minor code breaking change in the near future.
@@ -140,7 +140,7 @@ class RequestHandler:
140
140
141
141
When `GeneratorExit` is raised, the frontend receives the incomplete response and can seamlessly continue generation on another available worker instance, preserving the user experience even during worker shutdowns.
142
142
143
-
For more information about how request migration works, see the [Request Migration Architecture](../fault_tolerance/request_migration) documentation.
143
+
For more information about how request migration works, see the [Request Migration Architecture](../fault-tolerance/request_migration) documentation.
144
144
145
145
## Request Cancellation
146
146
@@ -162,4 +162,4 @@ class RequestHandler:
162
162
163
163
The context parameter is optional - if your generate method doesn't include it in its signature, Dynamo will call your method without the context argument.
164
164
165
-
For detailed information about request cancellation, including async cancellation monitoring and context propagation patterns, see the [Request Cancellation Architecture](../fault_tolerance/request_cancellation) documentation.
165
+
For detailed information about request cancellation, including async cancellation monitoring and context propagation patterns, see the [Request Cancellation Architecture](../fault-tolerance/request_cancellation) documentation.
0 commit comments