This playbook documents the AMD reference profile for a single real ROCm vLLM backend that exposes multiple semantic served-model aliases. The router keeps replay and Insights signal-native: records show matched signals, selected decision, selected alias, and cost/savings, without inventing a separate runtime dimension schema.
- Physical backend model:
Qwen/Qwen3.5-122B-A10B-FP8 - Docker service name expected by the profile:
vllm:8000 - Served-model aliases exposed by the backend:
qwen/qwen3.5-rocmgoogle/gemini-2.5-flash-litegoogle/gemini-3.1-proopenai/gpt5.4anthropic/claude-opus-4.6
- Reference routing profile: balance.yaml
- Canonical layout:
version/listeners/providers/routing/global providers.defaults.default_modelpoints at the SIMPLE tierproviders.models[].pricingis example pricing for Insights cost comparisonrouting.decisionsuses tier-prefixed dual-layer familiesglobal.model_catalog.modulesonly tightens learned-signal thresholds for conservative overlays
- Canonical layout:
The active AMD profile contains 23 routing decisions:
simple_*(3): lowest-cost FAQ and general fallbackmedium_*(5): low-to-mid-cost domain/scenario refinementverified_*(5): evidence-sensitive overlays layered just above their base routesfeedback_*(2): explicit correction and clarification recovery lanescomplex_*(3): hard technical, STEM, and agentic synthesisreasoning_*(3): high-reasoning escalationengaged_general(1): emotion-aware and urgency-aware general fallback above the cheap default lanepremium_*(1): one premium legal path only
Create the shared Docker network first, then start the single ROCm backend container:
sudo docker network create vllm-sr-network 2>/dev/null || true
sudo docker run -d \
--name vllm \
--network=vllm-sr-network \
--restart unless-stopped \
-p "${VLLM_PORT_122B:-8090}:8000" \
-v "${VLLM_HF_CACHE:-/mnt/data/huggingface-cache}:/root/.cache/huggingface" \
--device=/dev/kfd \
--device=/dev/dri \
--group-add=video \
--ipc=host \
--cap-add=SYS_PTRACE \
--security-opt seccomp=unconfined \
--shm-size 32G \
-v /data:/data \
-v "$HOME:/myhome" \
-w /myhome \
-e VLLM_ROCM_USE_AITER=1 \
-e VLLM_USE_AITER_UNIFIED_ATTENTION=1 \
-e VLLM_ROCM_USE_AITER_MHA=0 \
--entrypoint python3 \
vllm/vllm-openai-rocm:v0.17.0 \
-m vllm.entrypoints.openai.api_server \
--model Qwen/Qwen3.5-122B-A10B-FP8 \
--host 0.0.0.0 \
--port 8000 \
--enable-auto-tool-choice \
--tool-call-parser qwen3_coder \
--served-model-name qwen/qwen3.5-rocm google/gemini-2.5-flash-lite google/gemini-3.1-pro openai/gpt5.4 anthropic/claude-opus-4.6 \
--trust-remote-code \
--reasoning-parser qwen3 \
--max-model-len 262144 \
--language-model-only \
--max-num-seqs 128 \
--kv-cache-dtype fp8 \
--gpu-memory-utilization 0.85curl -fsSL https://vllm-semantic-router.com/install.sh | bashIf everything is working, the dashboard is available at:
http://<your-server-ip>:8700
Complete onboarding and import the reference profile from remote:
https://raw.githubusercontent.com/vllm-project/semantic-router/main/deploy/recipes/balance.yaml
Onboarding remote import can apply the full YAML directly. If you import the same file into the DSL editor, the routing surfaces decompile from routing.modelCards, routing.signals, routing.projections, and routing.decisions, while providers stays YAML-native.
Client
|
v
vLLM Semantic Router (:8899)
|
+-- signal evaluation
| - keyword
| - embedding
| - structure
| - fact_check
| - user_feedback
| - preference
| - language
| - context
| - complexity
| - domain
|
+-- projection coordination
| - domain partition winner
| - intent partition winner
| - difficulty band
| - emotion band
| - urgency band
| - verification band
|
+-- tiered decision selection
| - priority and tier choose one route
| - route rules combine raw signals with projection outputs
|
+-- alias-forwarded OpenAI request
| - SIMPLE: qwen/qwen3.5-rocm
| - MEDIUM: google/gemini-2.5-flash-lite
| - COMPLEX: google/gemini-3.1-pro
| - REASONING: openai/gpt5.4
| - PREMIUM: anthropic/claude-opus-4.6
|
v
Single ROCm vLLM backend on vllm:8000
|
v
Qwen/Qwen3.5-122B-A10B-FP8
The runtime does not add a separate 15-dimension scorecard. Instead, the profile expresses those routing ideas through native vSR signals and then exposes the matched signals, chosen decision, and chosen alias in replay and Insights.
The fact_check and user_feedback lanes are intentionally conservative:
- they require both the learned signal and explicit lexical confirmation
- they keep most traffic on
qwen/qwen3.5-rocm - they do not add any extra route-local plugins beyond the profile's existing replay capture
| Tier | Alias | Example pricing per 1M tokens | Role in the profile |
|---|---|---|---|
| SIMPLE | qwen/qwen3.5-rocm |
prompt $0.00, completion $0.00 |
Free self-hosted default alias for fast QA, broad fallback, and most low-cost traffic |
| MEDIUM | google/gemini-2.5-flash-lite |
prompt $0.01, completion $0.04 |
Low-cost expressive route for creative and softer medium tasks |
| COMPLEX | google/gemini-3.1-pro |
prompt $0.48, completion $1.92 |
Hard STEM and architecture design |
| REASONING | openai/gpt5.4 |
prompt $1.20, completion $4.80 |
Multi-step reasoning, proofs, and philosophy |
| PREMIUM | anthropic/claude-opus-4.6 |
prompt $1.80, completion $7.20 |
Reserved for legal and high-risk analysis |
Pricing is intentionally exaggerated for Insights demos so savings are easy to see. These values are not intended to mirror real vendor billing.
| Priority | Decision | Alias | What it is for | Match sketch |
|---|---|---|---|---|
| 260 | premium_legal |
anthropic/claude-opus-4.6 |
Highest-risk legal and compliance analysis | law or explicit legal-risk cues + premium legal embedding, verification overlay, or medium/hard legal_risk |
| 250 | reasoning_math |
openai/gpt5.4 |
Proofs, derivations, and hard math | domain:math + projection:balance_reasoning |
| 245 | reasoning_philosophy |
openai/gpt5.4 |
Philosophy prompts that need deep argumentation | domain:philosophy + projection:balance_reasoning |
| 243 | complex_agentic |
google/gemini-3.1-pro |
High-structure execution plans, migrations, and workflow orchestration | agentic embedding / preference / markers + projection:balance_complex or projection:balance_reasoning, excluding architecture markers |
| 240 | complex_architecture |
google/gemini-3.1-pro |
Complex systems and architecture design | CS or engineering + architecture embedding / markers + projection:balance_complex or projection:balance_reasoning |
| 235 | complex_stem |
google/gemini-3.1-pro |
Complex STEM synthesis outside dedicated math | STEM domain + STEM or research embedding, or high routing band |
| 232 | feedback_wrong_answer_verified |
google/gemini-3.1-pro |
Explicit correction on evidence-sensitive follow-ups | user_feedback:wrong_answer + correction markers + short/medium context + verification pressure or evidence-synthesis escalation |
| 220 | medium_code_general |
qwen/qwen3.5-rocm |
Low-medium cost coding, debugging, and technical Q&A | code domain / markers / embedding + projection:balance_medium or projection:balance_complex, or short urgent code prompts with projection:balance_simple + projection:urgency_elevated |
| 216 | verified_business |
google/gemini-2.5-flash-lite |
Evidence-sensitive business or economics requests | business/economics + projection:verification_required or hard evidence synthesis + business embedding or medium/complex routing band |
| 215 | medium_business |
qwen/qwen3.5-rocm |
Mid-tier business and economics analysis | business/economics + embedding:business_analysis + projection:balance_medium or projection:balance_complex, excluding verification overlay |
| 214 | verified_health |
google/gemini-3.1-pro |
Evidence-sensitive health and medical guidance | domain:health + projection:verification_required + health embedding or medium/complex/reasoning band |
| 211 | verified_history |
google/gemini-2.5-flash-lite |
Source-sensitive history explanation | domain:history + projection:verification_required or hard evidence synthesis + history embedding or medium/complex routing band |
| 210 | medium_history |
qwen/qwen3.5-rocm |
Mid-tier history explanation and comparison | domain:history + embedding:history_explainer + projection:balance_medium or projection:balance_complex, excluding verification overlay |
| 205 | medium_psychology |
qwen/qwen3.5-rocm |
Psychology and behavior queries with nuanced explanation | domain:psychology + embedding:psychology_support + projection:balance_medium or projection:balance_complex |
| 202 | engaged_general |
google/gemini-2.5-flash-lite |
General or psychology-adjacent prompts with visible emotion or urgency | projection:emotion_positive or projection:emotion_negative or projection:urgency_elevated + general/psychology cues, excluding specialist and verification-heavy lanes |
| 200 | medium_creative |
google/gemini-2.5-flash-lite |
Creative writing, copywriting, and ideation | creative markers / embedding / collaboration preference + projection:balance_simple or projection:balance_medium |
| 190 | reasoning_general |
openai/gpt5.4 |
Non-specialist deep analysis and multi-step reasoning | reasoning / research / multi-step cues + projection:balance_complex or projection:balance_reasoning, excluding specialist embeddings and broad technical markers |
| 185 | feedback_need_clarification |
qwen/qwen3.5-rocm |
Cheap clarification follow-up lane | user_feedback:need_clarification + clarification markers + short/medium context |
| 181 | verified_fast_qa_zh |
qwen/qwen3.5-rocm |
Chinese short FAQ with explicit verification ask | embedding:fast_qa_zh + language:zh + context:short_context + simple/medium routing band + verification cue or fact-check pressure |
| 180 | simple_fast_qa_zh |
qwen/qwen3.5-rocm |
Cheapest Chinese factual / definitional answers | embedding:fast_qa_zh + language:zh + context:short_context + projection:balance_simple, excluding verification, code, and urgency overlays |
| 176 | verified_fast_qa_en |
qwen/qwen3.5-rocm |
English short FAQ with explicit verification ask | embedding:fast_qa_en + language:en + context:short_context + simple/medium routing band + verification cue or fact-check pressure |
| 175 | simple_fast_qa_en |
qwen/qwen3.5-rocm |
Cheapest English factual / definitional answers | embedding:fast_qa_en + language:en + context:short_context + projection:balance_simple, excluding verification, code, and urgency overlays |
| 170 | simple_general |
qwen/qwen3.5-rocm |
Lowest-cost fallback for non-specialized traffic | short simple traffic, or medium-context domain:other traffic with simple/medium band, excluding fast-QA embeddings |
This ordering is intentional:
- specialized premium and hard-reasoning routes win first
- explicit correction recovery beats ordinary medium traffic, but only with strong confirmation cues
- factual overlays sit just above their cheap base routes instead of replacing them
- complex technical routes beat generic reasoning routes
- medium routes only accept easy or medium complexity
- simple routes remain the broad default landing zone
The profile uses the standard vSR signal families directly under routing.signals:
| Signal family | Role in this profile | Representative names |
|---|---|---|
keywords |
explicit lexical confirmation for route style, verification asks, emotion or urgency cues, feedback cues, and task shape | verification_markers, emotion_negative_markers, urgency_markers, clarification_feedback_markers |
embeddings |
learned intent and specialist boundaries | fast_qa_en, architecture_design, business_analysis, premium_legal_analysis, reasoning_general_en |
structure |
cheap structural overlays for workflow formatting and punctuation emphasis | ordered_workflow, numbered_steps, exclamation_emphasis |
fact_check |
evidence-sensitive detection that feeds verification pressure | needs_fact_check |
user_feedbacks |
explicit correction or clarification overlays | wrong_answer, need_clarification |
preferences |
collaboration style and request framing | coding_partner, creative_collaboration, agentic_execution |
language |
language-specific fast-QA split | en, zh |
domains |
subject-area routing and partitioning | law, math, history, health, computer science, other |
context |
token-count bands for cheap fallback versus longer tasks | short_context, medium_context, long_context |
complexity |
easy / medium / hard difficulty for general, code, math, legal, agentic, and evidence-heavy requests | general_reasoning, code_task, math_task, legal_risk, agentic_delivery, evidence_synthesis |
Notable profile-specific signal details:
contextbands are non-overlapping:short_contextis0-999,medium_contextis1K-7999, andlong_contextis8K-256K.complexitysignals are reusable across both route predicates and projection scores through sublevels such ascode_task:hardorevidence_synthesis:medium.- the emotion and urgency overlays stay heuristic on purpose: lexical markers and repeated
!/!are used as secondary coordination signals instead of replacing the learned primary-intent lanes. - short lexical verification and correction cues are intentionally literal in this profile, so examples that say
verify this,answer with citations, or Chinese给出处are more reliable than looser paraphrases. jailbreakandpiisignals are still defined in the profile for safety surfaces, but they are not the primary routing predicates for the 23 active decisions.
The profile uses routing.projections as the coordination layer between raw signal detections and final route selection.
| Projection | Kind | Purpose | Outputs or members |
|---|---|---|---|
balance_domain_partition |
partition | resolves one domain winner across the supported routing domains | biology, business, chemistry, computer science, economics, engineering, health, history, law, math, other, philosophy, physics, psychology |
balance_intent_partition |
partition | resolves one learned-intent winner across the maintained embedding lanes | agentic_workflows, architecture_design, code_general, creative_tasks, fast_qa_en, fast_qa_zh, general_chat_fallback, and related specialist embeddings |
difficulty_score |
score | blends context, keywords, embeddings, and complexity sublevels into one difficulty signal | source for the difficulty band mapping |
difficulty_band |
mapping | converts difficulty_score into reusable routing bands |
balance_simple, balance_medium, balance_complex, balance_reasoning |
emotion_valence |
score | blends positive and negative affect markers into one lightweight emotional-overlay score | source for the emotion band mapping |
emotion_band |
mapping | converts emotion_valence into reusable emotional overlays |
emotion_positive, emotion_negative |
urgency_pressure |
score | blends urgency markers with exclamation-count emphasis into one urgency overlay | source for the urgency band mapping |
urgency_band |
mapping | converts urgency_pressure into reusable urgency overlays |
urgency_standard, urgency_elevated |
verification_pressure |
score | blends fact_check, verification cues, high-stakes domains, long-context pressure, and wrong-answer correction pressure |
source for the verification mapping |
verification_band |
mapping | converts verification_pressure into verification routing outputs |
verification_standard, verification_required |
In practice, the profile routes in two steps:
- Raw signals fire under
routing.signals. - Projections turn that raw evidence into named outputs such as
balance_complexorverification_required. - Decisions combine ordinary signals with those projection outputs.
That lets the profile reuse one difficulty story and one verification story across many routes without repeating the same threshold logic inside every decision.
Test these in the dashboard playground at http://<your-server-ip>:8700:
The same stable examples are also maintained as machine-readable probes in balance.probes.yaml for live POST /api/v1/eval calibration loops. The maintained suite currently covers all 23 decisions with 58 probe variants, so routing changes are checked against a small robustness set instead of one crafted prompt per route.
Each decision below includes every maintained probe variant from the manifest, so the README stays copy-pasteable for playground checks and aligned with the executable eval suite.
Expected alias: anthropic/claude-opus-4.6
High-stakes legal analysis that should avoid generic business routing.
Provide a legal analysis of the indemnity clause, liability cap, and compliance obligations in this contract.
Assess the legal risk in this agreement by analyzing indemnification, limitation of liability, and the compliance duties each party assumes.
Expected alias: openai/gpt5.4
Pure mathematical reasoning that should not collapse into a generic reasoning lane.
Prove rigorously that the square root of 2 is irrational.
Give a formal proof that sqrt(2) cannot be expressed as a ratio of integers.
Expected alias: openai/gpt5.4
Philosophy-domain reasoning with explicit argumentative depth.
Compare utilitarianism and deontology, then argue which framework better handles autonomous-vehicle dilemmas.
Compare compatibilism and libertarian free will, then argue which view better explains moral responsibility.
Expected alias: google/gemini-3.1-pro
Multi-step agentic planning with execution phases, checkpoints, and rollback structure.
Plan a zero-downtime monolith-to-microservices migration with checkpoints, rollback steps, owners, and validation after each phase.
Create a phased runbook for consolidating two internal platforms, with owners, dependencies, rollback criteria, and verification gates for every phase.
Create a phased incident-recovery runbook with owners, checkpoints, rollback criteria, and verification gates for each stage.
Expected alias: google/gemini-3.1-pro
System-design and architecture requests without strong workflow-orchestration cues.
Design the software architecture for a distributed rate limiter in a microservices platform, including service boundaries and consistency tradeoffs.
Design the architecture for a multi-region feature-flag service, including storage boundaries, cache strategy, and consistency tradeoffs.
Expected alias: google/gemini-3.1-pro
Specialist STEM reasoning with technical explanation plus experiment design.
In electrochemistry terms, compare SEI growth, lithium plating, and cathode cracking as causes of lithium-ion battery degradation, then propose experiments to isolate the dominant mechanism.
Compare dielectric loss, flux noise, and quasiparticle poisoning as causes of superconducting-qubit decoherence, then propose experiments to isolate the dominant source.
Expected alias: google/gemini-3.1-pro
Wrong-answer correction requests that also require verified, sourced answers.
This is wrong. Please correct the explanation of why the Roman Republic collapsed and cite reliable historical sources.
You got this wrong earlier; re-answer why the Roman Republic collapsed and support the correction with sources.
That is incorrect. Please correct the explanation of what caused the Meiji Restoration and support the correction with sources.
Expected alias: qwen/qwen3.5-rocm
Mid-tier coding help without architecture-heavy or agentic workflow cues.
Debug this Python stack trace and suggest the most likely fix.
A Java unit test is failing after a refactor; explain the most likely cause and suggest the first fix to try.
After a refactor, an integration test started failing in a Java codebase. Explain the most likely cause and the first code change to inspect.
这太离谱了!!!马上告诉我该怎么处理这个 bug。
Expected alias: google/gemini-2.5-flash-lite
Business analysis with explicit evidence or source requirements.
Verify this claim with evidence: compare two B2B SaaS pricing strategies and cite sources for the market-share claim.
Compare enterprise SaaS churn benchmarks and verify the claim with sources before drawing a conclusion.
Compare B2B SaaS retention benchmarks and support the answer with sources before recommending a pricing model.
Expected alias: qwen/qwen3.5-rocm
Business reasoning without explicit verification requirements.
Explain when a mid-market SaaS company should prefer product-led growth over sales-led growth, and outline the trade-offs.
Explain when a B2B software company should prefer usage-based pricing over seat-based pricing, and outline the trade-offs.
Explain the trade-offs between annual and monthly pricing for a B2B SaaS product.
Expected alias: google/gemini-3.1-pro
Health-domain answers with explicit reliable-source requirements.
What are the early symptoms of iron deficiency? Please cite reliable medical sources.
What are common early signs of sleep apnea? Answer with citations to reliable medical sources.
Expected alias: google/gemini-2.5-flash-lite
History explanations that explicitly demand citations or verification.
Verify the claim and answer with citations: why did the Roman Republic collapse?
Explain what caused the Meiji Restoration and support the answer with reputable historical sources.
Expected alias: qwen/qwen3.5-rocm
History explanations without explicit evidence or verification overlays.
In plain language, explain why the Roman Republic collapsed.
In plain language, explain why the Ming dynasty fell.
In plain language, explain what caused the Meiji Restoration.
Expected alias: qwen/qwen3.5-rocm
Psychology explanation and practical intervention lane.
Why do people procrastinate even when the task matters, and what interventions tend to help?
Why do people fall into confirmation bias, and what strategies usually help reduce it?
Why do people procrastinate on important work, and what interventions usually help?
Expected alias: google/gemini-2.5-flash-lite
Emotion-aware and urgency-aware general lane for prompts that should avoid brittle specialist or fast-QA misroutes.
太好了!!!我终于拿到 offer 了,帮我写一段兴奋但得体的回复。
I am overwhelmed right now!! Help me write a calm text to my roommate and keep it supportive.
This is ridiculous!! Help me write a calm message to reschedule tonight's dinner.
Expected alias: google/gemini-2.5-flash-lite
Creative ideation without specialist routing or verification cues.
Invent three evocative taglines for a fictional tea house called Moonleaf.
Invent three concise names and taglines for a boutique pottery studio with a calm, modern feel.
Expected alias: openai/gpt5.4
Deep reasoning requests with explicit non-specialist framing.
Compare three general approaches to reasoning under uncertainty, without focusing on any specific domain.
Compare deductive, inductive, and abductive reasoning in general terms, without anchoring the discussion to any one domain.
Expected alias: qwen/qwen3.5-rocm
Clarification feedback overlays that ask for a simpler restatement.
Explain that more clearly and give one simple example.
That was confusing. Restate it more simply and walk me through one concrete example.
Please explain that more clearly and give one concrete example.
Expected alias: qwen/qwen3.5-rocm
Short Chinese factual questions with explicit verification or source cues.
法国的首都是巴黎吗?给出处。
太阳系最大的行星是木星吗?请核实并给来源。
澳大利亚的首都是悉尼还是堪培拉?请核实并给出处。
Expected alias: qwen/qwen3.5-rocm
Short Chinese factual questions without verification overlays.
水的化学式是什么?请简短回答。
一年有几个月?请简短回答。
Expected alias: qwen/qwen3.5-rocm
Short English factual questions with explicit verification or source requirements.
Verify this with a source: Is the capital of Australia Sydney or Canberra?
Verify with a source whether light travels faster than sound.
Expected alias: qwen/qwen3.5-rocm
Very short English factual or identity-style Q and A without verification cues.
Who are you? Answer briefly.
What is 2 + 2? Answer briefly.
Expected alias: qwen/qwen3.5-rocm
Short general explanations that should avoid fast-QA and specialist routes.
In one short paragraph, explain how induction cooktops work for a home kitchen user.
In one short paragraph, explain how a refrigerator keeps food cold.
In one short paragraph, explain how composting works for a first-time apartment resident.
sudo docker ps --filter name=vllmshows the single backend container as healthy.curl -s "http://localhost:${VLLM_PORT_122B:-8090}/v1/models"lists all five tier-aware alias IDs.- The router is started with
vllm-sr serve --image-pull-policy never --platform amd. - Requests hitting the playground show matched signals, one selected decision, one selected alias, and cost/savings in Insights.
deploy/recipes/balance.dslremains aligned with the maintained routing authoring story, anddeploy/recipes/balance.yamlremains aligned with this document's alias catalog, signal summary, projection summary, decision table, and examples.