vLLM Semantic Router on AMD ROCm

This playbook documents the AMD reference profile for a single real ROCm vLLM backend that exposes multiple semantic served-model aliases. The router keeps replay and Insights signal-native: records show matched signals, selected decision, selected alias, and cost/savings, without inventing a separate runtime dimension schema.

Overview

Physical backend model: Qwen/Qwen3.5-122B-A10B-FP8
Docker service name expected by the profile: vllm:8000
Served-model aliases exposed by the backend:
- qwen/qwen3.5-rocm
- google/gemini-2.5-flash-lite
- google/gemini-3.1-pro
- openai/gpt5.4
- anthropic/claude-opus-4.6
Reference routing profile: balance.yaml
- Canonical layout: version/listeners/providers/routing/global
- providers.defaults.default_model points at the SIMPLE tier
- providers.models[].pricing is example pricing for Insights cost comparison
- routing.decisions uses tier-prefixed dual-layer families
- global.model_catalog.modules only tightens learned-signal thresholds for conservative overlays

The active AMD profile contains 23 routing decisions:

simple_* (3): lowest-cost FAQ and general fallback
medium_* (5): low-to-mid-cost domain/scenario refinement
verified_* (5): evidence-sensitive overlays layered just above their base routes
feedback_* (2): explicit correction and clarification recovery lanes
complex_* (3): hard technical, STEM, and agentic synthesis
reasoning_* (3): high-reasoning escalation
engaged_general (1): emotion-aware and urgency-aware general fallback above the cheap default lane
premium_* (1): one premium legal path only

Installation

Step 1: Start the AMD vLLM backend

Create the shared Docker network first, then start the single ROCm backend container:

sudo docker network create vllm-sr-network 2>/dev/null || true

sudo docker run -d \
  --name vllm \
  --network=vllm-sr-network \
  --restart unless-stopped \
  -p "${VLLM_PORT_122B:-8090}:8000" \
  -v "${VLLM_HF_CACHE:-/mnt/data/huggingface-cache}:/root/.cache/huggingface" \
  --device=/dev/kfd \
  --device=/dev/dri \
  --group-add=video \
  --ipc=host \
  --cap-add=SYS_PTRACE \
  --security-opt seccomp=unconfined \
  --shm-size 32G \
  -v /data:/data \
  -v "$HOME:/myhome" \
  -w /myhome \
  -e VLLM_ROCM_USE_AITER=1 \
  -e VLLM_USE_AITER_UNIFIED_ATTENTION=1 \
  -e VLLM_ROCM_USE_AITER_MHA=0 \
  --entrypoint python3 \
  vllm/vllm-openai-rocm:v0.17.0 \
  -m vllm.entrypoints.openai.api_server \
    --model Qwen/Qwen3.5-122B-A10B-FP8 \
    --host 0.0.0.0 \
    --port 8000 \
    --enable-auto-tool-choice \
    --tool-call-parser qwen3_coder \
    --served-model-name qwen/qwen3.5-rocm google/gemini-2.5-flash-lite google/gemini-3.1-pro openai/gpt5.4 anthropic/claude-opus-4.6 \
    --trust-remote-code \
    --reasoning-parser qwen3 \
    --max-model-len 262144 \
    --language-model-only \
    --max-num-seqs 128 \
    --kv-cache-dtype fp8 \
    --gpu-memory-utilization 0.85

Step 2: Install vLLM Semantic Router

curl -fsSL https://vllm-semantic-router.com/install.sh | bash

Step 3: Access the dashboard

If everything is working, the dashboard is available at:

http://<your-server-ip>:8700

Complete onboarding and import the reference profile from remote:

https://raw.githubusercontent.com/vllm-project/semantic-router/main/deploy/recipes/balance.yaml

Onboarding remote import can apply the full YAML directly. If you import the same file into the DSL editor, the routing surfaces decompile from routing.modelCards, routing.signals, routing.projections, and routing.decisions, while providers stays YAML-native.

Architecture

Client
  |
  v
vLLM Semantic Router (:8899)
  |
  +-- signal evaluation
  |   - keyword
  |   - embedding
  |   - structure
  |   - fact_check
  |   - user_feedback
  |   - preference
  |   - language
  |   - context
  |   - complexity
  |   - domain
  |
  +-- projection coordination
  |   - domain partition winner
  |   - intent partition winner
  |   - difficulty band
  |   - emotion band
  |   - urgency band
  |   - verification band
  |
  +-- tiered decision selection
  |   - priority and tier choose one route
  |   - route rules combine raw signals with projection outputs
  |
  +-- alias-forwarded OpenAI request
  |   - SIMPLE: qwen/qwen3.5-rocm
  |   - MEDIUM: google/gemini-2.5-flash-lite
  |   - COMPLEX: google/gemini-3.1-pro
  |   - REASONING: openai/gpt5.4
  |   - PREMIUM: anthropic/claude-opus-4.6
  |
  v
Single ROCm vLLM backend on vllm:8000
  |
  v
Qwen/Qwen3.5-122B-A10B-FP8

The runtime does not add a separate 15-dimension scorecard. Instead, the profile expresses those routing ideas through native vSR signals and then exposes the matched signals, chosen decision, and chosen alias in replay and Insights.

The fact_check and user_feedback lanes are intentionally conservative:

they require both the learned signal and explicit lexical confirmation
they keep most traffic on qwen/qwen3.5-rocm
they do not add any extra route-local plugins beyond the profile's existing replay capture

Alias Catalog

Tier	Alias	Example pricing per 1M tokens	Role in the profile
SIMPLE	`qwen/qwen3.5-rocm`	prompt `$0.00`, completion `$0.00`	Free self-hosted default alias for fast QA, broad fallback, and most low-cost traffic
MEDIUM	`google/gemini-2.5-flash-lite`	prompt `$0.01`, completion `$0.04`	Low-cost expressive route for creative and softer medium tasks
COMPLEX	`google/gemini-3.1-pro`	prompt `$0.48`, completion `$1.92`	Hard STEM and architecture design
REASONING	`openai/gpt5.4`	prompt `$1.20`, completion `$4.80`	Multi-step reasoning, proofs, and philosophy
PREMIUM	`anthropic/claude-opus-4.6`	prompt `$1.80`, completion `$7.20`	Reserved for legal and high-risk analysis

Pricing is intentionally exaggerated for Insights demos so savings are easy to see. These values are not intended to mirror real vendor billing.

Active Routing Decisions

Priority	Decision	Alias	What it is for	Match sketch
260	`premium_legal`	`anthropic/claude-opus-4.6`	Highest-risk legal and compliance analysis	law or explicit legal-risk cues + premium legal embedding, verification overlay, or medium/hard `legal_risk`
250	`reasoning_math`	`openai/gpt5.4`	Proofs, derivations, and hard math	`domain:math` + `projection:balance_reasoning`
245	`reasoning_philosophy`	`openai/gpt5.4`	Philosophy prompts that need deep argumentation	`domain:philosophy` + `projection:balance_reasoning`
243	`complex_agentic`	`google/gemini-3.1-pro`	High-structure execution plans, migrations, and workflow orchestration	agentic embedding / preference / markers + `projection:balance_complex` or `projection:balance_reasoning`, excluding architecture markers
240	`complex_architecture`	`google/gemini-3.1-pro`	Complex systems and architecture design	CS or engineering + architecture embedding / markers + `projection:balance_complex` or `projection:balance_reasoning`
235	`complex_stem`	`google/gemini-3.1-pro`	Complex STEM synthesis outside dedicated math	STEM domain + STEM or research embedding, or high routing band
232	`feedback_wrong_answer_verified`	`google/gemini-3.1-pro`	Explicit correction on evidence-sensitive follow-ups	`user_feedback:wrong_answer` + correction markers + short/medium context + verification pressure or evidence-synthesis escalation
220	`medium_code_general`	`qwen/qwen3.5-rocm`	Low-medium cost coding, debugging, and technical Q&A	code domain / markers / embedding + `projection:balance_medium` or `projection:balance_complex`, or short urgent code prompts with `projection:balance_simple` + `projection:urgency_elevated`
216	`verified_business`	`google/gemini-2.5-flash-lite`	Evidence-sensitive business or economics requests	business/economics + `projection:verification_required` or hard evidence synthesis + business embedding or medium/complex routing band
215	`medium_business`	`qwen/qwen3.5-rocm`	Mid-tier business and economics analysis	business/economics + `embedding:business_analysis` + `projection:balance_medium` or `projection:balance_complex`, excluding verification overlay
214	`verified_health`	`google/gemini-3.1-pro`	Evidence-sensitive health and medical guidance	`domain:health` + `projection:verification_required` + health embedding or medium/complex/reasoning band
211	`verified_history`	`google/gemini-2.5-flash-lite`	Source-sensitive history explanation	`domain:history` + `projection:verification_required` or hard evidence synthesis + history embedding or medium/complex routing band
210	`medium_history`	`qwen/qwen3.5-rocm`	Mid-tier history explanation and comparison	`domain:history` + `embedding:history_explainer` + `projection:balance_medium` or `projection:balance_complex`, excluding verification overlay
205	`medium_psychology`	`qwen/qwen3.5-rocm`	Psychology and behavior queries with nuanced explanation	`domain:psychology` + `embedding:psychology_support` + `projection:balance_medium` or `projection:balance_complex`
202	`engaged_general`	`google/gemini-2.5-flash-lite`	General or psychology-adjacent prompts with visible emotion or urgency	`projection:emotion_positive` or `projection:emotion_negative` or `projection:urgency_elevated` + general/psychology cues, excluding specialist and verification-heavy lanes
200	`medium_creative`	`google/gemini-2.5-flash-lite`	Creative writing, copywriting, and ideation	creative markers / embedding / collaboration preference + `projection:balance_simple` or `projection:balance_medium`
190	`reasoning_general`	`openai/gpt5.4`	Non-specialist deep analysis and multi-step reasoning	reasoning / research / multi-step cues + `projection:balance_complex` or `projection:balance_reasoning`, excluding specialist embeddings and broad technical markers
185	`feedback_need_clarification`	`qwen/qwen3.5-rocm`	Cheap clarification follow-up lane	`user_feedback:need_clarification` + clarification markers + short/medium context
181	`verified_fast_qa_zh`	`qwen/qwen3.5-rocm`	Chinese short FAQ with explicit verification ask	`embedding:fast_qa_zh` + `language:zh` + `context:short_context` + simple/medium routing band + verification cue or fact-check pressure
180	`simple_fast_qa_zh`	`qwen/qwen3.5-rocm`	Cheapest Chinese factual / definitional answers	`embedding:fast_qa_zh` + `language:zh` + `context:short_context` + `projection:balance_simple`, excluding verification, code, and urgency overlays
176	`verified_fast_qa_en`	`qwen/qwen3.5-rocm`	English short FAQ with explicit verification ask	`embedding:fast_qa_en` + `language:en` + `context:short_context` + simple/medium routing band + verification cue or fact-check pressure
175	`simple_fast_qa_en`	`qwen/qwen3.5-rocm`	Cheapest English factual / definitional answers	`embedding:fast_qa_en` + `language:en` + `context:short_context` + `projection:balance_simple`, excluding verification, code, and urgency overlays
170	`simple_general`	`qwen/qwen3.5-rocm`	Lowest-cost fallback for non-specialized traffic	short simple traffic, or medium-context `domain:other` traffic with simple/medium band, excluding fast-QA embeddings

This ordering is intentional:

specialized premium and hard-reasoning routes win first
explicit correction recovery beats ordinary medium traffic, but only with strong confirmation cues
factual overlays sit just above their cheap base routes instead of replacing them
complex technical routes beat generic reasoning routes
medium routes only accept easy or medium complexity
simple routes remain the broad default landing zone

Signal Overview

The profile uses the standard vSR signal families directly under routing.signals:

Signal family	Role in this profile	Representative names
`keywords`	explicit lexical confirmation for route style, verification asks, emotion or urgency cues, feedback cues, and task shape	`verification_markers`, `emotion_negative_markers`, `urgency_markers`, `clarification_feedback_markers`
`embeddings`	learned intent and specialist boundaries	`fast_qa_en`, `architecture_design`, `business_analysis`, `premium_legal_analysis`, `reasoning_general_en`
`structure`	cheap structural overlays for workflow formatting and punctuation emphasis	`ordered_workflow`, `numbered_steps`, `exclamation_emphasis`
`fact_check`	evidence-sensitive detection that feeds verification pressure	`needs_fact_check`
`user_feedbacks`	explicit correction or clarification overlays	`wrong_answer`, `need_clarification`
`preferences`	collaboration style and request framing	`coding_partner`, `creative_collaboration`, `agentic_execution`
`language`	language-specific fast-QA split	`en`, `zh`
`domains`	subject-area routing and partitioning	`law`, `math`, `history`, `health`, `computer science`, `other`
`context`	token-count bands for cheap fallback versus longer tasks	`short_context`, `medium_context`, `long_context`
`complexity`	easy / medium / hard difficulty for general, code, math, legal, agentic, and evidence-heavy requests	`general_reasoning`, `code_task`, `math_task`, `legal_risk`, `agentic_delivery`, `evidence_synthesis`

Notable profile-specific signal details:

context bands are non-overlapping: short_context is 0-999, medium_context is 1K-7999, and long_context is 8K-256K.
complexity signals are reusable across both route predicates and projection scores through sublevels such as code_task:hard or evidence_synthesis:medium.
the emotion and urgency overlays stay heuristic on purpose: lexical markers and repeated ! / ！ are used as secondary coordination signals instead of replacing the learned primary-intent lanes.
short lexical verification and correction cues are intentionally literal in this profile, so examples that say verify this, answer with citations, or Chinese 给出处 are more reliable than looser paraphrases.
jailbreak and pii signals are still defined in the profile for safety surfaces, but they are not the primary routing predicates for the 23 active decisions.

Projection Overview

The profile uses routing.projections as the coordination layer between raw signal detections and final route selection.

Projection	Kind	Purpose	Outputs or members
`balance_domain_partition`	partition	resolves one domain winner across the supported routing domains	biology, business, chemistry, computer science, economics, engineering, health, history, law, math, other, philosophy, physics, psychology
`balance_intent_partition`	partition	resolves one learned-intent winner across the maintained embedding lanes	`agentic_workflows`, `architecture_design`, `code_general`, `creative_tasks`, `fast_qa_en`, `fast_qa_zh`, `general_chat_fallback`, and related specialist embeddings
`difficulty_score`	score	blends context, keywords, embeddings, and complexity sublevels into one difficulty signal	source for the difficulty band mapping
`difficulty_band`	mapping	converts `difficulty_score` into reusable routing bands	`balance_simple`, `balance_medium`, `balance_complex`, `balance_reasoning`
`emotion_valence`	score	blends positive and negative affect markers into one lightweight emotional-overlay score	source for the emotion band mapping
`emotion_band`	mapping	converts `emotion_valence` into reusable emotional overlays	`emotion_positive`, `emotion_negative`
`urgency_pressure`	score	blends urgency markers with exclamation-count emphasis into one urgency overlay	source for the urgency band mapping
`urgency_band`	mapping	converts `urgency_pressure` into reusable urgency overlays	`urgency_standard`, `urgency_elevated`
`verification_pressure`	score	blends `fact_check`, verification cues, high-stakes domains, long-context pressure, and wrong-answer correction pressure	source for the verification mapping
`verification_band`	mapping	converts `verification_pressure` into verification routing outputs	`verification_standard`, `verification_required`

In practice, the profile routes in two steps:

Raw signals fire under routing.signals.
Projections turn that raw evidence into named outputs such as balance_complex or verification_required.
Decisions combine ordinary signals with those projection outputs.

That lets the profile reuse one difficulty story and one verification story across many routes without repeating the same threshold logic inside every decision.

Usage Examples

Test these in the dashboard playground at http://<your-server-ip>:8700:

The same stable examples are also maintained as machine-readable probes in balance.probes.yaml for live POST /api/v1/eval calibration loops. The maintained suite currently covers all 23 decisions with 58 probe variants, so routing changes are checked against a small robustness set instead of one crafted prompt per route.

Each decision below includes every maintained probe variant from the manifest, so the README stays copy-pasteable for playground checks and aligned with the executable eval suite.

`premium_legal`

Expected alias: anthropic/claude-opus-4.6

High-stakes legal analysis that should avoid generic business routing.

`contract_clause_analysis`

Provide a legal analysis of the indemnity clause, liability cap, and compliance obligations in this contract.

`regulatory_risk_review`

Assess the legal risk in this agreement by analyzing indemnification, limitation of liability, and the compliance duties each party assumes.

`reasoning_math`

Expected alias: openai/gpt5.4

Pure mathematical reasoning that should not collapse into a generic reasoning lane.

`irrationality_proof`

Prove rigorously that the square root of 2 is irrational.

`integer_ratio_proof`

Give a formal proof that sqrt(2) cannot be expressed as a ratio of integers.

`reasoning_philosophy`

Expected alias: openai/gpt5.4

Philosophy-domain reasoning with explicit argumentative depth.

`av_ethics`

Compare utilitarianism and deontology, then argue which framework better handles autonomous-vehicle dilemmas.

`compatibilism_argument`

Compare compatibilism and libertarian free will, then argue which view better explains moral responsibility.

`complex_agentic`

Expected alias: google/gemini-3.1-pro

Multi-step agentic planning with execution phases, checkpoints, and rollback structure.

`migration_runbook`

Plan a zero-downtime monolith-to-microservices migration with checkpoints, rollback steps, owners, and validation after each phase.

`platform_cutover_plan`

Create a phased runbook for consolidating two internal platforms, with owners, dependencies, rollback criteria, and verification gates for every phase.

`incident_recovery_runbook`

Create a phased incident-recovery runbook with owners, checkpoints, rollback criteria, and verification gates for each stage.

`complex_architecture`

Expected alias: google/gemini-3.1-pro

System-design and architecture requests without strong workflow-orchestration cues.

`distributed_rate_limiter`

Design the software architecture for a distributed rate limiter in a microservices platform, including service boundaries and consistency tradeoffs.

`multi_region_feature_flags`

Design the architecture for a multi-region feature-flag service, including storage boundaries, cache strategy, and consistency tradeoffs.

`complex_stem`

Expected alias: google/gemini-3.1-pro

Specialist STEM reasoning with technical explanation plus experiment design.

`battery_degradation`

In electrochemistry terms, compare SEI growth, lithium plating, and cathode cracking as causes of lithium-ion battery degradation, then propose experiments to isolate the dominant mechanism.

`qubit_decoherence`

Compare dielectric loss, flux noise, and quasiparticle poisoning as causes of superconducting-qubit decoherence, then propose experiments to isolate the dominant source.

`feedback_wrong_answer_verified`

Expected alias: google/gemini-3.1-pro

Wrong-answer correction requests that also require verified, sourced answers.

`roman_republic_correction`

This is wrong. Please correct the explanation of why the Roman Republic collapsed and cite reliable historical sources.

`earlier_answer_wrong`

You got this wrong earlier; re-answer why the Roman Republic collapsed and support the correction with sources.

`meiji_correction`

That is incorrect. Please correct the explanation of what caused the Meiji Restoration and support the correction with sources.

`medium_code_general`

Expected alias: qwen/qwen3.5-rocm

Mid-tier coding help without architecture-heavy or agentic workflow cues.

`python_stack_trace`

Debug this Python stack trace and suggest the most likely fix.

`failing_unit_test`

A Java unit test is failing after a refactor; explain the most likely cause and suggest the first fix to try.

`integration_test_refactor`

After a refactor, an integration test started failing in a Java codebase. Explain the most likely cause and the first code change to inspect.

`urgent_bug_zh`

这太离谱了！！！马上告诉我该怎么处理这个 bug。

`verified_business`

Expected alias: google/gemini-2.5-flash-lite

Business analysis with explicit evidence or source requirements.

`pricing_strategy_evidence`

Verify this claim with evidence: compare two B2B SaaS pricing strategies and cite sources for the market-share claim.

`churn_benchmark_sources`

Compare enterprise SaaS churn benchmarks and verify the claim with sources before drawing a conclusion.

`retention_sources`

Compare B2B SaaS retention benchmarks and support the answer with sources before recommending a pricing model.

`medium_business`

Expected alias: qwen/qwen3.5-rocm

Business reasoning without explicit verification requirements.

`plg_vs_slg`

Explain when a mid-market SaaS company should prefer product-led growth over sales-led growth, and outline the trade-offs.

`pricing_model_tradeoffs`

Explain when a B2B software company should prefer usage-based pricing over seat-based pricing, and outline the trade-offs.

`annual_vs_monthly`

Explain the trade-offs between annual and monthly pricing for a B2B SaaS product.

`verified_health`

Expected alias: google/gemini-3.1-pro

Health-domain answers with explicit reliable-source requirements.

`iron_deficiency`

What are the early symptoms of iron deficiency? Please cite reliable medical sources.

`sleep_apnea_sources`

What are common early signs of sleep apnea? Answer with citations to reliable medical sources.

`verified_history`

Expected alias: google/gemini-2.5-flash-lite

History explanations that explicitly demand citations or verification.

`roman_republic_with_citations`

Verify the claim and answer with citations: why did the Roman Republic collapse?

`meiji_restoration_sources`

Explain what caused the Meiji Restoration and support the answer with reputable historical sources.

`medium_history`

Expected alias: qwen/qwen3.5-rocm

History explanations without explicit evidence or verification overlays.

`roman_republic_plain`

In plain language, explain why the Roman Republic collapsed.

`ming_dynasty_plain`

In plain language, explain why the Ming dynasty fell.

`meiji_plain`

In plain language, explain what caused the Meiji Restoration.

`medium_psychology`

Expected alias: qwen/qwen3.5-rocm

Psychology explanation and practical intervention lane.

`procrastination`

Why do people procrastinate even when the task matters, and what interventions tend to help?

`confirmation_bias`

Why do people fall into confirmation bias, and what strategies usually help reduce it?

`procrastination_important_work`

Why do people procrastinate on important work, and what interventions usually help?

`engaged_general`

Expected alias: google/gemini-2.5-flash-lite

Emotion-aware and urgency-aware general lane for prompts that should avoid brittle specialist or fast-QA misroutes.

`celebratory_reply_zh`

太好了！！！我终于拿到 offer 了，帮我写一段兴奋但得体的回复。

`roommate_text`

I am overwhelmed right now!! Help me write a calm text to my roommate and keep it supportive.

`dinner_reschedule`

This is ridiculous!! Help me write a calm message to reschedule tonight's dinner.

`medium_creative`

Expected alias: google/gemini-2.5-flash-lite

Creative ideation without specialist routing or verification cues.

`tea_house_taglines`

Invent three evocative taglines for a fictional tea house called Moonleaf.

`pottery_branding`

Invent three concise names and taglines for a boutique pottery studio with a calm, modern feel.

`reasoning_general`

Expected alias: openai/gpt5.4

Deep reasoning requests with explicit non-specialist framing.

`uncertainty_frameworks`

Compare three general approaches to reasoning under uncertainty, without focusing on any specific domain.

`inference_modes`

Compare deductive, inductive, and abductive reasoning in general terms, without anchoring the discussion to any one domain.

`feedback_need_clarification`

Expected alias: qwen/qwen3.5-rocm

Clarification feedback overlays that ask for a simpler restatement.

`explain_more_clearly`

Explain that more clearly and give one simple example.

`restate_simply`

That was confusing. Restate it more simply and walk me through one concrete example.

`concrete_example`

Please explain that more clearly and give one concrete example.

`verified_fast_qa_zh`

Expected alias: qwen/qwen3.5-rocm

Short Chinese factual questions with explicit verification or source cues.

`paris_source`

法国的首都是巴黎吗？给出处。

`jupiter_verify`

太阳系最大的行星是木星吗？请核实并给来源。

`australia_capital_verify`

澳大利亚的首都是悉尼还是堪培拉？请核实并给出处。

`simple_fast_qa_zh`

Expected alias: qwen/qwen3.5-rocm

Short Chinese factual questions without verification overlays.

`water_formula`

水的化学式是什么？请简短回答。

`months_in_year`

一年有几个月？请简短回答。

`verified_fast_qa_en`

Expected alias: qwen/qwen3.5-rocm

Short English factual questions with explicit verification or source requirements.

`australia_capital`

Verify this with a source: Is the capital of Australia Sydney or Canberra?

`light_vs_sound`

Verify with a source whether light travels faster than sound.

`simple_fast_qa_en`

Expected alias: qwen/qwen3.5-rocm

Very short English factual or identity-style Q and A without verification cues.

`who_are_you`

Who are you? Answer briefly.

`arithmetic`

What is 2 + 2? Answer briefly.

`simple_general`

Expected alias: qwen/qwen3.5-rocm

Short general explanations that should avoid fast-QA and specialist routes.

`induction_cooktops`

In one short paragraph, explain how induction cooktops work for a home kitchen user.

`refrigerator_cooling`

In one short paragraph, explain how a refrigerator keeps food cold.

`heat_pump_plain`

In one short paragraph, explain how composting works for a first-time apartment resident.

Validation Checklist

sudo docker ps --filter name=vllm shows the single backend container as healthy.
curl -s "http://localhost:${VLLM_PORT_122B:-8090}/v1/models" lists all five tier-aware alias IDs.
The router is started with vllm-sr serve --image-pull-policy never --platform amd.
Requests hitting the playground show matched signals, one selected decision, one selected alias, and cost/savings in Insights.
deploy/recipes/balance.dsl remains aligned with the maintained routing authoring story, and deploy/recipes/balance.yaml remains aligned with this document's alias catalog, signal summary, projection summary, decision table, and examples.

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

vLLM Semantic Router on AMD ROCm

Overview

Installation

Step 1: Start the AMD vLLM backend

Step 2: Install vLLM Semantic Router

Step 3: Access the dashboard

Architecture

Alias Catalog

Active Routing Decisions

Signal Overview

Projection Overview

Usage Examples

premium_legal

contract_clause_analysis

regulatory_risk_review

reasoning_math

irrationality_proof

integer_ratio_proof

reasoning_philosophy

av_ethics

compatibilism_argument

complex_agentic

migration_runbook

platform_cutover_plan

incident_recovery_runbook

complex_architecture

distributed_rate_limiter

multi_region_feature_flags

complex_stem

battery_degradation

qubit_decoherence

feedback_wrong_answer_verified

roman_republic_correction

earlier_answer_wrong

meiji_correction

medium_code_general

python_stack_trace

failing_unit_test

integration_test_refactor

urgent_bug_zh

verified_business

pricing_strategy_evidence

churn_benchmark_sources

retention_sources

medium_business

plg_vs_slg

pricing_model_tradeoffs

annual_vs_monthly

verified_health

iron_deficiency

sleep_apnea_sources

verified_history

roman_republic_with_citations

meiji_restoration_sources

medium_history

roman_republic_plain

ming_dynasty_plain

meiji_plain

medium_psychology

procrastination

confirmation_bias

procrastination_important_work

engaged_general

celebratory_reply_zh

roommate_text

dinner_reschedule

medium_creative

tea_house_taglines

pottery_branding

reasoning_general

uncertainty_frameworks

inference_modes

feedback_need_clarification

explain_more_clearly

`premium_legal`

`contract_clause_analysis`

`regulatory_risk_review`

`reasoning_math`

`irrationality_proof`

`integer_ratio_proof`

`reasoning_philosophy`

`av_ethics`

`compatibilism_argument`

`complex_agentic`

`migration_runbook`

`platform_cutover_plan`

`incident_recovery_runbook`

`complex_architecture`

`distributed_rate_limiter`

`multi_region_feature_flags`

`complex_stem`

`battery_degradation`

`qubit_decoherence`

`feedback_wrong_answer_verified`

`roman_republic_correction`

`earlier_answer_wrong`

`meiji_correction`

`medium_code_general`

`python_stack_trace`

`failing_unit_test`

`integration_test_refactor`

`urgent_bug_zh`

`verified_business`

`pricing_strategy_evidence`

`churn_benchmark_sources`

`retention_sources`

`medium_business`

`plg_vs_slg`

`pricing_model_tradeoffs`

`annual_vs_monthly`

`verified_health`

`iron_deficiency`

`sleep_apnea_sources`

`verified_history`

`roman_republic_with_citations`

`meiji_restoration_sources`

`medium_history`

`roman_republic_plain`

`ming_dynasty_plain`

`meiji_plain`

`medium_psychology`

`procrastination`

`confirmation_bias`

`procrastination_important_work`

`engaged_general`

`celebratory_reply_zh`

`roommate_text`

`dinner_reschedule`

`medium_creative`

`tea_house_taglines`

`pottery_branding`

`reasoning_general`

`uncertainty_frameworks`

`inference_modes`

`feedback_need_clarification`

`explain_more_clearly`

`restate_simply`

`concrete_example`

`verified_fast_qa_zh`

`paris_source`

`jupiter_verify`

`australia_capital_verify`

`simple_fast_qa_zh`

`water_formula`

`months_in_year`

`verified_fast_qa_en`

`australia_capital`

`light_vs_sound`

`simple_fast_qa_en`

`who_are_you`

`arithmetic`

`simple_general`

`induction_cooktops`

`refrigerator_cooling`