refactor: Implement modular candle-binding architecture (#254) by rootfs · Pull Request #266 · vllm-project/semantic-router

rootfs · 2025-09-28T12:40:16Z

Restructure codebase into modular layers (core/, ffi/, model_architectures/, classifiers/)
Add unified error handling and configuration loading systems
Implement dual-path architecture for traditional and LoRA models
Add comprehensive FFI layer with memory safety

Maintains backward compatibility while enabling future model integrations.

refactor: Implement modular candle-binding architecture

Restructure codebase into modular layers (core/, ffi/, model_architectures/, classifiers/)
Add unified error handling and configuration loading systems
Implement dual-path architecture for traditional and LoRA models
Add comprehensive FFI layer with memory safety

Maintains backward compatibility while enabling future model integrations.

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Fixes #485

Release Notes: Yes/No

netlify · 2025-09-28T12:40:22Z

✅ Deploy Preview for vllm-semantic-router ready!

Name	Link
🔨 Latest commit	`9adab4e`
🔍 Latest deploy log	https://app.netlify.com/projects/vllm-semantic-router/deploys/68fb90afaa796b0008b456a5
😎 Deploy Preview	https://deploy-preview-266--vllm-semantic-router.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

github-actions · 2025-09-28T12:40:26Z

👥 vLLM Semantic Team Notification

The following members have been identified for the changed files in this PR and have been automatically assigned:

📁 `candle-binding`

Owners: @rootfs
Files changed:

candle-binding/src/classifiers/lora/intent_lora.rs
candle-binding/src/classifiers/lora/intent_lora_test.rs
candle-binding/src/classifiers/lora/mod.rs
candle-binding/src/classifiers/lora/parallel_engine.rs
candle-binding/src/classifiers/lora/parallel_engine_test.rs
candle-binding/src/classifiers/lora/pii_lora.rs
candle-binding/src/classifiers/lora/pii_lora_test.rs
candle-binding/src/classifiers/lora/security_lora.rs
candle-binding/src/classifiers/lora/security_lora_test.rs
candle-binding/src/classifiers/lora/token_lora.rs
candle-binding/src/classifiers/lora/token_lora_test.rs
candle-binding/src/classifiers/mod.rs
candle-binding/src/classifiers/traditional/batch_processor.rs
candle-binding/src/classifiers/traditional/batch_processor_test.rs
candle-binding/src/classifiers/traditional/mod.rs
candle-binding/src/classifiers/traditional/modernbert_classifier.rs
candle-binding/src/classifiers/traditional/modernbert_classifier_test.rs
candle-binding/src/classifiers/unified.rs
candle-binding/src/classifiers/unified_test.rs
candle-binding/src/core/config_loader.rs
candle-binding/src/core/config_loader_test.rs
candle-binding/src/core/mod.rs
candle-binding/src/core/similarity.rs
candle-binding/src/core/similarity_test.rs
candle-binding/src/core/tokenization.rs
candle-binding/src/core/tokenization_test.rs
candle-binding/src/core/unified_error.rs
candle-binding/src/core/unified_error_test.rs
candle-binding/src/ffi/classify.rs
candle-binding/src/ffi/classify_test.rs
candle-binding/src/ffi/embedding.rs
candle-binding/src/ffi/embedding_test.rs
candle-binding/src/ffi/init.rs
candle-binding/src/ffi/init_test.rs
candle-binding/src/ffi/memory.rs
candle-binding/src/ffi/memory_safety.rs
candle-binding/src/ffi/memory_safety_test.rs
candle-binding/src/ffi/mod.rs
candle-binding/src/ffi/oncelock_concurrent_test.rs
candle-binding/src/ffi/similarity.rs
candle-binding/src/ffi/state_manager.rs
candle-binding/src/ffi/state_manager_test.rs
candle-binding/src/ffi/tokenization.rs
candle-binding/src/ffi/types.rs
candle-binding/src/ffi/validation.rs
candle-binding/src/ffi/validation_test.rs
candle-binding/src/model_architectures/config.rs
candle-binding/src/model_architectures/embedding/dense_layers.rs
candle-binding/src/model_architectures/embedding/dense_layers_test.rs
candle-binding/src/model_architectures/embedding/gemma3_model.rs
candle-binding/src/model_architectures/embedding/gemma3_model_test.rs
candle-binding/src/model_architectures/embedding/gemma_embedding.rs
candle-binding/src/model_architectures/embedding/gemma_embedding_test.rs
candle-binding/src/model_architectures/embedding/mod.rs
candle-binding/src/model_architectures/embedding/pooling.rs
candle-binding/src/model_architectures/embedding/pooling_test.rs
candle-binding/src/model_architectures/embedding/qwen3_embedding.rs
candle-binding/src/model_architectures/embedding/qwen3_embedding_test.rs
candle-binding/src/model_architectures/lora/bert_lora.rs
candle-binding/src/model_architectures/lora/bert_lora_test.rs
candle-binding/src/model_architectures/lora/lora_adapter.rs
candle-binding/src/model_architectures/lora/lora_adapter_test.rs
candle-binding/src/model_architectures/lora/mod.rs
candle-binding/src/model_architectures/mod.rs
candle-binding/src/model_architectures/model_factory.rs
candle-binding/src/model_architectures/model_factory_test.rs
candle-binding/src/model_architectures/routing.rs
candle-binding/src/model_architectures/routing_test.rs
candle-binding/src/model_architectures/traditional/base_model.rs
candle-binding/src/model_architectures/traditional/base_model_test.rs
candle-binding/src/model_architectures/traditional/bert.rs
candle-binding/src/model_architectures/traditional/bert_test.rs
candle-binding/src/model_architectures/traditional/mod.rs
candle-binding/src/model_architectures/traditional/modernbert.rs
candle-binding/src/model_architectures/traditional/modernbert_test.rs
candle-binding/src/model_architectures/traits.rs
candle-binding/src/model_architectures/unified_interface.rs
candle-binding/src/model_architectures/unified_interface_test.rs
candle-binding/src/test_fixtures.rs
candle-binding/src/utils/memory.rs
candle-binding/src/utils/mod.rs
candle-binding/test_data/gemma_reference_outputs.json
candle-binding/test_data/qwen3_reference_outputs.json
candle-binding/Cargo.lock
candle-binding/Cargo.toml
candle-binding/semantic-router.go
candle-binding/semantic-router_test.go
candle-binding/src/lib.rs

📁 `Root Directory`

Owners: @rootfs, @Xunzhuo
Files changed:

scripts/generate_gemma_reference.py
scripts/generate_qwen3_reference.py
.github/workflows/test-and-build.yml

📁 `config`

Owners: @rootfs
Files changed:

config/config.yaml

📁 `deploy`

Owners: @rootfs, @Xunzhuo
Files changed:

deploy/kubernetes/config.yaml
deploy/openshift/config-openshift.yaml

📁 `src`

Owners: @rootfs, @Xunzhuo, @wangchen615
Files changed:

src/semantic-router/cmd/main.go
src/semantic-router/pkg/api/server.go
src/semantic-router/pkg/apis/vllm.ai/v1alpha1/filter_types.go
src/semantic-router/pkg/cache/cache_factory.go
src/semantic-router/pkg/cache/cache_interface.go
src/semantic-router/pkg/cache/cache_test.go
src/semantic-router/pkg/cache/inmemory_cache.go
src/semantic-router/pkg/config/config.go
src/semantic-router/pkg/config/config_test.go
src/semantic-router/pkg/extproc/caching_test.go
src/semantic-router/pkg/extproc/router.go
src/semantic-router/pkg/extproc/test_utils_test.go

📁 `tools`

Owners: @yuluo-yx, @rootfs, @Xunzhuo
Files changed:

tools/make/build-run-test.mk
tools/make/common.mk
tools/make/golang.mk
tools/make/models.mk
tools/make/rust.mk

🎉 Thanks for your contributions!

This comment was automatically generated based on the OWNER files in the repository.

rootfs · 2025-09-28T12:41:54Z

@OneZero-Y @Xunzhuo Let's have the following resolved before merging

Add more candle unit tests
Verify API accuracy
Ensure semantic-router use the right binding API
Remove legacy comment and code

rootfs · 2025-09-30T12:42:49Z

@OneZero-Y now since we work on the feature branch, how about you use this branch for both refactoring and new embedding models?

OneZero-Y · 2025-10-09T14:36:21Z

@rootfs OK, I'll advance the embedded model on this branch

rootfs · 2025-10-09T17:16:11Z

@OneZero-Y that's great! I'll switch to this work as soon as i can.

ivarflakstad · 2025-10-11T20:10:50Z

candle-binding/src/classifiers/lora/parallel_engine.rs

+        let handles = vec![
+            self.spawn_intent_task(texts_owned.clone(), Arc::clone(&intent_results)),
+            self.spawn_pii_task(texts_owned.clone(), Arc::clone(&pii_results)),
+            self.spawn_security_task(texts_owned, Arc::clone(&security_results)),
+        ];
+
+        // Wait for all threads to complete
+        for handle in handles {
+            handle.join().map_err(|_| {
+                let unified_err = concurrency_error(
+                    "thread join",
+                    "Failed to join parallel classification thread",
+                );
+                candle_core::Error::from(unified_err)
+            })?;
+        }


This could be simplified a bit. Something like

let intent_handle = thread::spawn(|| intent_task(texts)); // slice is fine, no need to own the data. let pii_handle = ... same let security_handle = ... same let intent_results = intent_handle.join()?; // map_err omitted let pii_results = pii_handle.join()?; let security_results = security_handle.join()?;

Since we're on the topic of threads - you may like some of the abstractions that the rayon crate provides

thank you @ivarflakstad

ivarflakstad · 2025-10-13T17:22:54Z

candle-binding/src/classifiers/lora/pii_lora.rs

+    pub fn parallel_detect(&self, texts: &[&str]) -> Result<Vec<PIIResult>> {
+        let mut results = Vec::new();
+        for text in texts {
+            results.push(self.detect_pii(text)?);
+        }
+        Ok(results)
+    }


If you want this to be in parallel you could do something like

// add `use rayon::prelude::*;` at top of file Ok(texts.par_iter().map(|text| self.detect_pii(text)?).collect())

Though I'm starting to suspect that what you actually want, for the long term, is an async runtime.

@ivarflakstad thanks for looking into this. On a separate note, for async to run most efficiently, would you help look at the if locking is done the right way?

Sure :)
Are you thinking about any specific locks in particular? (pr is fairly large 😉 )

thank you @ivarflakstad

The classify_text is currently protected under lock. This could get us performance hit, would you help share your ideas? Thanks

Sorry for the delay.
I'd definitely look into using OnceCell instead of lazy_static.

But it depends. Will you actually be updating these static values at runtime? More than once?

If yes, then at the very least you want to use RwLock instead of Mutex because I doubt you're planning to write to the value as much as you read it.

If OnceCell doesn't cut it, perhaps you'll want to give OnceLock, LazyCell, or LazyLock a try :)

thank you @ivarflakstad

can you review #528?

OneZero-Y · 2025-10-21T12:42:49Z

@rootfs
Flash Attention 2 Testing Issues
The self.scale → self.scaling fix is correct through compilation (CUDA 12.9, requires candle 0.9.1 upgrade from 0.8.4) and code analysis.

Flash Attention 2 enabled (feature flag active)
   Status: Flash Attention 2 fully integrated (2-3x faster for long sequences)
   Performance: Optimized for 8K-32K token sequences
✅ Qwen3EmbeddingModel loaded successfully:
   - Model: ../models/Qwen3-Embedding-0.6B
   - Layers: 28
   - Hidden size: 1024
   - Attention heads: 16
   - KV heads (GQA): 8
   - Max seq length: 32768
   - RoPE theta: 1000000
   - Padding side: Left (CRITICAL: must be Left)
 ✅ Qwen3-Embedding-0.6B loaded successfully in 22.65s

though runtime testing is blocked by incompatible local GPU hardware (GT 730 CC 3.5 < required CC 8.0 for Flash Attention 2). Could you help verify with a compatible GPU if available?

🔍 Attempting to detect CUDA device...
✅ Device::new_cuda(0) succeeded: Cuda(CudaDevice(DeviceId(1)))
✅ Using CUDA GPU for testing
❌ Error: DriverError(CUDA_ERROR_UNSUPPORTED_PTX_VERSION, "the provided PTX was compiled with an unsupported toolchain.")

rootfs · 2025-10-21T12:54:17Z

@OneZero-Y sure, I am using L4, the unit test passed on my end, let me run again and post the test log

rootfs · 2025-10-21T13:09:39Z

@OneZero-Y here is my local test results using PR #489
#489 (comment)

Signed-off-by: Huamin Chen <hchen@redhat.com>

Signed-off-by: carlory <baofa.fan@daocloud.io>

Signed-off-by: Huamin Chen <hchen@redhat.com>

Signed-off-by: OneZero-Y <aukovyps@163.com>

…ntention (#516) - Remove duplicate UNIFIED_CLASSIFIER global state - Optimize PARALLEL_LORA_ENGINE lock contention by using Arc clone Signed-off-by: OneZero-Y <aukovyps@163.com>

* Update test description from Math to General (#483) Signed-off-by: carlory <baofa.fan@daocloud.io> * feat: add HuggingChat support (#477) * add chat ui to dashboard and docker compose & refactor dashboard/backend/ Signed-off-by: JaredforReal <w13431838023@gmail.com> * try fix network error Signed-off-by: JaredforReal <w13431838023@gmail.com> * more --------- Signed-off-by: JaredforReal <w13431838023@gmail.com> Co-authored-by: bitliu <bitliu@tencent.com> * project: 2025 Q4 roadmap (#487) * project: q4 roadmap * project: q4 roadmap * project: q4 roadmap * more * more * more * more * feat: add shelleck precommit hook (#488) * feat: add shelleck precommit hook Signed-off-by: yuluo-yx <yuluo08290126@gmail.com> * feat: add shelleck precommit hook Signed-off-by: yuluo-yx <yuluo08290126@gmail.com> * feat: add shelleck precommit hook Signed-off-by: yuluo-yx <yuluo08290126@gmail.com> --------- Signed-off-by: yuluo-yx <yuluo08290126@gmail.com> * project: add q4 roadmap news (#495) * fix missing shellcheck in pre-commit image (#497) Signed-off-by: carlory <baofa.fan@daocloud.io> * infra: update tools (#501) Signed-off-by: yuluo-yx <yuluo08290126@gmail.com> * feat(demo): enhance OpenShift demo scripts with improved UX (#478) - Reduce model selection test to 4 categories (2×Model-A, 2×Model-B) - Add new "Classification Examples" option calling curl-examples.sh - Update reasoning examples to avoid cache hits from previous tests - Remove benign examples from PII and Jailbreak tests (show only attacks) - Enhance live-semantic-router-logs.sh with better color visibility: - Fix duplicate "WITH SCORE" text in classification output - Fix CACHE HIT background color extending over timestamp - Distinguish reasoning enabled vs disabled messages - Remove redundant "(standard routing)" text - Add background colors for Model-A/Model-B routing display These improvements make the live demo clearer and more impactful for presentations and demonstrations. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Signed-off-by: Yossi Ovadia <yovadia@redhat.com> Co-authored-by: Claude <noreply@anthropic.com> * fix: fix precommit Argument list too long error (#502) Signed-off-by: yuluo-yx <yuluo08290126@gmail.com> * feat: enforce milvus dial timeout if set (#503) Signed-off-by: cryo <zdtna412@gmail.com> * Add IETF draft publication: Multi-Provider Extensions for Agentic AI Inference APIs (#506) * Initial plan * Add new IETF draft publication for Multi-Provider Extensions for Agentic AI Inference APIs Co-authored-by: rootfs <7062400+rootfs@users.noreply.github.com> --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: rootfs <7062400+rootfs@users.noreply.github.com> * Allow semantic cache similarity threshold to be set at the category level (#493) * Initial plan * Add category-level cache settings: enabled and similarity_threshold Co-authored-by: rootfs <7062400+rootfs@users.noreply.github.com> * Add comprehensive tests for category-level cache settings Co-authored-by: rootfs <7062400+rootfs@users.noreply.github.com> * Update config files and documentation for category-level cache settings - Updated 7 config YAML files (development, production, testing, e2e, and 3 recipes) with commented examples of category-level cache settings - Added comprehensive documentation section explaining category-level cache configuration - Updated semantic cache overview and in-memory cache docs with category-level examples - Added best practices for threshold selection and privacy considerations Co-authored-by: rootfs <7062400+rootfs@users.noreply.github.com> * Remove duplicate code in FindSimilar functions Refactored FindSimilar() to delegate to FindSimilarWithThreshold() with default threshold instead of duplicating the entire implementation. This eliminates 226 lines of duplicate code across inmemory_cache.go and milvus_cache.go. Co-authored-by: rootfs <7062400+rootfs@users.noreply.github.com> * Update src/semantic-router/pkg/extproc/request_handler.go Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Revert changes from unsigned commit ae39fe2 Restored the classificationText empty check that was removed in the previous commit. Co-authored-by: rootfs <7062400+rootfs@users.noreply.github.com> --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: rootfs <7062400+rootfs@users.noreply.github.com> Co-authored-by: Huamin Chen <rootfs@users.noreply.github.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Allow jailbreak detection and threshold to be configured at the category level (#508) * Initial plan * Add category-level jailbreak detection configuration Co-authored-by: Xunzhuo <48784001+Xunzhuo@users.noreply.github.com> * Add documentation for category-level jailbreak settings Co-authored-by: Xunzhuo <48784001+Xunzhuo@users.noreply.github.com> * Update documentation for category-level jailbreak detection - Add category-level jailbreak configuration to jailbreak-protection.md - Update category configuration docs with jailbreak_enabled parameter - Add security-focused configuration example - Update global configuration docs with category override notes - Update README to mention fine-grained security control Co-authored-by: Xunzhuo <48784001+Xunzhuo@users.noreply.github.com> * Add category-level jailbreak threshold configuration - Add JailbreakThreshold field to Category struct - Add GetJailbreakThresholdForCategory helper method - Create CheckForJailbreakWithThreshold and AnalyzeContentForJailbreakWithThreshold methods - Update performSecurityChecks to use category-specific threshold - Add 5 comprehensive tests for threshold configuration - Update example configs with threshold tuning examples - Update documentation with threshold configuration and tuning guidelines - Add threshold tuning guide with recommendations for different category types Co-authored-by: Xunzhuo <48784001+Xunzhuo@users.noreply.github.com> --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: Xunzhuo <48784001+Xunzhuo@users.noreply.github.com> * Allow PII detection threshold to be set at the category level (#510) * Initial plan * Add category-level PII threshold support Co-authored-by: Xunzhuo <48784001+Xunzhuo@users.noreply.github.com> * Update documentation with API integration notes Co-authored-by: Xunzhuo <48784001+Xunzhuo@users.noreply.github.com> * Fix markdown linting issues Co-authored-by: Xunzhuo <48784001+Xunzhuo@users.noreply.github.com> --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: Xunzhuo <48784001+Xunzhuo@users.noreply.github.com> * Fix: The caller information points to the wrapper function instead of the actual call location (#518) Signed-off-by: carlory <baofa.fan@daocloud.io> * feat: Implement hybrid cache that use in-memory index and milvus based doc store (#504) * feat: add HNSW index to inmemory semantic cache and implement hybrid cache that use in-memory index and milvus based doc store Signed-off-by: Huamin Chen <hchen@redhat.com> * chore: run go mod tidy to clean up module dependencies Signed-off-by: Huamin Chen <hchen@redhat.com> * conditionally build candle cuda support Signed-off-by: Huamin Chen <hchen@redhat.com> * rebuild index upon restart Signed-off-by: Huamin Chen <hchen@redhat.com> * precommit fix Signed-off-by: Huamin Chen <hchen@redhat.com> * fix precommit Signed-off-by: Huamin Chen <hchen@redhat.com> * fix precommit Signed-off-by: Huamin Chen <hchen@redhat.com> * fix precommit Signed-off-by: Huamin Chen <hchen@redhat.com> * disable cuda build on ci Signed-off-by: Huamin Chen <hchen@redhat.com> * review feedback Signed-off-by: Huamin Chen <hchen@redhat.com> * review feedback Signed-off-by: Huamin Chen <hchen@redhat.com> * review feedback Signed-off-by: Huamin Chen <hchen@redhat.com> * review feedback Signed-off-by: Huamin Chen <hchen@redhat.com> --------- Signed-off-by: Huamin Chen <hchen@redhat.com> --------- Signed-off-by: carlory <baofa.fan@daocloud.io> Signed-off-by: JaredforReal <w13431838023@gmail.com> Signed-off-by: yuluo-yx <yuluo08290126@gmail.com> Signed-off-by: Yossi Ovadia <yovadia@redhat.com> Signed-off-by: cryo <zdtna412@gmail.com> Signed-off-by: Huamin Chen <hchen@redhat.com> Co-authored-by: 杨朱 · Kiki <baofa.fan@daocloud.io> Co-authored-by: Jared <w13431838023@gmail.com> Co-authored-by: bitliu <bitliu@tencent.com> Co-authored-by: shown <yuluo08290126@gmail.com> Co-authored-by: Yossi Ovadia <yovadia@redhat.com> Co-authored-by: Claude <noreply@anthropic.com> Co-authored-by: cryo <zdtna412@gmail.com> Co-authored-by: Copilot <198982749+Copilot@users.noreply.github.com> Co-authored-by: rootfs <7062400+rootfs@users.noreply.github.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: Xunzhuo <48784001+Xunzhuo@users.noreply.github.com>

* Update test description from Math to General (#483) Signed-off-by: carlory <baofa.fan@daocloud.io> * feat: add HuggingChat support (#477) * add chat ui to dashboard and docker compose & refactor dashboard/backend/ Signed-off-by: JaredforReal <w13431838023@gmail.com> * try fix network error Signed-off-by: JaredforReal <w13431838023@gmail.com> * more --------- Signed-off-by: JaredforReal <w13431838023@gmail.com> Co-authored-by: bitliu <bitliu@tencent.com> * project: 2025 Q4 roadmap (#487) * project: q4 roadmap * project: q4 roadmap * project: q4 roadmap * more * more * more * more * feat: add shelleck precommit hook (#488) * feat: add shelleck precommit hook Signed-off-by: yuluo-yx <yuluo08290126@gmail.com> * feat: add shelleck precommit hook Signed-off-by: yuluo-yx <yuluo08290126@gmail.com> * feat: add shelleck precommit hook Signed-off-by: yuluo-yx <yuluo08290126@gmail.com> --------- Signed-off-by: yuluo-yx <yuluo08290126@gmail.com> * project: add q4 roadmap news (#495) * fix missing shellcheck in pre-commit image (#497) Signed-off-by: carlory <baofa.fan@daocloud.io> * infra: update tools (#501) Signed-off-by: yuluo-yx <yuluo08290126@gmail.com> * feat(demo): enhance OpenShift demo scripts with improved UX (#478) - Reduce model selection test to 4 categories (2×Model-A, 2×Model-B) - Add new "Classification Examples" option calling curl-examples.sh - Update reasoning examples to avoid cache hits from previous tests - Remove benign examples from PII and Jailbreak tests (show only attacks) - Enhance live-semantic-router-logs.sh with better color visibility: - Fix duplicate "WITH SCORE" text in classification output - Fix CACHE HIT background color extending over timestamp - Distinguish reasoning enabled vs disabled messages - Remove redundant "(standard routing)" text - Add background colors for Model-A/Model-B routing display These improvements make the live demo clearer and more impactful for presentations and demonstrations. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Signed-off-by: Yossi Ovadia <yovadia@redhat.com> Co-authored-by: Claude <noreply@anthropic.com> * fix: fix precommit Argument list too long error (#502) Signed-off-by: yuluo-yx <yuluo08290126@gmail.com> * feat: enforce milvus dial timeout if set (#503) Signed-off-by: cryo <zdtna412@gmail.com> * Add IETF draft publication: Multi-Provider Extensions for Agentic AI Inference APIs (#506) * Initial plan * Add new IETF draft publication for Multi-Provider Extensions for Agentic AI Inference APIs Co-authored-by: rootfs <7062400+rootfs@users.noreply.github.com> --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: rootfs <7062400+rootfs@users.noreply.github.com> * Allow semantic cache similarity threshold to be set at the category level (#493) * Initial plan * Add category-level cache settings: enabled and similarity_threshold Co-authored-by: rootfs <7062400+rootfs@users.noreply.github.com> * Add comprehensive tests for category-level cache settings Co-authored-by: rootfs <7062400+rootfs@users.noreply.github.com> * Update config files and documentation for category-level cache settings - Updated 7 config YAML files (development, production, testing, e2e, and 3 recipes) with commented examples of category-level cache settings - Added comprehensive documentation section explaining category-level cache configuration - Updated semantic cache overview and in-memory cache docs with category-level examples - Added best practices for threshold selection and privacy considerations Co-authored-by: rootfs <7062400+rootfs@users.noreply.github.com> * Remove duplicate code in FindSimilar functions Refactored FindSimilar() to delegate to FindSimilarWithThreshold() with default threshold instead of duplicating the entire implementation. This eliminates 226 lines of duplicate code across inmemory_cache.go and milvus_cache.go. Co-authored-by: rootfs <7062400+rootfs@users.noreply.github.com> * Update src/semantic-router/pkg/extproc/request_handler.go Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Revert changes from unsigned commit ae39fe2 Restored the classificationText empty check that was removed in the previous commit. Co-authored-by: rootfs <7062400+rootfs@users.noreply.github.com> --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: rootfs <7062400+rootfs@users.noreply.github.com> Co-authored-by: Huamin Chen <rootfs@users.noreply.github.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Allow jailbreak detection and threshold to be configured at the category level (#508) * Initial plan * Add category-level jailbreak detection configuration Co-authored-by: Xunzhuo <48784001+Xunzhuo@users.noreply.github.com> * Add documentation for category-level jailbreak settings Co-authored-by: Xunzhuo <48784001+Xunzhuo@users.noreply.github.com> * Update documentation for category-level jailbreak detection - Add category-level jailbreak configuration to jailbreak-protection.md - Update category configuration docs with jailbreak_enabled parameter - Add security-focused configuration example - Update global configuration docs with category override notes - Update README to mention fine-grained security control Co-authored-by: Xunzhuo <48784001+Xunzhuo@users.noreply.github.com> * Add category-level jailbreak threshold configuration - Add JailbreakThreshold field to Category struct - Add GetJailbreakThresholdForCategory helper method - Create CheckForJailbreakWithThreshold and AnalyzeContentForJailbreakWithThreshold methods - Update performSecurityChecks to use category-specific threshold - Add 5 comprehensive tests for threshold configuration - Update example configs with threshold tuning examples - Update documentation with threshold configuration and tuning guidelines - Add threshold tuning guide with recommendations for different category types Co-authored-by: Xunzhuo <48784001+Xunzhuo@users.noreply.github.com> --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: Xunzhuo <48784001+Xunzhuo@users.noreply.github.com> * Allow PII detection threshold to be set at the category level (#510) * Initial plan * Add category-level PII threshold support Co-authored-by: Xunzhuo <48784001+Xunzhuo@users.noreply.github.com> * Update documentation with API integration notes Co-authored-by: Xunzhuo <48784001+Xunzhuo@users.noreply.github.com> * Fix markdown linting issues Co-authored-by: Xunzhuo <48784001+Xunzhuo@users.noreply.github.com> --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: Xunzhuo <48784001+Xunzhuo@users.noreply.github.com> * Fix: The caller information points to the wrapper function instead of the actual call location (#518) Signed-off-by: carlory <baofa.fan@daocloud.io> * feat: Implement hybrid cache that use in-memory index and milvus based doc store (#504) * feat: add HNSW index to inmemory semantic cache and implement hybrid cache that use in-memory index and milvus based doc store Signed-off-by: Huamin Chen <hchen@redhat.com> * chore: run go mod tidy to clean up module dependencies Signed-off-by: Huamin Chen <hchen@redhat.com> * conditionally build candle cuda support Signed-off-by: Huamin Chen <hchen@redhat.com> * rebuild index upon restart Signed-off-by: Huamin Chen <hchen@redhat.com> * precommit fix Signed-off-by: Huamin Chen <hchen@redhat.com> * fix precommit Signed-off-by: Huamin Chen <hchen@redhat.com> * fix precommit Signed-off-by: Huamin Chen <hchen@redhat.com> * fix precommit Signed-off-by: Huamin Chen <hchen@redhat.com> * disable cuda build on ci Signed-off-by: Huamin Chen <hchen@redhat.com> * review feedback Signed-off-by: Huamin Chen <hchen@redhat.com> * review feedback Signed-off-by: Huamin Chen <hchen@redhat.com> * review feedback Signed-off-by: Huamin Chen <hchen@redhat.com> * review feedback Signed-off-by: Huamin Chen <hchen@redhat.com> --------- Signed-off-by: Huamin Chen <hchen@redhat.com> * merge main to feat branch Signed-off-by: Huamin Chen <hchen@redhat.com> --------- Signed-off-by: carlory <baofa.fan@daocloud.io> Signed-off-by: JaredforReal <w13431838023@gmail.com> Signed-off-by: yuluo-yx <yuluo08290126@gmail.com> Signed-off-by: Yossi Ovadia <yovadia@redhat.com> Signed-off-by: cryo <zdtna412@gmail.com> Signed-off-by: Huamin Chen <hchen@redhat.com> Co-authored-by: 杨朱 · Kiki <baofa.fan@daocloud.io> Co-authored-by: Jared <w13431838023@gmail.com> Co-authored-by: bitliu <bitliu@tencent.com> Co-authored-by: shown <yuluo08290126@gmail.com> Co-authored-by: Yossi Ovadia <yovadia@redhat.com> Co-authored-by: Claude <noreply@anthropic.com> Co-authored-by: cryo <zdtna412@gmail.com> Co-authored-by: Copilot <198982749+Copilot@users.noreply.github.com> Co-authored-by: rootfs <7062400+rootfs@users.noreply.github.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: Xunzhuo <48784001+Xunzhuo@users.noreply.github.com>

* chore: fix unit test Signed-off-by: Huamin Chen <hchen@redhat.com> * fix go vet Signed-off-by: Huamin Chen <hchen@redhat.com> * fix ci Signed-off-by: Huamin Chen <hchen@redhat.com> * fix ci Signed-off-by: Huamin Chen <hchen@redhat.com> * split test-binding to two stages on ci Signed-off-by: Huamin Chen <hchen@redhat.com> * ignore test failure due to embeddinggemma restriction Signed-off-by: Huamin Chen <hchen@redhat.com> * reorder ci test sequences to avoid missing models Signed-off-by: Huamin Chen <hchen@redhat.com> --------- Signed-off-by: Huamin Chen <hchen@redhat.com>

…reads based on review vllm-project#266 (comment)

…reads based on review vllm-project#266 (comment) Signed-off-by: Huamin Chen <hchen@redhat.com>

Xunzhuo · 2025-10-24T13:20:18Z

🚀🔥

rootfs · 2025-10-24T13:33:03Z

@OneZero-Y are you on semantic router slack? or can you reach me via email hchen@redhat.com? Let's coauthor a blog post on this progress, thanks.

…reads based on review (#528) * refactor: Replace lazy_static with OnceLock for zero-cost concurrent reads based on review #266 (comment) Signed-off-by: Huamin Chen <hchen@redhat.com> * update tests Signed-off-by: Huamin Chen <hchen@redhat.com> --------- Signed-off-by: Huamin Chen <hchen@redhat.com>

Signed-off-by: Huamin Chen <hchen@redhat.com>

* chore: fix lint error Signed-off-by: Huamin Chen <hchen@redhat.com> * chore: fix lint error Signed-off-by: Huamin Chen <hchen@redhat.com> --------- Signed-off-by: Huamin Chen <hchen@redhat.com>

rootfs · 2025-10-24T14:58:08Z

Thank you all the great work for this significant milestone. I am merging this to main branch to trigger more CI. Will follow up in the new few days on issues and enhancement.

OneZero-Y · 2025-10-25T02:09:04Z

@rootfs I left you a message at slack

rootfs · 2025-10-25T14:47:03Z

@OneZero-Y @ivarflakstad the blog post PR is here vllm-project/vllm-project.github.io#104

rootfs requested a review from Xunzhuo as a code owner September 28, 2025 12:40

github-actions bot assigned rootfs Sep 28, 2025

rootfs added this to the v0.1 milestone Sep 28, 2025

OneZero-Y mentioned this pull request Sep 30, 2025

feat:unit tests for candle refactoring #296

Merged

github-actions bot assigned Xunzhuo Sep 30, 2025

rootfs mentioned this pull request Oct 5, 2025

Feat/mock #348

Closed

ivarflakstad reviewed Oct 11, 2025

View reviewed changes

ivarflakstad reviewed Oct 13, 2025

View reviewed changes

OneZero-Y mentioned this pull request Oct 16, 2025

feat:support for two long-context embedding models (Qwen3 and Gemma) #453

Merged

rootfs requested a review from wangchen615 as a code owner October 16, 2025 12:30

github-actions bot assigned wangchen615 Oct 16, 2025

This was referenced Oct 17, 2025

fix:Implement Comprehensive Rayon Parallelization for LoRA Classifiers #464

Merged

fix:Improve rust unit test and optimize concurrent tests with rayon #471

Merged

rootfs force-pushed the feat-candle-refactoring branch 2 times, most recently from a2fe984 to 0ee291f Compare October 19, 2025 22:44

rootfs mentioned this pull request Oct 21, 2025

[Prompt Classification] Implement In-Tree Embedding Similarity Matching #366

Closed

carlory mentioned this pull request Oct 22, 2025

Bug: TestUnifiedClassifier_Initialize always fails on macOS #481

Closed

This was referenced Oct 22, 2025

make CUDA and Flash Attention 2 optional features #511

Merged

fix: duplicate UNIFIED_CLASSIFIER definition and optimize lock contention #516

Merged

rootfs requested a review from yuezhu1 as a code owner October 23, 2025 16:29

github-actions bot assigned JaredforReal Oct 23, 2025

rootfs and others added 9 commits October 23, 2025 12:52

fix: resolve syntax errors after rebase

fb7b6f9

Signed-off-by: Huamin Chen <hchen@redhat.com>

add additional update

dbb1e64

Signed-off-by: Huamin Chen <hchen@redhat.com>

Change label count params to c_int (#494)

2bee957

Signed-off-by: carlory <baofa.fan@daocloud.io>

update embedding setting in config (#489)

c83d109

Signed-off-by: Huamin Chen <hchen@redhat.com>

make CUDA and Flash Attention 2 optional features (#511)

a81c29c

Signed-off-by: OneZero-Y <aukovyps@163.com>

fix: Fix duplicate UNIFIED_CLASSIFIER definition and optimize lock co…

2b72f27

…ntention (#516) - Remove duplicate UNIFIED_CLASSIFIER global state - Optimize PARALLEL_LORA_ENGINE lock contention by using Arc clone Signed-off-by: OneZero-Y <aukovyps@163.com>

rootfs force-pushed the feat-candle-refactoring branch from 7d84e64 to 3230c35 Compare October 23, 2025 16:57

rootfs added a commit to rootfs/semantic-router.bak that referenced this pull request Oct 23, 2025

refactor: Replace lazy_static with OnceLock for zero-cost concurrent …

fe8d577

…reads based on review vllm-project#266 (comment)

rootfs mentioned this pull request Oct 23, 2025

refactor: Replace lazy_static with OnceLock for zero-cost concurrent reads based on review #528

Merged

rootfs added a commit to rootfs/semantic-router.bak that referenced this pull request Oct 23, 2025

refactor: Replace lazy_static with OnceLock for zero-cost concurrent …

3440283

…reads based on review vllm-project#266 (comment) Signed-off-by: Huamin Chen <hchen@redhat.com>

rootfs added a commit to rootfs/semantic-router.bak that referenced this pull request Oct 24, 2025

refactor: Replace lazy_static with OnceLock for zero-cost concurrent …

9b36254

…reads based on review vllm-project#266 (comment) Signed-off-by: Huamin Chen <hchen@redhat.com>

Xunzhuo previously approved these changes Oct 24, 2025

View reviewed changes

Xunzhuo changed the title ~~[WIP] refactor: Implement modular candle-binding architecture (#254)~~ refactor: Implement modular candle-binding architecture (#254) Oct 24, 2025

rootfs dismissed Xunzhuo’s stale review via 88c9d75 October 24, 2025 13:40

rootfs added 3 commits October 24, 2025 08:41

Merge branch 'main' into feat-candle-refactoring

8605045

chore: fix lint error (#530)

a2a20a7

Signed-off-by: Huamin Chen <hchen@redhat.com>

Fix lint error2 (#531)

9adab4e

* chore: fix lint error Signed-off-by: Huamin Chen <hchen@redhat.com> * chore: fix lint error Signed-off-by: Huamin Chen <hchen@redhat.com> --------- Signed-off-by: Huamin Chen <hchen@redhat.com>

rootfs merged commit 7d55463 into main Oct 24, 2025
20 of 21 checks passed

rootfs mentioned this pull request Oct 24, 2025

chore: upgrade rust version to 1.90 in all related Dockerfiles #499

Merged

Xunzhuo deleted the feat-candle-refactoring branch October 26, 2025 05:02

Conversation

rootfs commented Sep 28, 2025 • edited by Xunzhuo Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

netlify bot commented Sep 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for vllm-semantic-router ready!

Uh oh!

github-actions bot commented Sep 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

👥 vLLM Semantic Team Notification

📁 candle-binding

📁 Root Directory

📁 config

📁 deploy

📁 src

📁 tools

🎉 Thanks for your contributions!

Uh oh!

rootfs commented Sep 28, 2025

Uh oh!

rootfs commented Sep 30, 2025

Uh oh!

OneZero-Y commented Oct 9, 2025

Uh oh!

rootfs commented Oct 9, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

OneZero-Y commented Oct 21, 2025

Uh oh!

rootfs commented Oct 21, 2025

Uh oh!

rootfs commented Oct 21, 2025

Uh oh!

Xunzhuo commented Oct 24, 2025

Uh oh!

rootfs commented Oct 24, 2025

Uh oh!

rootfs commented Oct 24, 2025

Uh oh!

Uh oh!

OneZero-Y commented Oct 25, 2025

Uh oh!

rootfs commented Oct 25, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants

rootfs commented Sep 28, 2025 •

edited by Xunzhuo

Loading

netlify bot commented Sep 28, 2025 •

edited

Loading

github-actions bot commented Sep 28, 2025 •

edited

Loading

📁 `candle-binding`

📁 `Root Directory`

📁 `config`

📁 `deploy`

📁 `src`

📁 `tools`