Semantic Intelligence Layer for NVIDIA Dynamo
1. Executive Summaryβ
This proposal outlines a comprehensive integration strategy between vLLM Semantic Router and NVIDIA Dynamo, combining semantic intelligence with high-performance distributed inference. The integration creates a unified inference stack that leverages:
- Semantic Router's intelligent request classification (14 domain categories), domain-aware system prompts, fusion routing (BERT classification + keyword matching + similarity search), security filtering, Milvus-based semantic caching
- Dynamo's disaggregated serving, KV-aware routing, and multi-tier memory management
The result is a production-grade LLM serving platform with system-level intelligence that achieves optimal balance between accuracy (routing to the right model with optimized prompts for best quality) and efficiency (maximizing GPU utilization and minimizing latency), creating a holistically intelligent inference system.
Key Benefits:
- System-level intelligence that optimally balances accuracy and efficiency across the entire inference stack
- Significant cost reduction through intelligent model selection combined with infrastructure optimization
- Substantial latency improvement via semantic caching + KV cache management with adaptive routing strategies
- Enhanced LLM quality with domain-aware system prompts that improve Chain-of-Thought reasoning, token efficiency, and MoE expert matching
- Adaptive routing intelligence with fusion routing: fast path (keyword) to deep analysis (BERT) based on query complexity, maximizing efficiency without sacrificing accuracy
- Multi-signal decision making combining BERT classification, keyword matching, and similarity search for robust and accurate routing
- Holistic content safety with PII detection and jailbreak prevention before inference
- End-to-end observability across semantic and infrastructure layers for continuous system optimization
2. Motivation: Why Semantic Router for Dynamo?β
2.1 Dynamo Router Capabilities (Current State)β
NVIDIA Dynamo provides a sophisticated KV-aware router optimized for infrastructure-level efficiency:
Capability | Description | Optimization Target |
---|---|---|
KV Cache-Aware Routing | Routes requests to workers with highest KV cache hit rate | TTFT, throughput |
Load-Based Routing | Balances active decoding blocks across workers | ITL, GPU utilization |
Cost Function Optimization | Minimizes potential_prefill_blocks + potential_active_blocks | Computational cost |
Temperature-Based Selection | Probabilistic routing to prevent worker saturation | Load distribution |
Event-Driven Tracking | Real-time cache state via worker events | Routing accuracy |
Key Characteristics:
- Infrastructure-focused: Optimizes GPU memory and compute utilization
- Cache-aware: Leverages existing KV caches to reduce prefill cost
- Load-balanced: Distributes decoding workload across workers
- Performance-oriented: Minimizes TTFT and ITL through smart scheduling
2.2 Semantic Router Capabilities (System Intelligence Layer)β
vLLM Semantic Router provides system-level intelligence that operates at the request understanding layer, achieving optimal balance between accuracy and efficiency through intelligent decision-making across 14 domain categories:
Capability | Description | Intelligence Focus |
---|---|---|
Intent Classification | BERT-based categorization (14 categories: math, code, business, law, etc.) | Accuracy: Precise domain understanding |
Model Selection | Routes to best-performing model per category | Accuracy: Task-specific quality optimization |
Domain-Aware System Prompts | Auto-injects category-specific system prompts for prompt engineering | Accuracy: LLM CoT quality, token efficiency, MoE expert matching |
Fusion Routing | Multi-signal routing (keyword + similarity + BERT) | Efficiency: Adaptive latency based on query complexity |
Semantic Caching | Milvus-based vector cache with 0.85+ similarity threshold | Efficiency: Inference cost reduction |
PII Detection | Token-level classification (PERSON, EMAIL, SSN, etc.) | System Intelligence: Privacy protection |
Jailbreak Prevention | Binary classification for prompt injection attacks | System Intelligence: Security enforcement |
Tool Selection | Semantic matching of relevant tools to reduce prompt tokens | Efficiency: Context optimization |
Reasoning Control | Auto-enables reasoning mode for complex queries | Accuracy: Quality-aware mode selection |
System Intelligence Characteristics:
- Holistic Intelligence: Understands query intent, complexity, and security implications across 14 domain categories
- Accuracy-Efficiency Balance: Dynamically selects routing strategy (keyword/similarity/BERT) based on query complexity to maximize accuracy while minimizing latency
- Quality Optimization: Selects models and prompts based on task-specific accuracy requirements
- Intelligent Prompt Engineering: Auto-injects domain-specific system prompts to optimize LLM behavior and output quality
- Proactive Security: Blocks malicious or privacy-violating requests before reaching inference layer
- Cost Intelligence: Avoids expensive models for simple queries while ensuring quality for complex tasks
- Adaptive Routing: Multi-signal fusion routing adapts to query characteristics for optimal accuracy-efficiency tradeoff
2.2.1 14 Domain Categories with System Promptsβ
Semantic Router classifies queries into 14 specialized categories: math, computer science, physics, chemistry, biology, engineering, economics, business, law, psychology, philosophy, history, health, and other. Each category has an optimized system prompt automatically injected based on query classification.
System Prompt Benefits:
-
Improved Chain-of-Thought (CoT): Domain-specific prompts guide LLMs to use appropriate reasoning patterns
- Math: "Provide step-by-step solutions, show your work clearly"
- Law: "Provide accurate legal information while clearly stating disclaimers"
- Business: "Provide practical, actionable advice backed by proven methodologies"
-
Token Efficiency: Optimized prompts reduce unnecessary verbosity while maintaining quality
- Shorter, focused prompts for straightforward categories (business, history)
- Detailed prompts for complex domains requiring specific methodologies (math, physics)
-
MoE Expert Matching: Well-crafted system prompts improve expert selection in Mixture-of-Experts models
- Domain-specific terminology activates relevant experts
- Consistent prompt structure improves expert routing accuracy
- Example: "You are a mathematics expert" β activates math-specialized experts in DeepSeek-V3
-
Quality Control: Category-specific disclaimers and ethical guidelines
- Medical/Legal: Explicit disclaimers about professional consultation
- Psychology: Emphasis on evidence-based approaches
- Health: Clear boundaries between information and medical advice
Example System Prompt (Math Category):
You are a mathematics expert. Provide step-by-step solutions, show your
work clearly, and explain mathematical concepts in an understandable way.
Example System Prompt (Business Category):
You are a senior business consultant and strategic advisor with expertise
in corporate strategy, operations management, financial analysis, marketing,
and organizational development. Provide practical, actionable business advice
backed by proven methodologies and industry best practices. Consider market
dynamics, competitive landscape, and stakeholder interests in your recommendations.
2.2.2 Fusion Routing Strategyβ
Semantic Router implements a multi-signal fusion routing approach that combines three complementary routing methods (as detailed in the Prompt Classification Routing proposal):
1. Keyword-Based Routing (Fast Path)
- Deterministic routing for technology-specific terms (e.g., "kubernetes", "SQL", "React")
- Latency: Minimal (significantly faster than BERT classification)
- Boolean logic support (AND/OR operators)
- Easy to update without model retraining
- Use case: Exact term matching for known patterns
2. Similarity-Based Routing (Semantic Path)
- Embedding similarity for semantic concept detection
- Robust to paraphrasing ("step-by-step" β "explain thoroughly")
- Configurable similarity thresholds (default: 0.75)
- Latency: Low (faster than full BERT classification)
- Use case: Semantic concept matching beyond exact terms
3. BERT Classification (Deep Understanding Path)
- 14-category classification with ModernBERT
- Highest accuracy for complex queries
- Latency: Moderate (comprehensive analysis)
- Use case: Comprehensive intent understanding
Signal Fusion Layer:
- Policy-driven decision making: Combines signals with configurable priority
- Routing logic:
- Check keyword rules first (fastest)
- If no keyword match, check similarity rules
- If no similarity match, use BERT classification (fallback)
- Confidence scoring: Each signal provides confidence score
- Override mechanism: High-confidence signals can override lower-priority signals
- Observability: All signals logged for analysis
System Intelligence Benefits of Fusion Routing:
- Accuracy-Efficiency Balance: Dynamically selects routing strategy based on query complexityβfast path (keyword) for deterministic patterns achieves minimal latency, while deep analysis (BERT) for complex queries ensures maximum accuracy
- Adaptive Intelligence: System automatically chooses the most efficient signal that meets accuracy requirements, avoiding unnecessary computation
- Flexibility: Easy to add new routing rules without model retraining, enabling continuous system optimization
- Robustness: Multiple signals provide redundancy and cross-validation, reducing misclassification risk and improving overall system reliability
- Holistic Optimization: Considers both accuracy and efficiency in every routing decision, maximizing system-level intelligence
2.3 Differentiation Analysis: Complementary Strengthsβ
The two systems operate at different layers of the inference stack with minimal overlap:
Semantic Router: Request Intelligence Layerβ
User Query β [Semantic Understanding] β Model Selection β Request Enrichment
- What: Understands query semantics, intent, and safety
- Why: Routes to the right model for the task
- When: Before request reaches infrastructure
- Optimization: Accuracy, cost, security
Dynamo Router: Infrastructure Efficiency Layerβ
Enriched Request β [Worker Selection] β KV Cache Optimization β GPU Scheduling
- What: Optimizes worker selection and resource allocation
- Why: Maximizes GPU utilization and minimizes latency
- When: After model selection, during execution
- Optimization: TTFT, ITL, throughput
Integration Value Propositionβ
Dimension | Semantic Router Alone | Dynamo Router Alone | Integrated System |
---|---|---|---|
Model Selection | β Semantic accuracy (14 categories) | β No model awareness | β Best model for task |
Worker Selection | β No worker awareness | β KV cache optimization | β Optimal worker for model |
Prompt Engineering | β Domain-aware system prompts | β No prompt optimization | β Optimized CoT & MoE matching |
Fusion Routing | β BERT + keyword + similarity fusion | β KV-aware only | β Multi-signal intelligent routing |
Caching | β Semantic similarity (Milvus) | β KV cache reuse | β β Dual-layer caching |
Security | β PII + jailbreak | β No security layer | β Pre-inference filtering |
Cost Optimization | β Cross-Model-level | β Infrastructure-level | β β End-to-end optimization |
Latency | Adaptive (fusion routing) | Low routing overhead | Parallel execution |
Concrete Example:
Query: "Explain the proof of Fermat's Last Theorem step-by-step"
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Semantic Router Layer β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β 1. Fusion Routing (3-signal analysis): β
β a) Keyword Match: "theorem", "proof" β math (confidence: 0.8)β
β b) Similarity Search: matches "mathematical proofs" concept β
β (similarity: 0.87) β
β c) BERT Classification: "math" category (confidence: 0.92) β
β β Final Decision: "math" (multi-signal consensus) β
β 2. Model Selection: deepseek-v31 (best for math reasoning) β
β 3. System Prompt Injection: β
β "You are a mathematics expert. Provide step-by-step β
β solutions, show your work clearly, and explain β
β mathematical concepts in an understandable way." β
β 4. Reasoning Mode: ENABLED (entropy-based decision) β
β 5. Security: PASS (no PII, no jailbreak) β
β 6. Semantic Cache: MISS (novel query) β
β 7. Enriched Request: β
β - model=deepseek-v31 β
β - reasoning_effort=high β
β - system_prompt=<math expert prompt> β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Dynamo Router Layer β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β 1. Worker Pool: [worker-1, worker-2, worker-3] (deepseek-v31) β
β 2. KV Cache Analysis: β
β - worker-1: 15 cached blocks (math proofs context) β
β - worker-2: 3 cached blocks β
β - worker-3: 0 cached blocks β
β 3. Cost Calculation: β
β - worker-1: 85 prefill + 25 active = 110 (BEST) β
β - worker-2: 97 prefill + 20 active = 117 β
β - worker-3: 100 prefill + 18 active = 118 β
β 4. Selection: worker-1 (significant prefill cost reduction) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Result:
- Right model (deepseek-v31 for math reasoning)
- Right worker (worker-1 with relevant KV cache)
- Right mode (reasoning enabled)
- Significantly faster TTFT vs. random worker selection
2.4 Why Integration Matters: Achieving System-Level Intelligenceβ
Challenge 1: Infrastructure without Intelligence
- Dynamo optimizes infrastructure efficiency but lacks semantic understanding
- Cannot distinguish between "2+2=?" and "Prove Fermat's Last Theorem"
- Routes both to the same model pool without understanding complexity or quality requirements
- No ability to select specialized models (math vs. code vs. creative) based on task characteristics
Challenge 2: Intelligence without Infrastructure Awareness
- Semantic Router provides intelligent model selection but lacks infrastructure visibility
- Selects the right model but not the optimal worker
- Cannot leverage KV cache reuse across workers
- No awareness of GPU utilization or worker load for efficiency optimization
Solution: Holistic System Intelligence through Layered Integration
System Intelligence Layer (Semantic Router)
β [accuracy: model selection, quality optimization, security]
β [efficiency: semantic cache, adaptive routing, cost control]
Infrastructure Optimization Layer (Dynamo)
β [efficiency: worker selection, KV cache, GPU scheduling]
β [accuracy: consistent execution, reliable serving]
Execution Layer (vLLM/SGLang/TRT-LLM)
Result: A holistically intelligent system that optimizes for both accuracy (right model, right prompt, right quality) and efficiency (right worker, right cache, right resource utilization) at every layer.
3. Goals and Non-Goalsβ
3.1 Goalsβ
Primary Goals:
- Seamless Integration: Semantic Router operates as a pre-processing layer before Dynamo's router
- Dual-Layer Caching: Semantic cache (request-level) + KV cache (token-level) work in tandem
- Model-Aware Routing: Dynamo routes to worker pools filtered by Semantic Router's model selection
- Security Enforcement: PII and jailbreak detection before requests reach Dynamo
- Unified Observability: Single trace spans both semantic and infrastructure layers
- Zero Downtime: Hot-reload of semantic routing rules without Dynamo restart
Secondary Goals:
- Performance: Combined latency < 50ms (semantic + infrastructure routing)
- Scalability: Support 10K+ RPS with horizontal scaling
- Flexibility: Support multiple deployment patterns (sidecar, gateway, embedded)
3.2 Non-Goalsβ
- Replacing Dynamo Router: Semantic Router augments, not replaces, Dynamo's KV-aware routing
- Modifying Dynamo Core: Integration via standard APIs, no Dynamo internals changes required
- Unified Configuration: Maintain separate configs for semantic and infrastructure layers
- Synchronous Coupling: Systems can operate independently if needed
4. Proposal Detailsβ
4.1 Deep Learning Modelsβ
The Semantic Router leverages four specialized deep learning models for intelligent request processing. The system uses a combination of BERT and ModernBERT architectures optimized for different tasks.
4.1.1 Similarity Model (BERT Embeddings)β
Purpose: Generate embeddings for semantic similarity comparison
Model: sentence-transformers/all-MiniLM-L12-v2
Key Features:
- Architecture: BERT-based (microsoft/MiniLM-L12-H384-uncased)
- 12 layers, 384 hidden dimensions, 12 attention heads
- Fine-tuned on 1B+ sentence pairs using contrastive learning
- Base model: Standard BERT architecture (not ModernBERT)
- Embedding Dimension: 384
- Use Cases:
- Semantic cache similarity matching (threshold: 0.8)
- Tool selection via semantic search (threshold: 0.2)
- Similarity-based routing for semantic concepts
- Deployment: CPU-optimized for cost efficiency
- Model Size: 33.4M parameters (~120 MB)
Configuration:
bert_model:
model_id: sentence-transformers/all-MiniLM-L12-v2
threshold: 0.6
use_cpu: true
Why BERT (not ModernBERT)?
- Mature, well-tested model with proven performance
- Optimized for sentence embeddings via contrastive learning
- Smaller model size (120 MB) for faster loading
- ModernBERT (released Dec 2024) is used for classification tasks below
4.1.2 Classification Model (Category Detection)β
Purpose: Classify queries into 14 domain categories
Model: models/category_classifier_modernbert-base_model
Key Features:
- Architecture: ModernBERT-base (released Dec 2024)
- Modern replacement for BERT with improved architecture
- 8192 token context length (vs. BERT's 512)
- Rotary Position Embeddings (RoPE) for better long-context handling
- Flash Attention 2 for faster inference
- Fine-tuned on MMLU-Pro dataset for domain classification
- Categories: 14 domains (math, computer_science, physics, chemistry, biology, engineering, economics, business, law, psychology, philosophy, history, health, other)
- Output: Category label + confidence score
- Threshold: 0.6 (configurable)
- Training Data: MMLU-Pro dataset with domain-specific examples
- Model Size: ~149M parameters (ModernBERT-base)
Configuration:
classifier:
category_model:
model_id: "models/category_classifier_modernbert-base_model"
use_modernbert: true
threshold: 0.6
use_cpu: true
category_mapping_path: "models/category_classifier_modernbert-base_model/category_mapping.json"
Model Selection Impact:
- Determines which LLM to route to (e.g., DeepSeek-V3 for math, Qwen3 for business)
- Triggers domain-specific system prompt injection
- Controls reasoning mode activation
4.1.3 PII Detection Model (Privacy Protection)β
Purpose: Detect personally identifiable information at token level
Model: models/pii_classifier_modernbert-base_presidio_token_model
Key Features:
- Architecture: ModernBERT-base fine-tuned for token classification
- Token-level sequence labeling (BIO tagging scheme)
- Fine-tuned on Microsoft Presidio dataset
- Optimized for privacy-sensitive entity detection
- PII Types Detected: 17 types including:
- Identity:
PERSON
,AGE
,NRP
(nationality/religious/political) - Contact:
EMAIL_ADDRESS
,PHONE_NUMBER
,STREET_ADDRESS
,ZIP_CODE
- Financial:
CREDIT_CARD
,IBAN_CODE
,US_SSN
,US_DRIVER_LICENSE
- Technical:
IP_ADDRESS
,DOMAIN_NAME
- Organizational:
ORGANIZATION
,GPE
(geopolitical entity) - Temporal:
DATE_TIME
- Identity:
- Granularity: Token-level classification (not just entity-level)
- Threshold: 0.7 (configurable)
- Action: Block requests violating model-specific PII policies
- Model Size: ~149M parameters (ModernBERT-base)
Configuration:
classifier:
pii_model:
model_id: "models/pii_classifier_modernbert-base_presidio_token_model"
use_modernbert: true
threshold: 0.7
use_cpu: true
pii_mapping_path: "models/pii_classifier_modernbert-base_presidio_token_model/pii_type_mapping.json"
Policy Enforcement:
model_config:
public-model:
pii_policy:
allow_by_default: false
pii_types_allowed: ["PERSON"] # Only person names allowed
Response Headers (when blocked):
x-vsr-pii-violation: true
4.1.4 Jailbreak Detection Model (Security)β
Purpose: Detect adversarial prompts and jailbreak attempts
Model: Auto-discovered from models/
directory
Key Features:
- Architecture: Multiple options with automatic selection
- LoRA models (preferred): Fine-tuned adapters on BERT/RoBERTa/ModernBERT base
lora_jailbreak_classifier_bert_model
(Priority 1)lora_jailbreak_classifier_roberta_model
(Priority 2)lora_jailbreak_classifier_modernbert_model
(Priority 3)
- Legacy model (fallback):
jailbreak_classifier_modernbert-base_model
- LoRA models offer better accuracy with smaller size (~10-20 MB adapters)
- LoRA models (preferred): Fine-tuned adapters on BERT/RoBERTa/ModernBERT base
- Model Discovery: Automatic selection with architecture priority: BERT > RoBERTa > ModernBERT
- Detection Types:
- Prompt injection attacks
- Instruction override attempts
- Adversarial prompts
- Social engineering
- Threshold: 0.7 (configurable)
- Action: Block requests with confidence above threshold
- Model Size:
- LoRA: ~10-20 MB (adapter only) + base model
- Legacy: ~149M parameters (ModernBERT-base)
Configuration:
prompt_guard:
enabled: true
use_modernbert: true
threshold: 0.7
use_cpu: true
# model_id and jailbreak_mapping_path are auto-discovered
Response Headers (when blocked):
x-vsr-jailbreak-blocked: true
x-vsr-jailbreak-type: {type}
(e.g., "prompt_injection")x-vsr-jailbreak-confidence: {score}
(e.g., "0.950")
4.1.5 Model Performance Summaryβ
Model | Purpose | Architecture | Parameters | Threshold | CPU/GPU |
---|---|---|---|---|---|
Similarity | Semantic matching | BERT (MiniLM-L12) | 33.4M | 0.6-0.8 | CPU |
Classification | Category detection | ModernBERT-base | 149M | 0.6 | CPU |
PII Detection | Privacy protection | ModernBERT-base | 149M | 0.7 | CPU |
Jailbreak | Security filtering | ModernBERT-base/LoRA | 149M + adapters | 0.7 | CPU |
Architecture Comparison:
Feature | BERT (MiniLM) | ModernBERT |
---|---|---|
Release Date | 2020 | December 2024 |
Context Length | 512 tokens | 8192 tokens |
Position Encoding | Absolute | RoPE (Rotary) |
Attention | Standard | Flash Attention 2 |
Use Case | Embeddings | Classification |
Model Size | 33.4M params | 149M params |
Optimization Strategies:
- Parallel Execution: PII and Jailbreak detection run in parallel
- Early Exit: Cache hits bypass all model inference
- Keyword Routing: Fast path for deterministic patterns
- CPU Optimization: All models optimized for CPU inference to reduce cost
- LoRA Adapters: Jailbreak model uses lightweight adapters for faster loading
4.2 Design Principlesβ
- Separation of Concerns: Semantic intelligence and infrastructure optimization remain decoupled
- API-Driven Integration: Use Dynamo's frontend API and worker registration mechanisms
- Fail-Safe Design: Semantic Router failure falls back to Dynamo's default routing
- Observability-First: Every decision (semantic + infrastructure) is traced and logged
- Kubernetes-Native: Designed for cloud-native deployment with CRDs and operators
4.3 System Architectureβ
Architecture Layers:
-
Semantic Intelligence Layer (Semantic Router)
- Envoy Gateway with ExtProc for request interception
- BERT-based classification and security filtering
- Semantic caching with Milvus backend
- Request enrichment with routing metadata
-
Infrastructure Optimization Layer (Dynamo)
- Dynamo Frontend receives enriched requests
- KV Router performs model-aware worker selection
- Planner handles dynamic scaling
- KVBM manages multi-tier KV cache
-
Execution Layer (vLLM/SGLang/TRT-LLM)
- Model-specific worker pools
- Disaggregated prefill/decode workers
- Backend-agnostic execution
-
Storage Layer
- Milvus for semantic cache
- System memory for KV cache offload
- NVMe for cold KV cache storage
4.4 Request Flowβ
4.4.1 End-to-End Request Processingβ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Phase 1: Semantic Intelligence (Semantic Router) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β Step 1: Request Interception β
β - Envoy Gateway receives OpenAI API request β
β - ExtProc gRPC call to Semantic Router β
β - Extract query from messages array β