semantic-cache
Category: cache · Cloud + Local · Status: v1 — production
Embeds each incoming request, searches the vector index for similar past requests, returns the cached response if similarity exceeds the threshold. Skips the provider call entirely on a hit.
What it does
For workloads with repeated questions in different phrasings, semantic-cache can skip repeated provider calls. Hit rate depends on workload repetition, similarity threshold, cache TTL, and exclusions.
The local benchmark suite includes a repeated-seed synthetic fixture. Treat it as a mechanics test, not a production average. Run prxy bench --remote for your own traffic shape.
When to use it
✅ Customer support chatbots (lots of similar questions) ✅ Documentation Q&A ✅ Repetitive coding queries ✅ Any read-heavy workload
❌ Creative writing (similar prompts deserve different responses) ❌ Real-time data queries (stock price, weather) ❌ Prompts with embedded timestamps or unique tokens
Configuration
semantic-cache:
similarity: 0.85 # 0.0 - 1.0; cosine similarity threshold
ttlSeconds: 3600 # cache lifetime
embeddingModel: 'voyage-3'
excludePaths: [] # request paths to skip caching (regex)
scope: 'user' # 'user' | 'global' — share across users?
Metrics emitted
cache.semantic.hit(boolean)cache.semantic.score(0.0–1.0; the best score found, even on miss)cache.semantic.lookup_ms(number)cache.semantic.write_ms(number; post-hook only)
Examples
High precision — almost never false-positive, low hit rate:
semantic-cache:
similarity: 0.95
Balanced default — good for most apps:
semantic-cache:
similarity: 0.85
ttlSeconds: 3600
Aggressive — max savings, occasional false positives acceptable:
semantic-cache:
similarity: 0.75
ttlSeconds: 86400
How it works
-
Pre hook:
- Serialize the request (system + messages, normalized).
- Embed it.
vectorSearchagainst the cached embeddings table for this user (or global ifscope: global).- If best score ≥
similarity: return cached response withcache.semantic.hit = true. Skip provider call. - If miss: continue pipeline, attach
cache.semantic.scorefor visibility.
-
Post hook (only on miss):
- Store the request embedding + response with
ttlSeconds. - Skipped if the response was an error (
stop_reason === 'error').
- Store the request embedding + response with
Cache scope
| Setting | Behavior |
|---|---|
scope: 'user' (default) | Cache hits only match within the same userId. Each user has their own private cache. |
scope: 'global' | Cache hits match across all users. Use for documentation Q&A or other public-info workloads. |
Streaming
Cache hits on streaming requests are replayed as a synthetic SSE stream. Your client sees message_start → content_block_* → message_stop events just like a real stream — only faster and free.
Cloud vs Local
Cloud cache entries are scoped to your account or workspace. Local cache entries stay in your configured local data volume. Both modes store the request embedding next to the cached response so lookup and retrieval use the same adapter contract.
Don’t set scope: 'global' on workloads that include user-specific data in the prompt. The serializer normalizes whitespace + casing but does not redact user IDs, names, or PII.