prxy.monster API-key BYOK is live. Start free

semantic-cache

Category: cache · Cloud + Local · Status: v1 — production

Embeds each incoming request, searches the vector index for similar past requests, returns the cached response if similarity exceeds the threshold. Skips the provider call entirely on a hit.

What it does

For workloads with repeated questions in different phrasings, semantic-cache can skip repeated provider calls. Hit rate depends on workload repetition, similarity threshold, cache TTL, and exclusions.

The local benchmark suite includes a repeated-seed synthetic fixture. Treat it as a mechanics test, not a production average. Run prxy bench --remote for your own traffic shape.

When to use it

✅ Customer support chatbots (lots of similar questions) ✅ Documentation Q&A ✅ Repetitive coding queries ✅ Any read-heavy workload

❌ Creative writing (similar prompts deserve different responses) ❌ Real-time data queries (stock price, weather) ❌ Prompts with embedded timestamps or unique tokens

Configuration

semantic-cache:
  similarity: 0.85          # 0.0 - 1.0; cosine similarity threshold
  ttlSeconds: 3600          # cache lifetime
  embeddingModel: 'voyage-3'
  excludePaths: []          # request paths to skip caching (regex)
  scope: 'user'             # 'user' | 'global' — share across users?

Metrics emitted

Examples

High precision — almost never false-positive, low hit rate:

semantic-cache:
  similarity: 0.95

Balanced default — good for most apps:

semantic-cache:
  similarity: 0.85
  ttlSeconds: 3600

Aggressive — max savings, occasional false positives acceptable:

semantic-cache:
  similarity: 0.75
  ttlSeconds: 86400

How it works

  1. Pre hook:

    • Serialize the request (system + messages, normalized).
    • Embed it.
    • vectorSearch against the cached embeddings table for this user (or global if scope: global).
    • If best score ≥ similarity: return cached response with cache.semantic.hit = true. Skip provider call.
    • If miss: continue pipeline, attach cache.semantic.score for visibility.
  2. Post hook (only on miss):

    • Store the request embedding + response with ttlSeconds.
    • Skipped if the response was an error (stop_reason === 'error').

Cache scope

SettingBehavior
scope: 'user' (default)Cache hits only match within the same userId. Each user has their own private cache.
scope: 'global'Cache hits match across all users. Use for documentation Q&A or other public-info workloads.

Streaming

Cache hits on streaming requests are replayed as a synthetic SSE stream. Your client sees message_start → content_block_* → message_stop events just like a real stream — only faster and free.

Cloud vs Local

Cloud cache entries are scoped to your account or workspace. Local cache entries stay in your configured local data volume. Both modes store the request embedding next to the cached response so lookup and retrieval use the same adapter contract.

Don’t set scope: 'global' on workloads that include user-specific data in the prompt. The serializer normalizes whitespace + casing but does not redact user IDs, names, or PII.

Source

src/modules/semantic-cache.ts