:::tip 🎮 Interactive Playground Visualize this concept: Try the Semantic Caching for LLMs demo on the EngineersOfAI Playground - no code required. :::
Semantic Caching
The $18,000 FAQ
The data team ran the numbers and sent them to the engineering lead with a single subject line: "You need to see this." Of the 18,000 - nearly 60% - was spent on questions that were semantically identical to questions the system had already answered. The documentation assistant fielded 40,000 queries per day from developers using the company's API platform. When you stripped away surface-level wording variation, the actual unique question types numbered fewer than 2,000. "How do I install the SDK?" and "What's the installation process?" and "Can you show me how to install your library?" were three distinct strings representing a single intent - and all three were generating separate LLM calls at $0.015 each.
The team had a cache. An exact-match Redis cache keyed on the full query string. Its hit rate: under 2%. Because users almost never typed the exact same string twice, the cache was effectively useless. The cache infrastructure existed, but it was solving the wrong problem. String equality is the wrong heuristic for natural language.
One week after deploying semantic caching with a 0.93 similarity threshold, the hit rate was 41%. The monthly bill dropped from 19,000. The P95 latency on cached queries dropped from 1,800ms to 8ms. Finance got a 39% cost reduction. Users got faster responses on repeat question types. Engineering changed exactly zero prompts.
Why String-Match Caching Fails for Natural Language
Traditional caching works by treating the input as an opaque string and returning the stored output when an identical string appears again. This works perfectly for deterministic systems: the same SQL query, the same API parameters, the same cryptographic hash. For these systems, the input space is discrete and the same logical operation has exactly one representation.
Natural language is different. The same meaning can be expressed in hundreds of phrasings:
- "How do I install the SDK?" / "Installation instructions for the SDK?" / "Show me how to set up the SDK"
- "What are the rate limits?" / "How many API calls can I make per minute?" / "What's the request quota?"
- "How do I authenticate?" / "What authentication method does your API use?" / "How do I get an API token?"
These are not different questions - they are the same questions with different surface forms. An exact-match cache misses every variant. A semantic cache recognizes the shared meaning and serves the cached answer for all of them.
The technology that makes this possible is the same technology that powers semantic search: embedding models that convert text into dense vectors where semantically similar texts produce geometrically close vectors. Cosine similarity between vectors measures semantic relatedness.
How Semantic Caching Works
The five steps of semantic caching:
-
Embed the query: convert the incoming user query to a dense embedding vector using an embedding model (OpenAI
text-embedding-3-small, Cohereembed-english-v3.0, or a localsentence-transformersmodel). This step costs ~0.02 cents per query and takes ~20ms. -
Search the vector store: perform nearest-neighbor search in the vector index to find the most similar cached query vector. For small caches (under 10k entries), linear scan works. For larger caches, approximate nearest-neighbor (ANN) search with RediSearch or Qdrant is required.
-
Check similarity threshold: compute cosine similarity between the incoming vector and the nearest cached vector. If similarity exceeds the configured threshold (typically 0.92–0.95), it is a cache hit.
-
Return or call: cache hit means return the stored response immediately (no LLM call, no token cost). Cache miss means call the LLM provider.
-
Store on miss: after the LLM responds, store the query vector and response in the vector store for future hits.
The Similarity Threshold: The Most Consequential Knob
The similarity threshold is the single most important configuration decision in semantic caching. Too low, and you serve wrong answers. Too high, and you miss valid hits.
To understand why threshold matters, consider what happens at different values:
-
0.80 threshold: "What is a Python snake?" and "What is Python programming?" have a cosine similarity around 0.82 (they share "Python" and "is" - the embedding model sees lexical overlap). A threshold of 0.80 would serve the snake answer to the programming question. Never deploy below 0.90 without extensive empirical testing.
-
0.93 threshold: correctly identifies query pairs like "How do I install the SDK?" (0.97) and "What authentication method does your API use?" (0.89 with "How do I authenticate?"). Misses queries that are semantically identical but use highly different vocabulary - a small cost worth the accuracy gain.
-
0.99 threshold: only hits when queries are almost identical character-by-character. Hit rate falls below 5% in most applications. Essentially an expensive fuzzy string match at this point.
How to calibrate empirically: collect a sample of 500–1,000 query pairs from your production logs. Manually label each pair as "should share a response" or "should not share a response." Compute embeddings for each pair. Plot the cosine similarity distribution for both groups. The correct threshold sits at the point that maximizes hits in the "should share" group while minimizing hits in the "should not share" group. This process takes about a half-day and should be repeated when significant domain shift occurs.
Full Implementation: SemanticCache with Redis
The following is a production-quality semantic cache implementation using OpenAI embeddings and Redis for storage.
import anthropic
import hashlib
import json
import time
import numpy as np
import redis
from dataclasses import dataclass, field
from typing import Optional
from openai import OpenAI
# ─────────────────────────────────────────────────────────────────────────────
# Data classes for cache statistics and results
# ─────────────────────────────────────────────────────────────────────────────
@dataclass
class CacheHitResult:
response: str
similarity: float
latency_ms: float
entry_id: str
@dataclass
class CacheStats:
hits: int = 0
misses: int = 0
total_embed_time_ms: float = 0.0
total_search_time_ms: float = 0.0
llm_cost_saved_usd: float = 0.0
@property
def hit_rate(self) -> float:
total = self.hits + self.misses
return self.hits / total if total > 0 else 0.0
@property
def avg_embed_ms(self) -> float:
total = self.hits + self.misses
return self.total_embed_time_ms / total if total > 0 else 0.0
# ─────────────────────────────────────────────────────────────────────────────
# SemanticCache: the core cache layer
# ─────────────────────────────────────────────────────────────────────────────
class SemanticCache:
"""
Production semantic cache using Redis for vector and response storage.
For caches under 10k entries: linear scan (O(n) per query).
For larger caches: replace _find_similar() with RediSearch vector index
for O(log n) approximate nearest-neighbor search.
Thread-safe: Redis operations are atomic.
"""
def __init__(
self,
redis_url: str = "redis://localhost:6379",
embedding_model: str = "text-embedding-3-small",
similarity_threshold: float = 0.93,
ttl_seconds: int = 86400, # 24 hours default TTL
namespace: str = "default", # namespace for multi-tenant isolation
):
self.redis = redis.from_url(redis_url, decode_responses=False)
self.embedding_model = embedding_model
self.similarity_threshold = similarity_threshold
self.ttl_seconds = ttl_seconds
self.namespace = namespace
self.openai = OpenAI()
self.stats = CacheStats()
# Namespace-specific keys prevent cross-feature cache collisions
self._set_key = f"scache:{namespace}:entries"
self._prefix = f"scache:{namespace}:entry:"
def _embed(self, text: str) -> np.ndarray:
"""
Generate an embedding vector for a text string.
Costs ~$0.0001 per query with text-embedding-3-small.
"""
t0 = time.time()
response = self.openai.embeddings.create(
input=text.strip(),
model=self.embedding_model,
)
embed_ms = (time.time() - t0) * 1000
self.stats.total_embed_time_ms += embed_ms
return np.array(response.data[0].embedding, dtype=np.float32)
@staticmethod
def _cosine_similarity(a: np.ndarray, b: np.ndarray) -> float:
"""Cosine similarity in [-1, 1]. For embeddings, values are typically [0.6, 1.0]."""
norm_a = np.linalg.norm(a)
norm_b = np.linalg.norm(b)
if norm_a == 0.0 or norm_b == 0.0:
return 0.0
return float(np.dot(a, b) / (norm_a * norm_b))
def _find_similar(self, query_vector: np.ndarray) -> Optional[tuple[str, float]]:
"""
Linear scan over all cached entries to find the most similar.
Returns (entry_id, similarity) if above threshold, else None.
Performance: ~5ms for 1k entries, ~50ms for 10k entries.
For 100k+ entries, use RediSearch ANN instead.
"""
entry_ids = self.redis.smembers(self._set_key)
best_sim = -1.0
best_id: Optional[str] = None
t0 = time.time()
for entry_id_bytes in entry_ids:
entry_id = entry_id_bytes.decode("utf-8")
entry_key = f"{self._prefix}{entry_id}"
vector_bytes = self.redis.hget(entry_key, "vector")
if vector_bytes is None:
continue
cached_vec = np.frombuffer(vector_bytes, dtype=np.float32)
sim = self._cosine_similarity(query_vector, cached_vec)
if sim > best_sim:
best_sim = sim
best_id = entry_id
self.stats.total_search_time_ms += (time.time() - t0) * 1000
if best_id is not None and best_sim >= self.similarity_threshold:
return best_id, best_sim
return None
def get(self, query: str) -> Optional[CacheHitResult]:
"""
Look up the cache for a semantically similar query.
Returns CacheHitResult on hit, None on miss.
"""
t0 = time.time()
query_vec = self._embed(query)
result = self._find_similar(query_vec)
if result is not None:
entry_id, similarity = result
entry_key = f"{self._prefix}{entry_id}"
response_bytes = self.redis.hget(entry_key, "response")
if response_bytes is not None:
# Refresh TTL on successful hit (cache entry is "warm")
self.redis.expire(entry_key, self.ttl_seconds)
self.stats.hits += 1
return CacheHitResult(
response=response_bytes.decode("utf-8"),
similarity=similarity,
latency_ms=(time.time() - t0) * 1000,
entry_id=entry_id,
)
self.stats.misses += 1
return None
def set(self, query: str, response: str, query_vector: Optional[np.ndarray] = None) -> str:
"""
Store a query-response pair in the cache.
Returns the entry_id for the stored entry.
"""
if query_vector is None:
query_vector = self._embed(query)
# Deterministic entry ID from query text
entry_id = hashlib.sha256(f"{self.namespace}:{query}".encode()).hexdigest()[:16]
entry_key = f"{self._prefix}{entry_id}"
pipe = self.redis.pipeline()
pipe.hset(entry_key, mapping={
"query": query.encode("utf-8"),
"response": response.encode("utf-8"),
"vector": query_vector.tobytes(),
"created_at": str(time.time()).encode("utf-8"),
"namespace": self.namespace.encode("utf-8"),
})
pipe.expire(entry_key, self.ttl_seconds)
pipe.sadd(self._set_key, entry_id)
pipe.execute()
return entry_id
def warm(self, seed_pairs: list[tuple[str, str]]) -> int:
"""
Pre-populate the cache with known query-response pairs.
Use before launch to ensure high hit rates from day one.
Returns the number of entries stored.
"""
count = 0
for query, response in seed_pairs:
self.set(query, response)
count += 1
print(f"Cache warm: {count} seed entries stored in namespace '{self.namespace}'")
return count
def get_stats(self) -> dict:
cache_size = self.redis.scard(self._set_key)
return {
"namespace": self.namespace,
"hit_rate": f"{self.stats.hit_rate:.1%}",
"hits": self.stats.hits,
"misses": self.stats.misses,
"cache_size": cache_size,
"avg_embed_ms": f"{self.stats.avg_embed_ms:.1f}",
"similarity_threshold": self.similarity_threshold,
"llm_cost_saved_usd": f"${self.stats.llm_cost_saved_usd:.4f}",
}
# ─────────────────────────────────────────────────────────────────────────────
# CachedAnthropicClient: transparent caching in front of Claude
# ─────────────────────────────────────────────────────────────────────────────
class CachedAnthropicClient:
"""
Anthropic client with semantic caching layer.
Transparently returns cached responses for semantically similar queries.
Tracks cost savings from cache hits.
"""
# claude-sonnet-4-6 pricing (March 2026)
INPUT_COST_PER_TOKEN = 3.0 / 1_000_000
OUTPUT_COST_PER_TOKEN = 15.0 / 1_000_000
def __init__(self, cache: SemanticCache):
self.cache = cache
self.client = anthropic.Anthropic()
def complete(
self,
messages: list[dict],
model: str = "claude-sonnet-4-6",
max_tokens: int = 1024,
system: Optional[str] = None,
cache_namespace_prefix: str = "",
) -> dict:
"""
Complete a request with semantic caching.
The cache is checked before any LLM call is made.
Returns:
dict with: response, cache_hit, similarity, cost_usd,
input_tokens, output_tokens, latency_ms
"""
# Build a canonical cache query from the user-role messages
user_content = " ".join(
m["content"] for m in messages if m["role"] == "user"
)
# Optionally prefix the cache key with a domain context
# to prevent cross-feature cache collisions within the same namespace
cache_query = f"{cache_namespace_prefix}{user_content}".strip()
# Step 1: Check semantic cache
cache_result = self.cache.get(cache_query)
if cache_result is not None:
# Estimate cost saving (assume average 300 output tokens)
estimated_saving = self.OUTPUT_COST_PER_TOKEN * 300
self.cache.stats.llm_cost_saved_usd += estimated_saving
return {
"response": cache_result.response,
"cache_hit": True,
"similarity": round(cache_result.similarity, 4),
"cost_usd": 0.0,
"latency_ms": cache_result.latency_ms,
"cache_entry_id": cache_result.entry_id,
}
# Step 2: Cache miss - call the LLM
start = time.time()
kwargs: dict = {"model": model, "max_tokens": max_tokens, "messages": messages}
if system:
kwargs["system"] = system
api_response = self.client.messages.create(**kwargs)
llm_latency_ms = (time.time() - start) * 1000
response_text = api_response.content[0].text
input_tokens = api_response.usage.input_tokens
output_tokens = api_response.usage.output_tokens
cost = (
input_tokens * self.INPUT_COST_PER_TOKEN
+ output_tokens * self.OUTPUT_COST_PER_TOKEN
)
# Step 3: Store in cache for future hits
self.cache.set(cache_query, response_text)
return {
"response": response_text,
"cache_hit": False,
"similarity": None,
"cost_usd": round(cost, 8),
"input_tokens": input_tokens,
"output_tokens": output_tokens,
"latency_ms": llm_latency_ms,
}
# ─────────────────────────────────────────────────────────────────────────────
# Experiments and demos
# ─────────────────────────────────────────────────────────────────────────────
def run_hit_rate_experiment() -> None:
"""
Measure hit rate with realistic FAQ query variations.
Seeds 3 queries, tests 9 variant phrasings.
"""
cache = SemanticCache(
redis_url="redis://localhost:6379",
similarity_threshold=0.93,
namespace="docs-assistant",
)
client = CachedAnthropicClient(cache)
seed_queries = [
("How do I install the Python SDK?",
"Install with pip: `pip install yourlib`. Full docs at..."),
("What are the API rate limits?",
"Default limits: 60 requests/min, 100k tokens/min. Enterprise plans have higher limits."),
("How do I authenticate with the API?",
"Generate an API key from your dashboard at settings -> API keys. Pass it in the Authorization header."),
]
print("=== Warming cache with seed entries ===\n")
cache.warm(seed_queries)
variant_queries = [
"Show me how to install the SDK for Python",
"What is the install process for your Python SDK?",
"How many requests can I make per minute?",
"What rate limiting applies to my API calls?",
"How does authentication work with your API?",
"What's the auth method for the API?",
"Where can I find my API token?",
"How do I get a developer API key?",
"Tell me about the request rate limits",
]
print("=== Testing variant queries ===\n")
for query in variant_queries:
result = client.complete(
messages=[{"role": "user", "content": query}],
max_tokens=100,
cache_namespace_prefix="docs:",
)
status = "CACHE" if result["cache_hit"] else "LLM "
sim_str = f"sim={result['similarity']:.3f}" if result["cache_hit"] else "new LLM call"
cost_str = f"${result['cost_usd']:.6f}" if not result["cache_hit"] else "$0.000000"
print(f" [{status}] ({sim_str}) cost={cost_str}")
print(f" Q: {query[:65]}")
print(f"\n=== Cache Statistics ===")
stats = cache.get_stats()
for key, value in stats.items():
print(f" {key}: {value}")
def threshold_calibration_demo() -> None:
"""
Demonstrate how different thresholds affect the hit/miss boundary
for a set of known query pairs.
"""
from openai import OpenAI
openai_client = OpenAI()
def embed(text: str) -> np.ndarray:
response = openai_client.embeddings.create(
input=text, model="text-embedding-3-small"
)
return np.array(response.data[0].embedding, dtype=np.float32)
def cosine_sim(a: np.ndarray, b: np.ndarray) -> float:
return float(np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b)))
pairs = [
# Should share a response
("How do I install the Python SDK?", "What's the install process for your Python library?", True),
("What are the API rate limits?", "How many requests per minute can I make?", True),
("How do I authenticate?", "What authentication method does the API use?", True),
# Should NOT share a response
("What is Python?", "What is a python snake?", False),
("How do I reset my password?", "How do I install the Python SDK?", False),
("Show me a code example", "What are the rate limits?", False),
]
print("=== Threshold Calibration Analysis ===\n")
print(f"{'Query A':<45} {'Query B':<45} {'Should Share':<14} {'Cosine Sim'}")
print("-" * 120)
for q_a, q_b, should_share in pairs:
vec_a = embed(q_a)
vec_b = embed(q_b)
sim = cosine_sim(vec_a, vec_b)
mark = "YES" if should_share else "NO "
flag = " <-- THRESHOLD BOUNDARY" if 0.88 < sim < 0.96 else ""
print(f" {q_a[:42]:<45} {q_b[:42]:<45} {mark:<14} {sim:.4f}{flag}")
if __name__ == "__main__":
run_hit_rate_experiment()
Cache Invalidation: The Hard Problem
Semantic caches have a fundamentally different invalidation challenge than exact-match caches. There is no unique key to delete - entries are matched by similarity. When the underlying facts change (new API version, updated pricing, changed documentation), related cache entries need to be invalidated.
class SemanticCacheInvalidator:
"""
Strategies for invalidating semantic cache entries.
All strategies scan the entry set and operate on matching entries.
"""
def __init__(self, cache: SemanticCache):
self.cache = cache
def invalidate_by_topic(self, topic: str, threshold: float = 0.85) -> int:
"""
Invalidate all entries semantically related to a topic.
Lower threshold than cache hits because we want to err on the side
of over-invalidation (stale entries) rather than under-invalidation (wrong answers).
Use when: a new API version is released, a product changes its pricing,
documentation is significantly updated.
Returns: number of entries invalidated.
"""
topic_vec = self.cache._embed(topic)
entry_ids = list(self.cache.redis.smembers(self.cache._set_key))
invalidated = 0
for entry_id_bytes in entry_ids:
entry_id = entry_id_bytes.decode("utf-8")
entry_key = f"{self.cache._prefix}{entry_id}"
vec_bytes = self.cache.redis.hget(entry_key, "vector")
if vec_bytes is None:
continue
entry_vec = np.frombuffer(vec_bytes, dtype=np.float32)
sim = self.cache._cosine_similarity(topic_vec, entry_vec)
if sim >= threshold:
self.cache.redis.delete(entry_key)
self.cache.redis.srem(self.cache._set_key, entry_id)
invalidated += 1
print(f"Topic invalidation '{topic}': removed {invalidated} entries (threshold={threshold})")
return invalidated
def invalidate_older_than(self, max_age_seconds: float) -> int:
"""
Invalidate entries older than max_age_seconds.
Useful for time-sensitive domains (news, prices, status pages).
Returns: count of invalidated entries.
"""
now = time.time()
entry_ids = list(self.cache.redis.smembers(self.cache._set_key))
invalidated = 0
for entry_id_bytes in entry_ids:
entry_id = entry_id_bytes.decode("utf-8")
entry_key = f"{self.cache._prefix}{entry_id}"
created_bytes = self.cache.redis.hget(entry_key, "created_at")
if created_bytes is None:
continue
created_at = float(created_bytes.decode("utf-8"))
if (now - created_at) > max_age_seconds:
self.cache.redis.delete(entry_key)
self.cache.redis.srem(self.cache._set_key, entry_id)
invalidated += 1
print(f"Age invalidation: removed {invalidated} entries older than {max_age_seconds}s")
return invalidated
def full_flush(self) -> int:
"""
Nuclear option: clear the entire namespace cache.
Use before a major content overhaul when most entries are stale.
"""
entry_ids = list(self.cache.redis.smembers(self.cache._set_key))
for entry_id_bytes in entry_ids:
entry_id = entry_id_bytes.decode("utf-8")
self.cache.redis.delete(f"{self.cache._prefix}{entry_id}")
self.cache.redis.delete(self.cache._set_key)
print(f"Full flush: removed {len(entry_ids)} entries from namespace '{self.cache.namespace}'")
return len(entry_ids)
Scaling to Large Caches: RediSearch ANN
The linear scan in the implementation above becomes too slow at more than 10,000 cache entries (~50ms+ per lookup). For larger deployments, use Redis with the RediSearch module and a vector index for approximate nearest-neighbor search.
def create_redisearch_vector_index(redis_client: redis.Redis, index_name: str = "scache_idx") -> None:
"""
Create a RediSearch vector index for ANN (approximate nearest neighbor) search.
This replaces the O(n) linear scan with O(log n) HNSW search.
Requires: Redis Stack or Redis with the RediSearch module installed.
"""
try:
# Drop existing index if present (for re-creation)
redis_client.execute_command("FT.DROPINDEX", index_name, "DD")
except Exception:
pass
# Create index on the 'vector' field of hash keys prefixed with 'scache:default:entry:'
redis_client.execute_command(
"FT.CREATE", index_name,
"ON", "HASH",
"PREFIX", "1", "scache:default:entry:",
"SCHEMA",
"vector", "VECTOR", "HNSW", "6",
"TYPE", "FLOAT32",
"DIM", "1536", # text-embedding-3-small dimension
"DISTANCE_METRIC", "COSINE",
)
print(f"RediSearch vector index '{index_name}' created.")
def search_with_redisearch(
redis_client: redis.Redis,
query_vector: np.ndarray,
index_name: str = "scache_idx",
top_k: int = 1,
similarity_threshold: float = 0.93,
) -> Optional[tuple[str, float]]:
"""
Use RediSearch KNN vector search instead of linear scan.
Returns (entry_id, similarity) or None.
This scales to millions of entries with millisecond latency.
"""
vector_bytes = query_vector.tobytes()
# FT.SEARCH with KNN syntax
results = redis_client.execute_command(
"FT.SEARCH", index_name,
f"*=>[KNN {top_k} @vector $vec AS score]",
"PARAMS", "2", "vec", vector_bytes,
"SORTBY", "score",
"RETURN", "2", "score", "__key",
"DIALECT", "2",
)
if results[0] == 0:
return None
# Parse the result: [count, key, [field, value, ...], ...]
entry_key = results[1].decode("utf-8")
score_bytes = None
fields = results[2]
for i in range(0, len(fields), 2):
if fields[i] == b"score":
score_bytes = fields[i + 1]
break
if score_bytes is None:
return None
# RediSearch returns cosine DISTANCE (not similarity) - convert
cosine_distance = float(score_bytes)
similarity = 1.0 - cosine_distance
if similarity >= similarity_threshold:
# Extract entry_id from key
entry_id = entry_key.split(":")[-1]
return entry_id, similarity
return None
When Semantic Caching Is and Is Not Appropriate
| Use case | Cache suitable? | Reasoning |
|---|---|---|
| FAQ bot | Yes | High query repetition, answers are stable |
| Documentation assistant | Yes | Common questions, stable content |
| Product description generator | Yes (high threshold) | Templates reuse; 0.97+ threshold |
| Code generation | Partial | Subtle differences matter; threshold 0.97+ |
| User-personalized recommendations | No | Responses must be tailored to the individual |
| Real-time data queries | No | "What is the current BTC price?" cannot be cached |
| Creative writing | No | Users expect fresh responses each time |
| Queries containing PII | No | Privacy violation risk - must bypass entirely |
| Medical or legal advice | No | Liability from stale or slightly wrong answers |
:::danger Never cache queries containing PII If a user's query contains personal information - their name, account number, medical condition, location - that query-response pair must never be stored in the cache. A future user asking a semantically similar question could receive a cached response containing another user's private data. Implement a PII detection step before the cache lookup (regex for emails, phone numbers, SSNs, or use a dedicated PII classifier) and bypass the cache entirely for flagged queries. :::
:::warning The embedding model must match between writes and reads
If you populate the cache with text-embedding-3-small embeddings (1536 dimensions) and then change to text-embedding-3-large (3072 dimensions), all existing cache entries are incompatible - vector dimensions don't match and cosine similarity comparisons will be incorrect or will error. Version your cache namespace when changing embedding models. Warm the new namespace in parallel before switching traffic to it. Never mix embedding model versions in the same namespace.
:::
:::tip Warm the cache before launch for known high-traffic queries
For any application where you know the common queries in advance - FAQ bots, documentation assistants, customer support systems - pre-populate the cache before launch. Collect the top 200–500 most frequent questions from your existing support tickets, docs analytics, or user research. Generate answers for each. Use cache.warm() to store them. This ensures your launch-day traffic hits the cache from the first request, rather than spending the first day populating it through live traffic.
:::
:::info Measure cost savings, not just hit rate A 40% cache hit rate sounds good in isolation, but the value depends entirely on what kinds of queries are hitting. If your 40% hit rate is all on short, cheap queries while expensive long queries always miss, the cost savings may be only 15%. Report cost savings avoided (estimated cost of cache hits times hit count) alongside hit rate. This is the metric that matters to finance and engineering leadership. :::
Cache Monitoring in Production
Cache monitoring is non-negotiable in production. Track these metrics and their acceptable ranges:
| Metric | How to measure | Acceptable range | Action if outside range |
|---|---|---|---|
| Hit rate | hits / (hits + misses), hourly | Above 25% after 48h for FAQ workloads | Investigate query variety; warm with more seeds |
| Avg similarity on hits | Mean cosine sim for all cache hits | 0.93–0.99 | Below 0.93: threshold too low; raise it |
| P95 embedding latency | Time from query receipt to vector returned | Under 50ms | Switch to faster embedding model or cache embeddings |
| Cache entry count | SCARD scache:namespace:entries | Below 100k for linear scan | Migrate to RediSearch ANN at 10k entries |
| Stale hit rate | Manual sampling of cache hits | Under 5% incorrect | Run topic-based invalidation on changed content |
| Cost savings rate | Cache hits x avg cost per LLM call | Growing over first 2 weeks | Monitor trend; declining means query diversity increasing |
Production Engineering Notes
Cache monitoring is non-negotiable in production. Track these metrics:
- Hit rate (hourly): should be above 25% after 48 hours for FAQ-type workloads
- Avg similarity on hits: distribution should be tight near the threshold plus a cluster near 0.99 (near-exact matches); a flat distribution suggests the threshold is too low
- P95 embedding latency: embedding calls add overhead to every request; should be under 50ms
- Cache size: unbounded growth means prune with TTL and age-based invalidation
- Stale hit rate: manually sample cache hits periodically and verify the response is still correct
Interview Q&A
Q: How does semantic caching differ from traditional exact-match caching?
Exact-match caching uses the full query string as the cache key and only returns a cached response when an identical string appears again. For natural language, hit rates are near zero because users rarely type exactly the same string twice. Semantic caching converts queries to embedding vectors using a language model, then uses cosine similarity between the incoming query vector and stored vectors to determine whether a new query is "semantically close enough" to a cached query to return the cached response. Hit rates of 30–50% are typical for FAQ and documentation use cases, because semantically equivalent queries - regardless of wording - map to geometrically nearby points in embedding space.
Q: How do you choose the similarity threshold for a semantic cache?
Empirically. Collect 500–1,000 representative query pairs from production logs and manually label which pairs should share a response ("How do I install?" / "Installation instructions?") and which should not ("What is Python?" / "What is a python snake?"). Compute embeddings and cosine similarities for all pairs. Plot two distributions: similarity scores for "should share" pairs and similarity scores for "should not share" pairs. Set the threshold where the "should share" distribution is above it and the "should not share" distribution is below it. For most FAQ use cases, this is 0.92–0.95. For domains where subtle differences in wording matter (code generation, legal, medical), use 0.97+. Recalibrate when the query distribution changes significantly.
Q: What are the risks of setting the threshold too low?
Two distinct risks. First, serving semantically wrong answers: a user asking "What is a Python snake?" and a user asking "What is Python programming?" have embedding similarity around 0.82 - at a threshold of 0.80, the second user gets the snake answer. Second, privacy violations: "What is my account balance?" from two different users may produce similar embeddings; at a low threshold, User B could receive User A's cached balance response. This is a serious data privacy failure. Never deploy below 0.90 without extensive empirical testing on your specific domain, and always implement PII detection to bypass the cache for sensitive queries.
Q: What happens to cache quality when the underlying knowledge base changes?
Cache entries become stale when the facts they represent change. For example, if an API version changes and the installation instructions change, cached responses for "How do I install?" are now wrong. Strategies: (1) topic-based invalidation - embed the "changed topic" and delete all cache entries with similarity above 0.85 to that topic; (2) TTL-based invalidation - set short TTLs (1–7 days) on entries in domains that change frequently; (3) version-based namespace - when deploying a major documentation update, switch to a new cache namespace (e.g., docs-v2:) so old entries don't pollute new traffic. For domains where facts change multiple times per day (prices, status, live data), semantic caching is inappropriate regardless of invalidation strategy.
Q: How would you implement semantic caching at the gateway layer rather than in application code?
The gateway intercepts every request. Before forwarding to the LLM provider, it extracts the user-facing message content, generates an embedding, and queries the vector store. If similarity exceeds the threshold, the gateway returns the cached response directly without touching the provider - the request never leaves the gateway. On a cache miss, the gateway forwards to the provider, receives the response, stores the embedding-response pair, and returns the response to the caller. All application services benefit from this automatically without any application-level code changes. LiteLLM Proxy and Portkey both support semantic caching at the gateway layer with Redis as the backing store, configured via YAML or the dashboard. The advantage over application-level caching is that the cache is shared across all services and all users, maximizing hit rates.
Q: Walk me through how you would scale a semantic cache from 1,000 to 1,000,000 entries.
At 1,000 entries, a linear scan over Redis hashes works fine - O(n) scan takes under 5ms. At 10,000 entries, linear scan takes ~50ms - still acceptable but starting to add noticeable overhead. At 100,000 entries, linear scan takes 500ms+ - unacceptable. The solution is to switch from linear scan to approximate nearest-neighbor (ANN) search using a vector index. Options: Redis Stack with RediSearch (HNSW index, sub-millisecond ANN at millions of entries), Qdrant (dedicated vector database with ANN, filtering, and payload storage), or Pinecone (managed service). Migration path: build the ANN index from existing cache entries, run both in parallel for validation, then switch over. The ANN search introduces approximate results (may miss some near-threshold entries) but this tradeoff is acceptable given the massive latency improvement. Also implement cache eviction at this scale: track access frequency and evict least-recently-used entries when the index exceeds a configured maximum size.
Q: How do you measure whether semantic caching is actually saving money?
Track three numbers: (1) cache hit count per period, (2) average cost per LLM call for the same feature (from the cache misses that did hit the LLM), and (3) embedding cost per lookup. Cost savings = (cache_hits * avg_llm_cost_per_miss) - (total_lookups * embedding_cost_per_lookup). If the embedding cost exceeds the LLM cost savings, caching is making you money. For most FAQ scenarios with claude-sonnet-4-6 responses (avg 0.0001/embedding), you need a hit rate above 2% for caching to be net positive - in practice, FAQ workloads run 30–50% hit rates, making caching extremely profitable.
Semantic Cache Invalidation Strategies
When facts change, cached responses become stale. The right invalidation strategy depends on how frequently the underlying knowledge changes.
| Scenario | Recommended Strategy | Implementation |
|---|---|---|
| Documentation updated weekly | TTL-based (7-day expiry) | EXPIRE key on store |
| Product prices change daily | Very short TTL (4-6 hours) | Aggressive TTL |
| Live data (status, balance) | Bypass cache entirely | PII/dynamic flag |
| Major system change (new version) | Namespace switch | Change key prefix |
| Topic-specific update | Topic-based invalidation | Delete by similarity |
import anthropic
import redis
import numpy as np
class SemanticCacheInvalidator:
"""
Invalidates semantic cache entries based on topic similarity.
When a documentation section changes, embed the changed topic
and delete all cache entries whose stored query is similar to
the changed topic. This is more precise than full cache flush
and more automatic than manual key deletion.
"""
def __init__(
self,
redis_client: redis.Redis,
invalidation_threshold: float = 0.85,
cache_key_prefix: str = "scache:",
):
self.redis = redis_client
self.threshold = invalidation_threshold
self.prefix = cache_key_prefix
self._client = anthropic.Anthropic()
def _embed(self, text: str) -> list[float]:
"""Embed text for similarity comparison."""
# In production, use a dedicated embedding model
# This is a placeholder using hash-based approximation
import hashlib
h = hashlib.sha256(text.encode()).digest()
return [b / 255.0 for b in h[:64]]
def _cosine_sim(self, a: list[float], b: list[float]) -> float:
va = np.array(a, dtype=np.float32)
vb = np.array(b, dtype=np.float32)
norm_a = np.linalg.norm(va)
norm_b = np.linalg.norm(vb)
if norm_a == 0 or norm_b == 0:
return 0.0
return float(np.dot(va, vb) / (norm_a * norm_b))
def invalidate_by_topic(self, changed_topic: str) -> int:
"""
Delete all cache entries semantically similar to the changed topic.
Returns the number of entries deleted.
"""
topic_embedding = self._embed(changed_topic)
# Scan all cache keys
pattern = f"{self.prefix}*"
keys_deleted = 0
cursor = 0
while True:
cursor, keys = self.redis.scan(cursor, match=pattern, count=100)
for key in keys:
# Each cache entry stores the query embedding alongside the response
stored = self.redis.hgetall(key)
if b"query_embedding" not in stored:
continue
import json
stored_emb = json.loads(stored[b"query_embedding"])
similarity = self._cosine_sim(topic_embedding, stored_emb)
if similarity >= self.threshold:
self.redis.delete(key)
keys_deleted += 1
if cursor == 0:
break
print(f"[CacheInvalidator] Deleted {keys_deleted} entries similar to: '{changed_topic}'")
return keys_deleted
def invalidate_by_namespace(self, old_prefix: str) -> int:
"""Delete all cache entries under an old namespace prefix."""
pattern = f"{old_prefix}*"
keys = list(self.redis.scan_iter(match=pattern, count=500))
if keys:
self.redis.delete(*keys)
print(f"[CacheInvalidator] Deleted {len(keys)} entries with prefix '{old_prefix}'")
return len(keys)
# Usage example: documentation v2 released - invalidate v1 cache
# invalidator = SemanticCacheInvalidator(redis_client)
# deleted = invalidator.invalidate_by_topic(
# "Python SDK installation instructions version 2"
# )
# Also switch to new cache namespace for new traffic:
# cache.prefix = "scache:v2:"
Cache Warming: Pre-Populating Before Launch
After a cache flush or fresh deployment, the cache is empty. For high-traffic features, this "cold start" means every user query is a cache miss until enough unique queries have been answered and stored. Cache warming pre-populates the cache before launch using known high-frequency queries.
def warm_semantic_cache(
cache: "SemanticCache",
client: anthropic.Anthropic,
seed_queries: list[str],
model: str = "claude-haiku-4-5-20251001",
max_tokens: int = 512,
) -> dict:
"""
Pre-populate the semantic cache with responses to seed queries.
Seed queries are the top N queries from production logs (after deduplication)
or a manually curated list of expected frequent queries.
"""
results = {"warmed": 0, "already_cached": 0, "failed": 0}
for query in seed_queries:
# Check if already cached (e.g., from a previous warm run)
existing = cache.get(query)
if existing:
results["already_cached"] += 1
continue
try:
response = client.messages.create(
model=model,
max_tokens=max_tokens,
messages=[{"role": "user", "content": query}],
)
answer = response.content[0].text
cache.set(query, answer)
results["warmed"] += 1
print(f"Warmed: '{query[:60]}...' -> {len(answer)} chars")
except Exception as e:
results["failed"] += 1
print(f"Failed to warm '{query[:40]}...': {e}")
print(f"\nCache warm results: {results}")
return results
# Typical seed queries for a documentation assistant
SEED_QUERIES = [
"How do I install the SDK?",
"What are the pricing plans?",
"How do I authenticate API requests?",
"What is the rate limit for my plan?",
"How do I handle errors in the API?",
"What models are available?",
"How do I cancel my subscription?",
"Where can I find my API key?",
"Is there a free trial?",
"What are the supported programming languages?",
]
## Semantic Caching at the Gateway Layer vs Application Layer
Semantic caching can be implemented at two layers: within each application service, or centrally at the gateway layer. The choice has significant implications for effectiveness.
| Property | Application-layer cache | Gateway-layer cache |
|---|---|---|
| Scope | One service's queries | All services, all users |
| Hit rate | Lower - only one service's queries | Higher - shared across all users |
| Configuration | Per-service code change | Central config change |
| Invalidation | Per-service operation | One operation covers all |
| PII awareness | Service has context | Gateway needs hints via headers |
| Setup effort | Easy for one service | Harder - requires gateway deployment |
The gateway-layer cache provides dramatically higher hit rates because the cache is shared across all users of all services. When user A asks "How do I install the SDK?" and the response is cached, user B asking "What are the installation steps?" will hit the cache - even though they are different users from different services. Application-layer caches only serve the same user across sessions or users of the same specific service instance.
For production multi-service architectures, always implement semantic caching at the gateway layer. The per-service approach only makes sense for single-service systems or when gateway-layer caching is not feasible.
## Summary: Semantic Caching in Production
Semantic caching is the highest-ROI LLM optimization available for FAQ and documentation workloads. A correctly tuned cache (threshold 0.92–0.95) achieves 30–50% hit rates, reducing LLM costs by the same proportion with no visible quality degradation. The key operational decisions are:
- **Threshold calibration**: empirical, not intuitive - build a labeled evaluation set
- **Backing store**: Redis for small to medium scale, RediSearch with HNSW index for millions of entries
- **Invalidation strategy**: match the strategy to the knowledge change frequency
- **PII bypass**: non-negotiable - any query containing user-identifying information must bypass the cache
- **Cost measurement**: track cost avoidance (hits × avg_miss_cost) not just hit rate - this is the number finance understands
The semantic cache is a shared infrastructure component, not an application feature. It belongs in the gateway layer where it benefits every service and every user simultaneously.
## Semantic Cache Architecture Diagram
```mermaid
flowchart TD
Q["User query<br/>'How do I install the Python SDK?'"]:::primary
E["Embedding model<br/>text-embedding-3-small<br/>384-dim vector"]:::neutral
VS["Vector store<br/>Redis with RediSearch<br/>HNSW index"]:::primary
Sim{"Cosine similarity<br/>above 0.95 threshold?"}:::warning
Hit["Cache hit<br/>Return cached response<br/>in milliseconds"]:::success
Miss["Cache miss<br/>Call LLM provider"]:::neutral
LLM["LLM provider call<br/>claude-sonnet-4-6"]:::primary
Store["Store in vector index<br/>embedding + response + metadata"]:::success
Resp["Return response<br/>to caller"]:::success
Q --> E
E --> VS
VS --> Sim
Sim -->|Yes| Hit
Hit --> Resp
Sim -->|No| Miss
Miss --> LLM
LLM --> Store
Store --> Resp
classDef primary fill:#dbeafe,stroke:#2563eb,color:#1e3a5f
classDef success fill:#dcfce7,stroke:#16a34a,color:#14532d
classDef warning fill:#fef9c3,stroke:#ca8a04,color:#713f12
classDef neutral fill:#f3f4f6,stroke:#6b7280,color:#111827
The embedding step runs on every request (both hits and misses). For this reason, the embedding model must be fast and cheap - the entire benefit of caching is eliminated if the embedding call is slower than the LLM call it avoids. OpenAI text-embedding-3-small ($0.02/1M tokens, ~10ms latency) is the standard choice for production semantic caches at any reasonable scale.
