What is semantic caching?

Return cached LLM responses for semantically similar queries using embedding-based vector similarity. Cut costs 40–60% by never paying for the same question twice regardless of how it is phrased.

How does LLM cache work in practice?

Semantic Caching covers semantic caching, LLM cache, GPTCache from first principles with code examples. Free lesson at https://engineersofai.com/docs/ai-engineering/llm-gateways/semantic-caching

What is the difference between semantic caching and GPTCache?

See the full breakdown at https://engineersofai.com/docs/ai-engineering/llm-gateways/semantic-caching

:::tip 🎮 Interactive Playground Visualize this concept: Try the Semantic Caching for LLMs demo on the EngineersOfAI Playground - no code required. :::

Semantic Caching

The $18,000 FAQ

The data team ran the numbers and sent them to the engineering lead with a single subject line: "You need to see this." Of the $31,000 spent on LLM calls last month,$ 18,000 - nearly 60% - was spent on questions that were semantically identical to questions the system had already answered. The documentation assistant fielded 40,000 queries per day from developers using the company's API platform. When you stripped away surface-level wording variation, the actual unique question types numbered fewer than 2,000. "How do I install the SDK?" and "What's the installation process?" and "Can you show me how to install your library?" were three distinct strings representing a single intent - and all three were generating separate LLM calls at $0.015 each.

The team had a cache. An exact-match Redis cache keyed on the full query string. Its hit rate: under 2%. Because users almost never typed the exact same string twice, the cache was effectively useless. The cache infrastructure existed, but it was solving the wrong problem. String equality is the wrong heuristic for natural language.

One week after deploying semantic caching with a 0.93 similarity threshold, the hit rate was 41%. The monthly bill dropped from $31,000 to$ 19,000. The P95 latency on cached queries dropped from 1,800ms to 8ms. Finance got a 39% cost reduction. Users got faster responses on repeat question types. Engineering changed exactly zero prompts.

Why String-Match Caching Fails for Natural Language

Traditional caching works by treating the input as an opaque string and returning the stored output when an identical string appears again. This works perfectly for deterministic systems: the same SQL query, the same API parameters, the same cryptographic hash. For these systems, the input space is discrete and the same logical operation has exactly one representation.

Natural language is different. The same meaning can be expressed in hundreds of phrasings:

"How do I install the SDK?" / "Installation instructions for the SDK?" / "Show me how to set up the SDK"
"What are the rate limits?" / "How many API calls can I make per minute?" / "What's the request quota?"
"How do I authenticate?" / "What authentication method does your API use?" / "How do I get an API token?"

These are not different questions - they are the same questions with different surface forms. An exact-match cache misses every variant. A semantic cache recognizes the shared meaning and serves the cached answer for all of them.

The technology that makes this possible is the same technology that powers semantic search: embedding models that convert text into dense vectors where semantically similar texts produce geometrically close vectors. Cosine similarity between vectors measures semantic relatedness.

How Semantic Caching Works

The five steps of semantic caching:

Embed the query: convert the incoming user query to a dense embedding vector using an embedding model (OpenAI text-embedding-3-small, Cohere embed-english-v3.0, or a local sentence-transformers model). This step costs ~0.02 cents per query and takes ~20ms.
Search the vector store: perform nearest-neighbor search in the vector index to find the most similar cached query vector. For small caches (under 10k entries), linear scan works. For larger caches, approximate nearest-neighbor (ANN) search with RediSearch or Qdrant is required.
Check similarity threshold: compute cosine similarity between the incoming vector and the nearest cached vector. If similarity exceeds the configured threshold (typically 0.92–0.95), it is a cache hit.
Return or call: cache hit means return the stored response immediately (no LLM call, no token cost). Cache miss means call the LLM provider.
Store on miss: after the LLM responds, store the query vector and response in the vector store for future hits.

The Similarity Threshold: The Most Consequential Knob

The similarity threshold is the single most important configuration decision in semantic caching. Too low, and you serve wrong answers. Too high, and you miss valid hits.

To understand why threshold matters, consider what happens at different values:

0.80 threshold: "What is a Python snake?" and "What is Python programming?" have a cosine similarity around 0.82 (they share "Python" and "is" - the embedding model sees lexical overlap). A threshold of 0.80 would serve the snake answer to the programming question. Never deploy below 0.90 without extensive empirical testing.
0.93 threshold: correctly identifies query pairs like "How do I install the SDK?" (0.97) and "What authentication method does your API use?" (0.89 with "How do I authenticate?"). Misses queries that are semantically identical but use highly different vocabulary - a small cost worth the accuracy gain.
0.99 threshold: only hits when queries are almost identical character-by-character. Hit rate falls below 5% in most applications. Essentially an expensive fuzzy string match at this point.

How to calibrate empirically: collect a sample of 500–1,000 query pairs from your production logs. Manually label each pair as "should share a response" or "should not share a response." Compute embeddings for each pair. Plot the cosine similarity distribution for both groups. The correct threshold sits at the point that maximizes hits in the "should share" group while minimizing hits in the "should not share" group. This process takes about a half-day and should be repeated when significant domain shift occurs.

Full Implementation: SemanticCache with Redis

The following is a production-quality semantic cache implementation using OpenAI embeddings and Redis for storage.

import anthropic
import hashlib
import json
import time
import numpy as np
import redis
from dataclasses import dataclass, field
from typing import Optional
from openai import OpenAI


# ─────────────────────────────────────────────────────────────────────────────
# Data classes for cache statistics and results
# ─────────────────────────────────────────────────────────────────────────────

@dataclass
class CacheHitResult:
    response: str
    similarity: float
    latency_ms: float
    entry_id: str


@dataclass
class CacheStats:
    hits: int = 0
    misses: int = 0
    total_embed_time_ms: float = 0.0
    total_search_time_ms: float = 0.0
    llm_cost_saved_usd: float = 0.0

    @property
    def hit_rate(self) -> float:
        total = self.hits + self.misses
        return self.hits / total if total > 0 else 0.0

    @property
    def avg_embed_ms(self) -> float:
        total = self.hits + self.misses
        return self.total_embed_time_ms / total if total > 0 else 0.0


# ─────────────────────────────────────────────────────────────────────────────
# SemanticCache: the core cache layer
# ─────────────────────────────────────────────────────────────────────────────

class SemanticCache:
    """
    Production semantic cache using Redis for vector and response storage.

    For caches under 10k entries: linear scan (O(n) per query).
    For larger caches: replace _find_similar() with RediSearch vector index
    for O(log n) approximate nearest-neighbor search.

    Thread-safe: Redis operations are atomic.
    """

    def __init__(
        self,
        redis_url: str = "redis://localhost:6379",
        embedding_model: str = "text-embedding-3-small",
        similarity_threshold: float = 0.93,
        ttl_seconds: int = 86400,          # 24 hours default TTL
        namespace: str = "default",        # namespace for multi-tenant isolation
    ):
        self.redis = redis.from_url(redis_url, decode_responses=False)
        self.embedding_model = embedding_model
        self.similarity_threshold = similarity_threshold
        self.ttl_seconds = ttl_seconds
        self.namespace = namespace
        self.openai = OpenAI()
        self.stats = CacheStats()

        # Namespace-specific keys prevent cross-feature cache collisions
        self._set_key = f"scache:{namespace}:entries"
        self._prefix = f"scache:{namespace}:entry:"

    def _embed(self, text: str) -> np.ndarray:
        """
        Generate an embedding vector for a text string.
        Costs ~$0.0001 per query with text-embedding-3-small.
        """
        t0 = time.time()
        response = self.openai.embeddings.create(
            input=text.strip(),
            model=self.embedding_model,
        )
        embed_ms = (time.time() - t0) * 1000
        self.stats.total_embed_time_ms += embed_ms
        return np.array(response.data[0].embedding, dtype=np.float32)

    @staticmethod
    def _cosine_similarity(a: np.ndarray, b: np.ndarray) -> float:
        """Cosine similarity in [-1, 1]. For embeddings, values are typically [0.6, 1.0]."""
        norm_a = np.linalg.norm(a)
        norm_b = np.linalg.norm(b)
        if norm_a == 0.0 or norm_b == 0.0:
            return 0.0
        return float(np.dot(a, b) / (norm_a * norm_b))

    def _find_similar(self, query_vector: np.ndarray) -> Optional[tuple[str, float]]:
        """
        Linear scan over all cached entries to find the most similar.
        Returns (entry_id, similarity) if above threshold, else None.

        Performance: ~5ms for 1k entries, ~50ms for 10k entries.
        For 100k+ entries, use RediSearch ANN instead.
        """
        entry_ids = self.redis.smembers(self._set_key)
        best_sim = -1.0
        best_id: Optional[str] = None

        t0 = time.time()
        for entry_id_bytes in entry_ids:
            entry_id = entry_id_bytes.decode("utf-8")
            entry_key = f"{self._prefix}{entry_id}"
            vector_bytes = self.redis.hget(entry_key, "vector")
            if vector_bytes is None:
                continue
            cached_vec = np.frombuffer(vector_bytes, dtype=np.float32)
            sim = self._cosine_similarity(query_vector, cached_vec)
            if sim > best_sim:
                best_sim = sim
                best_id = entry_id

        self.stats.total_search_time_ms += (time.time() - t0) * 1000

        if best_id is not None and best_sim >= self.similarity_threshold:
            return best_id, best_sim
        return None

    def get(self, query: str) -> Optional[CacheHitResult]:
        """
        Look up the cache for a semantically similar query.
        Returns CacheHitResult on hit, None on miss.
        """
        t0 = time.time()
        query_vec = self._embed(query)
        result = self._find_similar(query_vec)

        if result is not None:
            entry_id, similarity = result
            entry_key = f"{self._prefix}{entry_id}"
            response_bytes = self.redis.hget(entry_key, "response")

            if response_bytes is not None:
                # Refresh TTL on successful hit (cache entry is "warm")
                self.redis.expire(entry_key, self.ttl_seconds)
                self.stats.hits += 1
                return CacheHitResult(
                    response=response_bytes.decode("utf-8"),
                    similarity=similarity,
                    latency_ms=(time.time() - t0) * 1000,
                    entry_id=entry_id,
                )

        self.stats.misses += 1
        return None

    def set(self, query: str, response: str, query_vector: Optional[np.ndarray] = None) -> str:
        """
        Store a query-response pair in the cache.
        Returns the entry_id for the stored entry.
        """
        if query_vector is None:
            query_vector = self._embed(query)

        # Deterministic entry ID from query text
        entry_id = hashlib.sha256(f"{self.namespace}:{query}".encode()).hexdigest()[:16]
        entry_key = f"{self._prefix}{entry_id}"

        pipe = self.redis.pipeline()
        pipe.hset(entry_key, mapping={
            "query": query.encode("utf-8"),
            "response": response.encode("utf-8"),
            "vector": query_vector.tobytes(),
            "created_at": str(time.time()).encode("utf-8"),
            "namespace": self.namespace.encode("utf-8"),
        })
        pipe.expire(entry_key, self.ttl_seconds)
        pipe.sadd(self._set_key, entry_id)
        pipe.execute()

        return entry_id

    def warm(self, seed_pairs: list[tuple[str, str]]) -> int:
        """
        Pre-populate the cache with known query-response pairs.
        Use before launch to ensure high hit rates from day one.
        Returns the number of entries stored.
        """
        count = 0
        for query, response in seed_pairs:
            self.set(query, response)
            count += 1
        print(f"Cache warm: {count} seed entries stored in namespace '{self.namespace}'")
        return count

    def get_stats(self) -> dict:
        cache_size = self.redis.scard(self._set_key)
        return {
            "namespace": self.namespace,
            "hit_rate": f"{self.stats.hit_rate:.1%}",
            "hits": self.stats.hits,
            "misses": self.stats.misses,
            "cache_size": cache_size,
            "avg_embed_ms": f"{self.stats.avg_embed_ms:.1f}",
            "similarity_threshold": self.similarity_threshold,
            "llm_cost_saved_usd": f"${self.stats.llm_cost_saved_usd:.4f}",
        }


# ─────────────────────────────────────────────────────────────────────────────
# CachedAnthropicClient: transparent caching in front of Claude
# ─────────────────────────────────────────────────────────────────────────────

class CachedAnthropicClient:
    """
    Anthropic client with semantic caching layer.
    Transparently returns cached responses for semantically similar queries.
    Tracks cost savings from cache hits.
    """

    # claude-sonnet-4-6 pricing (March 2026)
    INPUT_COST_PER_TOKEN = 3.0 / 1_000_000
    OUTPUT_COST_PER_TOKEN = 15.0 / 1_000_000

    def __init__(self, cache: SemanticCache):
        self.cache = cache
        self.client = anthropic.Anthropic()

    def complete(
        self,
        messages: list[dict],
        model: str = "claude-sonnet-4-6",
        max_tokens: int = 1024,
        system: Optional[str] = None,
        cache_namespace_prefix: str = "",
    ) -> dict:
        """
        Complete a request with semantic caching.
        The cache is checked before any LLM call is made.

        Returns:
            dict with: response, cache_hit, similarity, cost_usd,
                       input_tokens, output_tokens, latency_ms
        """
        # Build a canonical cache query from the user-role messages
        user_content = " ".join(
            m["content"] for m in messages if m["role"] == "user"
        )
        # Optionally prefix the cache key with a domain context
        # to prevent cross-feature cache collisions within the same namespace
        cache_query = f"{cache_namespace_prefix}{user_content}".strip()

        # Step 1: Check semantic cache
        cache_result = self.cache.get(cache_query)
        if cache_result is not None:
            # Estimate cost saving (assume average 300 output tokens)
            estimated_saving = self.OUTPUT_COST_PER_TOKEN * 300
            self.cache.stats.llm_cost_saved_usd += estimated_saving

            return {
                "response": cache_result.response,
                "cache_hit": True,
                "similarity": round(cache_result.similarity, 4),
                "cost_usd": 0.0,
                "latency_ms": cache_result.latency_ms,
                "cache_entry_id": cache_result.entry_id,
            }

        # Step 2: Cache miss - call the LLM
        start = time.time()
        kwargs: dict = {"model": model, "max_tokens": max_tokens, "messages": messages}
        if system:
            kwargs["system"] = system

        api_response = self.client.messages.create(**kwargs)
        llm_latency_ms = (time.time() - start) * 1000

        response_text = api_response.content[0].text
        input_tokens = api_response.usage.input_tokens
        output_tokens = api_response.usage.output_tokens
        cost = (
            input_tokens * self.INPUT_COST_PER_TOKEN
            + output_tokens * self.OUTPUT_COST_PER_TOKEN
        )

        # Step 3: Store in cache for future hits
        self.cache.set(cache_query, response_text)

        return {
            "response": response_text,
            "cache_hit": False,
            "similarity": None,
            "cost_usd": round(cost, 8),
            "input_tokens": input_tokens,
            "output_tokens": output_tokens,
            "latency_ms": llm_latency_ms,
        }


# ─────────────────────────────────────────────────────────────────────────────
# Experiments and demos
# ─────────────────────────────────────────────────────────────────────────────

def run_hit_rate_experiment() -> None:
    """
    Measure hit rate with realistic FAQ query variations.
    Seeds 3 queries, tests 9 variant phrasings.
    """
    cache = SemanticCache(
        redis_url="redis://localhost:6379",
        similarity_threshold=0.93,
        namespace="docs-assistant",
    )
    client = CachedAnthropicClient(cache)

    seed_queries = [
        ("How do I install the Python SDK?",
         "Install with pip: `pip install yourlib`. Full docs at..."),
        ("What are the API rate limits?",
         "Default limits: 60 requests/min, 100k tokens/min. Enterprise plans have higher limits."),
        ("How do I authenticate with the API?",
         "Generate an API key from your dashboard at settings -> API keys. Pass it in the Authorization header."),
    ]

    print("=== Warming cache with seed entries ===\n")
    cache.warm(seed_queries)

    variant_queries = [
        "Show me how to install the SDK for Python",
        "What is the install process for your Python SDK?",
        "How many requests can I make per minute?",
        "What rate limiting applies to my API calls?",
        "How does authentication work with your API?",
        "What's the auth method for the API?",
        "Where can I find my API token?",
        "How do I get a developer API key?",
        "Tell me about the request rate limits",
    ]

    print("=== Testing variant queries ===\n")
    for query in variant_queries:
        result = client.complete(
            messages=[{"role": "user", "content": query}],
            max_tokens=100,
            cache_namespace_prefix="docs:",
        )
        status = "CACHE" if result["cache_hit"] else "LLM  "
        sim_str = f"sim={result['similarity']:.3f}" if result["cache_hit"] else "new LLM call"
        cost_str = f"${result['cost_usd']:.6f}" if not result["cache_hit"] else "$0.000000"
        print(f"  [{status}] ({sim_str}) cost={cost_str}")
        print(f"           Q: {query[:65]}")

    print(f"\n=== Cache Statistics ===")
    stats = cache.get_stats()
    for key, value in stats.items():
        print(f"  {key}: {value}")


def threshold_calibration_demo() -> None:
    """
    Demonstrate how different thresholds affect the hit/miss boundary
    for a set of known query pairs.
    """
    from openai import OpenAI
    openai_client = OpenAI()

    def embed(text: str) -> np.ndarray:
        response = openai_client.embeddings.create(
            input=text, model="text-embedding-3-small"
        )
        return np.array(response.data[0].embedding, dtype=np.float32)

    def cosine_sim(a: np.ndarray, b: np.ndarray) -> float:
        return float(np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b)))

    pairs = [
        # Should share a response
        ("How do I install the Python SDK?", "What's the install process for your Python library?", True),
        ("What are the API rate limits?", "How many requests per minute can I make?", True),
        ("How do I authenticate?", "What authentication method does the API use?", True),
        # Should NOT share a response
        ("What is Python?", "What is a python snake?", False),
        ("How do I reset my password?", "How do I install the Python SDK?", False),
        ("Show me a code example", "What are the rate limits?", False),
    ]

    print("=== Threshold Calibration Analysis ===\n")
    print(f"{'Query A':<45} {'Query B':<45} {'Should Share':<14} {'Cosine Sim'}")
    print("-" * 120)

    for q_a, q_b, should_share in pairs:
        vec_a = embed(q_a)
        vec_b = embed(q_b)
        sim = cosine_sim(vec_a, vec_b)
        mark = "YES" if should_share else "NO "
        flag = " <-- THRESHOLD BOUNDARY" if 0.88 < sim < 0.96 else ""
        print(f"  {q_a[:42]:<45} {q_b[:42]:<45} {mark:<14} {sim:.4f}{flag}")


if __name__ == "__main__":
    run_hit_rate_experiment()

Cache Invalidation: The Hard Problem

Semantic caches have a fundamentally different invalidation challenge than exact-match caches. There is no unique key to delete - entries are matched by similarity. When the underlying facts change (new API version, updated pricing, changed documentation), related cache entries need to be invalidated.

class SemanticCacheInvalidator:
    """
    Strategies for invalidating semantic cache entries.
    All strategies scan the entry set and operate on matching entries.
    """

    def __init__(self, cache: SemanticCache):
        self.cache = cache

    def invalidate_by_topic(self, topic: str, threshold: float = 0.85) -> int:
        """
        Invalidate all entries semantically related to a topic.
        Lower threshold than cache hits because we want to err on the side
        of over-invalidation (stale entries) rather than under-invalidation (wrong answers).

        Use when: a new API version is released, a product changes its pricing,
        documentation is significantly updated.

        Returns: number of entries invalidated.
        """
        topic_vec = self.cache._embed(topic)
        entry_ids = list(self.cache.redis.smembers(self.cache._set_key))
        invalidated = 0

        for entry_id_bytes in entry_ids:
            entry_id = entry_id_bytes.decode("utf-8")
            entry_key = f"{self.cache._prefix}{entry_id}"
            vec_bytes = self.cache.redis.hget(entry_key, "vector")
            if vec_bytes is None:
                continue

            entry_vec = np.frombuffer(vec_bytes, dtype=np.float32)
            sim = self.cache._cosine_similarity(topic_vec, entry_vec)

            if sim >= threshold:
                self.cache.redis.delete(entry_key)
                self.cache.redis.srem(self.cache._set_key, entry_id)
                invalidated += 1

        print(f"Topic invalidation '{topic}': removed {invalidated} entries (threshold={threshold})")
        return invalidated

    def invalidate_older_than(self, max_age_seconds: float) -> int:
        """
        Invalidate entries older than max_age_seconds.
        Useful for time-sensitive domains (news, prices, status pages).
        Returns: count of invalidated entries.
        """
        now = time.time()
        entry_ids = list(self.cache.redis.smembers(self.cache._set_key))
        invalidated = 0

        for entry_id_bytes in entry_ids:
            entry_id = entry_id_bytes.decode("utf-8")
            entry_key = f"{self.cache._prefix}{entry_id}"
            created_bytes = self.cache.redis.hget(entry_key, "created_at")
            if created_bytes is None:
                continue

            created_at = float(created_bytes.decode("utf-8"))
            if (now - created_at) > max_age_seconds:
                self.cache.redis.delete(entry_key)
                self.cache.redis.srem(self.cache._set_key, entry_id)
                invalidated += 1

        print(f"Age invalidation: removed {invalidated} entries older than {max_age_seconds}s")
        return invalidated

    def full_flush(self) -> int:
        """
        Nuclear option: clear the entire namespace cache.
        Use before a major content overhaul when most entries are stale.
        """
        entry_ids = list(self.cache.redis.smembers(self.cache._set_key))
        for entry_id_bytes in entry_ids:
            entry_id = entry_id_bytes.decode("utf-8")
            self.cache.redis.delete(f"{self.cache._prefix}{entry_id}")
        self.cache.redis.delete(self.cache._set_key)
        print(f"Full flush: removed {len(entry_ids)} entries from namespace '{self.cache.namespace}'")
        return len(entry_ids)

Scaling to Large Caches: RediSearch ANN

The linear scan in the implementation above becomes too slow at more than 10,000 cache entries (~50ms+ per lookup). For larger deployments, use Redis with the RediSearch module and a vector index for approximate nearest-neighbor search.

def create_redisearch_vector_index(redis_client: redis.Redis, index_name: str = "scache_idx") -> None:
    """
    Create a RediSearch vector index for ANN (approximate nearest neighbor) search.
    This replaces the O(n) linear scan with O(log n) HNSW search.

    Requires: Redis Stack or Redis with the RediSearch module installed.
    """
    try:
        # Drop existing index if present (for re-creation)
        redis_client.execute_command("FT.DROPINDEX", index_name, "DD")
    except Exception:
        pass

    # Create index on the 'vector' field of hash keys prefixed with 'scache:default:entry:'
    redis_client.execute_command(
        "FT.CREATE", index_name,
        "ON", "HASH",
        "PREFIX", "1", "scache:default:entry:",
        "SCHEMA",
        "vector", "VECTOR", "HNSW", "6",
            "TYPE", "FLOAT32",
            "DIM", "1536",             # text-embedding-3-small dimension
            "DISTANCE_METRIC", "COSINE",
    )
    print(f"RediSearch vector index '{index_name}' created.")


def search_with_redisearch(
    redis_client: redis.Redis,
    query_vector: np.ndarray,
    index_name: str = "scache_idx",
    top_k: int = 1,
    similarity_threshold: float = 0.93,
) -> Optional[tuple[str, float]]:
    """
    Use RediSearch KNN vector search instead of linear scan.
    Returns (entry_id, similarity) or None.

    This scales to millions of entries with millisecond latency.
    """
    vector_bytes = query_vector.tobytes()

    # FT.SEARCH with KNN syntax
    results = redis_client.execute_command(
        "FT.SEARCH", index_name,
        f"*=>[KNN {top_k} @vector $vec AS score]",
        "PARAMS", "2", "vec", vector_bytes,
        "SORTBY", "score",
        "RETURN", "2", "score", "__key",
        "DIALECT", "2",
    )

    if results[0] == 0:
        return None

    # Parse the result: [count, key, [field, value, ...], ...]
    entry_key = results[1].decode("utf-8")
    score_bytes = None
    fields = results[2]
    for i in range(0, len(fields), 2):
        if fields[i] == b"score":
            score_bytes = fields[i + 1]
            break

    if score_bytes is None:
        return None

    # RediSearch returns cosine DISTANCE (not similarity) - convert
    cosine_distance = float(score_bytes)
    similarity = 1.0 - cosine_distance

    if similarity >= similarity_threshold:
        # Extract entry_id from key
        entry_id = entry_key.split(":")[-1]
        return entry_id, similarity

    return None

When Semantic Caching Is and Is Not Appropriate

Use case	Cache suitable?	Reasoning
FAQ bot	Yes	High query repetition, answers are stable
Documentation assistant	Yes	Common questions, stable content
Product description generator	Yes (high threshold)	Templates reuse; 0.97+ threshold
Code generation	Partial	Subtle differences matter; threshold 0.97+
User-personalized recommendations	No	Responses must be tailored to the individual
Real-time data queries	No	"What is the current BTC price?" cannot be cached
Creative writing	No	Users expect fresh responses each time
Queries containing PII	No	Privacy violation risk - must bypass entirely
Medical or legal advice	No	Liability from stale or slightly wrong answers

:::danger Never cache queries containing PII If a user's query contains personal information - their name, account number, medical condition, location - that query-response pair must never be stored in the cache. A future user asking a semantically similar question could receive a cached response containing another user's private data. Implement a PII detection step before the cache lookup (regex for emails, phone numbers, SSNs, or use a dedicated PII classifier) and bypass the cache entirely for flagged queries. :::

:::warning The embedding model must match between writes and reads If you populate the cache with text-embedding-3-small embeddings (1536 dimensions) and then change to text-embedding-3-large (3072 dimensions), all existing cache entries are incompatible - vector dimensions don't match and cosine similarity comparisons will be incorrect or will error. Version your cache namespace when changing embedding models. Warm the new namespace in parallel before switching traffic to it. Never mix embedding model versions in the same namespace. :::

:::tip Warm the cache before launch for known high-traffic queries For any application where you know the common queries in advance - FAQ bots, documentation assistants, customer support systems - pre-populate the cache before launch. Collect the top 200–500 most frequent questions from your existing support tickets, docs analytics, or user research. Generate answers for each. Use cache.warm() to store them. This ensures your launch-day traffic hits the cache from the first request, rather than spending the first day populating it through live traffic. :::

:::info Measure cost savings, not just hit rate A 40% cache hit rate sounds good in isolation, but the value depends entirely on what kinds of queries are hitting. If your 40% hit rate is all on short, cheap queries while expensive long queries always miss, the cost savings may be only 15%. Report cost savings avoided (estimated cost of cache hits times hit count) alongside hit rate. This is the metric that matters to finance and engineering leadership. :::

Cache Monitoring in Production

Cache monitoring is non-negotiable in production. Track these metrics and their acceptable ranges:

Metric	How to measure	Acceptable range	Action if outside range
Hit rate	hits / (hits + misses), hourly	Above 25% after 48h for FAQ workloads	Investigate query variety; warm with more seeds
Avg similarity on hits	Mean cosine sim for all cache hits	0.93–0.99	Below 0.93: threshold too low; raise it
P95 embedding latency	Time from query receipt to vector returned	Under 50ms	Switch to faster embedding model or cache embeddings
Cache entry count	`SCARD scache:namespace:entries`	Below 100k for linear scan	Migrate to RediSearch ANN at 10k entries
Stale hit rate	Manual sampling of cache hits	Under 5% incorrect	Run topic-based invalidation on changed content
Cost savings rate	Cache hits x avg cost per LLM call	Growing over first 2 weeks	Monitor trend; declining means query diversity increasing

Production Engineering Notes

Cache monitoring is non-negotiable in production. Track these metrics:

Hit rate (hourly): should be above 25% after 48 hours for FAQ-type workloads
Avg similarity on hits: distribution should be tight near the threshold plus a cluster near 0.99 (near-exact matches); a flat distribution suggests the threshold is too low
P95 embedding latency: embedding calls add overhead to every request; should be under 50ms
Cache size: unbounded growth means prune with TTL and age-based invalidation
Stale hit rate: manually sample cache hits periodically and verify the response is still correct

Interview Q&A

Q: How does semantic caching differ from traditional exact-match caching?

Exact-match caching uses the full query string as the cache key and only returns a cached response when an identical string appears again. For natural language, hit rates are near zero because users rarely type exactly the same string twice. Semantic caching converts queries to embedding vectors using a language model, then uses cosine similarity between the incoming query vector and stored vectors to determine whether a new query is "semantically close enough" to a cached query to return the cached response. Hit rates of 30–50% are typical for FAQ and documentation use cases, because semantically equivalent queries - regardless of wording - map to geometrically nearby points in embedding space.

Q: How do you choose the similarity threshold for a semantic cache?

Empirically. Collect 500–1,000 representative query pairs from production logs and manually label which pairs should share a response ("How do I install?" / "Installation instructions?") and which should not ("What is Python?" / "What is a python snake?"). Compute embeddings and cosine similarities for all pairs. Plot two distributions: similarity scores for "should share" pairs and similarity scores for "should not share" pairs. Set the threshold where the "should share" distribution is above it and the "should not share" distribution is below it. For most FAQ use cases, this is 0.92–0.95. For domains where subtle differences in wording matter (code generation, legal, medical), use 0.97+. Recalibrate when the query distribution changes significantly.

Q: What are the risks of setting the threshold too low?

Two distinct risks. First, serving semantically wrong answers: a user asking "What is a Python snake?" and a user asking "What is Python programming?" have embedding similarity around 0.82 - at a threshold of 0.80, the second user gets the snake answer. Second, privacy violations: "What is my account balance?" from two different users may produce similar embeddings; at a low threshold, User B could receive User A's cached balance response. This is a serious data privacy failure. Never deploy below 0.90 without extensive empirical testing on your specific domain, and always implement PII detection to bypass the cache for sensitive queries.

Q: What happens to cache quality when the underlying knowledge base changes?

Cache entries become stale when the facts they represent change. For example, if an API version changes and the installation instructions change, cached responses for "How do I install?" are now wrong. Strategies: (1) topic-based invalidation - embed the "changed topic" and delete all cache entries with similarity above 0.85 to that topic; (2) TTL-based invalidation - set short TTLs (1–7 days) on entries in domains that change frequently; (3) version-based namespace - when deploying a major documentation update, switch to a new cache namespace (e.g., docs-v2:) so old entries don't pollute new traffic. For domains where facts change multiple times per day (prices, status, live data), semantic caching is inappropriate regardless of invalidation strategy.

Q: How would you implement semantic caching at the gateway layer rather than in application code?

The gateway intercepts every request. Before forwarding to the LLM provider, it extracts the user-facing message content, generates an embedding, and queries the vector store. If similarity exceeds the threshold, the gateway returns the cached response directly without touching the provider - the request never leaves the gateway. On a cache miss, the gateway forwards to the provider, receives the response, stores the embedding-response pair, and returns the response to the caller. All application services benefit from this automatically without any application-level code changes. LiteLLM Proxy and Portkey both support semantic caching at the gateway layer with Redis as the backing store, configured via YAML or the dashboard. The advantage over application-level caching is that the cache is shared across all services and all users, maximizing hit rates.

Q: Walk me through how you would scale a semantic cache from 1,000 to 1,000,000 entries.

At 1,000 entries, a linear scan over Redis hashes works fine - O(n) scan takes under 5ms. At 10,000 entries, linear scan takes ~50ms - still acceptable but starting to add noticeable overhead. At 100,000 entries, linear scan takes 500ms+ - unacceptable. The solution is to switch from linear scan to approximate nearest-neighbor (ANN) search using a vector index. Options: Redis Stack with RediSearch (HNSW index, sub-millisecond ANN at millions of entries), Qdrant (dedicated vector database with ANN, filtering, and payload storage), or Pinecone (managed service). Migration path: build the ANN index from existing cache entries, run both in parallel for validation, then switch over. The ANN search introduces approximate results (may miss some near-threshold entries) but this tradeoff is acceptable given the massive latency improvement. Also implement cache eviction at this scale: track access frequency and evict least-recently-used entries when the index exceeds a configured maximum size.

Q: How do you measure whether semantic caching is actually saving money?

Track three numbers: (1) cache hit count per period, (2) average cost per LLM call for the same feature (from the cache misses that did hit the LLM), and (3) embedding cost per lookup. Cost savings = (cache_hits * avg_llm_cost_per_miss) - (total_lookups * embedding_cost_per_lookup). If the embedding cost exceeds the LLM cost savings, caching is making you money. For most FAQ scenarios with claude-sonnet-4-6 responses (avg $0.005/call) and OpenAI embeddings ($ 0.0001/embedding), you need a hit rate above 2% for caching to be net positive - in practice, FAQ workloads run 30–50% hit rates, making caching extremely profitable.

Semantic Cache Invalidation Strategies

When facts change, cached responses become stale. The right invalidation strategy depends on how frequently the underlying knowledge changes.

Scenario	Recommended Strategy	Implementation
Documentation updated weekly	TTL-based (7-day expiry)	`EXPIRE` key on store
Product prices change daily	Very short TTL (4-6 hours)	Aggressive TTL
Live data (status, balance)	Bypass cache entirely	PII/dynamic flag
Major system change (new version)	Namespace switch	Change key prefix
Topic-specific update	Topic-based invalidation	Delete by similarity

import anthropic
import redis
import numpy as np


class SemanticCacheInvalidator:
    """
    Invalidates semantic cache entries based on topic similarity.

    When a documentation section changes, embed the changed topic
    and delete all cache entries whose stored query is similar to
    the changed topic. This is more precise than full cache flush
    and more automatic than manual key deletion.
    """

    def __init__(
        self,
        redis_client: redis.Redis,
        invalidation_threshold: float = 0.85,
        cache_key_prefix: str = "scache:",
    ):
        self.redis = redis_client
        self.threshold = invalidation_threshold
        self.prefix = cache_key_prefix
        self._client = anthropic.Anthropic()

    def _embed(self, text: str) -> list[float]:
        """Embed text for similarity comparison."""
        # In production, use a dedicated embedding model
        # This is a placeholder using hash-based approximation
        import hashlib
        h = hashlib.sha256(text.encode()).digest()
        return [b / 255.0 for b in h[:64]]

    def _cosine_sim(self, a: list[float], b: list[float]) -> float:
        va = np.array(a, dtype=np.float32)
        vb = np.array(b, dtype=np.float32)
        norm_a = np.linalg.norm(va)
        norm_b = np.linalg.norm(vb)
        if norm_a == 0 or norm_b == 0:
            return 0.0
        return float(np.dot(va, vb) / (norm_a * norm_b))

    def invalidate_by_topic(self, changed_topic: str) -> int:
        """
        Delete all cache entries semantically similar to the changed topic.

        Returns the number of entries deleted.
        """
        topic_embedding = self._embed(changed_topic)

        # Scan all cache keys
        pattern = f"{self.prefix}*"
        keys_deleted = 0

        cursor = 0
        while True:
            cursor, keys = self.redis.scan(cursor, match=pattern, count=100)
            for key in keys:
                # Each cache entry stores the query embedding alongside the response
                stored = self.redis.hgetall(key)
                if b"query_embedding" not in stored:
                    continue

                import json
                stored_emb = json.loads(stored[b"query_embedding"])
                similarity = self._cosine_sim(topic_embedding, stored_emb)

                if similarity >= self.threshold:
                    self.redis.delete(key)
                    keys_deleted += 1

            if cursor == 0:
                break

        print(f"[CacheInvalidator] Deleted {keys_deleted} entries similar to: '{changed_topic}'")
        return keys_deleted

    def invalidate_by_namespace(self, old_prefix: str) -> int:
        """Delete all cache entries under an old namespace prefix."""
        pattern = f"{old_prefix}*"
        keys = list(self.redis.scan_iter(match=pattern, count=500))
        if keys:
            self.redis.delete(*keys)
        print(f"[CacheInvalidator] Deleted {len(keys)} entries with prefix '{old_prefix}'")
        return len(keys)


# Usage example: documentation v2 released - invalidate v1 cache
# invalidator = SemanticCacheInvalidator(redis_client)
# deleted = invalidator.invalidate_by_topic(
#     "Python SDK installation instructions version 2"
# )
# Also switch to new cache namespace for new traffic:
# cache.prefix = "scache:v2:"

Cache Warming: Pre-Populating Before Launch

After a cache flush or fresh deployment, the cache is empty. For high-traffic features, this "cold start" means every user query is a cache miss until enough unique queries have been answered and stored. Cache warming pre-populates the cache before launch using known high-frequency queries.

def warm_semantic_cache(
    cache: "SemanticCache",
    client: anthropic.Anthropic,
    seed_queries: list[str],
    model: str = "claude-haiku-4-5-20251001",
    max_tokens: int = 512,
) -> dict:
    """
    Pre-populate the semantic cache with responses to seed queries.

    Seed queries are the top N queries from production logs (after deduplication)
    or a manually curated list of expected frequent queries.
    """
    results = {"warmed": 0, "already_cached": 0, "failed": 0}

    for query in seed_queries:
        # Check if already cached (e.g., from a previous warm run)
        existing = cache.get(query)
        if existing:
            results["already_cached"] += 1
            continue

        try:
            response = client.messages.create(
                model=model,
                max_tokens=max_tokens,
                messages=[{"role": "user", "content": query}],
            )
            answer = response.content[0].text
            cache.set(query, answer)
            results["warmed"] += 1
            print(f"Warmed: '{query[:60]}...' -> {len(answer)} chars")
        except Exception as e:
            results["failed"] += 1
            print(f"Failed to warm '{query[:40]}...': {e}")

    print(f"\nCache warm results: {results}")
    return results


# Typical seed queries for a documentation assistant
SEED_QUERIES = [
    "How do I install the SDK?",
    "What are the pricing plans?",
    "How do I authenticate API requests?",
    "What is the rate limit for my plan?",
    "How do I handle errors in the API?",
    "What models are available?",
    "How do I cancel my subscription?",
    "Where can I find my API key?",
    "Is there a free trial?",
    "What are the supported programming languages?",
]

## Semantic Caching at the Gateway Layer vs Application Layer

Semantic caching can be implemented at two layers: within each application service, or centrally at the gateway layer. The choice has significant implications for effectiveness.

| Property | Application-layer cache | Gateway-layer cache |
|---|---|---|
| Scope | One service's queries | All services, all users |
| Hit rate | Lower - only one service's queries | Higher - shared across all users |
| Configuration | Per-service code change | Central config change |
| Invalidation | Per-service operation | One operation covers all |
| PII awareness | Service has context | Gateway needs hints via headers |
| Setup effort | Easy for one service | Harder - requires gateway deployment |

The gateway-layer cache provides dramatically higher hit rates because the cache is shared across all users of all services. When user A asks "How do I install the SDK?" and the response is cached, user B asking "What are the installation steps?" will hit the cache - even though they are different users from different services. Application-layer caches only serve the same user across sessions or users of the same specific service instance.

For production multi-service architectures, always implement semantic caching at the gateway layer. The per-service approach only makes sense for single-service systems or when gateway-layer caching is not feasible.

## Summary: Semantic Caching in Production

Semantic caching is the highest-ROI LLM optimization available for FAQ and documentation workloads. A correctly tuned cache (threshold 0.92–0.95) achieves 30–50% hit rates, reducing LLM costs by the same proportion with no visible quality degradation. The key operational decisions are:

- **Threshold calibration**: empirical, not intuitive - build a labeled evaluation set
- **Backing store**: Redis for small to medium scale, RediSearch with HNSW index for millions of entries
- **Invalidation strategy**: match the strategy to the knowledge change frequency
- **PII bypass**: non-negotiable - any query containing user-identifying information must bypass the cache
- **Cost measurement**: track cost avoidance (hits × avg_miss_cost) not just hit rate - this is the number finance understands

The semantic cache is a shared infrastructure component, not an application feature. It belongs in the gateway layer where it benefits every service and every user simultaneously.

## Semantic Cache Architecture Diagram

```mermaid
flowchart TD
    Q["User query<br/>'How do I install the Python SDK?'"]:::primary

    E["Embedding model<br/>text-embedding-3-small<br/>384-dim vector"]:::neutral

    VS["Vector store<br/>Redis with RediSearch<br/>HNSW index"]:::primary

    Sim{"Cosine similarity<br/>above 0.95 threshold?"}:::warning

    Hit["Cache hit<br/>Return cached response<br/>in milliseconds"]:::success
    Miss["Cache miss<br/>Call LLM provider"]:::neutral

    LLM["LLM provider call<br/>claude-sonnet-4-6"]:::primary
    Store["Store in vector index<br/>embedding + response + metadata"]:::success
    Resp["Return response<br/>to caller"]:::success

    Q --> E
    E --> VS
    VS --> Sim
    Sim -->|Yes| Hit
    Hit --> Resp
    Sim -->|No| Miss
    Miss --> LLM
    LLM --> Store
    Store --> Resp

    classDef primary fill:#dbeafe,stroke:#2563eb,color:#1e3a5f
    classDef success fill:#dcfce7,stroke:#16a34a,color:#14532d
    classDef warning fill:#fef9c3,stroke:#ca8a04,color:#713f12
    classDef neutral fill:#f3f4f6,stroke:#6b7280,color:#111827

The embedding step runs on every request (both hits and misses). For this reason, the embedding model must be fast and cheap - the entire benefit of caching is eliminated if the embedding call is slower than the LLM call it avoids. OpenAI text-embedding-3-small ($0.02/1M tokens, ~10ms latency) is the standard choice for production semantic caches at any reasonable scale.

The $18,000 FAQ​

Why String-Match Caching Fails for Natural Language​

How Semantic Caching Works​

The Similarity Threshold: The Most Consequential Knob​

Full Implementation: SemanticCache with Redis​

Cache Invalidation: The Hard Problem​

Scaling to Large Caches: RediSearch ANN​

When Semantic Caching Is and Is Not Appropriate​

Cache Monitoring in Production​

Production Engineering Notes​

Interview Q&A​

Semantic Cache Invalidation Strategies​

Cache Warming: Pre-Populating Before Launch​