AI Infrastructure

Semantic Cache

Semantic cache is a caching technique where LLM responses are stored and retrieved based on semantic similarity between queries rather than exact string matching — dramatically reducing redundant LLM calls and API costs when users ask questions that mean the same thing in different words.

Every time a user sends a question to an LLM-powered application, the model processes that query from scratch — consuming tokens, spending time, and burning API budget. A user asking “What’s the capital of France?” and another user asking “Which city is France’s capital?” will both trigger separate, full LLM calls, even though these questions are semantically identical and would yield identical answers. Semantic cache is the infrastructure layer that eliminates this redundancy at scale.

Semantic cache is a caching mechanism that stores LLM responses (or any compute-intensive outputs) indexed by the meaning of the query rather than its exact text. When a new query arrives, it is converted into a vector embedding and compared against previously cached query embeddings using cosine similarity. If a match above a configured similarity threshold is found, the stored response is returned instantly — no LLM call is made. Only novel, semantically distinct queries reach the model.

How Semantic Cache Differs from Traditional Cache

Traditional HTTP or key-value caching relies on exact string matching: the incoming request must be byte-for-byte identical to a previously cached request. This works well for static assets, database query results, or API calls with deterministic inputs. For natural language, it fails almost entirely — users rephrase the same questions constantly, and a cache hit rate of 1–2% provides negligible value.

Semantic cache replaces the key lookup with a vector similarity search:

Traditional CacheSemantic Cache
Match criterionExact string equalityCosine similarity ≥ threshold
Key typeRaw string / hashEmbedding vector
Cache hit rate (NL)~1–3%30–70% (domain dependent)
Lookup costO(1) hash lookupANN search (~milliseconds)
Storage backendRedis, MemcachedVector DB + scalar store
Suitable forStatic/deterministic queriesConversational AI, search

In an LLM context, even a 30% cache hit rate can cut API spend by a third while reducing median response latency from multiple seconds to under 50 milliseconds for cached queries.

How Semantic Cache Works

The lifecycle of a query through a semantic cache involves five stages:

flowchart TD
    classDef default fill:#ffffff,stroke:#4338CA,stroke-width:2px,color:#0F172A
    classDef user fill:#EEF0F7,stroke:#0D9488,stroke-width:2px,color:#0F172A
    classDef system fill:#4338CA,stroke:#4338CA,stroke-width:2px,color:#ffffff
    classDef decision fill:#F7F8FC,stroke:#6366F1,stroke-width:2px,color:#0F172A
    classDef hit fill:#0D9488,stroke:#0D9488,stroke-width:2px,color:#ffffff

    A([User Query]):::user
    B(Embedding Model\nQuery → Vector):::system
    C{ANN Search\nin Vector Store}:::decision
    D{Similarity ≥\nThreshold?}:::decision
    E([Return Cached Response\n< 50ms]):::hit
    F(Invoke LLM\nGenerate Response):::system
    G(Store Query Vector\n+ Response in Cache):::system
    H([Return Response\nto User]):::user

    A --> B --> C --> D
    D -->|Yes — Cache HIT| E
    D -->|No — Cache MISS| F --> G --> H
    E --> H

1. Embedding: The incoming query string is passed to an embedding model (e.g., OpenAI text-embedding-3-small, bge-m3, or an ONNX-based model for zero-latency inference). This produces a dense vector of fixed dimensionality representing the query’s semantic meaning.

2. Approximate Nearest Neighbour (ANN) Search: The query vector is searched against all previously cached query vectors using an ANN index — typically FAISS, HNSW (via Qdrant, Weaviate, or pgvector), or Redis’s vector search layer. This returns the closest matching cached queries along with their similarity scores.

3. Threshold Evaluation: The top-ranked result’s similarity score is compared against a configured threshold (commonly in the range of 0.85–0.95 cosine similarity). If the score clears the threshold, the cached response is returned immediately. If not, the query proceeds to the LLM.

4. LLM Invocation (cache miss path): On a miss, the request is sent to the LLM as normal. The generated response is received and displayed to the user.

5. Cache Population: The new query vector and its LLM response are written back to the cache — the vector to the ANN index and the response text to a scalar store (SQLite, PostgreSQL, or Redis). Future semantically similar queries will now find this entry.

The Similarity Threshold — The Critical Tuning Parameter

The similarity threshold is the most consequential hyperparameter in a semantic cache deployment:

  • Threshold too high (e.g., > 0.98): Only near-identical phrasing triggers a cache hit. Hit rate drops toward traditional caching’s ~1–3%. The cache provides minimal benefit.
  • Threshold too low (e.g., < 0.75): Semantically distant queries match each other. A question about “Python the programming language” might return a cached answer about “Python the snake.” Response quality degrades unacceptably.
  • Optimal range (0.85–0.92): Captures genuine paraphrases — “How do transformers work?” and “Explain the transformer architecture” — while rejecting thematically adjacent but distinct queries. This range must be tuned empirically per domain and use case.

The right threshold is application-specific. A customer support bot with a narrow FAQ domain can safely use 0.88. A general-purpose assistant with open-ended queries may need 0.92 to avoid false positives.

Cache Invalidation and TTL

Unlike static caches where staleness is the primary concern, semantic caches face two distinct invalidation challenges:

Time-based invalidation (TTL): Information becomes outdated. A cached response to “What’s the latest version of PyTorch?” is stale the moment a new release drops. Production deployments assign a TTL (time-to-live) to each cache entry — commonly 24 hours to 7 days — after which the entry is evicted and the next matching query re-hits the LLM.

Context-aware invalidation: Responses that depend on user-specific context — account status, conversation history, real-time data — must not be served from a shared semantic cache. Semantic cache is most safely applied to queries whose ideal responses are user-agnostic and temporally stable: definitions, explanations, factual questions, and template-style responses.

Key Implementations

Bifrost (open source): An LLM gateway built around a dual-layer caching architecture — an exact hash lookup runs first (nanosecond cost, zero false positives), and on a miss, a vector similarity search fires against the semantic index. The hash layer catches identical queries instantly; the vector layer catches paraphrases. Bifrost reports an overhead of 11µs for the combined cache lookup path, making it one of the lowest-latency options for high-throughput deployments. It supports Weaviate, Redis, and Qdrant as the backing vector store and works across any LLM provider. The dual-layer design is particularly suited to applications that receive a mix of exact repeats (e.g., a fixed command vocabulary) and near-duplicate natural language queries, since neither layer alone handles both efficiently.

LiteLLM: A lightweight Python proxy and SDK that provides a unified interface across 100+ LLM providers (OpenAI, Anthropic, Bedrock, Azure, Cohere, and more) and includes built-in semantic caching via Redis or Qdrant. Configuring semantic cache in LiteLLM is a one-line addition to the proxy config.yaml — the proxy handles embedding, similarity search, and cache read/write transparently before the request reaches any provider. Because LiteLLM sits at the proxy layer, caching applies uniformly across all providers and all callers without any application-side changes. This makes it the natural choice for platform teams standardising LLM access across multiple teams and models.

Kong AI Gateway (v3.8+, enterprise): Kong’s AI Gateway plugin suite added a semantic caching plugin in version 3.8, using Redis vector search as the backing store. The plugin operates at the API gateway layer — upstream of the application entirely — meaning semantic cache behaviour is enforced as infrastructure policy rather than application code. Cache scope, TTL, similarity threshold, and Redis connection are all configured as Kong plugin settings. For enterprises already running Kong for API management, this is the path of least resistance to organisation-wide semantic caching with centralised observability and no per-application integration work.

Upstash Semantic Cache (2024): The fastest-growing managed semantic cache in the LLM ecosystem as of 2024–2025. Upstash provides a serverless vector database with native LangChain integration via UpstashSemanticCache. Zero infrastructure to manage — the vector index scales automatically and billing is per-request. The score_threshold parameter sets the cosine similarity cutoff directly in the constructor, making it the recommended starting point for teams using LangChain who want production semantic caching without operating a vector database.

GPTCache (Zilliz, open source): The most widely adopted open-source semantic caching library. Supports multiple embedding backends (OpenAI, Hugging Face, ONNX), multiple vector stores (FAISS, Milvus, Qdrant, pgvector), and multiple scalar stores (SQLite, MySQL, PostgreSQL). Integrates as a drop-in wrapper around the OpenAI client. Best suited for teams who need full control over the embedding pipeline and storage layer.

Redis Semantic Cache (via RedisVL): Redis’s vector search capabilities (added in Redis 7.2) enable a semantic cache layer within a single Redis instance. RedisVL’s SemanticCache class manages embedding, ANN search, and TTL-based eviction — minimising infrastructure footprint for teams already running Redis in production. Redis 8.0 (2024) improved HNSW index performance significantly, reducing vector search latency at scale.

LangChain InMemoryCache and SQLiteSemanticCache: LangChain’s caching layer (updated extensively in 2024 with the LangChain v0.2 / v0.3 releases) offers both in-memory semantic caching for development and SQLiteSemanticCache for lightweight persistent caching without a dedicated vector store. LangChain also provides CacheBackedEmbeddings — a wrapper that caches embedding model calls rather than LLM calls, reducing costs when the same documents are embedded repeatedly.

Momento Semantic Cache: A fully managed, serverless semantic cache with a simple set / get API. Momento handles the vector index, similarity search, and TTL automatically. Best for teams who want minimal operational overhead and do not need to control the underlying embedding or ANN strategy.

Zep: An open-source memory and cache layer designed for long-running LLM agents. Zep’s semantic cache stores session-specific LLM interactions indexed by meaning, enabling agents to recall previous answers to similar sub-questions without re-running expensive chains. Zep Cloud (2024) extended this with hosted infrastructure and automatic memory summarisation for long-running agent sessions.

Semantic Router (Aurelio AI, 2024): While not a cache in the traditional sense, Semantic Router is a complementary technology that sits upstream of the LLM and uses vector similarity to route queries to predefined responses or different handlers — functioning as a rule-based semantic cache for structured intents. For FAQ-style queries with known expected answers, Semantic Router can intercept and respond without embedding-to-cache lookup latency.

Architectural Patterns in Production

flowchart LR
    classDef default fill:#ffffff,stroke:#4338CA,stroke-width:2px,color:#0F172A
    classDef user fill:#EEF0F7,stroke:#0D9488,stroke-width:2px,color:#0F172A
    classDef system fill:#4338CA,stroke:#4338CA,stroke-width:2px,color:#ffffff
    classDef store fill:#F7F8FC,stroke:#6366F1,stroke-width:2px,color:#0F172A

    U([User / Client]):::user
    LB[API Gateway /\nLoad Balancer]:::default
    SC[Semantic Cache\nLayer]:::system
    EMB[Embedding\nService]:::system
    VDB[(Vector Store\nFAISS / Qdrant)]:::store
    KV[(Scalar Store\nRedis / Postgres)]:::store
    LLM[LLM Provider\nOpenAI / Anthropic]:::system

    U --> LB --> SC
    SC --> EMB
    EMB --> VDB
    VDB -->|Cache HIT| KV
    KV -->|Response| SC
    SC -->|Cache MISS| LLM
    LLM -->|Store + Respond| SC
    SC --> U

In high-throughput production systems, the semantic cache operates as a sidecar or proxy layer in front of the LLM gateway. The embedding service is typically co-located with the cache layer (or uses a lightweight ONNX model to run in-process) to keep pre-cache latency under 10ms. Vector search on a warm FAISS or HNSW index typically completes in 5–20ms for caches containing up to one million entries. The total overhead of a cache hit is thus well under 50ms — compared to 1–10+ seconds for a cold LLM call.

Benefits and Trade-offs

Benefits:

  • Cost reduction: Every cache hit eliminates 100% of the token cost for that query. At high traffic volumes, this compounds into 30–60% reductions in LLM API spend for typical conversational applications.
  • Latency improvement: Sub-50ms responses for cache hits versus seconds for LLM generation — improving user experience and enabling real-time applications.
  • Throughput scaling: Cache hits bypass rate limits imposed by LLM providers, allowing the application to serve more concurrent users.
  • Consistency: Users asking similar questions receive the same vetted, high-quality answer rather than stochastic LLM outputs.

Trade-offs:

  • Threshold sensitivity: An incorrectly tuned threshold is either too permissive (wrong answers) or too restrictive (no cache benefit). Requires domain-specific calibration.
  • Freshness risk: Cached responses go stale. Requires explicit TTL management and monitoring for domain drift.
  • Context contamination: User-specific or session-specific responses must be excluded from the shared cache, requiring careful routing logic.
  • Cold start: A new deployment has an empty cache. Benefits accumulate gradually as traffic fills the cache — cache warming strategies (pre-populating from FAQ datasets) can mitigate this.

Semantic Cache vs. Provider-Level Prompt Caching

A common source of confusion in 2024–2025 is conflating semantic cache with provider-level prompt caching — two distinct techniques that operate at different layers of the stack and solve different problems.

Provider-level prompt caching (offered by Anthropic since August 2024, and by OpenAI and Google in similar forms) caches the KV (key-value) attention states computed during the prefill phase of LLM inference for a fixed prefix of the prompt. When the same prefix appears in subsequent requests, the model skips recomputing attention for that prefix, saving tokens and reducing latency. Anthropic’s prompt caching charges 10% of input token cost for cache hits versus full price for cache misses, with a 5-minute TTL extended on each read.

Semantic CachePrompt Caching (e.g., Anthropic)
What is cachedThe complete LLM responseKV attention states for a prompt prefix
Match typeSemantic similarity (embedding cosine)Exact byte-level prefix match
Operates atApplication layer (your infrastructure)Provider inference layer
Cache hit resultEntire response returned, no LLM callLLM still runs, but prefix is cheaper
Best forRepeated semantic queries across usersLong system prompts, few-shot examples
Latency on hit< 50ms (no LLM involved)Reduced but still full generation time
RequiresYour own vector DB + embedding modelAPI flag (cache_control in Anthropic)

The two techniques are complementary, not competing. A production deployment might use provider-level prompt caching to reduce the cost of a long system prompt that appears in every request, while simultaneously using semantic cache at the application layer to skip the LLM call entirely for semantically repeated user queries. Together, they address different cost and latency levers in the same pipeline.

When to Use Semantic Cache

Semantic cache delivers the most value in:

  • Customer support chatbots — high query volume, narrow domain, repetitive intent patterns
  • Internal knowledge base assistants — employees ask similar HR, policy, and onboarding questions repeatedly
  • Developer documentation bots — “How do I install X?”, “What does function Y do?” — high overlap between queries
  • RAG pipelines — caching not just LLM responses but also retrieval results, avoiding redundant vector DB queries for common topics

It provides minimal value for personalised, real-time, or highly context-dependent applications where every query is genuinely unique and must reflect individual state.

Semantic Cache with LangChain + Upstash Vector (2025)

python
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain_community.cache import UpstashSemanticCache
from langchain_core.globals import set_llm_cache
import os

# Upstash Vector is a serverless managed vector DB — no infra to manage
set_llm_cache(
    UpstashSemanticCache(
        url=os.environ["UPSTASH_VECTOR_REST_URL"],
        token=os.environ["UPSTASH_VECTOR_REST_TOKEN"],
        score_threshold=0.90,   # cosine similarity threshold for a cache hit
    )
)

llm = ChatOpenAI(model="gpt-4o-mini")

# First call — cache MISS, LLM is invoked (~1–3s)
r1 = llm.invoke("What is the transformer architecture in deep learning?")
print(r1.content)

# Second call — semantically similar phrasing, cache HIT (<50ms, zero tokens)
r2 = llm.invoke("Can you explain how transformers work in neural networks?")
print(r2.content)  # Returns cached response from r1

# Check response metadata to confirm cache hit
# LangChain attaches llm_output cache metadata when the cache is active

Ready to build?

Leverage AI technologies to build your product stack

Superteams can help you build, deploy and launch AI application stacks using open source technologies — from architecture through to production.

Talk to Superteams