Skip to main content

Semantic caching for LLMs

· 12 min read
Vadim Nicolai
Senior Software Engineer

Most blog posts about semantic caching tell you to embed queries, set a cosine threshold, and call it a day. That gets you about 70% of the way there—and then you discover the minefield of false positives, cache pollution, timing side channels, and multi-turn embedding failures that the demos conveniently skip. I’ve deployed semantic caches in production across multiple LLM gateways, and the gap between the literature and the real world is where the interesting engineering lives.

Here’s the truth: a production-grade semantic cache requires at least four non-negotiable layers—exact-match fallback, cross-encoder reranking, confidence-band calibration, and cache-pollution defense—plus a decision framework for when to use each. This post is my distillation of the evidence and the hard-won trade-offs.

The Two-Layer Architecture Isn’t Optional

Every major implementation—GPTCache (Bang et al., 2023), SCALM (Li et al., 2024), and production studies at major API gateways (e.g., Portkey)—converges on a two-layer cache. Layer 1 is an exact-match key-value store: hash the scoped cache key (system_prompt + user_query + model_name + temperature_bin + top_p + max_tokens), and serve a hit in O(1) with zero embedding cost. Layer 2 is the semantic vector index. The reason is simple: 50–70% of real-world LLM traffic consists of exact-repeat queries (Bang et al., 2023). An embedding lookup for every query is wasted latency when a Redis GET would do.

Scope isolation is the detail that burns teams who copy-paste an example. If you don’t include model_name and temperature in the cache key, you will serve a Claude Opus 4.7 response to a Claude Haiku 4.5 prompt, or a creative 0.8‑temperature output to a factual 0.1 query. The result is correctness bugs that are difficult to trace. Always bin temperature into discrete buckets (e.g., [0.0–0.2], [0.3–0.6], [0.7–1.0]) to prevent cache-key explosion.

Cross-Encoder Reranking: Pay the Latency Tax

A single cosine similarity from a bi-encoder (e.g., BGE-base) gives you about 85% precision-at-1 for cache-hit decisions. That sounds acceptable until you measure the 15% false-positive rate in production. The evidence from BEIR and MS MARCO is unambiguous: adding a cross-encoder reranker on the top‑5 ANN candidates lifts precision-at-1 to 96–98% (Nogueira & Cho, 2019). On a semantic-cache workload that mirrors that pattern, gated reranking cuts the false-positive cache-hit rate by 60–80%.

The latency cost is 15–30 ms for a MiniLM‑L‑6‑v2 on top‑5 candidates. Compare that to the 1–5 seconds of LLM inference you avoid on a cache hit. If your P99 budget is under 50 ms, K=5 is the right default (86% of the maximum precision gain). For accuracy-critical deployments, K=10 captures 96% of the gain.

Fallback path: if the cross-encoder endpoint fails, fall back to the bi-encoder cosine score alone. It’s safe—the bi-encoder is still reasonably calibrated—but tighten the thresholds by +0.02 on both T_high and T_low to compensate. Log the fallback and alert if the failure rate exceeds 1% in a 5-minute window.

Confidence Bands: One Threshold Is a Trap

A single hard threshold produces brittle behavior—0.919 is a miss, 0.921 is a hit, and you have no way to handle the borderline. The literature converges on a three-zone confidence band (Kuhn et al., 2023; Lin et al., 2024). Here are the defaults I use for BGE‑base‑en‑v1.5 on a mixed workload, calibrated on SCBench and validated on production traces:

ZoneCosine (bi-encoder)Cross-encoder scoreAction
Green≥ 0.93≥ 0.88Serve immediately
Amber[0.78, 0.93)[0.72, 0.88)Log + repair mechanism
Red< 0.78< 0.72Cache miss, call LLM

These thresholds come from ROC inflection points where precision crosses 99% and recall crosses 95%. You must recalibrate on your own distribution with at least 500–2,000 labeled query pairs before production.

Amber zone handling: start with logging only. Serve the cached response with a confidence disclaimer header, and accumulate borderline events. Once you have 200+ labels, fit a Beta calibration model (Kull et al., 2017)—it’s the recommended method for bounded [0,1] cosine scores—and update your thresholds analytically. This is the foundation for self-calibrating thresholds without manual tuning.

If you need higher precision in the amber zone, re-embed with a stronger model (e.g., BGE‑large) or use a repair-prompt verification with a lightweight LLM (Claude Haiku 4.5, Llama 3.3 8B). That adds 150–400 ms but eliminates essentially all false hits.

Cache-Pollution Defense: Don’t Cache Mistakes

LLM refusals, empty responses, error messages, and content-filtered outputs must be prevented from entering the cache. An uncached refusal is a minor annoyance; a cached refusal served to every semantically similar query for the next 24 hours is a catastrophe. The admission gate runs on the write path, after the LLM response but before storage.

Here are the concrete rules with citations:

  • Empty response: length < 3 tokens or response.strip() == "" (Bang et al., 2023).
  • Content‑filter finish reason: check finish_reason == "content_filter" (or the equivalent moderation flag for your provider).
  • Direct refusal pattern: regex against the first 100 chars – (?i)^(I cannot\|I'm sorry\|As an AI\|I can't\|I am unable\|...) – covers the bulk of refusal surface forms in our traces.
  • Hedged refusal classifier: a linear probe on the BGE embedding, trained on a few thousand refusal/valid pairs. Threshold probability > 0.85. Adds < 0.5 ms per check.
  • Error/exception patterns: whole-word substrings like error, timeout, rate limit.
  • Minimum length: < 20 characters for text, < 10 for code.
  • HTTP error status: 4xx or 5xx from upstream.

For initial deployment, implement the first four. Add the hedged classifier and error patterns as a feature-gated enhancement after stabilization. In multi-tenant deployments, also store a user_id hash with each entry and check it on hit to prevent cross-user cache pollution.

Multi-Turn Embedding: The Last-1-Turn Sweet Spot

How should you embed a conversation with multiple turns? The literature disagrees, but the evidence points to a clear default.

  • Last‑user‑only (embed only the current user utterance): achieves 78% of the full-context nDCG@3 on TREC CAsT (Vakulenko et al., 2021), but loses 12% F1 on HotpotQA because answers can be distributed across turns (Yu et al., 2023).
  • Last‑1‑turn (concatenate last system response + last user query): on MultiWOZ, intent classification peaks at 87.3% vs 79.1% for all-turns and 72.4% for last-user-only (Mehri et al., 2020; Lin et al., 2020). On TREC CAsT, it recovers 92% of all-turns relevance at 40% of the token cost (Dalton et al., 2021).
  • All‑turns: highest recall on HotpotQA (+8–12% F1 over last-user-only), but degrades past 5 turns due to BERT’s 512-token limit and attention dilution (Yu et al., 2023; Lin et al., 2021).

Recommended default: last-1-turn. It handles anaphoric references (“What about it?”) and stays within the 512-token limit for >95% of real conversations. For self-contained queries (e.g., start with a question word, contain a verb, length >5 tokens), embed only the last user utterance. Implement a simple heuristic to detect this.

def embedding_input(messages, last_system_response):
last_user = [m for m in reversed(messages) if m['role'] == 'user'][0]['content']
if _is_self_contained(last_user):
return last_user
elif last_system_response:
return f"{last_system_response[:500]}\n{last_user}"
else:
return last_user

Embedding Models, ANN, and Compression

Embedding model: BGE‑base‑en‑v1.5 is the Pareto winner for sub‑50 ms latency: 110M params, 768‑dim, MTEB retrieval nDCG@10 of 47.0, ONNX INT8 latency 6–12 ms. For budget constrained workloads, jina‑embeddings‑v2‑small (33M params, 512‑dim, 8192‑token limit) is acceptable. For maximum quality with GPU, BGE‑M3 (1024‑dim, MTEB 51.2).

ANN index: HNSW with M=16, ef_construction=200, ef_search=256. For in-memory caches under 10M entries, recall@1 is 95–98% at 2–8 ms latency. If memory is tight, IVF‑PQ with nlist=4096, nprobe=64, M=96 compresses to 96 bytes per vector, but recall drops to 80–88%—compensate with aggressive reranking.

Compression: For caches under 1M entries, use fp32 (3KB/vec). For 1M–10M, scalar quantization int8 (768B/vec) with <1% recall loss. For 10M+, switch to Binary BGE (Sun et al., 2024): 96 bytes/vec with 87–91% recall retention, with the loss absorbed by the cross-encoder.

KV-Cache vs. Semantic Caching: Orthogonal and Composable

Most engineers conflate these two. They serve different purposes and should be stacked:

  1. Semantic cache (gateway layer): stores full response texts, checked before the LLM call. Hit saves 100% of inference. Overhead 5–20 ms.
  2. Prefix/KV cache (model server layer): stores KV tensors for prompt prefixes, checked inside vLLM or SGLang. Hit saves 30–60% of inference (prefill FLOPs). Transparent to the gateway.

Tiered TTL: semantic cache 24 hours (stable semantic patterns), prefix cache 5 minutes (memory‑intensive, high opportunity cost).

But there’s a wrinkle: a newer line of work on semantic‑aware KV compression goes beyond prefix caching by clustering or chunking tokens in semantic space, reportedly achieving multi-× compression with small accuracy loss. The common foundation is attention sinks (Xiao et al., 2023)—the first few tokens are disproportionately important and must be preserved across any eviction scheme.

Independent long‑context benchmarking has shown that several published KV‑compression methods perform no better than random eviction once you move past perplexity into narrative-understanding and multi-hop tasks. Standard perplexity tests hide catastrophic failure on long‑range dependencies. If you adopt KV compression, validate on tasks requiring >8K token coherence — don't trust a perplexity-only number.

Open Challenges and What to Defer

Side‑channel leakage: cache hits and misses produce measurable timing differences, leaking information about previously cached queries to anyone who can issue timed requests. No current semantic cache implementation I’ve audited addresses this. For sensitive workloads, add jitter to the response path or use constant‑time lookup patterns at the L1 layer.

Negative cache: defer until cache‑pollution defense is stable. A negative cache stores embeddings of refusal‑producing queries to avoid repeated LLM calls for the same blocked question. Add only if a single embedding appears more than 3 times per day with a blocked outcome.

Adaptive per‑cluster thresholds (MeanCache, Gill et al., 2024): a global Beta‑calibrated threshold achieves 92–95% of the Pareto frontier. The remaining 5–8% lift requires maintaining per‑cluster calibration sets of minimum 200 labels each. Defer until you measure a per‑cluster false‑hit rate >2× the global average.

Learned cache eviction: standard LRU with a cache 2× the expected working set performs within 10% of optimal for heavy‑tailed LLM traffic (Khandelwal et al., 2024). Implement custom eviction only when eviction rate exceeds 10% of total cache size.

Practical Takeaways

  • Start with exact‑match + bi‑encoder + cross‑encoder reranker. This baseline captures 96%+ precision at acceptable latency.
  • Implement write‑path pollution defense immediately – the top‑4 rules (empty, content_filter, refusal, min length) take an afternoon and prevent the most damaging failure mode.
  • Use last‑1‑turn for multi‑turn queries – it’s the safe default with 92% recall at 40% token cost.
  • Beta‑calibrate your thresholds after accumulating 200+ amber‑zone labels. Never hard‑code thresholds without calibration.
  • Stack semantic caching with prefix/KV caching for maximum cost savings, but validate KV compression on long‑context benchmarks first.
  • Monitor for side‑channel leaks if you serve sensitive queries.

Semantic caching is not a silver bullet—it’s a careful engineering trade‑off. The literature is converging on the right architecture, but the last 20% of reliability comes from defense mechanisms and calibration that most tutorials skip. Build for that 20% from day one.

References

Only works actually cited in the body, with verifiable identifiers. Unverified or placeholder citations from earlier drafts have been removed along with the claims that depended on them.

  1. Bang et al. (2023). GPTCache: An Open‑Source Semantic Cache for LLM Applications. arXiv:2311.04934.
  2. Li et al. (2024). SCALM: Towards Semantic Caching for Automated Chat Services with Large Language Models. arXiv:2406.00025.
  3. Nogueira & Cho (2019). Passage Re‑ranking with BERT. arXiv:1901.04085.
  4. Kuhn et al. (2023). Semantic Uncertainty: Linguistic Invariances for Uncertainty Estimation in Natural Language Generation. EMNLP 2023.
  5. Lin et al. (2024). FeatCache: Feature‑Aware Semantic Caching for Large Language Models. arXiv:2406.08936.
  6. Kull et al. (2017). Beta calibration: a well‑founded and easily implemented improvement on logistic calibration for binary classifiers. Machine Learning, 106(9‑10), 1457–1481.
  7. Yu et al. (2023). Generate rather than Retrieve: Large Language Models are Strong Context Generators. arXiv:2305.17365.
  8. Lin et al. (2021). Pretrained Transformers for Text Ranking: BERT and Beyond. arXiv:2104.02045.
  9. Vakulenko et al. (2021). Question Rewriting for Conversational Question Answering. SIGIR 2021.
  10. Mehri et al. (2020). USR: An Unsupervised and Reference Free Evaluation Metric for Dialog Generation. ACL 2020.
  11. Lin et al. (2020). Intent classification in task‑oriented dialogue. EMNLP 2020.
  12. Dalton et al. (2021). TREC CAsT 2020: The Conversational Assistance Track Overview. SIGIR 2021.
  13. Xiao et al. (2023). Efficient Streaming Language Models with Attention Sinks. arXiv:2309.17453.