Sigmoda Blog
Production RAG observability: a retrieval health playbook
RAG makes LLMs useful on private and fast-changing data by retrieving external context at runtime. The catch is that retrieval becomes a production dependency. When retrieval degrades, the model still produces confident answers, and the regression is easy to miss until users complain.
This post is a practical playbook for teams running RAG in production: what to log, which metrics surface failures early, and what to do when retrieval drifts.
RAG turns retrieval into your failure point
RAG combines a generator with a retriever. The retriever supplies relevant passages, and the model uses them to answer. When retrieval is wrong, stale, or empty, the generator has nothing trustworthy to ground on. That is why RAG systems need observability at the retrieval layer, not just at the model layer.
The failure modes to expect
Most RAG issues are not exotic. They are boring, recurring failure modes that you can detect with the right signals:
- Empty retrieval: the retriever returns no chunks or only boilerplate.
- Wrong retrieval: relevant docs exist, but the top-k is off target.
- Stale retrieval: the answer is correct for last month, not today.
- Chunking failures: the right doc exists, but the chunk boundary cuts out the key sentence.
- Context overload: too much context causes truncation or model confusion.
- Index drift: a new embedder or reindex silently changes ranking quality.
Log the retrieval step like a dependency
If you cannot answer "what did we retrieve and why" for any request, you cannot debug RAG. Start by logging a compact retrieval summary with every event.
// Example metadata attached to an LLM event
{
"route": "support.reply",
"env": "prod",
"model": "gpt-5-mini",
"retrieval": {
"index_version": "kb-2026-02-01",
"embedder": "embedding-model-v1",
"k": 6,
"topk_scores": [0.86, 0.83, 0.81, 0.78, 0.74, 0.71],
"doc_ids": [
"kb/returns.md#p12",
"kb/returns.md#p13",
"kb/shipping.md#p2"
],
"context_tokens": 920,
"retrieval_ms": 48
}
}// Example metadata attached to an LLM event
{
"route": "support.reply",
"env": "prod",
"model": "gpt-5-mini",
"retrieval": {
"index_version": "kb-2026-02-01",
"embedder": "embedding-model-v1",
"k": 6,
"topk_scores": [0.86, 0.83, 0.81, 0.78, 0.74, 0.71],
"doc_ids": [
"kb/returns.md#p12",
"kb/returns.md#p13",
"kb/shipping.md#p2"
],
"context_tokens": 920,
"retrieval_ms": 48
}
}You do not need to store full documents. IDs, scores, versions, and token counts are usually enough to diagnose issues without bloating storage.
Non-negotiable
Always log the index version and embedder version. Otherwise you will never prove whether a retrieval regression came from a code change or a data change.
The metrics that catch regressions early
Track a small set of retrieval health signals. These are cheap to compute and surface the majority of failures before users do.
- Empty retrieval rate: % of requests with zero chunks or only boilerplate.
- Low-score rate: % of requests where top-1 score is below a threshold.
- Context budget overflow: % of requests where retrieved tokens exceed the budget and are truncated.
- Source diversity: how many unique doc IDs appear in a day per route (drops often signal index issues).
- Stale source rate: % of retrieved docs older than N days for time-sensitive routes.
For offline evaluation, add rank-aware retrieval metrics. MRR and nDCG help you see whether relevant docs appear near the top of the ranked list.
Build a tiny evaluation set
You do not need a giant benchmark to catch drift. A small, curated set of queries per route is enough to detect breakage.
- Pick 30 to 50 real user queries per route (redact if needed).
- Label the 1 to 3 most relevant docs for each query.
- Run retrieval nightly and track MRR and nDCG.
- Gate index changes on this set before you deploy.
Add a lightweight grounding check
Grounding checks do not need to be perfect. The goal is to detect when the answer is untethered from retrieved context.
- Require the model to cite doc IDs or section titles for claims.
- Check for overlap between answer spans and retrieved text.
- If retrieval is empty, force a safe fallback or a clarifying question.
When metrics regress, do this first
Treat retrieval regressions like outages. The fastest fixes are usually operational, not magical.
- Rollback the index version or embedder to the last known good.
- Clamp context length and reduce k to stabilize latency.
- Re-embed recent documents if freshness is the issue.
- Adjust chunking or overlap before you touch prompts.
Do not guess
If retrieval metrics are degraded, do not mask it with prompt changes. Fix retrieval first or you will ship a fragile workaround.
A one-page checklist
## RAG observability checklist
- Log: index_version, embedder, k, topk_scores, doc_ids, retrieval_ms
- Monitor: empty_retrieval_rate, low_score_rate, context_overflow_rate
- Eval: MRR + nDCG on a small labeled set
- Guardrail: empty retrieval -> fallback or clarify
- Rollback: keep a last-known-good index for fast recovery## RAG observability checklist
- Log: index_version, embedder, k, topk_scores, doc_ids, retrieval_ms
- Monitor: empty_retrieval_rate, low_score_rate, context_overflow_rate
- Eval: MRR + nDCG on a small labeled set
- Guardrail: empty retrieval -> fallback or clarify
- Rollback: keep a last-known-good index for fast recoveryRAG works because retrieval keeps the model grounded in reality. If you observe retrieval like a first class system, you get predictable quality, faster debugging, and fewer late-night surprises.