Measuring hallucination rates in production systems #12111

terrywerk · 2026-03-08T16:47:31Z

terrywerk
Mar 8, 2026

We've been experimenting with stress testing LLM systems
for hallucinations and prompt injection.

Curious how people here measure hallucination rates
in production systems?

Thanks!
Terry

Empreiteiro · 2026-03-09T17:00:04Z

Empreiteiro
Mar 9, 2026
Maintainer

@terrywerk One option for this type of monitoring is the use of Langfuse: https://langfuse.com/docs/evaluation/evaluation-methods/llm-as-a-judge

For prompt injection, Langflow 1.8 already has a native Guardrail Component that can help you avoid prompt injection attempts.

3 replies

terrywerk Mar 10, 2026
Author

Thanks — Langfuse looks solid for tracing and observability.

One thing I’ve been trying to understand better is how teams move from observability to actual reliability signals. Traces are great for debugging individual failures, but in production it still seems hard to answer questions like:

how often are answers unsupported by retrieved documents
how often retrieval misses the correct document entirely
how often the model should abstain but doesn’t

Curious if teams using Langfuse typically layer additional evaluation pipelines on top of traces, or if people are mostly doing manual review of samples.

Still researching how different teams handle this in production.

Empreiteiro Mar 10, 2026
Maintainer

@terrywerk, great questions!

There isn't a single answer to your question, but to find those answers, the best practice is to create a dataset for evaluating your agent.

This dataset should have examples of questions that the agent will receive and also the answers (answer key).

With this question and answer pair, you can use evaluation systems (like Langfuse) to create these agent evaluation metrics.

With this insight, you will be able to:

Adjust prompt
Adjust context
Adjust RAG (chunk size, number of chunks)

https://langfuse.com/docs/evaluation/overview
https://developers.openai.com/cookbook/examples/agents_sdk/evaluate_agents/
https://docs.langchain.com/langsmith/evaluation-quickstart

terrywerk Mar 16, 2026
Author

Totally agree, having a representative eval dataset with question/answer pairs is foundational. Without that, it’s hard to tell whether changes to prompts, retrieval, or chunking are actually improving behavior or just shifting failure modes.

One thing we’ve run into in production is that accuracy scores alone don’t explain why an agent failed. It often helps to layer in behavioral checks (e.g., grounding, retrieval coverage, abstention) so regressions show up immediately when prompts or models change.

I’ve been working on Veritell CLI to turn those kinds of checks into repeatable tests that run alongside an eval dataset; you define assertions once and run Veritell test before deployment or in CI.

The resources you linked are great starting points for building the dataset itself.
Examples + access for the testing side: https://veritell.ai/

Curious how large your eval sets typically are for real workloads.

aniruddhaadak80 · 2026-03-09T22:13:44Z

aniruddhaadak80
Mar 9, 2026

From my point of view, hallucination rate is too broad to be useful unless you break it into narrower failure classes.

I usually separate unsupported claims, retrieval misses, wrong transformations, and cases where the model should have abstained but did not. Those behave very differently in production, and a single aggregate number tends to hide the real problem.

What has worked best for me is a small human labeled eval set for calibration, plus online sampling where I log retrieval coverage, citation support, and abstention behavior. That gives a much more stable signal than asking for one global hallucination score.

1 reply

terrywerk Mar 10, 2026
Author

This breakdown matches what I’ve been seeing as well.

When people report a single hallucination rate it usually mixes several different failure modes together, and the fixes end up being completely different (retrieval tuning vs prompt changes vs abstention logic).

Out of curiosity — how much of that workflow is automated for you today?

Are you mostly labeling a calibration set and then sampling production traffic, or have you found a good way to automatically detect things like unsupported claims or retrieval misses at scale?

Trying to understand what parts of this are still mostly manual for teams in production systems.

Nyrok · 2026-03-10T12:31:55Z

Nyrok
Mar 10, 2026

Hallucination rate depends a lot on how the prompt is structured. One pattern that helps: separate the "retrieval context" from the "task instructions" with explicit XML-tagged blocks. Models like Claude treat tagged sections differently, so mixing retrieved docs into a prose prompt increases hallucination.

Measuring it in prod: I log the structured prompt + response, then run a lightweight eval prompt asking "did the answer introduce any claim not present in [context]?" Works well as a cheap automated check before human review.

I built flompt.dev for the prompt structuring side of this, a visual builder that decomposes prompts into semantic blocks and compiles to Claude-optimized XML. Keeps the prompt shape consistent across runs. github.com/Nyrok/flompt

0 replies

terrywerk · 2026-03-16T01:20:31Z

terrywerk
Mar 16, 2026
Author

This matches what we’ve seen — prompt structure has a big impact on hallucination behavior, especially when retrieval context and instructions get blended. Explicit separation (XML/sections) tends to reduce unsupported claims and makes downstream evaluation easier.

Your production check (“did the answer introduce claims not present in context?”) is basically a grounding test — we’ve been formalizing that as an automated assertion so it can run consistently in CI, not just post-hoc review.

I’m building Veritell CLI to turn these kinds of checks into repeatable tests (e.g., unsupported claims, retrieval coverage, abstention). You define them once and run veritell test to catch regressions when prompts or models change.

flompt looks interesting for keeping prompt shape stable — that’s often half the battle.
Examples + access: https://veritell.ai/

Happy to compare notes if you’re testing against real RAG workloads.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Measuring hallucination rates in production systems #12111

Uh oh!

{{title}}

Uh oh!

Replies: 4 comments 4 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Measuring hallucination rates in production systems #12111

Uh oh!

terrywerk Mar 8, 2026

Replies: 4 comments · 4 replies

Uh oh!

Empreiteiro Mar 9, 2026 Maintainer

Uh oh!

terrywerk Mar 10, 2026 Author

Uh oh!

Empreiteiro Mar 10, 2026 Maintainer

Uh oh!

terrywerk Mar 16, 2026 Author

Uh oh!

aniruddhaadak80 Mar 9, 2026

Uh oh!

terrywerk Mar 10, 2026 Author

Uh oh!

Nyrok Mar 10, 2026

Uh oh!

terrywerk Mar 16, 2026 Author

terrywerk
Mar 8, 2026

Replies: 4 comments 4 replies

Empreiteiro
Mar 9, 2026
Maintainer

terrywerk Mar 10, 2026
Author

Empreiteiro Mar 10, 2026
Maintainer

terrywerk Mar 16, 2026
Author

aniruddhaadak80
Mar 9, 2026

terrywerk Mar 10, 2026
Author

Nyrok
Mar 10, 2026

terrywerk
Mar 16, 2026
Author