Replies: 4 comments 4 replies
-
|
@terrywerk One option for this type of monitoring is the use of Langfuse: https://langfuse.com/docs/evaluation/evaluation-methods/llm-as-a-judge For prompt injection, Langflow 1.8 already has a native Guardrail Component that can help you avoid prompt injection attempts. |
Beta Was this translation helpful? Give feedback.
-
|
From my point of view, hallucination rate is too broad to be useful unless you break it into narrower failure classes. I usually separate unsupported claims, retrieval misses, wrong transformations, and cases where the model should have abstained but did not. Those behave very differently in production, and a single aggregate number tends to hide the real problem. What has worked best for me is a small human labeled eval set for calibration, plus online sampling where I log retrieval coverage, citation support, and abstention behavior. That gives a much more stable signal than asking for one global hallucination score. |
Beta Was this translation helpful? Give feedback.
-
|
Hallucination rate depends a lot on how the prompt is structured. One pattern that helps: separate the "retrieval context" from the "task instructions" with explicit XML-tagged blocks. Models like Claude treat tagged sections differently, so mixing retrieved docs into a prose prompt increases hallucination. Measuring it in prod: I log the structured prompt + response, then run a lightweight eval prompt asking "did the answer introduce any claim not present in [context]?" Works well as a cheap automated check before human review. I built flompt.dev for the prompt structuring side of this, a visual builder that decomposes prompts into semantic blocks and compiles to Claude-optimized XML. Keeps the prompt shape consistent across runs. github.com/Nyrok/flompt |
Beta Was this translation helpful? Give feedback.
-
|
This matches what we’ve seen — prompt structure has a big impact on hallucination behavior, especially when retrieval context and instructions get blended. Explicit separation (XML/sections) tends to reduce unsupported claims and makes downstream evaluation easier. Your production check (“did the answer introduce claims not present in context?”) is basically a grounding test — we’ve been formalizing that as an automated assertion so it can run consistently in CI, not just post-hoc review. I’m building Veritell CLI to turn these kinds of checks into repeatable tests (e.g., unsupported claims, retrieval coverage, abstention). You define them once and run veritell test to catch regressions when prompts or models change. flompt looks interesting for keeping prompt shape stable — that’s often half the battle. Happy to compare notes if you’re testing against real RAG workloads. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
We've been experimenting with stress testing LLM systems
for hallucinations and prompt injection.
Curious how people here measure hallucination rates
in production systems?
Thanks!
Terry
Beta Was this translation helpful? Give feedback.
All reactions