-
Notifications
You must be signed in to change notification settings - Fork 57
Description
Summary
Thanks for maintaining this project. It is one of the few evaluation frameworks that aims at careful and reproducible testing.
I maintain WFGY, an MIT-licensed framework that focuses on failure modes in RAG and multi-agent systems (about 1.5k GitHub stars right now).
The core piece is the WFGY ProblemMap, which catalogs 16 common failure modes for RAG and agents (retrieval drift, ghost matches, semantic collapse, prompt-routing bugs, etc.):
- WFGY ProblemMap (16 failure modes)
https://github.com/onestardao/WFGY/blob/main/ProblemMap/README.md
This map has already been referenced by several research groups:
- Harvard MIMS Lab ToolUniverse (robustness / RAG debugging section)
- QCRI LLM Lab Multimodal RAG Survey
- University of Innsbruck Data Science Group Rankify project
I think this fits the spirit of dioptra, since you already treat evaluations as first-class experiments instead of just metrics.
Why this could help this project
When people run LLM or RAG experiments, they often only look at scores, not why the system failed. Typical patterns:
- Retrieval looks fine in isolation but fails under slightly shifted queries.
- The model hallucinates because the retrieved chunk is semantically close but logically wrong.
- Multi-step pipelines hide the real failure point.
The 16-problem map gives users a simple checklist:
- “Which failure mode am I seeing?”
- “Is this a retrieval issue, a routing issue, or a reasoning issue?”
- “Where should I place more tests inside the pipeline?”
For a government / standards oriented project, having a named taxonomy of failure modes can also make it easier to write guidance and audit notes.
Concrete proposal
If this sounds useful, I would like to propose:
-
Docs section
Add a short page like:
“Diagnosing LLM and RAG failures with the WFGY 16-problem map”
The page can:
- List the 16 problems in a compact table.
- Show how each one maps to typical evaluation setups in this repo.
- Include 1–2 tiny examples of “metric looks OK, but this problem is actually happening”.
-
Optional tagging / examples
If you think it fits the roadmap, I can draft:
- A small example notebook or config that runs an evaluation
- Then labels the outcome with one or more WFGY problem codes such as
No.3 ghost-match retrieval.
This does not require any change to your core logic. It is only an optional diagnostic layer and documentation.
What I can contribute
If you are open to this, I am happy to:
- Draft the docs page in your existing style (including diagrams or tables if helpful).
- Add a very small, self-contained example that demonstrates the idea.
- Iterate based on your review so that it matches your terminology and standards.
Thank you for considering this. Even a short pointer in your docs like
“for a detailed RAG failure taxonomy, see WFGY ProblemMap”
would already help many users debug their systems in a more systematic way.