[Proposal] Add WFGY 16-problem RAG failure map as troubleshooting guide for LLM evaluations

### Summary

Thanks for maintaining this project. It is one of the few evaluation frameworks that aims at careful and reproducible testing.

I maintain **WFGY**, an MIT-licensed framework that focuses on failure modes in RAG and multi-agent systems (about 1.5k GitHub stars right now).

The core piece is the **WFGY ProblemMap**, which catalogs **16 common failure modes** for RAG and agents (retrieval drift, ghost matches, semantic collapse, prompt-routing bugs, etc.):

- WFGY ProblemMap (16 failure modes)  
  https://github.com/onestardao/WFGY/blob/main/ProblemMap/README.md

This map has already been referenced by several research groups:

- Harvard MIMS Lab *ToolUniverse* (robustness / RAG debugging section)
- QCRI LLM Lab *Multimodal RAG Survey*
- University of Innsbruck Data Science Group *Rankify* project

I think this fits the spirit of dioptra, since you already treat evaluations as first-class experiments instead of just metrics.

---

### Why this could help this project

When people run LLM or RAG experiments, they often only look at scores, not *why* the system failed. Typical patterns:

1. Retrieval looks fine in isolation but fails under slightly shifted queries.
2. The model hallucinates because the retrieved chunk is semantically close but logically wrong.
3. Multi-step pipelines hide the real failure point.

The 16-problem map gives users a simple checklist:

- “Which failure mode am I seeing?”
- “Is this a retrieval issue, a routing issue, or a reasoning issue?”
- “Where should I place more tests inside the pipeline?”

For a government / standards oriented project, having a named taxonomy of failure modes can also make it easier to write guidance and audit notes.

---

### Concrete proposal

If this sounds useful, I would like to propose:

1. **Docs section**

   Add a short page like:

   > “Diagnosing LLM and RAG failures with the WFGY 16-problem map”

   The page can:
   - List the 16 problems in a compact table.
   - Show how each one maps to typical evaluation setups in this repo.
   - Include 1–2 tiny examples of “metric looks OK, but this problem is actually happening”.

2. **Optional tagging / examples**

   If you think it fits the roadmap, I can draft:
   - A small example notebook or config that runs an evaluation
   - Then labels the outcome with one or more WFGY problem codes such as `No.3 ghost-match retrieval`.

This does **not** require any change to your core logic. It is only an optional diagnostic layer and documentation.

---

### What I can contribute

If you are open to this, I am happy to:

- Draft the docs page in your existing style (including diagrams or tables if helpful).
- Add a very small, self-contained example that demonstrates the idea.
- Iterate based on your review so that it matches your terminology and standards.

Thank you for considering this. Even a short pointer in your docs like  
“for a detailed RAG failure taxonomy, see WFGY ProblemMap”  
would already help many users debug their systems in a more systematic way.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Proposal] Add WFGY 16-problem RAG failure map as troubleshooting guide for LLM evaluations #1190

Summary

Why this could help this project

Concrete proposal

What I can contribute

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Proposal] Add WFGY 16-problem RAG failure map as troubleshooting guide for LLM evaluations #1190

Description

Summary

Why this could help this project

Concrete proposal

What I can contribute

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions