This repository starts with explicit, falsifiable hypotheses. If these hypotheses do not hold, the design should change.
Claim
For questions that require traversing at least one relationship edge, weighted graph retrieval should outperform plain lexical or embedding-only top-k retrieval.
Success criteria
- at least
+15%relative improvement inevidence_recallon the multi-hop subset - at least
+10%relative improvement inanswer_recallon the same subset
Failure signal
- the graph retriever behaves the same as baseline top-k even when the target evidence is split across linked clauses
Claim
Adding risk, importance, or anomaly weight should increase the chance that critical clauses appear in the top result set.
Success criteria
- weighted retrieval returns at least one critical evidence node in top-3 results more often than the unweighted structure-only variant
- no more than
20%median latency regression against the structure-only retriever
Failure signal
- weighting only reshuffles already obvious hits and does not improve critical evidence recall
Claim
Returning a replayable path of nodes and edges should make retrieval behavior auditable while keeping query latency within a practical research-prototype range.
Success criteria
- every evaluation result includes a machine-readable
MemoryPath - path steps can be traced back to source evidence
- median query latency stays under
200 msfor the current synthetic benchmark dataset on local runs
Failure signal
- answer quality only comes from a hidden heuristic and the path object is cosmetic rather than causally useful
- Use
docs/evaluation.mdas the baseline comparison contract. - Track ablations for:
- no structure
- no weights
- no path expansion