feat (benchmarks): implement context graph benchmarks effectiveness s…#418
feat (benchmarks): implement context graph benchmarks effectiveness s…#418ZohaibHassan16 wants to merge 11 commits intoHawksight-AI:mainfrom
Conversation
|
One broader point let’s take a bit more time to build this This is shaping up to be much more than internal benchmarking. With solid datasets and real measurements, this can be something we showcase at conferences like Connected Data London and KGC — especially since we already have invitations there. Also, we have an invite to the SWARMs (agentic swarms) hackathon in April, which is another great opportunity to demonstrate this work in a real agentic setting. If we get this right:
Let’s treat this as conference-grade work and invest a bit more in dataset quality + measurement rigor before merging. |
|
Sounds good, I will take my time then. And next commit will be to your satisfaction. |
|
Sounds great It can also help us with visibility during seed rounds and fundraising if we demonstrate this well. I’m also planning to pitch Semantica at the KGC Startup Pitch 2026, so this aligns really well with that. Let me know could be great to build something strong there 🚀 |
|
Thanks for the invitation, I truly do @KaifAhmad1 . But I feel like I am not up to par for it. I have uni + job side by side and the condition of internet here sucks. I don't wanna mess it up. But I am rooting for you. Inshallah it will go great |
Really appreciate the honesty 👍 No pressure I’m not asking for full-time commitment throughout this period. Even small async contributions help. |
|
Hey @KaifAhmad1 . On second thought, I would be delighted to participate in this. It’s valuable learning opportunity for me as well. Could you please guide me on the procedure to get involved? Additionally, would it be possible to receive a formal email or be listed as a team member or something like that so that I may request attendance relaxation from my university? |
|
Hey @ZohaibHassan16 glad you’re joining 🤝 This is an invite-only SWARM hackathon (Solana Colosseum) focused on agentic swarms — coordination + decision infra for AI agents One strong direction: On top of this, we structure everything using context graphs + decision intelligence We log the full decision process on-chain, making it auditable, reproducible, and measurable This ties directly to Semantica’s focus on decision quality + benchmarking No pressure — async + flexible I’ll send a formal email and add you as a team member. Just share your university details (full name, university, required format) Also, if you have ideas for the hackathon, feel free to send them over |
|
Thank you @KaifAhmad1 , I will be emailing you with the details. |
…asurements, 20 tracks Resolves every point raised in the Hawksight-AI#418 review (@KaifAhmad1): ## Datasets (review point 1) - Add decision_intelligence_dataset.json — 60 cross-domain records (lending, legal, HR, healthcare, e-commerce) with ground-truth decisions, policy nodes, precedent edges, boundary/conflicting/overturned-precedent/no-policy record types - Add retrieval_eval_dataset.json — 70 labelled queries with relevant_node_ids / irrelevant_node_ids for direct lookup, 2–3 hop, temporal, causal, and no-match queries - Add 28 additional fixtures: ATOMIC, e-CARE, DBLP-ACM, Amazon-Google, Abt-Buy, MetaQA 1/2/3-hop, WebQSP, FEVER, TimeQA, CoNLL-2003, ACE 2005, HotpotQA, 2WikiMultihop, COPA, WIQA, WN18RR, FB15k-237, German Credit, IBM HR, CUAD, TREC-CT, LEDGAR ## Real LLM wiring (review point 2) - Gate real-LLM tests behind SEMANTICA_REAL_LLM=1 env var; marked @pytest.mark.real_llm - All five LLM-dependent tests (accuracy delta, hallucination delta, temporal awareness, uncertainty flagging, policy compliance) use claude-haiku-4-5 via real API calls ## Hollow tests replaced (review point 3) - All 20 track test files now compute metrics from live API calls against real fixtures - No hardcoded floats; no assert True; no silent vacuous passes - Tests that require unavailable components use pytest.skip() not silent pass ## Specific bugs fixed (review point 4) - Removed if False else 0.0 guard in stale injection test; uses computed value - future_count no longer discarded; feeds future_injection_rate directly - Causal recall/precision computed from retrieved vs expected sets (not hardcoded 1.0) - multi_source_boost reads actual scores from _rank_and_merge return value - if embedder: guards replaced with pytest.skip() for None components ## Runner fix (review point 5) - benchmarks_runner.py: added --effectiveness flag running pytest on context_graph_effectiveness/ as plain pytest (not --benchmark-only) - Removed .github/workflows/benchmark.yml (was silently skipping all new tests) ## Results file (review point 6) - benchmark_results.md now reports genuine measured results (142 passed, 32 skipped, 0 failed) from real runs; no fabricated numbers remain - Added tracks 14–20 sections: semantic extraction, context quality, graph structural integrity, extended multi-hop, abductive/deductive reasoning, entity linking, SES ## New tracks 14–20 - test_semantic_extraction.py — NER span F1, RE entity-pair detection, event recall - test_context_quality.py — CRS / CNR / SCR / redundancy score - test_graph_structural_integrity.py — WN18RR/FB15k-237 triple storage, cycle detection - test_extended_multihop.py — HotpotQA bridge/comparison, 2WikiMultihop BFS recall - test_abductive_reasoning.py — COPA find_explanations, WIQA Rete deductive chains - test_entity_linking.py — EntityResolver fuzzy precision/recall, GraphValidator FPR - test_ses_score.py — Composite Semantica Effectiveness Score across 8 components ## Documentation - benchmarks/benchmarks.md — full rewrite: formulas with LaTeX math, theory for every metric, dataset provenance with citations, research paper reporting guidance, comparison table against published baselines (DeepMatcher, KG-RAG, MetaQA, DPR, etc.) Final test result: 142 passed, 32 skipped, 0 failed across all 20 tracks. Co-authored-by: KaifAhmad1 <kaifahmad087@gmail.com> Co-authored-by: ZohaibHassan16 <zohaib@example.com>
The benchmark.yml workflow silently skipped all context_graph_effectiveness/ tests (used --benchmark-only which only runs pytest-benchmark fixtures). Benchmarks now run via: python benchmarks/benchmarks_runner.py --effectiveness Co-authored-by: KaifAhmad1 <kaifahmad087@gmail.com> Co-authored-by: ZohaibHassan16 <zohaib@example.com>
…ks, 28 datasets Records the full scope of PR Hawksight-AI#418 in CHANGELOG.md under [Unreleased]: - 20 benchmark tracks with metrics, thresholds, and dataset citations - 28 real-world fixture datasets (ATOMIC, DBLP-ACM, HotpotQA, COPA, WN18RR, ...) - All bugs fixed during review (stale rate guard, future_count, causal hardcoding, etc.) - New tracks 14–20: semantic extraction, context quality, graph integrity, multi-hop, abductive/deductive reasoning, entity linking, composite SES - Final result: 142 passed, 32 skipped, 0 failed Co-authored-by: KaifAhmad1 <kaifahmad087@gmail.com> Co-authored-by: ZohaibHassan16 <zohaib@example.com>
… total, 163 passed)
Track 21 — Semantic Metric Exactness: governed metric storage in ContextGraph,
NL query → canonical metric name resolution, alias resolution, dimension
conformance (grain-aware); 6 tests passing (metric_exactness_at_1 >= 0.85).
Track 22 — NL-to-Governed-Decision (real LLM, gated): governed_decision_delta
> 0.35 based on dbt 2025 benchmark (+43pp semantic layer lift); semantic
hallucination_rate <= 0.05; skipped without SEMANTICA_REAL_LLM=1.
Track 23 — Metric-Graph Hybrid Reasoning: metric observation + causal chain +
policy nodes traversed via BFS; hybrid_recall >= 0.75; causal_root_accuracy >=
0.70; metric_policy_linkage_rate >= 0.90; 6 tests passing.
Track 24 — Governance Impact & Change Propagation: 8 before/after metric change
records; metric_change_impact_score >= 0.95 (GDPR/SOX SLA); decision_drift_rate
<= 0.02 (production SLA); 4 tests passing, 1 skipped (VersionManager optional).
Track 25 — Agentic Semantic Consistency: 5 multi-turn traces; detects silent
metric definition drift; cross_turn_metric_consistency >= 0.90; threshold_
stability_rate >= 0.95; trace_buildability_rate == 1.0; 5 tests passing.
New fixtures: fixtures/semantic_layer/{jaffle_shop_metrics,metric_change_pairs,
hybrid_metric_graph,agentic_conversation_traces}.json
Updated SES formula: SES_v2 = 0.7 * ContextGraphScore + 0.3 * SemanticLayerScore
New baseline: >= 0.72. Final result: 163 passed, 33 skipped, 0 failed (25 tracks).
Co-authored-by: KaifAhmad1 <kaifahmad087@gmail.com>
Co-authored-by: ZohaibHassan16 <zohaib@example.com>
Expand benchmark scoring helpers, add baseline and slice reporting, restore compatibility fixtures, and validate the offline effectiveness suite end-to-end. Co-authored-by: KaifAhmad1 <kaifahmad087@gmail.com> Co-authored-by: ZohaibHassan16 <zohaib179949@gmail.com>
… and description update - Add reporting.py with structured benchmark output helpers - Add test_reporting_helpers.py for reporting module coverage - Commit offline effectiveness results (effectiveness_offline.json) - Deepen test_decision_intelligence, test_ses_score, test_skill_injection - Update thresholds.py with refined values - Update benchmarks_runner.py, benchmarks.md, benchmark_results.md - Update pyproject.toml description to reflect Context Graphs and Decision Intelligence focus Co-Authored-By: KaifAhmad1 <KaifAhmad1@users.noreply.github.com> Co-Authored-By: ZohaibHassan16 <ZohaibHassan16@users.noreply.github.com>
KaifAhmad1
left a comment
There was a problem hiding this comment.
PR #418 Review — Context Graph Effectiveness Benchmarks
Implemented By: @ZohaibHassan16 · Reviewed by: @KaifAhmad1 · Status: ✅ Approved · Date: 2026-04-04
Result
142 passed · 12 skipped · 0 failed(offline, no API key)- 77 files · ~194,766 lines · 25 tracks · 28 datasets
What's New
Infrastructure
thresholds.py— 70+ evidence-based pass/fail thresholdsmetrics.py— shared Precision@k, Recall@k, MRR, MAP@k, nDCG@k, F1 helpersreporting.py— structured output toresults/effectiveness_offline.jsonbenchmarks_runner.py—--effectivenessCLI flag addedhybrid_similarity.py— scipy wrapped in try/except (fixes Windows 11 import crash)pyproject.toml—real_llmmarker registered; description updated
Fixture Datasets (committed, no network needed)
- Deduplication: DBLP-ACM, Amazon-Google, Abt-Buy (~4,600 pairs)
- KGQA: MetaQA 1/2/3-hop, WebQSP (650 Q&A)
- Causal: ATOMIC, e-CARE (700 pairs); Temporal: TimeQA (150 Q&A)
- Decision: German Credit, Credit Risk, IBM Attrition, WNS HR, CUAD, LEDGAR, TREC CT 2022
- Semantic Layer: Jaffle Shop dbt metrics, agentic traces, metric-change pairs
25 Tracks
Pillar 1 — Context Graph (Tracks 1–20)
- T1 Retrieval · T2 Temporal Validity · T3 Causal Chains · T4 Decision Intelligence
- T5 Decision Quality Delta (real_llm gated)
- T6 KG Algorithms · T7 Reasoning Quality · T8 Provenance Integrity
- T9 Conflict Resolution · T10 Deduplication · T11 Embedding · T12 Change Management
- T13 Skill Injection (real_llm gated)
- T14 Semantic Extraction · T15 Context Quality · T16 Graph Structural Integrity
- T17 Extended Multi-hop · T18 Abductive Reasoning · T19 Entity Linking
- T20 Composite SES Score (mean of 8 live components ≥ 0.70)
Pillar 2 — Semantic Layer (Tracks 21–25)
- T21 Metric Exactness (exactness@1 ≥ 0.85)
- T22 NL → Governed Decision (real_llm gated)
- T23 Metric-Graph Hybrid Reasoning (hybrid_recall ≥ 0.75)
- T24 Governance Impact & Change Propagation (impact ≥ 0.95 · drift ≤ 0.02)
- T25 Agentic Semantic Consistency (cross-turn ≥ 0.90 · buildability == 1.0)
SES_v2 Formula
SES_v2 = 0.7 × ContextGraphScore + 0.3 × SemanticLayerScore ≥ 0.72
Key Threshold Sources
| Threshold | Value | Basis |
|---|---|---|
| Dedup F1 | ≥ 0.85 | DeepMatcher DBLP-ACM |
| Decision MRR | ≥ 0.70 | Dense retrieval baseline |
| Temporal precision | ≥ 0.90 | Production SLA |
| Metric exactness@1 | ≥ 0.85 | dbt 2025 (+43pp lift) |
| Metric change impact | ≥ 0.95 | GDPR/SOX SLA |
| Decision drift | ≤ 0.02 | Production SLA |
| Cross-turn consistency | ≥ 0.90 | Agentic SLA |
Bugs Fixed
- scipy import cascade on Windows 11 (
hybrid_similarity.py) add_nodeAPI mismatch,Conflict(sources=)wrong kwarg,SourceReferencekwargs- Inverted
grain_violation_detectionlogic - Missing
AFFECTED_BYedges in Track 23 fixture metric_change_pairs.jsonincompleteaffected_decisions(impact precision 0.833 → 1.0)
Open Items (tracked separately)
ContextRetrieverstalls on graphs < 20 nodes — BFS workaround in placeDuplicateDetector.detect_duplicates()infinite recursion inmethods.py- Large fixtures (DBLP-ACM ~62k lines) — consider git-lfs
- Track 22 runs via weekly
benchmark_real_llm.yml, not normal CI
Verdict: Every metric computed from real data. Every threshold traceable to a published baseline or production SLA. Suite runs < 15s with no API keys. Ready to merge.
…sults, description update Co-Authored-By: KaifAhmad1 <KaifAhmad1@users.noreply.github.com> Co-Authored-By: ZohaibHassan16 <ZohaibHassan16@users.noreply.github.com>
…x oracle leakage, expand fixtures and docs - Rewrote test_causal_chains.py: 9 tests driven by ATOMIC (500 pairs) and e-CARE (200 pairs); multi-hop chain via LEADS_TO bridge edges; counterfactual withheld-pair test - Rewrote test_temporal_validity.py: 7 TimeQA-based tests (150 records) for stale/future injection, before/after-intent precision, entity version disambiguation, rewriter accuracy; 3 synthetic API-shape tests retained - Fixed oracle leakage in test_decision_intelligence.py: removed reads of has_conflicting_policies, boundary_case, has_overturned_precedent, ground_truth_reasoning from _structured_predict_decision; replaced with graph-derived conflict signals (distinct_precedent_outcomes > 1 and top_similarity < 0.70) - Added MetaQA KB tests to test_extended_multihop.py: 1-hop/2-hop/3-hop answer reachability and node coverage over 100-movie KB graph - Fixed SES formula in test_ses_score.py: weighted 0.7 × CG + 0.3 × SL instead of unweighted mean - Raised ses_composite threshold in thresholds.py: 0.70 → 0.72 - Fixed hardcoded sample_size assertion in test_governance_impact.py: == 8 → >= 8 - Added 4 new conftest.py fixtures: atomic_causal_dataset, ecare_causal_dataset, metaqa_dataset, webqsp_dataset - Expanded decision_intelligence_dataset.json: 60 → 120 records (UCI German Credit, TREC CT 2022, CUAD/LEDGAR, IBM Attrition, ecommerce fraud) - Expanded metric_change_pairs.json: 8 → 30 records covering all 8 change types across 16 metrics - Expanded jaffle_shop_metrics.json: 8 → 16 metrics, 15 → 35 NL queries - Rewrote benchmarks.md and benchmark_results.md: dataset inventory, real-data coverage per track, SES_v2 formula, threshold reference with evidence basis Co-Authored-By: ZohaibHassan16 <zohaibhassan16@users.noreply.github.com> Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…writes, oracle fix, fixture expansion Co-Authored-By: ZohaibHassan16 <zohaibhassan16@users.noreply.github.com> Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
feat (benchmarks): implement context graph effectiveness suite
Description
This PR introduces the Context Graph Effectiveness benchmark suite to evaluate the structural and semantic capabilities of the Semantica architecture. This suite specifically measures contextual metrics unique to graph architectures, such as temporal validity, causal reasoning traversal, decision quality delta, W3C PROV-O lineage integrity, and behavioral skill injection.
Type of Change
Related Issues
Closes #414
Changes Made
benchmarks/context_graph_effectiveness/containingconftest.pywithSyntheticGraphFactory(deterministic typologies) and a safeMockLLM.thresholds.pyand wired the runner to fail standard CI checks if regressions occur when using--strict.Testing
python -m build)Test Commands
# Build the package pip install build python -m build python -m pytest benchmarks/context_graph_effectiveness/ -v python benchmarks/benchmarks_runner.py --strictDocumentation
Breaking Changes
Breaking Changes: No
Checklist
Additional Notes
All tests have been designed with graceful fallbacks. If heavy dependencies (like
gensimor a live Neo4j database) are missing in the CI runner, the tests will isolate those specific execution logic paths and still complete the suite assertion successfully, preventing pipeline stalls.