feat (benchmarks): implement context graph benchmarks effectiveness s… by ZohaibHassan16 · Pull Request #418 · Hawksight-AI/semantica

ZohaibHassan16 · 2026-03-28T23:57:32Z

feat (benchmarks): implement context graph effectiveness suite

Description

This PR introduces the Context Graph Effectiveness benchmark suite to evaluate the structural and semantic capabilities of the Semantica architecture. This suite specifically measures contextual metrics unique to graph architectures, such as temporal validity, causal reasoning traversal, decision quality delta, W3C PROV-O lineage integrity, and behavioral skill injection.

Type of Change

New feature (non-breaking change which adds functionality)
Documentation update

Related Issues

Closes #414

Changes Made

Effectiveness Infrastructure: Created benchmarks/context_graph_effectiveness/ containing conftest.py with SyntheticGraphFactory (deterministic typologies) and a safe MockLLM.
13 API Capabilities Tested: Added test tracks for Retrieval, Temporal Validity, Causal Chains, Decision Intelligence, Knowledge Graph Algorithms, Reasoning Quality, Provenance Integrity, Conflict Resolution, Deduplication, Embedded Quality, Change Management, and Skill Injection.
CI Enforcement: Defined strict validation rules in thresholds.py and wired the runner to fail standard CI checks if regressions occur when using --strict.

Testing

Tested locally
Added tests for new functionality
Package builds successfully (python -m build)

Test Commands

# Build the package
pip install build
python -m build
python -m pytest benchmarks/context_graph_effectiveness/ -v
python benchmarks/benchmarks_runner.py --strict

Documentation

Updated relevant documentation

Note: Added docs/benchmarks/skill_injection.md and updated benchmarks/benchmark_results.md with the new track data.

Breaking Changes

Breaking Changes: No

Checklist

My code follows the project's style guidelines
I have performed a self-review of my code
I have commented my code, particularly in hard-to-understand areas
My changes generate no new warnings
Package builds successfully

Additional Notes

All tests have been designed with graceful fallbacks. If heavy dependencies (like gensim or a live Neo4j database) are missing in the CI runner, the tests will isolate those specific execution logic paths and still complete the suite assertion successfully, preventing pipeline stalls.

…uite

KaifAhmad1 · 2026-03-29T10:35:13Z

@ZohaibHassan16

One broader point let’s take a bit more time to build this

This is shaping up to be much more than internal benchmarking. With solid datasets and real measurements, this can be something we showcase at conferences like Connected Data London and KGC — especially since we already have invitations there.

Also, we have an invite to the SWARMs (agentic swarms) hackathon in April, which is another great opportunity to demonstrate this work in a real agentic setting.

If we get this right:

it becomes a credible, reproducible benchmark suite
clearly differentiates Semantica from traditional RAG
and gives us a strong “decision-quality delta” story

Let’s treat this as conference-grade work and invest a bit more in dataset quality + measurement rigor before merging.

ZohaibHassan16 · 2026-03-29T10:48:40Z

Sounds good, I will take my time then. And next commit will be to your satisfaction.

KaifAhmad1 · 2026-03-29T11:47:19Z

@ZohaibHassan16

Sounds great
Also, quick question: are you up for participating in the SWARMs (agentic swarms) hackathon in April? We’ve got an invite, and it could be a strong opportunity to showcase this work in a real agentic setup.

It can also help us with visibility during seed rounds and fundraising if we demonstrate this well. I’m also planning to pitch Semantica at the KGC Startup Pitch 2026, so this aligns really well with that.
Here’s the link if you want to explore:
https://luma.com/6229drs5?tk=Wtfx9U

Let me know could be great to build something strong there 🚀

ZohaibHassan16 · 2026-03-29T18:30:47Z

Thanks for the invitation, I truly do @KaifAhmad1 . But I feel like I am not up to par for it. I have uni + job side by side and the condition of internet here sucks. I don't wanna mess it up. But I am rooting for you. Inshallah it will go great

KaifAhmad1 · 2026-03-30T06:15:14Z

Thanks for the invitation, I truly do @KaifAhmad1 . But I feel like I am not up to par for it. I have uni + job side by side and the condition of internet here sucks. I don't wanna mess it up. But I am rooting for you. Inshallah it will go great

Really appreciate the honesty 👍

No pressure I’m not asking for full-time commitment throughout this period. Even small async contributions help.
Given the market direction and the problem we’re solving, there are strong future opportunities as well.
You can jump in anytime you feel comfortable 🤝

ZohaibHassan16 · 2026-03-30T07:56:41Z

Hey @KaifAhmad1 . On second thought, I would be delighted to participate in this. It’s valuable learning opportunity for me as well. Could you please guide me on the procedure to get involved? Additionally, would it be possible to receive a formal email or be listed as a team member or something like that so that I may request attendance relaxation from my university?
Thank you once again for the invitation

KaifAhmad1 · 2026-03-30T08:56:29Z

Hey @ZohaibHassan16 glad you’re joining 🤝

This is an invite-only SWARM hackathon (Solana Colosseum) focused on agentic swarms — coordination + decision infra for AI agents
https://luma.com/6229drs5?tk=Wtfx9U

One strong direction:
Build a system where multiple agents solve a task and dynamically choose how to decide
either by debating (step-by-step reasoning) or voting (aggregating outputs)

On top of this, we structure everything using context graphs + decision intelligence
so decisions are not just outputs, but grounded in connected context, dependencies, and reasoning paths

We log the full decision process on-chain, making it auditable, reproducible, and measurable
and usable in real-world scenarios where decision quality actually matters

This ties directly to Semantica’s focus on decision quality + benchmarking

No pressure — async + flexible

I’ll send a formal email and add you as a team member. Just share your university details (full name, university, required format)

Also, if you have ideas for the hackathon, feel free to send them over
Here is my email: kaifahmad087@gmail.com

ZohaibHassan16 · 2026-03-30T10:11:57Z

Thank you @KaifAhmad1 , I will be emailing you with the details.

@KaifAhmad1

…asurements, 20 tracks Resolves every point raised in the Hawksight-AI#418 review (@KaifAhmad1): ## Datasets (review point 1) - Add decision_intelligence_dataset.json — 60 cross-domain records (lending, legal, HR, healthcare, e-commerce) with ground-truth decisions, policy nodes, precedent edges, boundary/conflicting/overturned-precedent/no-policy record types - Add retrieval_eval_dataset.json — 70 labelled queries with relevant_node_ids / irrelevant_node_ids for direct lookup, 2–3 hop, temporal, causal, and no-match queries - Add 28 additional fixtures: ATOMIC, e-CARE, DBLP-ACM, Amazon-Google, Abt-Buy, MetaQA 1/2/3-hop, WebQSP, FEVER, TimeQA, CoNLL-2003, ACE 2005, HotpotQA, 2WikiMultihop, COPA, WIQA, WN18RR, FB15k-237, German Credit, IBM HR, CUAD, TREC-CT, LEDGAR ## Real LLM wiring (review point 2) - Gate real-LLM tests behind SEMANTICA_REAL_LLM=1 env var; marked @pytest.mark.real_llm - All five LLM-dependent tests (accuracy delta, hallucination delta, temporal awareness, uncertainty flagging, policy compliance) use claude-haiku-4-5 via real API calls ## Hollow tests replaced (review point 3) - All 20 track test files now compute metrics from live API calls against real fixtures - No hardcoded floats; no assert True; no silent vacuous passes - Tests that require unavailable components use pytest.skip() not silent pass ## Specific bugs fixed (review point 4) - Removed if False else 0.0 guard in stale injection test; uses computed value - future_count no longer discarded; feeds future_injection_rate directly - Causal recall/precision computed from retrieved vs expected sets (not hardcoded 1.0) - multi_source_boost reads actual scores from _rank_and_merge return value - if embedder: guards replaced with pytest.skip() for None components ## Runner fix (review point 5) - benchmarks_runner.py: added --effectiveness flag running pytest on context_graph_effectiveness/ as plain pytest (not --benchmark-only) - Removed .github/workflows/benchmark.yml (was silently skipping all new tests) ## Results file (review point 6) - benchmark_results.md now reports genuine measured results (142 passed, 32 skipped, 0 failed) from real runs; no fabricated numbers remain - Added tracks 14–20 sections: semantic extraction, context quality, graph structural integrity, extended multi-hop, abductive/deductive reasoning, entity linking, SES ## New tracks 14–20 - test_semantic_extraction.py — NER span F1, RE entity-pair detection, event recall - test_context_quality.py — CRS / CNR / SCR / redundancy score - test_graph_structural_integrity.py — WN18RR/FB15k-237 triple storage, cycle detection - test_extended_multihop.py — HotpotQA bridge/comparison, 2WikiMultihop BFS recall - test_abductive_reasoning.py — COPA find_explanations, WIQA Rete deductive chains - test_entity_linking.py — EntityResolver fuzzy precision/recall, GraphValidator FPR - test_ses_score.py — Composite Semantica Effectiveness Score across 8 components ## Documentation - benchmarks/benchmarks.md — full rewrite: formulas with LaTeX math, theory for every metric, dataset provenance with citations, research paper reporting guidance, comparison table against published baselines (DeepMatcher, KG-RAG, MetaQA, DPR, etc.) Final test result: 142 passed, 32 skipped, 0 failed across all 20 tracks. Co-authored-by: KaifAhmad1 <kaifahmad087@gmail.com> Co-authored-by: ZohaibHassan16 <zohaib@example.com>

The benchmark.yml workflow silently skipped all context_graph_effectiveness/ tests (used --benchmark-only which only runs pytest-benchmark fixtures). Benchmarks now run via: python benchmarks/benchmarks_runner.py --effectiveness Co-authored-by: KaifAhmad1 <kaifahmad087@gmail.com> Co-authored-by: ZohaibHassan16 <zohaib@example.com>

…ks, 28 datasets Records the full scope of PR Hawksight-AI#418 in CHANGELOG.md under [Unreleased]: - 20 benchmark tracks with metrics, thresholds, and dataset citations - 28 real-world fixture datasets (ATOMIC, DBLP-ACM, HotpotQA, COPA, WN18RR, ...) - All bugs fixed during review (stale rate guard, future_count, causal hardcoding, etc.) - New tracks 14–20: semantic extraction, context quality, graph integrity, multi-hop, abductive/deductive reasoning, entity linking, composite SES - Final result: 142 passed, 32 skipped, 0 failed Co-authored-by: KaifAhmad1 <kaifahmad087@gmail.com> Co-authored-by: ZohaibHassan16 <zohaib@example.com>

… total, 163 passed) Track 21 — Semantic Metric Exactness: governed metric storage in ContextGraph, NL query → canonical metric name resolution, alias resolution, dimension conformance (grain-aware); 6 tests passing (metric_exactness_at_1 >= 0.85). Track 22 — NL-to-Governed-Decision (real LLM, gated): governed_decision_delta > 0.35 based on dbt 2025 benchmark (+43pp semantic layer lift); semantic hallucination_rate <= 0.05; skipped without SEMANTICA_REAL_LLM=1. Track 23 — Metric-Graph Hybrid Reasoning: metric observation + causal chain + policy nodes traversed via BFS; hybrid_recall >= 0.75; causal_root_accuracy >= 0.70; metric_policy_linkage_rate >= 0.90; 6 tests passing. Track 24 — Governance Impact & Change Propagation: 8 before/after metric change records; metric_change_impact_score >= 0.95 (GDPR/SOX SLA); decision_drift_rate <= 0.02 (production SLA); 4 tests passing, 1 skipped (VersionManager optional). Track 25 — Agentic Semantic Consistency: 5 multi-turn traces; detects silent metric definition drift; cross_turn_metric_consistency >= 0.90; threshold_ stability_rate >= 0.95; trace_buildability_rate == 1.0; 5 tests passing. New fixtures: fixtures/semantic_layer/{jaffle_shop_metrics,metric_change_pairs, hybrid_metric_graph,agentic_conversation_traces}.json Updated SES formula: SES_v2 = 0.7 * ContextGraphScore + 0.3 * SemanticLayerScore New baseline: >= 0.72. Final result: 163 passed, 33 skipped, 0 failed (25 tracks). Co-authored-by: KaifAhmad1 <kaifahmad087@gmail.com> Co-authored-by: ZohaibHassan16 <zohaib@example.com>

Expand benchmark scoring helpers, add baseline and slice reporting, restore compatibility fixtures, and validate the offline effectiveness suite end-to-end. Co-authored-by: KaifAhmad1 <kaifahmad087@gmail.com> Co-authored-by: ZohaibHassan16 <zohaib179949@gmail.com>

… and description update - Add reporting.py with structured benchmark output helpers - Add test_reporting_helpers.py for reporting module coverage - Commit offline effectiveness results (effectiveness_offline.json) - Deepen test_decision_intelligence, test_ses_score, test_skill_injection - Update thresholds.py with refined values - Update benchmarks_runner.py, benchmarks.md, benchmark_results.md - Update pyproject.toml description to reflect Context Graphs and Decision Intelligence focus Co-Authored-By: KaifAhmad1 <KaifAhmad1@users.noreply.github.com> Co-Authored-By: ZohaibHassan16 <ZohaibHassan16@users.noreply.github.com>

KaifAhmad1

PR #418 Review — Context Graph Effectiveness Benchmarks

Implemented By: @ZohaibHassan16 · Reviewed by: @KaifAhmad1 · Status: ✅ Approved · Date: 2026-04-04

Result

142 passed · 12 skipped · 0 failed (offline, no API key)
77 files · ~194,766 lines · 25 tracks · 28 datasets

What's New

Infrastructure

thresholds.py — 70+ evidence-based pass/fail thresholds
metrics.py — shared Precision@k, Recall@k, MRR, MAP@k, nDCG@k, F1 helpers
reporting.py — structured output to results/effectiveness_offline.json
benchmarks_runner.py — --effectiveness CLI flag added
hybrid_similarity.py — scipy wrapped in try/except (fixes Windows 11 import crash)
pyproject.toml — real_llm marker registered; description updated

Fixture Datasets (committed, no network needed)

Deduplication: DBLP-ACM, Amazon-Google, Abt-Buy (~4,600 pairs)
KGQA: MetaQA 1/2/3-hop, WebQSP (650 Q&A)
Causal: ATOMIC, e-CARE (700 pairs); Temporal: TimeQA (150 Q&A)
Decision: German Credit, Credit Risk, IBM Attrition, WNS HR, CUAD, LEDGAR, TREC CT 2022
Semantic Layer: Jaffle Shop dbt metrics, agentic traces, metric-change pairs

25 Tracks

Pillar 1 — Context Graph (Tracks 1–20)

T1 Retrieval · T2 Temporal Validity · T3 Causal Chains · T4 Decision Intelligence
T5 Decision Quality Delta (real_llm gated)
T6 KG Algorithms · T7 Reasoning Quality · T8 Provenance Integrity
T9 Conflict Resolution · T10 Deduplication · T11 Embedding · T12 Change Management
T13 Skill Injection (real_llm gated)
T14 Semantic Extraction · T15 Context Quality · T16 Graph Structural Integrity
T17 Extended Multi-hop · T18 Abductive Reasoning · T19 Entity Linking
T20 Composite SES Score (mean of 8 live components ≥ 0.70)

Pillar 2 — Semantic Layer (Tracks 21–25)

T21 Metric Exactness (exactness@1 ≥ 0.85)
T22 NL → Governed Decision (real_llm gated)
T23 Metric-Graph Hybrid Reasoning (hybrid_recall ≥ 0.75)
T24 Governance Impact & Change Propagation (impact ≥ 0.95 · drift ≤ 0.02)
T25 Agentic Semantic Consistency (cross-turn ≥ 0.90 · buildability == 1.0)

SES_v2 Formula

SES_v2 = 0.7 × ContextGraphScore + 0.3 × SemanticLayerScore  ≥ 0.72

Key Threshold Sources

Threshold	Value	Basis
Dedup F1	≥ 0.85	DeepMatcher DBLP-ACM
Decision MRR	≥ 0.70	Dense retrieval baseline
Temporal precision	≥ 0.90	Production SLA
Metric exactness@1	≥ 0.85	dbt 2025 (+43pp lift)
Metric change impact	≥ 0.95	GDPR/SOX SLA
Decision drift	≤ 0.02	Production SLA
Cross-turn consistency	≥ 0.90	Agentic SLA

Bugs Fixed

scipy import cascade on Windows 11 (hybrid_similarity.py)
add_node API mismatch, Conflict(sources=) wrong kwarg, SourceReference kwargs
Inverted grain_violation_detection logic
Missing AFFECTED_BY edges in Track 23 fixture
metric_change_pairs.json incomplete affected_decisions (impact precision 0.833 → 1.0)

Open Items (tracked separately)

ContextRetriever stalls on graphs < 20 nodes — BFS workaround in place
DuplicateDetector.detect_duplicates() infinite recursion in methods.py
Large fixtures (DBLP-ACM ~62k lines) — consider git-lfs
Track 22 runs via weekly benchmark_real_llm.yml, not normal CI

Verdict: Every metric computed from real data. Every threshold traceable to a published baseline or production SLA. Suite runs < 15s with no API keys. Ready to merge.

…sults, description update Co-Authored-By: KaifAhmad1 <KaifAhmad1@users.noreply.github.com> Co-Authored-By: ZohaibHassan16 <ZohaibHassan16@users.noreply.github.com>

…x oracle leakage, expand fixtures and docs - Rewrote test_causal_chains.py: 9 tests driven by ATOMIC (500 pairs) and e-CARE (200 pairs); multi-hop chain via LEADS_TO bridge edges; counterfactual withheld-pair test - Rewrote test_temporal_validity.py: 7 TimeQA-based tests (150 records) for stale/future injection, before/after-intent precision, entity version disambiguation, rewriter accuracy; 3 synthetic API-shape tests retained - Fixed oracle leakage in test_decision_intelligence.py: removed reads of has_conflicting_policies, boundary_case, has_overturned_precedent, ground_truth_reasoning from _structured_predict_decision; replaced with graph-derived conflict signals (distinct_precedent_outcomes > 1 and top_similarity < 0.70) - Added MetaQA KB tests to test_extended_multihop.py: 1-hop/2-hop/3-hop answer reachability and node coverage over 100-movie KB graph - Fixed SES formula in test_ses_score.py: weighted 0.7 × CG + 0.3 × SL instead of unweighted mean - Raised ses_composite threshold in thresholds.py: 0.70 → 0.72 - Fixed hardcoded sample_size assertion in test_governance_impact.py: == 8 → >= 8 - Added 4 new conftest.py fixtures: atomic_causal_dataset, ecare_causal_dataset, metaqa_dataset, webqsp_dataset - Expanded decision_intelligence_dataset.json: 60 → 120 records (UCI German Credit, TREC CT 2022, CUAD/LEDGAR, IBM Attrition, ecommerce fraud) - Expanded metric_change_pairs.json: 8 → 30 records covering all 8 change types across 16 metrics - Expanded jaffle_shop_metrics.json: 8 → 16 metrics, 15 → 35 NL queries - Rewrote benchmarks.md and benchmark_results.md: dataset inventory, real-data coverage per track, SES_v2 formula, threshold reference with evidence basis Co-Authored-By: ZohaibHassan16 <zohaibhassan16@users.noreply.github.com> Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…writes, oracle fix, fixture expansion Co-Authored-By: ZohaibHassan16 <zohaibhassan16@users.noreply.github.com> Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

feat (benchmarks): implement context graph benchmarks effectiveness s…

4fb3ca3

…uite

ZohaibHassan16 requested a review from KaifAhmad1 March 28, 2026 23:57

This comment was marked as off-topic.

Sign in to view

KaifAhmad1 and others added 4 commits April 2, 2026 11:48

This comment was marked as off-topic.

Sign in to view

KaifAhmad1 and others added 2 commits April 3, 2026 17:35

Merge branch 'main' into feat/cg-benchmarks

25022d6

This comment was marked as off-topic.

Sign in to view

KaifAhmad1 mentioned this pull request Apr 4, 2026

feat(explorer): add initial UI for SKOS Vocabulary Workspace #420

Merged

KaifAhmad1 approved these changes Apr 4, 2026

View reviewed changes

KaifAhmad1 and others added 3 commits April 4, 2026 16:19

docs(changelog): document benchmark follow-up — reporting, offline re…

04925a7

…sults, description update Co-Authored-By: KaifAhmad1 <KaifAhmad1@users.noreply.github.com> Co-Authored-By: ZohaibHassan16 <ZohaibHassan16@users.noreply.github.com>

docs(changelog): document real-dataset migration — causal/temporal re…

34c0c87

…writes, oracle fix, fixture expansion Co-Authored-By: ZohaibHassan16 <zohaibhassan16@users.noreply.github.com> Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat (benchmarks): implement context graph benchmarks effectiveness s…#418

feat (benchmarks): implement context graph benchmarks effectiveness s…#418
ZohaibHassan16 wants to merge 11 commits intoHawksight-AI:mainfrom
ZohaibHassan16:feat/cg-benchmarks

ZohaibHassan16 commented Mar 28, 2026

Uh oh!

This comment was marked as off-topic.

Uh oh!

KaifAhmad1 commented Mar 29, 2026 •

edited

Loading

Uh oh!

ZohaibHassan16 commented Mar 29, 2026

Uh oh!

KaifAhmad1 commented Mar 29, 2026 •

edited

Loading

Uh oh!

ZohaibHassan16 commented Mar 29, 2026

Uh oh!

KaifAhmad1 commented Mar 30, 2026

Uh oh!

ZohaibHassan16 commented Mar 30, 2026

Uh oh!

KaifAhmad1 commented Mar 30, 2026

Uh oh!

ZohaibHassan16 commented Mar 30, 2026

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

KaifAhmad1 left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

ZohaibHassan16 commented Mar 28, 2026

feat (benchmarks): implement context graph effectiveness suite

Description

Type of Change

Related Issues

Changes Made

Testing

Test Commands

Documentation

Breaking Changes

Checklist

Additional Notes

Uh oh!

This comment was marked as off-topic.

Uh oh!

KaifAhmad1 commented Mar 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ZohaibHassan16 commented Mar 29, 2026

Uh oh!

KaifAhmad1 commented Mar 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ZohaibHassan16 commented Mar 29, 2026

Uh oh!

KaifAhmad1 commented Mar 30, 2026

Uh oh!

ZohaibHassan16 commented Mar 30, 2026

Uh oh!

KaifAhmad1 commented Mar 30, 2026

Uh oh!

ZohaibHassan16 commented Mar 30, 2026

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

KaifAhmad1 left a comment

Choose a reason for hiding this comment

PR #418 Review — Context Graph Effectiveness Benchmarks

Result

What's New

25 Tracks

Key Threshold Sources

Bugs Fixed

Open Items (tracked separately)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

KaifAhmad1 commented Mar 29, 2026 •

edited

Loading

KaifAhmad1 commented Mar 29, 2026 •

edited

Loading