Skip to content

feat (benchmarks): implement context graph benchmarks effectiveness s…#418

Open
ZohaibHassan16 wants to merge 11 commits intoHawksight-AI:mainfrom
ZohaibHassan16:feat/cg-benchmarks
Open

feat (benchmarks): implement context graph benchmarks effectiveness s…#418
ZohaibHassan16 wants to merge 11 commits intoHawksight-AI:mainfrom
ZohaibHassan16:feat/cg-benchmarks

Conversation

@ZohaibHassan16
Copy link
Copy Markdown
Collaborator

feat (benchmarks): implement context graph effectiveness suite


Description

This PR introduces the Context Graph Effectiveness benchmark suite to evaluate the structural and semantic capabilities of the Semantica architecture. This suite specifically measures contextual metrics unique to graph architectures, such as temporal validity, causal reasoning traversal, decision quality delta, W3C PROV-O lineage integrity, and behavioral skill injection.

Type of Change

  • New feature (non-breaking change which adds functionality)
  • Documentation update

Related Issues

Closes #414

Changes Made

  • Effectiveness Infrastructure: Created benchmarks/context_graph_effectiveness/ containing conftest.py with SyntheticGraphFactory (deterministic typologies) and a safe MockLLM.
  • 13 API Capabilities Tested: Added test tracks for Retrieval, Temporal Validity, Causal Chains, Decision Intelligence, Knowledge Graph Algorithms, Reasoning Quality, Provenance Integrity, Conflict Resolution, Deduplication, Embedded Quality, Change Management, and Skill Injection.
  • CI Enforcement: Defined strict validation rules in thresholds.py and wired the runner to fail standard CI checks if regressions occur when using --strict.

Testing

  • Tested locally
  • Added tests for new functionality
  • Package builds successfully (python -m build)

Test Commands

# Build the package
pip install build
python -m build
python -m pytest benchmarks/context_graph_effectiveness/ -v
python benchmarks/benchmarks_runner.py --strict

Documentation

  • Updated relevant documentation
Note: Added docs/benchmarks/skill_injection.md and updated benchmarks/benchmark_results.md with the new track data.

Breaking Changes

Breaking Changes: No

Checklist

  • My code follows the project's style guidelines
  • I have performed a self-review of my code
  • I have commented my code, particularly in hard-to-understand areas
  • My changes generate no new warnings
  • Package builds successfully

Additional Notes

All tests have been designed with graceful fallbacks. If heavy dependencies (like gensim or a live Neo4j database) are missing in the CI runner, the tests will isolate those specific execution logic paths and still complete the suite assertion successfully, preventing pipeline stalls.

KaifAhmad1

This comment was marked as off-topic.

@KaifAhmad1
Copy link
Copy Markdown
Contributor

KaifAhmad1 commented Mar 29, 2026

@ZohaibHassan16

One broader point let’s take a bit more time to build this

This is shaping up to be much more than internal benchmarking. With solid datasets and real measurements, this can be something we showcase at conferences like Connected Data London and KGC — especially since we already have invitations there.

Also, we have an invite to the SWARMs (agentic swarms) hackathon in April, which is another great opportunity to demonstrate this work in a real agentic setting.

If we get this right:

  • it becomes a credible, reproducible benchmark suite
  • clearly differentiates Semantica from traditional RAG
  • and gives us a strong “decision-quality delta” story

Let’s treat this as conference-grade work and invest a bit more in dataset quality + measurement rigor before merging.

@ZohaibHassan16
Copy link
Copy Markdown
Collaborator Author

Sounds good, I will take my time then. And next commit will be to your satisfaction.

@KaifAhmad1
Copy link
Copy Markdown
Contributor

KaifAhmad1 commented Mar 29, 2026

@ZohaibHassan16

Sounds great
Also, quick question: are you up for participating in the SWARMs (agentic swarms) hackathon in April? We’ve got an invite, and it could be a strong opportunity to showcase this work in a real agentic setup.

It can also help us with visibility during seed rounds and fundraising if we demonstrate this well. I’m also planning to pitch Semantica at the KGC Startup Pitch 2026, so this aligns really well with that.
Here’s the link if you want to explore:
https://luma.com/6229drs5?tk=Wtfx9U

Let me know could be great to build something strong there 🚀

@ZohaibHassan16
Copy link
Copy Markdown
Collaborator Author

Thanks for the invitation, I truly do @KaifAhmad1 . But I feel like I am not up to par for it. I have uni + job side by side and the condition of internet here sucks. I don't wanna mess it up. But I am rooting for you. Inshallah it will go great

@KaifAhmad1
Copy link
Copy Markdown
Contributor

Thanks for the invitation, I truly do @KaifAhmad1 . But I feel like I am not up to par for it. I have uni + job side by side and the condition of internet here sucks. I don't wanna mess it up. But I am rooting for you. Inshallah it will go great

Really appreciate the honesty 👍

No pressure I’m not asking for full-time commitment throughout this period. Even small async contributions help.
Given the market direction and the problem we’re solving, there are strong future opportunities as well.
You can jump in anytime you feel comfortable 🤝

@ZohaibHassan16
Copy link
Copy Markdown
Collaborator Author

Hey @KaifAhmad1 . On second thought, I would be delighted to participate in this. It’s valuable learning opportunity for me as well. Could you please guide me on the procedure to get involved? Additionally, would it be possible to receive a formal email or be listed as a team member or something like that so that I may request attendance relaxation from my university?
Thank you once again for the invitation

@KaifAhmad1
Copy link
Copy Markdown
Contributor

Hey @ZohaibHassan16 glad you’re joining 🤝

This is an invite-only SWARM hackathon (Solana Colosseum) focused on agentic swarms — coordination + decision infra for AI agents
https://luma.com/6229drs5?tk=Wtfx9U

One strong direction:
Build a system where multiple agents solve a task and dynamically choose how to decide
either by debating (step-by-step reasoning) or voting (aggregating outputs)

On top of this, we structure everything using context graphs + decision intelligence
so decisions are not just outputs, but grounded in connected context, dependencies, and reasoning paths

We log the full decision process on-chain, making it auditable, reproducible, and measurable
and usable in real-world scenarios where decision quality actually matters

This ties directly to Semantica’s focus on decision quality + benchmarking

No pressure — async + flexible

I’ll send a formal email and add you as a team member. Just share your university details (full name, university, required format)

Also, if you have ideas for the hackathon, feel free to send them over
Here is my email: kaifahmad087@gmail.com

@ZohaibHassan16
Copy link
Copy Markdown
Collaborator Author

Thank you @KaifAhmad1 , I will be emailing you with the details.

KaifAhmad1 and others added 4 commits April 2, 2026 11:48
…asurements, 20 tracks

Resolves every point raised in the Hawksight-AI#418 review (@KaifAhmad1):

## Datasets (review point 1)
- Add decision_intelligence_dataset.json — 60 cross-domain records (lending, legal, HR,
  healthcare, e-commerce) with ground-truth decisions, policy nodes, precedent edges,
  boundary/conflicting/overturned-precedent/no-policy record types
- Add retrieval_eval_dataset.json — 70 labelled queries with relevant_node_ids /
  irrelevant_node_ids for direct lookup, 2–3 hop, temporal, causal, and no-match queries
- Add 28 additional fixtures: ATOMIC, e-CARE, DBLP-ACM, Amazon-Google, Abt-Buy, MetaQA
  1/2/3-hop, WebQSP, FEVER, TimeQA, CoNLL-2003, ACE 2005, HotpotQA, 2WikiMultihop,
  COPA, WIQA, WN18RR, FB15k-237, German Credit, IBM HR, CUAD, TREC-CT, LEDGAR

## Real LLM wiring (review point 2)
- Gate real-LLM tests behind SEMANTICA_REAL_LLM=1 env var; marked @pytest.mark.real_llm
- All five LLM-dependent tests (accuracy delta, hallucination delta, temporal awareness,
  uncertainty flagging, policy compliance) use claude-haiku-4-5 via real API calls

## Hollow tests replaced (review point 3)
- All 20 track test files now compute metrics from live API calls against real fixtures
- No hardcoded floats; no assert True; no silent vacuous passes
- Tests that require unavailable components use pytest.skip() not silent pass

## Specific bugs fixed (review point 4)
- Removed if False else 0.0 guard in stale injection test; uses computed value
- future_count no longer discarded; feeds future_injection_rate directly
- Causal recall/precision computed from retrieved vs expected sets (not hardcoded 1.0)
- multi_source_boost reads actual scores from _rank_and_merge return value
- if embedder: guards replaced with pytest.skip() for None components

## Runner fix (review point 5)
- benchmarks_runner.py: added --effectiveness flag running pytest on
  context_graph_effectiveness/ as plain pytest (not --benchmark-only)
- Removed .github/workflows/benchmark.yml (was silently skipping all new tests)

## Results file (review point 6)
- benchmark_results.md now reports genuine measured results (142 passed, 32 skipped, 0
  failed) from real runs; no fabricated numbers remain
- Added tracks 14–20 sections: semantic extraction, context quality, graph structural
  integrity, extended multi-hop, abductive/deductive reasoning, entity linking, SES

## New tracks 14–20
- test_semantic_extraction.py — NER span F1, RE entity-pair detection, event recall
- test_context_quality.py — CRS / CNR / SCR / redundancy score
- test_graph_structural_integrity.py — WN18RR/FB15k-237 triple storage, cycle detection
- test_extended_multihop.py — HotpotQA bridge/comparison, 2WikiMultihop BFS recall
- test_abductive_reasoning.py — COPA find_explanations, WIQA Rete deductive chains
- test_entity_linking.py — EntityResolver fuzzy precision/recall, GraphValidator FPR
- test_ses_score.py — Composite Semantica Effectiveness Score across 8 components

## Documentation
- benchmarks/benchmarks.md — full rewrite: formulas with LaTeX math, theory for every
  metric, dataset provenance with citations, research paper reporting guidance,
  comparison table against published baselines (DeepMatcher, KG-RAG, MetaQA, DPR, etc.)

Final test result: 142 passed, 32 skipped, 0 failed across all 20 tracks.

Co-authored-by: KaifAhmad1 <kaifahmad087@gmail.com>
Co-authored-by: ZohaibHassan16 <zohaib@example.com>
The benchmark.yml workflow silently skipped all context_graph_effectiveness/
tests (used --benchmark-only which only runs pytest-benchmark fixtures).
Benchmarks now run via: python benchmarks/benchmarks_runner.py --effectiveness

Co-authored-by: KaifAhmad1 <kaifahmad087@gmail.com>
Co-authored-by: ZohaibHassan16 <zohaib@example.com>
…ks, 28 datasets

Records the full scope of PR Hawksight-AI#418 in CHANGELOG.md under [Unreleased]:
- 20 benchmark tracks with metrics, thresholds, and dataset citations
- 28 real-world fixture datasets (ATOMIC, DBLP-ACM, HotpotQA, COPA, WN18RR, ...)
- All bugs fixed during review (stale rate guard, future_count, causal hardcoding, etc.)
- New tracks 14–20: semantic extraction, context quality, graph integrity,
  multi-hop, abductive/deductive reasoning, entity linking, composite SES
- Final result: 142 passed, 32 skipped, 0 failed

Co-authored-by: KaifAhmad1 <kaifahmad087@gmail.com>
Co-authored-by: ZohaibHassan16 <zohaib@example.com>
… total, 163 passed)

Track 21 — Semantic Metric Exactness: governed metric storage in ContextGraph,
NL query → canonical metric name resolution, alias resolution, dimension
conformance (grain-aware); 6 tests passing (metric_exactness_at_1 >= 0.85).

Track 22 — NL-to-Governed-Decision (real LLM, gated): governed_decision_delta
> 0.35 based on dbt 2025 benchmark (+43pp semantic layer lift); semantic
hallucination_rate <= 0.05; skipped without SEMANTICA_REAL_LLM=1.

Track 23 — Metric-Graph Hybrid Reasoning: metric observation + causal chain +
policy nodes traversed via BFS; hybrid_recall >= 0.75; causal_root_accuracy >=
0.70; metric_policy_linkage_rate >= 0.90; 6 tests passing.

Track 24 — Governance Impact & Change Propagation: 8 before/after metric change
records; metric_change_impact_score >= 0.95 (GDPR/SOX SLA); decision_drift_rate
<= 0.02 (production SLA); 4 tests passing, 1 skipped (VersionManager optional).

Track 25 — Agentic Semantic Consistency: 5 multi-turn traces; detects silent
metric definition drift; cross_turn_metric_consistency >= 0.90; threshold_
stability_rate >= 0.95; trace_buildability_rate == 1.0; 5 tests passing.

New fixtures: fixtures/semantic_layer/{jaffle_shop_metrics,metric_change_pairs,
hybrid_metric_graph,agentic_conversation_traces}.json

Updated SES formula: SES_v2 = 0.7 * ContextGraphScore + 0.3 * SemanticLayerScore
New baseline: >= 0.72. Final result: 163 passed, 33 skipped, 0 failed (25 tracks).

Co-authored-by: KaifAhmad1 <kaifahmad087@gmail.com>
Co-authored-by: ZohaibHassan16 <zohaib@example.com>
KaifAhmad1

This comment was marked as off-topic.

KaifAhmad1 and others added 2 commits April 3, 2026 17:35
Expand benchmark scoring helpers, add baseline and slice reporting, restore compatibility fixtures, and validate the offline effectiveness suite end-to-end.

Co-authored-by: KaifAhmad1 <kaifahmad087@gmail.com>
Co-authored-by: ZohaibHassan16 <zohaib179949@gmail.com>
KaifAhmad1

This comment was marked as off-topic.

… and description update

- Add reporting.py with structured benchmark output helpers
- Add test_reporting_helpers.py for reporting module coverage
- Commit offline effectiveness results (effectiveness_offline.json)
- Deepen test_decision_intelligence, test_ses_score, test_skill_injection
- Update thresholds.py with refined values
- Update benchmarks_runner.py, benchmarks.md, benchmark_results.md
- Update pyproject.toml description to reflect Context Graphs and Decision Intelligence focus

Co-Authored-By: KaifAhmad1 <KaifAhmad1@users.noreply.github.com>
Co-Authored-By: ZohaibHassan16 <ZohaibHassan16@users.noreply.github.com>
Copy link
Copy Markdown
Contributor

@KaifAhmad1 KaifAhmad1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR #418 Review — Context Graph Effectiveness Benchmarks

Implemented By: @ZohaibHassan16 · Reviewed by: @KaifAhmad1 · Status: ✅ Approved · Date: 2026-04-04


Result

  • 142 passed · 12 skipped · 0 failed (offline, no API key)
  • 77 files · ~194,766 lines · 25 tracks · 28 datasets

What's New

Infrastructure

  • thresholds.py — 70+ evidence-based pass/fail thresholds
  • metrics.py — shared Precision@k, Recall@k, MRR, MAP@k, nDCG@k, F1 helpers
  • reporting.py — structured output to results/effectiveness_offline.json
  • benchmarks_runner.py--effectiveness CLI flag added
  • hybrid_similarity.py — scipy wrapped in try/except (fixes Windows 11 import crash)
  • pyproject.tomlreal_llm marker registered; description updated

Fixture Datasets (committed, no network needed)

  • Deduplication: DBLP-ACM, Amazon-Google, Abt-Buy (~4,600 pairs)
  • KGQA: MetaQA 1/2/3-hop, WebQSP (650 Q&A)
  • Causal: ATOMIC, e-CARE (700 pairs); Temporal: TimeQA (150 Q&A)
  • Decision: German Credit, Credit Risk, IBM Attrition, WNS HR, CUAD, LEDGAR, TREC CT 2022
  • Semantic Layer: Jaffle Shop dbt metrics, agentic traces, metric-change pairs

25 Tracks

Pillar 1 — Context Graph (Tracks 1–20)

  • T1 Retrieval · T2 Temporal Validity · T3 Causal Chains · T4 Decision Intelligence
  • T5 Decision Quality Delta (real_llm gated)
  • T6 KG Algorithms · T7 Reasoning Quality · T8 Provenance Integrity
  • T9 Conflict Resolution · T10 Deduplication · T11 Embedding · T12 Change Management
  • T13 Skill Injection (real_llm gated)
  • T14 Semantic Extraction · T15 Context Quality · T16 Graph Structural Integrity
  • T17 Extended Multi-hop · T18 Abductive Reasoning · T19 Entity Linking
  • T20 Composite SES Score (mean of 8 live components ≥ 0.70)

Pillar 2 — Semantic Layer (Tracks 21–25)

  • T21 Metric Exactness (exactness@1 ≥ 0.85)
  • T22 NL → Governed Decision (real_llm gated)
  • T23 Metric-Graph Hybrid Reasoning (hybrid_recall ≥ 0.75)
  • T24 Governance Impact & Change Propagation (impact ≥ 0.95 · drift ≤ 0.02)
  • T25 Agentic Semantic Consistency (cross-turn ≥ 0.90 · buildability == 1.0)

SES_v2 Formula

SES_v2 = 0.7 × ContextGraphScore + 0.3 × SemanticLayerScore  ≥ 0.72

Key Threshold Sources

Threshold Value Basis
Dedup F1 ≥ 0.85 DeepMatcher DBLP-ACM
Decision MRR ≥ 0.70 Dense retrieval baseline
Temporal precision ≥ 0.90 Production SLA
Metric exactness@1 ≥ 0.85 dbt 2025 (+43pp lift)
Metric change impact ≥ 0.95 GDPR/SOX SLA
Decision drift ≤ 0.02 Production SLA
Cross-turn consistency ≥ 0.90 Agentic SLA

Bugs Fixed

  • scipy import cascade on Windows 11 (hybrid_similarity.py)
  • add_node API mismatch, Conflict(sources=) wrong kwarg, SourceReference kwargs
  • Inverted grain_violation_detection logic
  • Missing AFFECTED_BY edges in Track 23 fixture
  • metric_change_pairs.json incomplete affected_decisions (impact precision 0.833 → 1.0)

Open Items (tracked separately)

  • ContextRetriever stalls on graphs < 20 nodes — BFS workaround in place
  • DuplicateDetector.detect_duplicates() infinite recursion in methods.py
  • Large fixtures (DBLP-ACM ~62k lines) — consider git-lfs
  • Track 22 runs via weekly benchmark_real_llm.yml, not normal CI

Verdict: Every metric computed from real data. Every threshold traceable to a published baseline or production SLA. Suite runs < 15s with no API keys. Ready to merge.

KaifAhmad1 and others added 3 commits April 4, 2026 16:19
…sults, description update

Co-Authored-By: KaifAhmad1 <KaifAhmad1@users.noreply.github.com>
Co-Authored-By: ZohaibHassan16 <ZohaibHassan16@users.noreply.github.com>
…x oracle leakage, expand fixtures and docs

- Rewrote test_causal_chains.py: 9 tests driven by ATOMIC (500 pairs) and e-CARE (200 pairs); multi-hop chain via LEADS_TO bridge edges; counterfactual withheld-pair test
- Rewrote test_temporal_validity.py: 7 TimeQA-based tests (150 records) for stale/future injection, before/after-intent precision, entity version disambiguation, rewriter accuracy; 3 synthetic API-shape tests retained
- Fixed oracle leakage in test_decision_intelligence.py: removed reads of has_conflicting_policies, boundary_case, has_overturned_precedent, ground_truth_reasoning from _structured_predict_decision; replaced with graph-derived conflict signals (distinct_precedent_outcomes > 1 and top_similarity < 0.70)
- Added MetaQA KB tests to test_extended_multihop.py: 1-hop/2-hop/3-hop answer reachability and node coverage over 100-movie KB graph
- Fixed SES formula in test_ses_score.py: weighted 0.7 × CG + 0.3 × SL instead of unweighted mean
- Raised ses_composite threshold in thresholds.py: 0.70 → 0.72
- Fixed hardcoded sample_size assertion in test_governance_impact.py: == 8 → >= 8
- Added 4 new conftest.py fixtures: atomic_causal_dataset, ecare_causal_dataset, metaqa_dataset, webqsp_dataset
- Expanded decision_intelligence_dataset.json: 60 → 120 records (UCI German Credit, TREC CT 2022, CUAD/LEDGAR, IBM Attrition, ecommerce fraud)
- Expanded metric_change_pairs.json: 8 → 30 records covering all 8 change types across 16 metrics
- Expanded jaffle_shop_metrics.json: 8 → 16 metrics, 15 → 35 NL queries
- Rewrote benchmarks.md and benchmark_results.md: dataset inventory, real-data coverage per track, SES_v2 formula, threshold reference with evidence basis

Co-Authored-By: ZohaibHassan16 <zohaibhassan16@users.noreply.github.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…writes, oracle fix, fixture expansion

Co-Authored-By: ZohaibHassan16 <zohaibhassan16@users.noreply.github.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[FEATURE] Expand Context Graph & Decision Intelligence Benchmarks

2 participants