Skip to content

feat(proforge): measured supervisor quality gates + review-required label#213

Merged
qnbs merged 4 commits into
mainfrom
feat/proforge-quality-gates
Jun 23, 2026
Merged

feat(proforge): measured supervisor quality gates + review-required label#213
qnbs merged 4 commits into
mainfrom
feat/proforge-quality-gates

Conversation

@qnbs

@qnbs qnbs commented Jun 23, 2026

Copy link
Copy Markdown
Owner

User description

Wave B · PR6 — ProForge measured quality gates (P0)

Stacked on #212 → … → #208.

Deepens the ProForge supervisor: scores were hard-coded constants and thresholds were fixed in code.

Changes

  • Measured scoring: replaces the flat 80/85/88/90 pass-scores with a confidence score that scales with how much signal a stage produced relative to manuscript size (findings per ~1000 words), within a pass band. "Suspect"/fail scores scale with size too. The supervisor still does no AI calls — it's heuristic confidence, not editorial quality — but the score now actually varies with the work done instead of always reporting the same number.
  • Configurable thresholds: new QualityThresholds (largeManuscriptWords + intakeHardGate) on PipelineConfig, defaulted via DEFAULT_QUALITY_THRESHOLDS, threaded from run config → SupervisorAgent + the orchestrator's intake hard gate.
  • Dashboard: an explicit "Experimental — your review is required" line, so the human-in-the-loop expectation is stated, not just implied in help.

i18n

proforge.pipeline.reviewRequired across all 19 locales + bundles.

Tests

Measured-score-varies-with-findings, configurable-threshold behaviour; existing exact-score assertions relaxed to pass-band ranges. 61 ProForge tests green; typecheck + lint + i18n clean.

Note on Wave B / PR7: the remaining ProForge item (off-main-thread/non-blocking execution) needs deep WorkerBus-v2 integration of the orchestrator loop; given the pipeline is network-bound (AI calls), it's better scoped as its own focused PR off main rather than added to this 6-deep stack.

🤖 Generated with Claude Code


CodeAnt-AI Description

Make ProForge quality checks configurable and show that review is required

What Changed

  • ProForge now shows an explicit “review required” notice in the dashboard so users know the manuscript will not change without approval.
  • Supervisor quality gates now use configurable thresholds, so runs can be tuned instead of always using the same built-in limits.
  • Intake now fails when it is truly unanalyzable, while low-but-valid scores no longer trigger a false provider failure.
  • Quality checks now treat a wider set of proof findings as signal, and their scores vary with the amount of work done instead of staying fixed.
  • Added coverage for configurable thresholds, the new review notice, and the revised gate behavior.

Impact

✅ Clearer review expectations
✅ Fewer false intake failures
✅ More consistent quality gate behavior

💡 Usage Guide

Checking Your Pull Request

Every time you make a pull request, our system automatically looks through it. We check for security issues, mistakes in how you're setting up your infrastructure, and common code problems. We do this to make sure your changes are solid and won't cause any trouble later.

Talking to CodeAnt AI

Got a question or need a hand with something in your pull request? You can easily get in touch with CodeAnt AI right here. Just type the following in a comment on your pull request, and replace "Your question here" with whatever you want to ask:

@codeant-ai ask: Your question here

This lets you have a chat with CodeAnt AI about your pull request, making it easier to understand and improve your code.

Example

@codeant-ai ask: Can you suggest a safer alternative to storing this secret?

Preserve Org Learnings with CodeAnt

You can record team preferences so CodeAnt AI applies them in future reviews. Reply directly to the specific CodeAnt AI suggestion (in the same thread) and replace "Your feedback here" with your input:

@codeant-ai: Your feedback here

This helps CodeAnt AI learn and adapt to your team's coding style and standards.

Example

@codeant-ai: Do not flag unused imports.

Retrigger review

Ask CodeAnt AI to review the PR again, by typing:

@codeant-ai: review

Check Your Repository Health

To analyze the health of your code repository, visit our dashboard at https://app.codeant.ai. This tool helps you identify potential issues and areas for improvement in your codebase, ensuring your repository maintains high standards of code health.

@codeant-ai

codeant-ai Bot commented Jun 23, 2026

Copy link
Copy Markdown

Thanks for using CodeAnt! 🎉

We're free for open-source projects. if you're enjoying it, help us grow by sharing.

Share on X ·
Reddit ·
LinkedIn

@vercel

vercel Bot commented Jun 23, 2026

Copy link
Copy Markdown

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
worldscript-studio Ready Ready Preview, Comment Jun 23, 2026 10:06pm

@codeant-ai codeant-ai Bot added the size:L This PR changes 100-499 lines, ignoring generated files label Jun 23, 2026
@qnbs qnbs force-pushed the feat/voice-feedback branch from fbd2528 to 3896e87 Compare June 23, 2026 16:59
Base automatically changed from feat/voice-feedback to main June 23, 2026 20:17
…abel

Wave B · deepens the ProForge supervisor (audit P0): scores were hard-coded
constants and thresholds were fixed in code.

- Measured scoring: replace the flat 80/85/88/90 pass-scores with a confidence
  score that scales with how much signal a stage produced relative to manuscript
  size (findings per ~1000 words), within a pass band. Fail/"suspect" scores now
  scale with manuscript size too. The supervisor still does NO AI calls — this is
  heuristic confidence, not editorial quality — but the score now actually varies
  with the work done instead of always reporting the same number.
- Configurable thresholds: new QualityThresholds (largeManuscriptWords +
  intakeHardGate) on PipelineConfig, defaulted via DEFAULT_QUALITY_THRESHOLDS and
  threaded from run config → SupervisorAgent + the orchestrator's intake hard gate.
- Dashboard: an explicit "Experimental — your review is required" line, so the
  human-in-the-loop expectation is stated, not just implied in help.

i18n: proforge.pipeline.reviewRequired across all 19 locales + bundles.
Tests: measured-score-varies-with-findings, configurable-threshold behaviour;
existing exact-score assertions relaxed to pass-band ranges.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@qnbs qnbs force-pushed the feat/proforge-quality-gates branch from 451d233 to 426da66 Compare June 23, 2026 20:23
@qnbs

qnbs commented Jun 23, 2026

Copy link
Copy Markdown
Owner Author

@CodeAnt-AI review

@codeant-ai

codeant-ai Bot commented Jun 23, 2026

Copy link
Copy Markdown

Thanks for using CodeAnt! 🎉

We're free for open-source projects. if you're enjoying it, help us grow by sharing.

Share on X ·
Reddit ·
LinkedIn

1 similar comment
@codeant-ai

codeant-ai Bot commented Jun 23, 2026

Copy link
Copy Markdown

Thanks for using CodeAnt! 🎉

We're free for open-source projects. if you're enjoying it, help us grow by sharing.

Share on X ·
Reddit ·
LinkedIn

@codeant-ai codeant-ai Bot added size:L This PR changes 100-499 lines, ignoring generated files and removed size:L This PR changes 100-499 lines, ignoring generated files labels Jun 23, 2026
@codeant-ai

codeant-ai Bot commented Jun 23, 2026

Copy link
Copy Markdown

Sequence Diagram

This diagram shows how the ProForge orchestrator now threads configurable quality thresholds into the SupervisorAgent to compute measured confidence scores per stage and enforce a hard intake gate, while reporting results back to the dashboard.

sequenceDiagram
    participant Author
    participant Dashboard
    participant Orchestrator
    participant Supervisor

    Author->>Dashboard: Start ProForge pipeline
    Dashboard->>Orchestrator: Run pipeline with config and quality thresholds
    Orchestrator->>Supervisor: Initialize with thresholds (defaults overridden by config)

    loop For each stage
        Orchestrator->>Orchestrator: Run stage agent and collect findings
        Orchestrator->>Supervisor: Evaluate stage using findings and manuscript size
        Supervisor->>Supervisor: Compute confidenceScore and suspectScore
        Supervisor-->>Orchestrator: Return pass flag and qualityScore

        alt Intake stage below intake hard gate
            Orchestrator-->>Dashboard: Mark intake failed with diagnostic message
            break
        else Stage passes quality gate
            Orchestrator-->>Dashboard: Mark stage complete and advance
        end
    end
Loading

Generated by CodeAnt AI

Comment thread features/proForge/types.ts
@codecov

codecov Bot commented Jun 23, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

CodeAnt: partialPipelineConfigSchema lacked qualityThresholds, so Zod
silently stripped the key for Node/MCP capability callers — overrides
were dropped and the supervisor always used DEFAULT_QUALITY_THRESHOLDS.
Add qualityThresholdsSchema to the validator and pass config.qualityThresholds
into the SupervisorAgent so non-Redux entry points honor the config contract.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@qnbs

qnbs commented Jun 23, 2026

Copy link
Copy Markdown
Owner Author

@CodeAnt-AI review

@codeant-ai

codeant-ai Bot commented Jun 23, 2026

Copy link
Copy Markdown

Thanks for using CodeAnt! 🎉

We're free for open-source projects. if you're enjoying it, help us grow by sharing.

Share on X ·
Reddit ·
LinkedIn

@codeant-ai codeant-ai Bot added size:L This PR changes 100-499 lines, ignoring generated files and removed size:L This PR changes 100-499 lines, ignoring generated files labels Jun 23, 2026
@qnbs

qnbs commented Jun 23, 2026

Copy link
Copy Markdown
Owner Author

@CodeAnt-AI review

@codeant-ai

codeant-ai Bot commented Jun 23, 2026

Copy link
Copy Markdown

Thanks for using CodeAnt! 🎉

We're free for open-source projects. if you're enjoying it, help us grow by sharing.

Share on X ·
Reddit ·
LinkedIn

@codeant-ai codeant-ai Bot added size:L This PR changes 100-499 lines, ignoring generated files and removed size:L This PR changes 100-499 lines, ignoring generated files labels Jun 23, 2026
@codeant-ai

codeant-ai Bot commented Jun 23, 2026

Copy link
Copy Markdown

Sequence Diagram

This PR updates the ProForge pipeline so each stage is evaluated by a supervisor that computes a measured confidence score based on findings and manuscript size, using configurable quality thresholds and an intake hard gate.

sequenceDiagram
    participant User
    participant ProForgePipeline
    participant StageAgent
    participant Supervisor

    User->>ProForgePipeline: Start stage with config and manuscript
    ProForgePipeline->>StageAgent: Execute stage work
    StageAgent-->>ProForgePipeline: Stage result and review items
    ProForgePipeline->>Supervisor: Evaluate quality with thresholds and word count
    Supervisor-->>ProForgePipeline: pass flag, quality score, retry suggestion

    alt Intake stage below hard gate
        ProForgePipeline-->>User: Fail run with diagnostic message
    else Pass or soft flag
        ProForgePipeline-->>User: Return stage result and supervisor decision
    end
Loading

Generated by CodeAnt AI

Comment thread services/proForge/proForgeCapabilityLayer.ts
Comment thread services/proForge/proForgeCapabilitySchemas.ts Outdated
Comment thread services/proForge/proForgeOrchestrator.ts Outdated
…-100

Three CodeAnt findings on the quality-gate config contract:
- Gate intake failure on the supervisor actually flagging it (!decision.pass)
  plus a sub-floor score, not score alone — a legitimately weak-but-analyzed
  manuscript no longer mislabels as an AI-provider failure.
- Centralize the rule in SupervisorAgent.intakeHardGateFailed so the orchestrator
  AND the capability layer (Node/MCP runStage) enforce identical behavior; the
  capability layer now throws STAGE_FAILED on a fallback intake instead of
  returning a misleading success.
- Bound intakeHardGate to 0..100 in the Zod schema so an impossible (>100)
  threshold can't make every intake fail via misconfiguration.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@qnbs

qnbs commented Jun 23, 2026

Copy link
Copy Markdown
Owner Author

@CodeAnt-AI review

@codeant-ai

codeant-ai Bot commented Jun 23, 2026

Copy link
Copy Markdown

Thanks for using CodeAnt! 🎉

We're free for open-source projects. if you're enjoying it, help us grow by sharing.

Share on X ·
Reddit ·
LinkedIn

@codeant-ai codeant-ai Bot added size:L This PR changes 100-499 lines, ignoring generated files and removed size:L This PR changes 100-499 lines, ignoring generated files labels Jun 23, 2026
@codeant-ai

codeant-ai Bot commented Jun 23, 2026

Copy link
Copy Markdown

Sequence Diagram

This PR makes the ProForge supervisor use measured, configurable quality thresholds for intake and editing stages, and enforces a shared intake hard gate while clearly signaling that human review is required.

sequenceDiagram
    participant Author
    participant Dashboard
    participant ProForgeBackend
    participant StageAgent
    participant SupervisorAgent

    Author->>Dashboard: Start ProForge intake with run settings
    Dashboard->>ProForgeBackend: Run intake with config (quality thresholds)
    ProForgeBackend->>StageAgent: Execute intake agent
    StageAgent-->>ProForgeBackend: Return diagnostic and review items
    ProForgeBackend->>SupervisorAgent: Evaluate intake using thresholds

    alt Intake fallback and score below intake gate
        SupervisorAgent-->>ProForgeBackend: Hard gate failure decision
        ProForgeBackend-->>Dashboard: Report intake failed (check AI provider)
    else Intake analyzed or above gate
        SupervisorAgent-->>ProForgeBackend: Pass or soft fail decision with score
        ProForgeBackend-->>Dashboard: Show intake result and mark stage for review
    end

    Dashboard-->>Author: Display experimental review required notice
Loading

Generated by CodeAnt AI

Comment thread services/proForge/pipelineAgents/supervisorAgent.ts Outdated
Comment thread services/proForge/pipelineAgents/supervisorAgent.ts Outdated
Comment thread services/proForge/pipelineAgents/supervisorAgent.ts Outdated
Comment thread services/proForge/pipelineAgents/supervisorAgent.ts Outdated
…f findings

Four CodeAnt findings on the measured confidence scores:
- structural/lineProse/copyEdit summed agentOutput edits AND reviewItems, but
  reviewItems are derived 1:1 from those same edits — inflating confidence.
  Use Math.max(editCount, reviewItems.length) as the canonical single-source
  signal so the score is proportional to real work, never doubled.
- proof scored only grammar issues, ignoring style/technical/legal findings.
  Count all proof-stage signal so reports with substantial non-grammar findings
  aren't mis-scored (and aren't mislabeled as fallbacks).

Adds a double-count regression guard + a non-grammar-proof test.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@qnbs

qnbs commented Jun 23, 2026

Copy link
Copy Markdown
Owner Author

@CodeAnt-AI review

@codeant-ai

codeant-ai Bot commented Jun 23, 2026

Copy link
Copy Markdown

Thanks for using CodeAnt! 🎉

We're free for open-source projects. if you're enjoying it, help us grow by sharing.

Share on X ·
Reddit ·
LinkedIn

@codeant-ai codeant-ai Bot added size:XL This PR changes 500-999 lines, ignoring generated files and removed size:L This PR changes 100-499 lines, ignoring generated files labels Jun 23, 2026
@codeant-ai

codeant-ai Bot commented Jun 23, 2026

Copy link
Copy Markdown

Sequence Diagram

This PR updates the ProForge pipeline so stage results are scored by a measured supervisor and intake failures are consistently hard-gated using configurable quality thresholds, with callers clearly informed that human review is required.

sequenceDiagram
    participant User
    participant ProForgeRunner
    participant StageAgent
    participant Supervisor

    User->>ProForgeRunner: Start stage with optional quality thresholds
    ProForgeRunner->>StageAgent: Execute stage on manuscript
    StageAgent-->>ProForgeRunner: Return agent output and review items
    ProForgeRunner->>Supervisor: Evaluate stage result with thresholds
    Supervisor-->>ProForgeRunner: Decision with measured quality score

    alt Intake hard gate fails
        ProForgeRunner-->>User: Report intake failure and do not advance pipeline
    else Stage passes or soft fails
        ProForgeRunner-->>User: Return stage result and supervisor decision for review
    end
Loading

Generated by CodeAnt AI

@qnbs qnbs merged commit 5852270 into main Jun 23, 2026
19 checks passed
@qnbs qnbs deleted the feat/proforge-quality-gates branch June 23, 2026 22:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size:XL This PR changes 500-999 lines, ignoring generated files

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant