Module:
moralstack/models/risk/(subpackage)
The Risk Estimator performs semantic classification of the ethical risk associated with user requests. The module is
organized as a subpackage: schema.py, categories.py, estimator.py, calibration.py, parse_result.py, utils.py;
the public API is exposed via __init__.py.
For testers and stakeholders: The output includes risk_category, score and semantic signals (intent_clarity,
misuse_plausibility, actionability_risk, operational_risk) used by the Orchestrator for routing (Fast Path vs
Deliberative) and for the SAFE_COMPLETE vs NORMAL_COMPLETE decision. Tests can verify that benign requests
receive BENIGN and that harmful requests receive CLEARLY_HARMFUL or POTENTIALLY_HARMFUL.
The Risk Estimator is a "semantic judge" that:
- Analyzes the meaning and intent of requests
- Does not rely on keyword matching or superficial patterns
- Determines the processing path (Fast Path vs Deliberative Path)
- Provides semantic signals to guide deliberation
"How does a nuclear explosion work?" → BENIGN (educational)
"How to build a nuclear bomb?" → CLEARLY_HARMFUL (harmful intent)
Context and semantic nuance are fundamental.
| Category | Score | Description | Action |
|---|---|---|---|
BENIGN |
0.0 - 0.3 | Standard informational requests | Fast Path |
MORALLY_NUANCED |
0.3 - 0.5 | Interpersonal ethical dilemmas | Deliberative (default 2 cycles) |
SENSITIVE |
0.5 - 0.7 | Delicate but legitimate topics | Deliberative (default 2 cycles) |
POTENTIALLY_HARMFUL |
0.7 - 0.9 | Requires careful scrutiny | Full deliberative |
CLEARLY_HARMFUL |
0.9 - 1.0 | Evident violation | Immediate refusal |
from moralstack.models.risk import RiskCategory
class RiskCategory(Enum):
BENIGN = "benign"
MORALLY_NUANCED = "morally_nuanced"
SENSITIVE = "sensitive"
POTENTIALLY_HARMFUL = "potentially_harmful"
CLEARLY_HARMFUL = "clearly_harmful"parse_risk_response(text) returns an immutable RiskParseResult (from parse_result.py) with all parsed and
calibrated fields (score, confidence, category, signals, rationale, risk_policy_action, domain_sensitivity,
operational_risk, intent_clarity, misuse_plausibility, actionability_risk, harm_type, self_harm_language,
requested_instructions, intent_to_harm, request_type, intent_operational, detected_language). The estimator uses this
result to build the public RiskEstimation and applies crisis post-processing when needed.
@dataclass
class RiskEstimation:
score: float # [0, 1] - higher = riskier
confidence: float # [0, 1] - estimate confidence
risk_category: RiskCategory # Assigned category
semantic_signals: list[str] # Identified semantic signals
rationale: str # Judge explanation
raw_response: str # Raw LLM response (debug)RiskEstimation(
score=0.35,
confidence=0.85,
risk_category=RiskCategory.MORALLY_NUANCED,
semantic_signals=["ethical_dilemma", "interpersonal_conflict"],
rationale="The request presents an ethical dilemma that requires moral deliberation"
)- LLM Semantic Analysis: The prompt is analyzed for meaning and intent
- Signal Extraction: Identification of semantic risk indicators
- Classification: Category and score assignment
- Constitution Consultation: Identification of relevant principles
Semantic analysis in LLMBasedRiskEstimator._semantic_analysis is split into:
- Prompt building —
_build_generation_config(),_build_full_prompt(prompt)(base template + optional principles from constitution store).GenerationConfigrequests OpenAIresponse_format={"type":"json_object"}for monolithic and parallel mini-estimator calls (structured output); tolerant recovery viaextract_jsonremains for parse classification and edge cases. In parallel mini-estimator mode,OpenAIPolicyobjects for per-mini model overrides are pooled per model id on theLLMBasedRiskEstimatorinstance (optional diagnostics:get_pooling_diagnostics()). - LLM call with retry —
_call_llm_with_retry(full_prompt, gen_config)runs the policy LLM, persists the call via_persist_risk_llm_call(when persistence is available), and returns(raw_response, RiskParseResult)on success;parsed_summary_jsonincludes aparse_contractobject (response_contract,strict_json_requested,parse_status,fallback_used,retry_count, etc.). On parse/generation failure it retries up tomax_retries(unchanged policy), then raisesRiskEstimationError. - Parsing —
parse_risk_dictafterparse_dict_with_contract(directjson.loadsvsextract_jsonfallback) produces aRiskParseResult(same governance semantics as the formerparse_risk_responsepipeline). - Crisis post-processing —
_post_process_crisis(parsed)applies the crisis/help-seeking clamp (self-harm language without requested instructions or intent to harm → score clamp, category/signals overrides). - Mapping —
_to_risk_estimation(...)builds the publicRiskEstimationfrom the parsed result and post-processed values (includingintent_typefrom request type).
Persistence of LLM calls is best-effort: if moralstack.persistence.sink.persist_llm_call is unavailable (e.g. import error), a debug log is emitted and execution continues.
The Risk Estimator uses a structured prompt that asks the LLM to:
- Analyze the intent of the request
- Identify potential ethical risks
- Evaluate context and nuance
- Identify request language (
detected_language, ISO 639-1, required) — used for response language matching ( refusal, safe_complete) - Produce a structured JSON judgment
RiskEstimatorConfig (in moralstack/models/risk/schema.py) controls LLM settings. When no explicit config is passed,
the estimator loads values from environment variables (see Environment Variables).
- max_tokens: 512 (configurable) — response budget for the semantic judge; 512 avoids JSON truncation with Q1–Q12,
rationale, and all fields. The API parameter name (
max_tokensvsmax_completion_tokens) is chosen automatically based on the model; see OpenAI Params. - max_retries: 2 — parse retries before fallback
- temperature: 0.1 — low for consistent decisions
All risk estimator tuning can be overridden via .env. Variables are read at estimator construction; empty or missing
values use the defaults below. See .env.template for the full list. In application runs (CLI and benchmark), risk
configuration is the single source of configuration — no CLI or code path overrides these variables.
- Default: (none — uses the same model as the rest of the stack, e.g.
OPENAI_MODELorgpt-4o) - Tipo: string (OpenAI model id)
- Significato: OpenAI model used only for the risk estimator (semantic judge). When set and non-empty, the CLI
and benchmark create a dedicated
OpenAIPolicywith this model for the risk estimator; the rest of the stack (critic, simulator, generation) keeps usingOPENAI_MODEL. - Effetto della modifica:
- Set to a model id (e.g.
gpt-4o,gpt-4o-mini): The semantic judge uses that model. Lets you use a smaller/cheaper model for risk classification and a larger one for generation, or vice versa. - Unset or empty: The risk estimator uses the same policy (and model) as the rest of the pipeline — current behaviour.
- Set to a model id (e.g.
- Esempio:
MORALSTACK_RISK_MODEL=gpt-4o-miniuses a smaller model for risk classification to reduce cost/latency while keepingOPENAI_MODEL=gpt-4ofor generation.
- Default:
0.3 - Tipo: float (0–1)
- Significato: Score below this is considered "low" risk. Used by
get_risk_level()andshould_deliberate(). Requests with score < low_threshold take the Fast Path (minimal deliberation, fewer cycles). - Effetto della modifica:
- Valore più alto: More requests are classified as medium/high → more deliberative processing, fewer Fast Paths. More conservative.
- Valore più basso: More requests are classified as low → more Fast Paths, less deliberation. More permissive.
- Esempio: Setting
0.4sends more borderline requests into deliberation.
- Default:
0.7 - Tipo: float (0–1)
- Significato: Boundary between "medium" and "high" risk. Scores between low_threshold and medium_threshold are medium; scores ≥ medium_threshold are high and get full deliberation.
- Effetto della modifica:
- Valore più alto: More requests stay "medium" (fewer "high") → slightly less intensive deliberation at the top.
- Valore più basso: More requests become "high" → more requests get full deliberation. More conservative.
- Default:
2 - Tipo: int (≥ 1)
- Significato: Number of parse attempts for the semantic judge JSON response before using the conservative
fallback (e.g.
RiskEstimation.from_error). - Effetto della modifica:
- Valore più alto: More resilient to transient parse failures; slightly higher latency on repeated attempts.
- Valore più basso: Fallback to conservative estimate sooner; faster failure path, less resilient.
- Default:
512 - Tipo: int (≥ 1)
- Significato: Maximum tokens for the semantic judge LLM response. 512 is tuned to avoid truncating the JSON ( Q1–Q12, rationale, etc.).
- Effetto della modifica:
- Valore più alto: Longer responses possible; higher token cost and risk of truncation if the model outputs more than expected.
- Valore più basso: Shorter responses; risk of truncated JSON and parse failures.
- Default:
0.1 - Tipo: float (0–2)
- Significato: LLM temperature for the semantic judge. Low values favour consistent, deterministic classifications.
- Effetto della modifica:
- Valore più alto: More variability in risk scores and categories across similar requests; less predictable.
- Valore più basso: More stable, repeatable judgements; may be less sensitive to nuance.
- Default:
0.5 - Tipo: float (0–1)
- Significato: Risk score used when no LLM is available (e.g. policy=None or all retries failed). Drives path and deliberation requirements.
- Effetto della modifica:
- Valore più alto: Fallback is treated as higher risk → more deliberation or refusal.
- Valore più basso: Fallback is treated as lower risk → less deliberation; more permissive when LLM fails.
- Default:
0.3 - Tipo: float (0–1)
- Significato: Confidence assigned to the fallback estimation when LLM is unavailable. Low value signals uncertainty.
- Effetto della modifica: Affects downstream logic that uses confidence; lowering it further stresses uncertainty; raising it is not recommended for fallback.
- Default:
true - Tipo: bool (
true/false,1/0,yes/no) - Significato: When true, the system always requires deliberation when the estimator uses the fallback (no LLM). Ensures safe behaviour on failure.
- Effetto della modifica:
- true: Safe default; every fallback triggers deliberation.
- false: Fallback may skip deliberation; only change for special deployments where safety is handled elsewhere.
- Default:
0.9 - Tipo: float (0–1)
- Significato: Nucleus sampling (top_p) for the semantic judge LLM call. Controls diversity of token sampling.
- Effetto della modifica:
- Valore più alto: Slightly more variety in outputs.
- Valore più basso: More deterministic, focused outputs.
- Default:
10 - Tipo: int (≥ 1)
- Significato: Number of relevant constitution principles injected into the risk prompt. More principles give more context but increase prompt size and cost.
- Effetto della modifica:
- Valore più alto: Richer principle context; longer prompts, higher token usage.
- Valore più basso: Shorter prompts; less constitutional context in the judge.
- Default:
200 - Tipo: int (≥ 1)
- Significato: Maximum characters of each principle rule shown in the prompt (longer rules are truncated with "...").
- Effetto della modifica:
- Valore più alto: More rule text in the prompt; better context, larger prompts.
- Valore più basso: Shorter rule previews; smaller prompts, possible loss of nuance.
- Default:
0.35 - Tipo: float (0–1)
- Significato: Lower bound of the score clamp applied to crisis_support requests (self-harm language without requested instructions or intent to harm). Keeps score in a range that triggers deliberate, supportive handling without over-penalising.
- Effetto della modifica:
- Valore più alto: Crisis requests get a higher minimum score → more consistently treated as sensitive/deliberative.
- Valore più basso: Crisis requests can have a lower score → may approach Fast Path if combined with other logic.
- Default:
0.65 - Tipo: float (0–1)
- Significato: Upper bound of the score clamp for crisis_support requests. Prevents such requests from being classified as clearly harmful when they are help-seeking.
- Effetto della modifica:
- Valore più alto: Crisis requests can reach higher scores → closer to potentially harmful band.
- Valore più basso: Crisis requests capped lower → stay in sensitive/deliberate band; more protective.
- Default:
0.2 - Tipo: float (0–1)
- Significato: Score below this is mapped to BENIGN in
categorize_from_score(). Together with low_threshold, defines the benign band. - Effetto della modifica:
- Valore più alto: Fewer requests classified as BENIGN; more become MORALLY_NUANCED/SENSITIVE. More conservative.
- Valore più basso: More requests classified as BENIGN. More permissive.
- Default:
0.5 - Tipo: float (0–1)
- Significato: In
categorize_from_score(), score between low_threshold and this (and below medium_threshold) influences MORALLY_NUANCED vs SENSITIVE. Boundary between nuanced ethical dilemmas and sensitive topics. - Effetto della modifica:
- Valore più alto: More scores in the middle range map to SENSITIVE; fewer to MORALLY_NUANCED.
- Valore più basso: More scores map to MORALLY_NUANCED; fewer to SENSITIVE.
- Default:
0.85 - Tipo: float (0–1)
- Significato: Score ≥ this is mapped to CLEARLY_HARMFUL in
categorize_from_score(); below it (and above medium_threshold) maps to POTENTIALLY_HARMFUL. - Effetto della modifica:
- Valore più alto: Fewer requests classified as CLEARLY_HARMFUL; more stay POTENTIALLY_HARMFUL. More permissive at the top.
- Valore più basso: More requests classified as CLEARLY_HARMFUL; stricter refusal behaviour.
from moralstack.models.risk import LLMBasedRiskEstimator
from moralstack.constitution.store import ConstitutionStore
risk_estimator = LLMBasedRiskEstimator(
policy=policy_llm,
constitution_store=constitution_store,
)estimation = risk_estimator.estimate("How can I manage a family conflict?")
print(f"Score: {estimation.score}")
print(f"Category: {estimation.risk_category.value}")
print(f"Signals: {estimation.semantic_signals}")
print(f"Rationale: {estimation.rationale}")# For clearly benign requests
estimation = RiskEstimation.benign(
confidence=0.95,
rationale="No harmful intent detected"
)# For clearly harmful requests
estimation = RiskEstimation.clearly_harmful(
semantic_signals=["violence", "harm_intent"],
rationale="Request for harmful instructions",
confidence=0.95
)# Fallback on error (assumes moderate risk)
estimation = RiskEstimation.from_error("LLM parsing failed")
# Returns score=0.5, requires deliberation for safetyThe Risk Estimator determines the flow in the Orchestrator:
if estimation.score < risk_thresholds.low: # < 0.3
path = "fast"
max_cycles = 1
elif estimation.risk_category == RiskCategory.MORALLY_NUANCED:
path = "deliberative"
max_cycles = 2 # Ethical dilemmas require deliberation
else:
path = "deliberative"
max_cycles = config.max_deliberation_cycles| Signal | Description | Impact |
|---|---|---|
educational_intent |
Informational/educational request | Reduces risk |
ethical_dilemma |
Moral dilemma present | MORALLY_NUANCED |
vulnerable_context |
Potential vulnerable user | Increases scrutiny |
harm_potential |
Possibility of harm | Increases risk |
illegal_activity |
Reference to illegal activity | CLEARLY_HARMFUL |
medical_context |
Medical/health context | Activates medical overlay |
financial_advice |
Financial advice request | Activates financial overlay |
The Risk Estimator is designed to be resilient:
try:
estimation = risk_estimator.estimate(prompt)
except Exception as e:
# Fallback: assume moderate risk, requires deliberation
estimation = RiskEstimation.from_error(str(e))High-level flow of risk estimation:
- Input prompt →
RiskEstimator(implementation:LLMBasedRiskEstimator) - Raw score calculation — LLM semantic analysis and signal extraction
- Category classification —
RiskCategoryassignment based on thresholds - Calibration → score mapping to risk band (schema in
calibration.py) - Output →
RiskEstimationwith score, category, signals and rationale
- Orchestrator - Flow coordination
- Constitutional Critic - Principle validation
- Constitution Store - Ethical principle management