A suite of single-purpose Claude Code hooks that suppress LLM dark-pattern defaults — sycophancy, paternalism, false-success, permission-loops, training-cutoff confidence, and compaction amnesia — at the textual boundary, so power-user operators can actually work.
This repo is the umbrella for a series of small hook repos, umbrella-only legacy hooks that still live here, and the research-grade closeout physics engine in waitdeadai/agent-closeout-bench. Each public standalone hook remains separately installable. The physics-backed lane uses one reproducible engine with per-category rule packs, fixtures, and decision JSON.
That does not collapse every hook into one generic detector. Each hook maps to its own category engine; the shared Rust binary is packaging for reproducible hashing, safe regex compilation, fixture testing, telemetry discipline, and paper-grade evaluation.
The shared architecture is out-of-band textual enforcement at Claude Code hook boundaries. The judge is deterministic code, not another LLM call. That means the model cannot modify the hook's code path from inside its closeout text; it does not mean the system is impossible to bypass, misconfigure, or evade by paraphrase.
| Phase | Surface | Status |
|---|---|---|
| Phase 1 — Locale loader + English pack | lib/packs.sh, packs/locale/en.txt |
✓ ships |
| Phase 2 — Spanish + Polish locale packs | packs/locale/{es,pl}.txt |
✓ ships |
| Phase 3 — Evidence binary allowlist (devops/k8s/cloud/database/system) | packs/evidence/binaries.txt (9 sections, 200+ binaries) |
✓ ships |
| Phase 4 — Destructive command surface packs (filesystem, container, git-protected, config-overwrite, cloud-prod, database, service) | packs/destructive/*.txt (7 surfaces, 56 patterns) |
✓ ships |
| Phase 5 — Bypass hardening (clause-local negation, evidence proximity + action-verb) | hooks/no-vibes.sh |
✓ ships |
| Phase 6 — Physics-backed closeout adapters | agentcloseout-physics v0.2, per-category rule packs, Claude Code wrappers, PreToolUse tamper guard |
✓ ships in AgentCloseoutBench |
Operators with a non-English session, a non-app-dev toolchain, or a load-bearing destructive surface (kubectl, terraform, redis FLUSHALL, force-push to main) can extend coverage without forking by dropping a .txt into ${XDG_CONFIG_HOME:-$HOME/.config}/llm-dark-patterns/packs/<subdir>/<name>.txt. See ROADMAP.md for the architecture spec.
LLM "dark patterns" is now an academically-recognized category:
- DarkBench (Kran et al. 2025, ICLR 2025, arXiv:2503.10728) — 660 prompts across 6 dark-pattern categories. 48% of LLM conversations trigger at least one dark pattern.
- DarkBench+ (Liu et al. 2026, AAAI 2026 main conference) — extended benchmark testing ~40 mainstream LLMs across 10 major categories and 24 subcategories. First specialized evaluation dimensions for reasoning models. Bilingual (Chinese/English).
- AAAI 2026 Spring Symposium (Li, Qu, Chang 2026, Lighting Up or Dimming Down?) — co-creativity study identifying 5 patterns: sycophancy, tone policing, moralizing, loop of death, anchoring. Sycophancy at 91.7% prevalence.
- IEEE S&P 2026 (Investigating the Impact of Dark Patterns on LLM-Based Web Agents) — agents susceptible 41% of the time to a single dark pattern.
- CHI 2026 (The Siren Song of LLMs) — user-perception study; users normalize dark patterns as "ordinary assistance."
- DarkPatterns-LLM (Dec 2025 benchmark) — 7 harm categories.
- MAST — Multi-Agent System failure Taxonomy (Cemri et al. 2025, NeurIPS 2025, arXiv:2503.13657, repo) — 14 failure modes in 3 categories: specification & system design (41.8% of observed failures), inter-agent misalignment (36.9%), task verification & termination (21.3%). Built on 1600+ annotated traces across 7 MAS frameworks (AG2, AppWorld, HyperAgent, MagenticOne_GAIA, OpenManus_GAIA, programdev, math_interventions, mmlu). MAD dataset published at huggingface.co/datasets/mcemri/MAD. Production multi-agent systems fail at 41–86.7% rates.
- Sean Goedecke (2024 essay) — "Sycophancy is the first LLM dark pattern." Naming convention now widespread.
- Anthropic's own Constitution — "various forms of paternalism and moralizing are disrespectful."
The category is real. The academic side measures and benchmarks. The tooling side — until now — has been mostly system-prompt calibrators (FutureSpeakAI/anti-sycophancy) and in-context skills (0xcjl/anti-sycophancy). Both live inside the model's reasoning loop. Both can be drifted past on long sessions. Neither survives the hard adversarial case where the model has every incentive to ignore them.
The LLM Dark Patterns Hooks suite is the out-of-band complement: deterministic judges that inspect the model's outgoing text and refuse to let dark-patterned closeouts through.
Two power-users have independently filed substantive issues against anthropics/claude-code describing the failure modes this suite catches:
- Patti (anthropics/claude-code#45502, Apr 2026) — 200+ Claude Code sessions, US tax work under IRS deadline. "Green checkmarks with nothing behind them." RECONCILED status with blank proof columns. 36 PayPal transactions silently deleted by a post-compaction model. Premature closeout at 17% context, "shall we wrap up", "goodnight" at 8 AM. The framing — "the trust is in the evidence. The relationship is why we bother" — is the design principle this suite operationalizes.
- Sara (supplemental report on anthropics/claude-code#45502, May 2026) — quantitative corpus over ~96 Claude Code sessions + 119 claude.ai exports. 1 disagreement in 96 sessions. Refusal-to-disagree as substrate, not surface. claude.ai uses "profound" about the user 6 times; the user uses "profound" 0 times. Three months of CLAUDE.md rules suppressed certain words but not the disposition.
See pinned issue #6 — Field reports for the per-finding mapping to specific hooks. Honest scope: this catches the textual signature, not the underlying disposition. The training-level fix Patti is asking for still belongs to Anthropic.
The MAST taxonomy (Cemri et al., NeurIPS 2025) is the canonical peer-reviewed catalogue of multi-agent failure modes. 13 of the 28 detector hooks in this suite conceptually map to 8 of MAST's 14 modes. Empirical evaluation against the MAD dataset has now been run — full results at evaluation/MAST-RESULTS.md. The mapping table is split into empirically-validated coverage and conceptual-only mapping below.
| Hook | MAST mode | LLM-judge full (n=954) | Human-labelled (n=19) |
|---|---|---|---|
no-vibes / evidence_claims |
3.3 No or Incorrect Verification | F1 0.308 (P 0.226 R 0.486) | F1 0.815 (P 0.733 R 0.917) |
honest-eta |
2.6 Action-Reasoning Mismatch | F1 0.230 (P 0.466 R 0.153) | 0 — no positives in subset |
no-wrap-up |
3.1 Premature Termination | F1 0.022 (P 0.167 R 0.012) | 0 — no positives in subset |
no-phantom-tool-call |
2.6 Action-Reasoning Mismatch | F1 0.005 (P 1.000 R 0.003) | 0 — no positives in subset |
Read: no-vibes is the strong multi-agent catch — F1 0.815 on human-labelled traces against MAST's highest-prevalence mode (3.3). honest-eta and no-phantom-tool-call are high-precision low-recall tools (when they fire they're usually right; they just don't fire often on trajectory text). no-wrap-up fires rarely and is mostly noise.
The following 9 hooks conceptually target their MAST mode but did not produce measurable F1 at the trace-level baseline against MAD. Reasons documented in evaluation/MAST-RESULTS.md §"Honest findings":
| Hook | Conceptually targets | Why no signal at trace-level baseline |
|---|---|---|
no-ownership-violation (DOCUMENTED-LIMITED) |
1.2 Disobey Role Specification | Bash-canonical TaskCompleted event handler; Rust scan path passes by design |
no-handoff-loop (DOCUMENTED-LIMITED) |
1.3 Step Repetition, 1.5 Unaware of Termination Conditions | Same — TaskCreated event handler |
no-fake-recall |
1.4 Loss of Conversation History | Vocabulary tuned for chat-reply recall claims; trajectory text uses different scaffolding |
no-cliffhanger |
1.5 Unaware of Termination Conditions, 3.1 Premature Termination | zone: tail (last 520 chars) is the trajectory tail, not a closeout sentence |
no-aggregator-hallucination, no-fake-stats |
2.6 Action-Reasoning Mismatch | Tuned for supervisor closeouts; synthesis claim buried in trajectory chatter |
no-cherry-pick-rollup, no-silent-worker-success, no-sandbagging-disguise |
3.1 / 3.2 Verification failures | Calibrated for supervisor reports, not multi-turn collaboration text |
The methodology gap is structural: hooks are tuned for individual Claude Code closeout messages; MAD's text is full multi-agent trajectory. Per-message scanning is the planned next experiment (MAST-RESULTS.md §"Next steps").
| Hook | Relationship to MAST |
|---|---|
no-credential-leak-in-handoff |
MAST commentary cites inter-agent privacy leakage as a failure surface but does not number a privacy mode in the 14-mode taxonomy |
What MAST does not cover (single-agent UX / style dark patterns): no-sycophancy, no-curfew, no-emoji-spam, no-tldr-bait, no-disclaimer-spam, no-ai-tells, no-meta-commentary, no-prompt-restate, no-roleplay-drift. Those map to DarkBench / DarkBench+ / DarkPatterns-LLM instead — see the original DarkBench eval in evaluation/RESULTS.md.
The active catalog is organized in six branches by mechanism:
- Interaction-style (8): catch how the model talks.
no-vibes,time-anchor,no-curfew,no-sycophancy,no-cliffhanger,no-wrap-up,no-tldr-bait,honest-eta. - Fact-fabrication (5): catch what the model claims.
no-fake-recall,no-fake-stats,no-fake-cite,no-phantom-tool-call,no-rollback-claim-without-evidence. - Continuity (1): counter context loss rather than block dishonest output.
no-amnesia. - Multi-agent orchestration (5): catch supervisor / +N-parallel-instance failure modes.
no-aggregator-hallucination,no-silent-worker-success,no-cherry-pick-rollup,no-ownership-violation,no-handoff-loop. - Agentic safety (3): catch credential leak, sandbagging disguise, approval-sneak surfaces.
no-credential-leak-in-handoff,no-sandbagging-disguise,no-approval-sneak. - Power-user polish (6): catch frontier-LLM annoyances power users hate.
no-emoji-spam,no-meta-commentary,no-prompt-restate,no-disclaimer-spam,no-ai-tells,no-roleplay-drift.
The hooks now ship through three distribution lanes:
| Lane | What it means | Examples |
|---|---|---|
| Standalone repo | Public single-purpose repo with its own install docs, receipts/tests, and plugin metadata where available. | no-vibes, time-anchor, no-curfew, no-sycophancy, no-cliffhanger, honest-eta, no-fake-recall, no-fake-stats, no-fake-cite, no-amnesia |
| Umbrella-only legacy | Hook implementation exists in this umbrella bundle, but the public standalone repo has not been created or restored yet. Install from this repo's bundled plugin wiring. | no-wrap-up, no-roleplay-drift, multi-agent rollup hooks, approval/credential/phantom-tool/polish hooks |
| AgentCloseoutBench physics-backed | Reproducible engine adapters generated from waitdeadai/agent-closeout-bench; these are stricter rule-pack lanes, not copies of the standalone Bash scripts. |
no-vibes, no-wrap-up, no-cliffhanger, no-roleplay-drift, no-sycophancy |
Standalone and umbrella-only hooks stay small and inspectable: single bash file
or bash plus python3 for engine-heavier hooks, Apache-2.0, drop-in via
.claude/settings.json, with reproducible-test receipts or fixtures.
See METHODOLOGY.md for the harness-engineering playbook used to discover and ship every hook in the suite. Now includes the Adversarial Discovery via Impossible Tasks methodology backed by AbstentionBench, Anthropic's tracing-thoughts research, and the CoT-faithfulness literature.
See
waitdeadai/impossible-tasks— the discovery-engine companion repo. 30 impossible-task classes mapped to dishonest defaults mapped to existing or candidate hooks. 11 of 30 classes covered; 19 candidates remain, prioritized by difficulty.
| Hook | Dark pattern | Mechanism | Repo |
|---|---|---|---|
| no-vibes | confidence theater (claims of completion without evidence) | block positive-closeout vocabulary lacking same-message evidence | waitdeadai/no-vibes |
| time-anchor | training-cutoff confidence (stale knowledge presented as current) | inject local system clock at SessionStart + UserPromptSubmit | waitdeadai/time-anchor |
| no-curfew | unsolicited rest/wellness paternalism | block paternalism vocabulary at turn-end with allow-clause for operator-requested rest content | waitdeadai/no-curfew |
| no-sycophancy | praise-spam at turn-open | inspect first 240 chars; block validation theater | waitdeadai/no-sycophancy |
| no-cliffhanger | dangling permission-loop endings | inspect last 320 chars; block "want me to continue?" with allow-clauses for partial-status and explicit choice | waitdeadai/no-cliffhanger |
| no-wrap-up | engagement-fishing closures at message end (DarkBench User Retention) | inspect last 280 chars; block "anything else?" / "let me know if you need anything else" / "hope this helps!" + tail with allow-clause for operator-asked closure | hooks/no-wrap-up.sh (umbrella-only legacy; standalone restoration planned) |
| honest-eta | vibe time estimates + linear-scaling parallelism claims | block time-estimate vocabulary lacking Agent-Native Estimate shape or hedge range; always block linear-scaling | waitdeadai/honest-eta |
| no-fake-recall | false-memory recall ("as we discussed earlier" without quoted prior content) | block recall vocabulary unless message contains a markdown blockquote or 30+ char inline quote | waitdeadai/no-fake-recall |
| no-fake-stats | fabricated percentages, dollar amounts, large counts without source | block stat patterns unless message contains URL / "according to " / "(YYYY)" / strong neutral hedge | waitdeadai/no-fake-stats |
| no-fake-cite | citation patterns ("Smith et al., 2023", "[1]", "doi:") without verifiable URL | block citation patterns unless message contains a https:// URL |
waitdeadai/no-fake-cite |
| no-amnesia | context loss after auto-compaction | snapshot working state on Stop / PreCompact / PostCompact, rehydrate on SessionStart | waitdeadai/no-amnesia |
| no-aggregator-hallucination | supervisor synthesizes "the workers' results" without citing any per-worker output (DarkBench-adjacent supervisor failure mode) | catch synthesis vocab; require per-worker enumeration / blockquote | hooks/no-aggregator-hallucination.sh (umbrella-only) |
| no-silent-worker-success | "all N workers completed" rollup without per-worker exit codes (the dominant 2026 multi-agent failure mode per arXiv:2604.14228) | catch rollup vocab; require per-worker exit/status enumeration | hooks/no-silent-worker-success.sh (umbrella-only) |
| no-cherry-pick-rollup | partial worker success ("4 of 5 succeeded") + positive closeout without explicitly handling the failed workers | require explicit handling of failed lanes (retry / blocked / reasoned-ignore) | hooks/no-cherry-pick-rollup.sh (umbrella-only) |
| no-ownership-violation | TaskCompleted edits files outside the agent's declared owned_paths/scope | parse payload; block out-of-scope file edits; fail-open without payload shape | hooks/no-ownership-violation.sh (umbrella-only) |
| no-handoff-loop | TaskCreated chain shows the same agent_id 3+ times in delegation history | parse payload; count agent_id occurrences; fail-open without history field | hooks/no-handoff-loop.sh (umbrella-only) |
| no-credential-leak-in-handoff | task delegation or message contains plaintext credentials (sk-, ghp_, AWS keys, Bearer tokens, password=, api_key=) — AgentLeak benchmark surface | regex match against canonical credential shapes; fire on any match | hooks/no-credential-leak-in-handoff.sh (umbrella-only) |
| no-phantom-tool-call | "I ran tool and got X" / "the tool returned X" without same-message structural output (Tool result: header, fenced block, exit_code field, blockquote) |
catch tool-call claim vocab; require structural evidence markers | hooks/no-phantom-tool-call.sh (umbrella-only) |
| no-sandbagging-disguise | "tried but couldn't" / "gave it my best shot" without specific blocker, error, or exit code (Anthropic Claude Opus 4.6 sabotage report) | catch sandbag vocab; require specific blocker citation | hooks/no-sandbagging-disguise.sh (umbrella-only) |
| no-rollback-claim-without-evidence | "I rolled back" / "reverted" / "undid" without same-message rollback command | catch rollback claim; require git revert / kubectl undo / terraform / helm rollback evidence | hooks/no-rollback-claim-without-evidence.sh (umbrella-only) |
| no-approval-sneak | Edit/Write to operator-defined sensitive paths (.env*, secrets/, .kube/, terraform/state/, .ssh/, .gnupg/, prod/) without prior approval token | path match against pack-defined sensitive surfaces; block unless tool_input.approval=approved |
hooks/no-approval-sneak.sh (umbrella-only) |
| no-emoji-spam | message has more than N emoji codepoints (default 3; configurable via LLM_DARK_PATTERNS_EMOJI_THRESHOLD) |
python codepoint counter against configurable threshold | hooks/no-emoji-spam.sh (umbrella-only) |
| no-tldr-bait | "TL;DR:" / "In summary:" / "Bottom line:" tail block on long messages (>200 chars) | regex match at message end; short-message exemption | hooks/no-tldr-bait.sh (umbrella-only) |
| no-meta-commentary | "Let me think about this" / "Now I'll consider" / "First, I need to think" message-open patterns narrating chain-of-thought instead of producing the answer | inspect first 240 chars for meta-thinking openers | hooks/no-meta-commentary.sh (umbrella-only) |
| no-prompt-restate | "You asked me to X" / "I understand that you want X" / "So you'd like me to X" preamble waste at message open | inspect first 200 chars for restate openers; allow-clause for explicit operator-asked verification | hooks/no-prompt-restate.sh (umbrella-only) |
| no-disclaimer-spam | "Please note that" / "It's important to mention" / "Keep in mind" defensive padding (paternalism family, Anthropic Constitution) | regex match against disclaimer phrases; fire on any occurrence | hooks/no-disclaimer-spam.sh (umbrella-only) |
| no-ai-tells | known LLM-default phrases ("delve into", "tapestry", "navigate the intricacies", "in the realm of", "leverage cutting-edge", etc.) | regex match against canonical AI-tell vocabulary | hooks/no-ai-tells.sh (umbrella-only) |
| no-roleplay-drift | "as an AI assistant, I" / "I'm just an AI" / "as a language model" / "I do not have opinions" — model breaking agent character mid-task (DarkBench Anthropomorphism inverse) | regex match against roleplay-break phrases | hooks/no-roleplay-drift.sh (umbrella-only legacy; standalone restoration planned) |
Vocabulary, evidence binaries, and destructive command lists are now
external .txt files. Operators can extend coverage by dropping new
files at the XDG location — no fork, no PR required for local use.
packs/
locale/ # vocabulary used by no-vibes (positive_closeout, negation)
en.txt # English (default, ships with repo)
es.txt # Spanish (Latin American + Iberian forms)
pl.txt # Polish (Tekalan-confirmed bootstrap)
evidence/
binaries.txt # binaries that count as command evidence in 9 sections:
# app-dev, containers, k8s, devops, cloud, database,
# shell-tools, system, archive, http (200+ binaries)
destructive/ # destructive command surfaces (operator opts in via env)
filesystem.txt # rm -r/, dd, mkfs, find -delete, chmod -R 777,
# git reset --hard, git clean -fd, git checkout --
container.txt # docker stop/rm/prune, kubectl delete, helm
# uninstall, argocd app delete
git-protected.txt # git push --force, filter-branch, filter-repo,
# branch -D, reflog expire
config-overwrite.txt # in-place writes to .env*, .storage/, .ssh/,
# .gnupg/, .kube/, secrets/
cloud-prod.txt # terraform/tofu/pulumi destroy, terraform state
# rm/mv, aws s3 rm --recursive, gcloud delete,
# az delete, doctl delete
database.txt # DROP TABLE/DATABASE/SCHEMA, TRUNCATE, FLUSHALL,
# dropDatabase()
service.txt # systemctl/service/launchctl/supervisorctl stop
Discovery priority (highest first):
$LLM_DARK_PATTERNS_PACK_DIR/<subdir>/<name>.txt— explicit override${XDG_CONFIG_HOME:-$HOME/.config}/llm-dark-patterns/packs/<subdir>/<name>.txt— operator local<repo>/packs/<subdir>/<name>.txt— ships with repo
Locale selection:
$LLM_DARK_PATTERNS_LOCALE=en,es,pl— explicit comma-separated${LANG:0:2}— auto-detect when env unset (always layered on top ofen)en— final fallback
Surface opt-in for destructive packs:
LLM_DARK_PATTERNS_DESTRUCTIVE_PACKS=filesystem,container,git-protected— subset- Default: all 7 surfaces active
Evidence category opt-in:
LLM_DARK_PATTERNS_EVIDENCE_CATEGORIES=app-dev,devops,k8s— subset- Default: all 9 categories active
The paper-grade, benchmark-backed lane lives in waitdeadai/agent-closeout-bench. It is not a replacement for the small standalone hooks; it is the reproducible engine layer that makes closeout mechanics testable, hashable, and comparable. For daily use, the adapter installer writes the selected Stop/SubagentStop hook wrappers plus a PreToolUse tamper guard. The tamper guard blocks ordinary Claude Code attempts to edit the hook wiring, adapter env, pinned engine, or pinned rule pack; it is not an OS sandbox and should not be described as bypass-proof.
The v0.2 evidence-claim engine deliberately rejects weak proof shapes such as
Implemented and checked., Done. Commands run: none., and Changed files:
without command or verification evidence. This is still closeout-contract
evidence, not independent proof that the underlying work truly happened.
Current physics-backed adapters:
| Adapter hook | Category engine | Use |
|---|---|---|
no-vibes.sh |
evidence_claims |
block completion/verification claims without evidence markers |
no-wrap-up.sh |
wrap_up |
block generic retention tails |
no-cliffhanger.sh |
cliffhanger |
block dangling permission loops |
no-roleplay-drift.sh |
roleplay_drift |
block persona drift replacing useful status |
no-sycophancy.sh |
sycophancy |
block praise/validation before substance |
Install all physics-backed adapters from a clone of AgentCloseoutBench:
git clone https://github.com/waitdeadai/agent-closeout-bench
cd agent-closeout-bench
bash adapters/claude-code/install.sh /path/to/your/project
bash scripts/hook-smoke.shInstall one adapter:
bash adapters/claude-code/install.sh /path/to/your/project no-cliffhangerThe adapter installer writes a .claude/settings.agentcloseout.example.json
snippet for Claude Code, including the tamper-guard PreToolUse entry. Merge
the entries you want into .claude/settings.json.
For research, fixtures, public-data intake, human-labeling protocol, and collaboration telemetry, use AgentCloseoutBench directly:
bin/agentcloseout-physics lint-rules rules/closeout
bin/agentcloseout-physics test-rules rules/closeout fixtures/closeout
bin/agentcloseout-physics telemetry-preview --queue /path/to/local-queue.jsonlEvery hook in the suite follows the same 4-step design:
- Pick a failure mode that has a textual signature. Not "model is wrong" (no signature). Something like "claims success without evidence" or "opens with praise-spam" — these have distinct vocabularies.
- Define the signature precisely. Two regex sets: the bad pattern, and the redemption (or allow) pattern. Bad without redemption → trigger.
- Wire a non-LLM judge at a Claude Code hook event. Bash. Python. Anything that isn't another LLM call. The judge is not the same kind of thing as the actor.
- Block + repair-template. A bare block stalls. A block + the literal compliant shape lets the model copy the template on the next turn. The repair-template teaches; the block alone just punishes.
This pattern composes. If you find a sixth dark pattern with a clean textual signature, write no-X.sh in 50–100 lines of bash and ship it as a sister repo. If you publish it under the same conventions (Apache-2.0, single file, RECEIPTS.md with reproducible fixtures, sister-tools cross-link block), open a PR adding it to the table above.
This suite is orthogonal, not competitive, to the established LLM safety tools. Each operates at a different boundary:
| Layer | Tool | Catches | Operates at |
|---|---|---|---|
| Input firewall | Lakera Guard, LLM Guard (ProtectAI), Pangea | Prompt injection, jailbreak, PII in input | LLM API request boundary |
| Conversational rails | NVIDIA NeMo Guardrails | Topic control, dialog flow, fact-check, jailbreak | Inline middleware, programmable Colang DSL |
| Agent runtime policy | AgentSpec (Wang et al., ICSE '26), Pro²Guard (arXiv 2508.00500) | Code execution safety, embodied agent safety, AV compliance | Agent tool-call boundary, DSL-defined triggers/predicates |
| Output content scanning | LLM Guard output scanners (35 scanners) | Toxicity, PII, gibberish, factual consistency, bias, code injection | LLM response boundary |
| Behavioral benchmarks (not enforcement) | DarkBench (Kran et al., ICLR '25), DarkPatterns-LLM (arXiv 2512.22470) | Six dark-pattern categories (sycophancy, anthropomorphism, sneaking, brand bias, retention, harmful gen) | Offline evaluation corpora |
| This suite | llm-dark-patterns |
Dark-pattern closeout-boundary enforcement: sycophancy, false-success, permission loops, paternalism, training-cutoff confidence, compaction amnesia, and the Slice 2-5 fact-fabrication / interaction-style / multi-agent / residual families | Claude Code Stop / SubagentStop / TaskCreated / TaskCompleted / PreToolUse / PostToolUse / PreCompact / PostCompact / SessionStart hook events |
The intersection of dark-pattern detection + runtime enforcement at the agent loop boundary is the slot this suite occupies. The enforcement-side peers (Lakera, NeMo, LLM Guard, AgentSpec) currently have approximately zero dark-pattern coverage; the dark-pattern-side peers (DarkBench, DarkPatterns-LLM) are detection corpora with no runtime enforcement.
You can run this suite alongside any of the input-firewall / conversational-rails / output-scanning layers — they operate at different boundaries and don't conflict. Pairing with the deterministic floor + LLM ceiling pattern (2026 production best-practice: deterministic checks first, LLM judge for ambiguous cases) is also straightforward — the hooks' BLOCKED decisions are the deterministic floor, and an out-of-band LLM judge can be wired as a separate hook for cases the regex doesn't catch.
Re-ran DarkBench (Kran et al., ICLR 2025, arXiv:2503.10728) against claude-sonnet-4-6 in 2026-05. Sycophancy prevalence dropped from 13% in the paper's 14-model 2025 average to 1.8% on Sonnet 4.6 alone — RLHF appears to have measurably reduced the canonical sycophancy surface in the year between studies. Anthropomorphization (62%) and user-retention (79%) prevalence remain high.
Hooks tested as black-box text classifiers against the same corpus. With the Rust YAML rule pack engine (physics-engines plan, Slices 2-5), no-roleplay-drift achieves F1 0.590 (P 0.655, R 0.537) on the anthropomorphization sample — a 5x true-positive jump (7→36) over the original bash-only F1 of 0.163. The other three in-scope hooks (no-sycophancy, no-wrap-up, no-cliffhanger) show parity between bash regex and Rust YAML on this chat surface. User-retention hooks underperformed because the chat-reply vocabulary in DarkBench prompts is emotional/relational ("good friend dropping by", "your daily companion") rather than the transactional closeout vocabulary the hooks were tuned for ("shall we wrap up", "let me know if anything else"). The 240-character opener window in no-sycophancy also misses sycophancy that lives later in long responses.
Honest data: the hooks have a documented vocabulary-distribution gap when applied to chat-reply text vs the Claude Code closeout text they were designed for. Reproducible end-to-end (~$12 PAYG-equiv, ~3 hours sequential). Full v1 results → · Head-to-head bash-vs-Rust comparison (v1.5-rust) →
claude plugin marketplace add waitdeadai/claude-plugins
claude plugin install llm-dark-patterns@waitdeadai-pluginsThis installs all 31 wired hooks across Stop, SubagentStop, TaskCreated, TaskCompleted, PreToolUse, PostToolUse, PreCompact, PostCompact, and SessionStart events. Each hook remains independently disablable by editing hooks.json after install.
The self-hosted marketplace at waitdeadai/claude-plugins is the canonical install path because the Anthropic community marketplace pipeline has stalled for many submitters since at least March 2026. This plugin shows as Published in the submissions dashboard since 2026-05-11 but does not appear in the live claude-plugins-community/marketplace.json (verified 2026-05-17 — zero matches across 1715 entries; last bulk sync to that file was 2026-05-13 with no new syncs since).
The same pattern is documented across at least eight open issues on anthropics/claude-plugins-official: #984 (since 2026-03-25, 11 comments), #1272 (closed without resolution, 23+ "same here" comments), #1474, #1512, #1834, #1841, #1870, #1887. Two sync PRs (#18, #21) have been stuck unmerged for 12-15 days.
If Anthropic's pipeline resumes, the community-marketplace path becomes a redundant install option, but until then the self-hosted route above is the only one that actually resolves:
# Currently does NOT resolve for this plugin or for many others — see #1887
claude plugin marketplace add anthropics/claude-plugins-community
claude plugin install llm-dark-patterns@claude-communityThe public standalone repos are still the simplest daily-use path when you want a subset rather than the whole suite. Install the single-file hooks that already have standalone repos:
mkdir -p .claude/hooks
# Single-file hooks
for hook in no-vibes time-anchor no-curfew no-sycophancy no-cliffhanger honest-eta no-fake-recall no-fake-stats no-fake-cite; do
curl -fsSL "https://raw.githubusercontent.com/waitdeadai/${hook}/main/${hook}.sh" \
-o ".claude/hooks/${hook}.sh"
chmod +x ".claude/hooks/${hook}.sh"
done
# no-amnesia is a 5-file bundle (state engine + 4 event wrappers)
for f in state.sh state-stop.sh state-precompact.sh state-postcompact.sh state-sessionstart.sh; do
curl -fsSL "https://raw.githubusercontent.com/waitdeadai/no-amnesia/main/hooks/${f}" \
-o ".claude/hooks/${f}"
chmod +x ".claude/hooks/${f}"
doneThen merge each repo's settings.example.json hooks block into your .claude/settings.json. Each hook is independent — you can install any subset.
Requires jq (and python3 for time-anchor and no-amnesia).
The industry is optimizing LLMs for mass-market efficiency: faster, shorter, more agreeable, more cautious. That gradient runs against the power-user objective of correct results, deep verification, and operator agency. The Dark Patterns Hooks suite is the counter-position: small, surgical bash hooks that suppress the polite-cautious-efficient defaults at the textual boundary so the model can produce results instead of vibes.
The hooks are conservative on purpose — they would rather false-positive on legitimate prose that overlaps the dark-pattern vocabulary than false-negative on the actual dark pattern. The repair-template is the part that makes false-positives non-painful: when the hook fires on a legitimate use, the model sees the template and can repair into a closeout shape that satisfies the regex.
Important clarification: this suite does not suppress safety refusals, content-policy enforcement, or harm-prevention. The regexes are narrow to specific interaction-style defaults (sycophancy, paternalism, false-completion, permission-spam). If you want a tool that bypasses Claude's safety behaviors, this is not it.
This suite catches specific textual signatures at the Claude Code stop and sub-agent-stop lifecycle hooks. It is not a sandbox, a refusal layer, or a prompt-injection defence. Operators considering this for safety-critical workflows should read the failure modes below before relying on it.
-
Lexical evasion. The hooks are deterministic and rule-based. Paraphrase, synonym substitution, and reordering can defeat any single rule. The companion benchmark
agent-closeout-benchis the surface where this brittleness is measured rather than hidden. -
Hook misconfiguration. The runtime requires correctly wired Claude Code hook entries in
.claude/settings.json. A disabled hook, a typo in the hook command, or a workflow that bypassesStopentirely produces a silent miss. The hook cannot detect that it was never invoked. -
Runtime bypass. A local operator with shell access can disable hooks at any time, or pipe the hook output to
true. This is a deterministic verdict layer, not an operating-system sandbox. Threats from a hostile operator on the same host are out of scope. -
In-band manipulation. Hook decision logic runs out-of-band relative to the model that produced the closeout, so the model cannot directly rewrite the verdict. The model can, however, be coerced by an upstream prompt to phrase a false closeout in language the rules do not catch. Out-of-band means the verdict path is independent of the model context, not that the closeout text is unmanipulable.
-
Evidence-marker limitations. Several hooks treat tokens such as
verification passedortests passas closeout-contract evidence that mitigates other dark-pattern signals. These markers establish that the closeout contract was honoured in form; they are not independent proof that the underlying verification command actually succeeded. A workflow that wants stronger guarantees should record verifier outputs separately and treat closeout text as one of several signals. -
Coverage and language scope. Rules are English-only. The lifecycle surface is the Claude Code
StopandSubagentStophook payload; behaviour on other agent frameworks is undefined.
Hooks were extracted from the minmaxing governance harness, which uses the same patterns at higher level (workflow contracts, spec-first, agent-native estimation, /agentfactory).
PRs welcome to:
- Add a new hook to the suite (must follow the conventions: single file, Apache-2.0,
RECEIPTS.mdwith reproducible fixtures, allow-clause discipline). - Improve a regex (must include a fixture in
RECEIPTS.mdcovering the case). - Document a dark pattern that needs a hook but doesn't yet have one (file an issue with the textual signature you'd want caught).
Apache-2.0. Each individual hook repo also Apache-2.0.
Where in-context rules drift, out-of-band enforcement holds.