Extreme semantic steering in small language models (Pythia-1.4B).
We applied α=10.0 steering pressure via Centroid Repulsion loss. Expected collapse. Got stability.
| α | Semantic Divergence | Final Loss |
|---|---|---|
| 0.0 (baseline) | +0.123% | 3.05 |
| 1.0 | +0.168% | 3.07 |
| 2.0 | +0.183% | 3.10 |
| 5.0 | +0.197% | 3.12 |
| 10.0 | +0.202% | 3.12 |
Result: 64% improvement in sense separation vs data-only baseline. No perplexity collapse. Saturation at α≈2.
Full writeup: The Sledgehammer Anomaly: Why Pythia-1.4B Refused to Break
sledgehammer-polysemy/
├── README.md
├── LICENSE
├── training/
│ ├── protocol_c_unified_v2.py # Main training script with GUI
│ └── protocol_e_breaker.py # Breaker run (α=2, 5, 10)
├── data/
│ ├── talmyan_sense_definitions.json # 16 polysemous words, 43 senses
│ └── protocol_c_dataset.json # 5,919 training sequences from The Pile
├── evaluation/
│ ├── rare_sense_test.py # Behavioral testing script
│ └── statistical_comparison.py # 400-prompt comparison script
└── results/
├── evaluation_results.json # Protocol C/D numerical results
└── breaker_results.json # Protocol E numerical results
# Install dependencies
pip install torch transformers datasets tkinter
# Run training (GUI)
python training/protocol_c_unified_v2.py
# Run breaker experiment
python training/protocol_e_breaker.py
# Run behavioral evaluation
python evaluation/rare_sense_test.py- RTX 4090 24GB (or similar VRAM)
- ~1 hour per condition
- ~5 hours total for all experiments
L_total = L_CE + α × L_CR
Where:
L_CE= Standard cross-entropy (language modeling)L_CR= Centroid Repulsion (pushes sense-specific activation clusters apart)α= Steering coefficient
The loss acts purely on geometry (cosine similarity). It doesn't know or care about frequency or probability. This created "sticky priors" where rare senses dominated—a limitation we address in Project Titan.
If you use this work, please cite:
@misc{wade2026sledgehammer,
author = {Wade, Benjamin},
title = {The Sledgehammer Anomaly: Why Pythia-1.4B Refused to Break},
year = {2026},
url = {LINK_TO_LESSWRONG_POST}
}MIT
The experimental design and analysis were developed through iterative dialogue with Claude (Anthropic) and Gemini (Google), which proved valuable for steelmanning hypotheses and identifying failure modes. Code and experiments were executed by the author.
