The Sledgehammer Protocol

Extreme semantic steering in small language models (Pythia-1.4B).

Key Finding

We applied α=10.0 steering pressure via Centroid Repulsion loss. Expected collapse. Got stability.

α	Semantic Divergence	Final Loss
0.0 (baseline)	+0.123%	3.05
1.0	+0.168%	3.07
2.0	+0.183%	3.10
5.0	+0.197%	3.12
10.0	+0.202%	3.12

Result: 64% improvement in sense separation vs data-only baseline. No perplexity collapse. Saturation at α≈2.

Blog Post

Full writeup: The Sledgehammer Anomaly: Why Pythia-1.4B Refused to Break

Repository Structure

sledgehammer-polysemy/
├── README.md
├── LICENSE
├── training/
│   ├── protocol_c_unified_v2.py    # Main training script with GUI
│   └── protocol_e_breaker.py       # Breaker run (α=2, 5, 10)
├── data/
│   ├── talmyan_sense_definitions.json  # 16 polysemous words, 43 senses
│   └── protocol_c_dataset.json         # 5,919 training sequences from The Pile
├── evaluation/
│   ├── rare_sense_test.py          # Behavioral testing script
│   └── statistical_comparison.py   # 400-prompt comparison script
└── results/
    ├── evaluation_results.json     # Protocol C/D numerical results
    └── breaker_results.json        # Protocol E numerical results

Usage

# Install dependencies
pip install torch transformers datasets tkinter

# Run training (GUI)
python training/protocol_c_unified_v2.py

# Run breaker experiment  
python training/protocol_e_breaker.py

# Run behavioral evaluation
python evaluation/rare_sense_test.py

Hardware Requirements

RTX 4090 24GB (or similar VRAM)
~1 hour per condition
~5 hours total for all experiments

Key Concepts

Centroid Repulsion Loss

L_total = L_CE + α × L_CR

Where:

L_CE = Standard cross-entropy (language modeling)
L_CR = Centroid Repulsion (pushes sense-specific activation clusters apart)
α = Steering coefficient

The Geometry-Probability Gap

The loss acts purely on geometry (cosine similarity). It doesn't know or care about frequency or probability. This created "sticky priors" where rare senses dominated—a limitation we address in Project Titan.

Citation

If you use this work, please cite:

@misc{wade2026sledgehammer,
  author = {Wade, Benjamin},
  title = {The Sledgehammer Anomaly: Why Pythia-1.4B Refused to Break},
  year = {2026},
  url = {LINK_TO_LESSWRONG_POST}
}

License

MIT

Acknowledgments

The experimental design and analysis were developed through iterative dialogue with Claude (Anthropic) and Gemini (Google), which proved valuable for steelmanning hypotheses and identifying failure modes. Code and experiments were executed by the author.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

The Sledgehammer Protocol

Key Finding

Blog Post

Repository Structure

Usage

Hardware Requirements

Key Concepts

Centroid Repulsion Loss

The Geometry-Probability Gap

Citation

License

Acknowledgments

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
data		data
evaluation		evaluation
images		images
results		results
training		training
LICENSE		LICENSE
README.md		README.md

License

benwade/sledgehammer-polysemy

Folders and files

Latest commit

History

Repository files navigation

The Sledgehammer Protocol

Key Finding

Blog Post

Repository Structure

Usage

Hardware Requirements

Key Concepts

Centroid Repulsion Loss

The Geometry-Probability Gap

Citation

License

Acknowledgments

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages