Skip to content

Extreme semantic steering in small language models - The Sledgehammer Protocol

License

Notifications You must be signed in to change notification settings

benwade/sledgehammer-polysemy

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Steering Through the Storm of Words

The Sledgehammer Protocol

Extreme semantic steering in small language models (Pythia-1.4B).

Key Finding

We applied α=10.0 steering pressure via Centroid Repulsion loss. Expected collapse. Got stability.

α Semantic Divergence Final Loss
0.0 (baseline) +0.123% 3.05
1.0 +0.168% 3.07
2.0 +0.183% 3.10
5.0 +0.197% 3.12
10.0 +0.202% 3.12

Result: 64% improvement in sense separation vs data-only baseline. No perplexity collapse. Saturation at α≈2.

Blog Post

Full writeup: The Sledgehammer Anomaly: Why Pythia-1.4B Refused to Break

Repository Structure

sledgehammer-polysemy/
├── README.md
├── LICENSE
├── training/
│   ├── protocol_c_unified_v2.py    # Main training script with GUI
│   └── protocol_e_breaker.py       # Breaker run (α=2, 5, 10)
├── data/
│   ├── talmyan_sense_definitions.json  # 16 polysemous words, 43 senses
│   └── protocol_c_dataset.json         # 5,919 training sequences from The Pile
├── evaluation/
│   ├── rare_sense_test.py          # Behavioral testing script
│   └── statistical_comparison.py   # 400-prompt comparison script
└── results/
    ├── evaluation_results.json     # Protocol C/D numerical results
    └── breaker_results.json        # Protocol E numerical results

Usage

# Install dependencies
pip install torch transformers datasets tkinter

# Run training (GUI)
python training/protocol_c_unified_v2.py

# Run breaker experiment  
python training/protocol_e_breaker.py

# Run behavioral evaluation
python evaluation/rare_sense_test.py

Hardware Requirements

  • RTX 4090 24GB (or similar VRAM)
  • ~1 hour per condition
  • ~5 hours total for all experiments

Key Concepts

Centroid Repulsion Loss

L_total = L_CE + α × L_CR

Where:

  • L_CE = Standard cross-entropy (language modeling)
  • L_CR = Centroid Repulsion (pushes sense-specific activation clusters apart)
  • α = Steering coefficient

The Geometry-Probability Gap

The loss acts purely on geometry (cosine similarity). It doesn't know or care about frequency or probability. This created "sticky priors" where rare senses dominated—a limitation we address in Project Titan.

Citation

If you use this work, please cite:

@misc{wade2026sledgehammer,
  author = {Wade, Benjamin},
  title = {The Sledgehammer Anomaly: Why Pythia-1.4B Refused to Break},
  year = {2026},
  url = {LINK_TO_LESSWRONG_POST}
}

License

MIT

Acknowledgments

The experimental design and analysis were developed through iterative dialogue with Claude (Anthropic) and Gemini (Google), which proved valuable for steelmanning hypotheses and identifying failure modes. Code and experiments were executed by the author.

About

Extreme semantic steering in small language models - The Sledgehammer Protocol

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages