A 3B model that reports confidence you can act on, and spends inference compute only where it pays. On a laptop.
Small language models earn their keep when they run cheaply: offline, private, on hardware you already own. They also earn the least trust, because they fail with a straight face. Bojador studies two fixes for one of them, SmolLM3-3B. Each was measured under a protocol fixed before the run, then picked apart by a second AI acting as a hostile reviewer.
- A disagreement cascade that matches 6-sample voting accuracy while spending about 47% of the tokens on GSM8K.
- A reporter-only adapter that makes the model's spoken confidence discriminate its own errors (AUROC ≈0.87, against ≈0.75 for the best external use of the raw signal). In-domain, replicated.
Don't take the numbers on faith. Run them. Every figure maps to a versioned artifact with a hash.
| result | status | where |
|---|---|---|
| Disagreement cascade ≈ uniform voting at 47% tokens / 46% time | confirmed (pre-registered non-inferiority, GSM8K) | src/cascade_block.py |
| Joint reporter > S6-trained control in error discrimination | confirmed & replicated (synthetic generator) | src/tt4_*.py |
| Per-item teacher information contributes beyond target richness | descriptive support (10 seed-pairs) | src/tt4_mech.py, tt4_final.py |
| Brier superiority over an external Platt calibrator | not confirmed (3 measurements, all marginal) | — |
| Incremental joint advantage cross-domain | not confirmed | src/tt4_final.py |
| Trained reporters transfer where external calibration degrades | descriptive | HellaSwag |
The full, dated lab log — including three retracted over-claims, a fixed
statistical method, and a second AI system auditing every step — is in
docs/JOURNAL.md.
python -m venv .venv && ./.venv/bin/pip install -r requirements.txt
# capability floor:
./.venv/bin/python -m src.floor_gsm8k --n 200
# the cascade (selective compute):
./.venv/bin/python -m src.cascade_blockApple Silicon (MLX) required. Model pinned to an immutable revision; see
REPRODUCE.md for the full pipeline and exact commands.
- Is: a measured recipe and a public evaluation battery, on one 3B model, with the boundaries drawn honestly. Something you can run on your own small model.
- Isn't: a state-of-the-art claim, a guarantee, or a result that travels further than the tables say. The confirmatory training results live inside a synthetic generator. The cascade is confirmed on GSM8K. The cross-domain increment did not confirm, and the page says so plainly.
One 3B model (SmolLM3); confirmatory reporter results live in a synthetic MCQ generator; small n with bootstrap CIs throughout; substantial train-seed heterogeneity (reported per-seed); capability preservation is established only for the reporter-only pipeline (the adapter never alters the answer). GSM8K prompts are Portuguese, scored by a tolerant scorer validated against blind annotation.
I ran Bojador alone, directing two AI systems in adversarial roles: one to execute, one to audit. Every claim was fixed in advance and checked against the artifacts, and the ones that failed their gates were retracted, not buried. The full log is public, as a case study in what AI-audited research can look like.
Code, model adapters, and artifacts: Apache 2.0. Paper text and figures:
CC BY 4.0. Built on
SmolLM3-3B (Apache 2.0).
See CITATION.cff for citation.
Bojador — the cape that marked the edge of the known. Passing it took method, not courage alone.