Bojador

A 3B model that reports confidence you can act on, and spends inference compute only where it pays. On a laptop.

Small language models earn their keep when they run cheaply: offline, private, on hardware you already own. They also earn the least trust, because they fail with a straight face. Bojador studies two fixes for one of them, SmolLM3-3B. Each was measured under a protocol fixed before the run, then picked apart by a second AI acting as a hostile reviewer.

A disagreement cascade that matches 6-sample voting accuracy while spending about 47% of the tokens on GSM8K.
A reporter-only adapter that makes the model's spoken confidence discriminate its own errors (AUROC ≈0.87, against ≈0.75 for the best external use of the raw signal). In-domain, replicated.

Don't take the numbers on faith. Run them. Every figure maps to a versioned artifact with a hash.

What's here

result	status	where
Disagreement cascade ≈ uniform voting at 47% tokens / 46% time	confirmed (pre-registered non-inferiority, GSM8K)	`src/cascade_block.py`
Joint reporter > S6-trained control in error discrimination	confirmed & replicated (synthetic generator)	`src/tt4_*.py`
Per-item teacher information contributes beyond target richness	descriptive support (10 seed-pairs)	`src/tt4_mech.py`, `tt4_final.py`
Brier superiority over an external Platt calibrator	not confirmed (3 measurements, all marginal)	—
Incremental joint advantage cross-domain	not confirmed	`src/tt4_final.py`
Trained reporters transfer where external calibration degrades	descriptive	HellaSwag

The full, dated lab log — including three retracted over-claims, a fixed statistical method, and a second AI system auditing every step — is in docs/JOURNAL.md.

Quickstart

python -m venv .venv && ./.venv/bin/pip install -r requirements.txt
# capability floor:
./.venv/bin/python -m src.floor_gsm8k --n 200
# the cascade (selective compute):
./.venv/bin/python -m src.cascade_block

Apple Silicon (MLX) required. Model pinned to an immutable revision; see REPRODUCE.md for the full pipeline and exact commands.

What this is, and what it isn't

Is: a measured recipe and a public evaluation battery, on one 3B model, with the boundaries drawn honestly. Something you can run on your own small model.
Isn't: a state-of-the-art claim, a guarantee, or a result that travels further than the tables say. The confirmatory training results live inside a synthetic generator. The cascade is confirmed on GSM8K. The cross-domain increment did not confirm, and the page says so plainly.

Limitations

One 3B model (SmolLM3); confirmatory reporter results live in a synthetic MCQ generator; small n with bootstrap CIs throughout; substantial train-seed heterogeneity (reported per-seed); capability preservation is established only for the reporter-only pipeline (the adapter never alters the answer). GSM8K prompts are Portuguese, scored by a tolerant scorer validated against blind annotation.

How this was built

I ran Bojador alone, directing two AI systems in adversarial roles: one to execute, one to audit. Every claim was fixed in advance and checked against the artifacts, and the ones that failed their gates were retracted, not buried. The full log is public, as a case study in what AI-audited research can look like.

License & citation

Code, model adapters, and artifacts: Apache 2.0. Paper text and figures: CC BY 4.0. Built on SmolLM3-3B (Apache 2.0). See CITATION.cff for citation.

Bojador — the cape that marked the edge of the known. Passing it took method, not courage alone.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
data		data
docs		docs
figures		figures
model_cards		model_cards
paper		paper
site		site
src		src
.gitignore		.gitignore
CITATION.cff		CITATION.cff
LICENSE		LICENSE
README.md		README.md
REPRODUCE.md		REPRODUCE.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Bojador

What's here

Quickstart

What this is, and what it isn't

Limitations

How this was built

License & citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Bojador

What's here

Quickstart

What this is, and what it isn't

Limitations

How this was built

License & citation

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages