Skip to content

ajdramos/bojador

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Bojador

A 3B model that reports confidence you can act on, and spends inference compute only where it pays. On a laptop.

Small language models earn their keep when they run cheaply: offline, private, on hardware you already own. They also earn the least trust, because they fail with a straight face. Bojador studies two fixes for one of them, SmolLM3-3B. Each was measured under a protocol fixed before the run, then picked apart by a second AI acting as a hostile reviewer.

  1. A disagreement cascade that matches 6-sample voting accuracy while spending about 47% of the tokens on GSM8K.
  2. A reporter-only adapter that makes the model's spoken confidence discriminate its own errors (AUROC ≈0.87, against ≈0.75 for the best external use of the raw signal). In-domain, replicated.

Don't take the numbers on faith. Run them. Every figure maps to a versioned artifact with a hash.

What's here

result status where
Disagreement cascade ≈ uniform voting at 47% tokens / 46% time confirmed (pre-registered non-inferiority, GSM8K) src/cascade_block.py
Joint reporter > S6-trained control in error discrimination confirmed & replicated (synthetic generator) src/tt4_*.py
Per-item teacher information contributes beyond target richness descriptive support (10 seed-pairs) src/tt4_mech.py, tt4_final.py
Brier superiority over an external Platt calibrator not confirmed (3 measurements, all marginal)
Incremental joint advantage cross-domain not confirmed src/tt4_final.py
Trained reporters transfer where external calibration degrades descriptive HellaSwag

The full, dated lab log — including three retracted over-claims, a fixed statistical method, and a second AI system auditing every step — is in docs/JOURNAL.md.

Quickstart

python -m venv .venv && ./.venv/bin/pip install -r requirements.txt
# capability floor:
./.venv/bin/python -m src.floor_gsm8k --n 200
# the cascade (selective compute):
./.venv/bin/python -m src.cascade_block

Apple Silicon (MLX) required. Model pinned to an immutable revision; see REPRODUCE.md for the full pipeline and exact commands.

What this is, and what it isn't

  • Is: a measured recipe and a public evaluation battery, on one 3B model, with the boundaries drawn honestly. Something you can run on your own small model.
  • Isn't: a state-of-the-art claim, a guarantee, or a result that travels further than the tables say. The confirmatory training results live inside a synthetic generator. The cascade is confirmed on GSM8K. The cross-domain increment did not confirm, and the page says so plainly.

Limitations

One 3B model (SmolLM3); confirmatory reporter results live in a synthetic MCQ generator; small n with bootstrap CIs throughout; substantial train-seed heterogeneity (reported per-seed); capability preservation is established only for the reporter-only pipeline (the adapter never alters the answer). GSM8K prompts are Portuguese, scored by a tolerant scorer validated against blind annotation.

How this was built

I ran Bojador alone, directing two AI systems in adversarial roles: one to execute, one to audit. Every claim was fixed in advance and checked against the artifacts, and the ones that failed their gates were retracted, not buried. The full log is public, as a case study in what AI-audited research can look like.

License & citation

Code, model adapters, and artifacts: Apache 2.0. Paper text and figures: CC BY 4.0. Built on SmolLM3-3B (Apache 2.0). See CITATION.cff for citation.


Bojador — the cape that marked the edge of the known. Passing it took method, not courage alone.

About

A 3B model that knows when it's unsure and spends compute only where it pays. Reproducible, on a laptop. Built on SmolLM3.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors