shipbehaves

Emre shipbehaves

ships ai in places where it has to behave · post-training, rl + reward design

Pinned Loading

constitutional-cai constitutional-cai Public

Constitutional AI reproduction (Bai et al. 2022) on a small open model: self-critique/revise SFT + RLAIF DPO, two-axis safety/over-refusal eval, and a failure analysis of the over-refusal regression.

Python
distributed-sft-fsdp distributed-sft-fsdp Public

Genuine multi-GPU FSDP full fine-tune of a 7B across 4x A100 (closes the scale gap), with the tied-embeddings, collective-save, and checkpoint-consolidation gotchas made concrete.

Python
grpo-gsm8k grpo-gsm8k Public

GRPO + RLVR on GSM8K (DeepSeek-R1 / TinyZero recipe): verifiable reward, no reward/value model, with the vLLM-rollout necessity and headroom lessons made concrete.

Python
regulated-evals regulated-evals Public

Reproducible, regulation-anchored Trustworthy-AI scorecards for frontier and open-weight models in regulated industries (finance first).

Python
reward-model-ppo reward-model-ppo Public

Classic RLHF: train a reward model (0.757 held-out) then PPO a policy against it, with an honest teardown of PPO's memory + instability cost vs DPO.

Python
self-reward-collapse self-reward-collapse Public

Does a model training on its own judgment collapse? On verifiable math: the reward gets hacked (a brevity reward halves answer length) but capability does not collapse. An honest failure analysis w…

Python