A lightweight, high-quality sentence embedding project for sentence similarity. This repository contains:
- A compact attention-pooling model with masked linear projection
- Reproducible training pipelines on token-level inputs and raw text
- A distillation pipeline from a transformer teacher to the lightweight student
- Versioned runs with configs, metrics, and plots
- @dec0dedd
- @tomasz-kielbasa
- @elprofesoriqo
.
├─ model/
│ └─ model.py # The model architecture (get_model)
├─ solution1/ # Artifacts (weights, stats, notes)
├─ plots/ # Generated plots (kept for reference)
├─ data/ # Dataset packs and labels
├─ train.py # Training on token-level embeddings
├─ train_on_text.py # Training on raw text with HF models
├─ distill_pipeline.py # Teacher → student distillation
├─ vis.py # Visualization of stats.json files
├─ requirements.txt
└─ TaskDescription/ # Original challenge docs
python -m venv .venv
.\.venv\Scripts\pip install --upgrade pip
.\.venv\Scripts\pip install -r requirements.txt- Train on token-level inputs
python train.py --model-file model/model.py --model-factory-name get_model \
--lr 5e-5 --scheduler plateau --epochs 2500 --ema-decay 0.9997 \
--model-path solution1/best_model.bin --stats-file solution1/stats.jsonTrain Solution 2 (same model, different target paths/hparams)
python train.py --model-file model/model.py --model-factory-name get_model \
--lr 5e-5 --scheduler plateau --epochs 2500 --ema-decay 0.9997 \
--model-path solution2/best_model.bin --stats-file solution2/stats.jsonTrain Solution 3 (KD-heavy variant example)
python train.py --model-file model/model.py --model-factory-name get_model \
--pairs-file TechArena_FormalDataset_EN_TRAIN.dat --teacher-weights best_text_model.bin \
--lr 5e-5 --scheduler plateau --epochs 2500 --ema-decay 0.9998 --kd-weight 50.0 \
--model-path solution3/best_model.bin --stats-file solution3/stats.json- Train on raw text (teacher backbone)
python utils/train_on_text.py --data-dir data --pairs-file TechArena_FormalDataset_EN_TRAIN.dat \
--pretrained-model sentence-transformers/all-mpnet-base-v2 \
--epochs 25 --lr 3e-5 --ema-decay 0.999- Distill teacher → student (token inputs)
python utils/distill_pipeline.py --data-dir data \
--pairs-file TechArena_FormalDataset_EN_TRAIN.dat \
--pack1 sentence_pack_1.dat --pack2 sentence_pack_2.dat \
--pack-tokenizer google-bert/bert-base-uncased \
--texts-file utils/my_texts.txt \
--teacher-pretrained sentence-transformers/all-mpnet-base-v2 \
--teacher-weights best_text_model.bin \
--student-model-file model/model.py \
--student-factory-name get_model \
--epochs 30 --batch-size 256 --lr 5e-3 --onecycle \
--save-path solution1/student_distilled.bin- Visualize results
python vis.py --files solution1/stats.json solution2/stats.json solution3/stats.json