Techarena Sentence Embeddings

A lightweight, high-quality sentence embedding project for sentence similarity. This repository contains:

A compact attention-pooling model with masked linear projection
Reproducible training pipelines on token-level inputs and raw text
A distillation pipeline from a transformer teacher to the lightweight student
Versioned runs with configs, metrics, and plots

Team

@dec0dedd
@tomasz-kielbasa
@elprofesoriqo

Project structure

.
├─ model/
│  └─ model.py                    # The model architecture (get_model)
├─ solution1/                     # Artifacts (weights, stats, notes)                   
├─ plots/                        # Generated plots (kept for reference)
├─ data/                         # Dataset packs and labels 
├─ train.py                      # Training on token-level embeddings
├─ train_on_text.py              # Training on raw text with HF models
├─ distill_pipeline.py           # Teacher → student distillation
├─ vis.py                        # Visualization of stats.json files
├─ requirements.txt
└─ TaskDescription/              # Original challenge docs

Installation

python -m venv .venv
.\.venv\Scripts\pip install --upgrade pip
.\.venv\Scripts\pip install -r requirements.txt

Quickstart

Train on token-level inputs

python train.py --model-file model/model.py --model-factory-name get_model \
  --lr 5e-5 --scheduler plateau --epochs 2500 --ema-decay 0.9997 \
  --model-path solution1/best_model.bin --stats-file solution1/stats.json

Train Solution 2 (same model, different target paths/hparams)

python train.py --model-file model/model.py --model-factory-name get_model \
  --lr 5e-5 --scheduler plateau --epochs 2500 --ema-decay 0.9997 \
  --model-path solution2/best_model.bin --stats-file solution2/stats.json

Train Solution 3 (KD-heavy variant example)

python train.py --model-file model/model.py --model-factory-name get_model \
  --pairs-file TechArena_FormalDataset_EN_TRAIN.dat --teacher-weights best_text_model.bin \
  --lr 5e-5 --scheduler plateau --epochs 2500 --ema-decay 0.9998 --kd-weight 50.0 \
  --model-path solution3/best_model.bin --stats-file solution3/stats.json

Train on raw text (teacher backbone)

python utils/train_on_text.py --data-dir data --pairs-file TechArena_FormalDataset_EN_TRAIN.dat \
  --pretrained-model sentence-transformers/all-mpnet-base-v2 \
  --epochs 25 --lr 3e-5 --ema-decay 0.999

Distill teacher → student (token inputs)

python utils/distill_pipeline.py --data-dir data \
  --pairs-file TechArena_FormalDataset_EN_TRAIN.dat \
  --pack1 sentence_pack_1.dat --pack2 sentence_pack_2.dat \
  --pack-tokenizer google-bert/bert-base-uncased \
  --texts-file utils/my_texts.txt \
  --teacher-pretrained sentence-transformers/all-mpnet-base-v2 \
  --teacher-weights best_text_model.bin \
  --student-model-file model/model.py \
  --student-factory-name get_model \
  --epochs 30 --batch-size 256 --lr 5e-3 --onecycle \
  --save-path solution1/student_distilled.bin

Visualize results

python vis.py --files solution1/stats.json solution2/stats.json solution3/stats.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Techarena Sentence Embeddings

Team

Project structure

Installation

Quickstart

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
TaskDescription		TaskDescription
model		model
plots		plots
solution1		solution1
solution2		solution2
solution3		solution3
utils		utils
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt
train.py		train.py
vis.py		vis.py

Folders and files

Latest commit

History

Repository files navigation

Techarena Sentence Embeddings

Team

Project structure

Installation

Quickstart

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages