Skip to content

elprofesoriqo/SentenceEmbeddings

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Techarena Sentence Embeddings

A lightweight, high-quality sentence embedding project for sentence similarity. This repository contains:

  • A compact attention-pooling model with masked linear projection
  • Reproducible training pipelines on token-level inputs and raw text
  • A distillation pipeline from a transformer teacher to the lightweight student
  • Versioned runs with configs, metrics, and plots

Team

  • @dec0dedd
  • @tomasz-kielbasa
  • @elprofesoriqo

Project structure

.
├─ model/
│  └─ model.py                    # The model architecture (get_model)
├─ solution1/                     # Artifacts (weights, stats, notes)                   
├─ plots/                        # Generated plots (kept for reference)
├─ data/                         # Dataset packs and labels 
├─ train.py                      # Training on token-level embeddings
├─ train_on_text.py              # Training on raw text with HF models
├─ distill_pipeline.py           # Teacher → student distillation
├─ vis.py                        # Visualization of stats.json files
├─ requirements.txt
└─ TaskDescription/              # Original challenge docs

Installation

python -m venv .venv
.\.venv\Scripts\pip install --upgrade pip
.\.venv\Scripts\pip install -r requirements.txt

Quickstart

  1. Train on token-level inputs
python train.py --model-file model/model.py --model-factory-name get_model \
  --lr 5e-5 --scheduler plateau --epochs 2500 --ema-decay 0.9997 \
  --model-path solution1/best_model.bin --stats-file solution1/stats.json

Train Solution 2 (same model, different target paths/hparams)

python train.py --model-file model/model.py --model-factory-name get_model \
  --lr 5e-5 --scheduler plateau --epochs 2500 --ema-decay 0.9997 \
  --model-path solution2/best_model.bin --stats-file solution2/stats.json

Train Solution 3 (KD-heavy variant example)

python train.py --model-file model/model.py --model-factory-name get_model \
  --pairs-file TechArena_FormalDataset_EN_TRAIN.dat --teacher-weights best_text_model.bin \
  --lr 5e-5 --scheduler plateau --epochs 2500 --ema-decay 0.9998 --kd-weight 50.0 \
  --model-path solution3/best_model.bin --stats-file solution3/stats.json
  1. Train on raw text (teacher backbone)
python utils/train_on_text.py --data-dir data --pairs-file TechArena_FormalDataset_EN_TRAIN.dat \
  --pretrained-model sentence-transformers/all-mpnet-base-v2 \
  --epochs 25 --lr 3e-5 --ema-decay 0.999
  1. Distill teacher → student (token inputs)
python utils/distill_pipeline.py --data-dir data \
  --pairs-file TechArena_FormalDataset_EN_TRAIN.dat \
  --pack1 sentence_pack_1.dat --pack2 sentence_pack_2.dat \
  --pack-tokenizer google-bert/bert-base-uncased \
  --texts-file utils/my_texts.txt \
  --teacher-pretrained sentence-transformers/all-mpnet-base-v2 \
  --teacher-weights best_text_model.bin \
  --student-model-file model/model.py \
  --student-factory-name get_model \
  --epochs 30 --batch-size 256 --lr 5e-3 --onecycle \
  --save-path solution1/student_distilled.bin
  1. Visualize results
python vis.py --files solution1/stats.json solution2/stats.json solution3/stats.json

About

Lightweight model that converts token-level embeddings of a sentence into a single, combined sentence-level embedding.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages