-
-
Notifications
You must be signed in to change notification settings - Fork 20
Description
Title & Overview
Template: Machine Translation (Intro): An Intermediate, End-to-End Analysis Tutorial
Overview (≤2 sentences): Learners will build introductory machine translation pipelines, starting with a sequence-to-sequence model with attention and comparing it to pre-trained transformer MT models. It is intermediate because it highlights alignment, evaluation metrics beyond accuracy, and reproducibility across languages.
Purpose
The value-add is introducing learners to machine translation systems with defensible baselines (seq2seq attention) and modern pre-trained MT models. This stresses evaluation with BLEU/METEOR/chrF, robustness to sentence length and domain shifts, and reproducible training and inference pipelines.
Prerequisites
- Skills: Python, Git, pandas, ML basics.
- NLP: tokenization, embeddings, sequence-to-sequence models, evaluation metrics (BLEU, METEOR).
- Tooling: pandas, scikit-learn, Hugging Face Transformers, MLflow, FastAPI.
Setup Instructions
-
Environment: Conda/Poetry (Python 3.11), deterministic seeds.
-
Install: pandas, scikit-learn, Hugging Face Transformers + Datasets, sacreBLEU, MLflow, FastAPI.
-
Datasets:
- Small: Tatoeba (English ↔ French pairs).
- Medium: WMT14 English–German (subset).
-
Repo layout:
tutorials/t10-machine-translation/ ├─ notebooks/ ├─ src/ │ ├─ seq2seq.py │ ├─ transformer_mt.py │ ├─ eval.py │ └─ config.yaml ├─ data/README.md ├─ reports/ └─ tests/
Core Concepts
- Seq2Seq with attention: encoder–decoder RNN with alignment mechanism.
- Transformer MT: pre-trained models (MarianMT, mBART) with byte-level BPE tokenization.
- Evaluation: BLEU, METEOR, chrF; interpretability via alignment visualizations.
- Error slicing: sentence length, rare words, domain mismatch.
- Reproducibility: fixed seeds, dataset splits, config logging.
Step-by-Step Walkthrough
- Data intake & splits: load Tatoeba and WMT subsets, deterministic train/val/test.
- Classical baseline: seq2seq model with attention (PyTorch or HF examples).
- Transformer baseline: pre-trained MarianMT (e.g.,
Helsinki-NLP/opus-mt-en-de). - Evaluation: compute BLEU, METEOR, chrF; analyze performance across sentence length buckets.
- Error analysis: mistranslations, gender bias, rare word errors, OOD examples.
- Reporting: metrics tables, example translations, error categories in
reports/t10-machine-translation.md. - (Optional) Serve: FastAPI endpoint for translation; schema validation (source/target language codes).
Hands-On Exercises
- Ablations: seq2seq w/ and w/o attention; transformer frozen vs fine-tuned.
- Robustness: noisy inputs (typos, code-switching), measure BLEU drop.
- Slice evaluation: compare performance for short vs long sentences.
- Stretch: back-translation for data augmentation.
Common Pitfalls & Troubleshooting
- Tokenization mismatch: source/target vocab drift → bad translations.
- Metrics misuse: BLEU favors n-gram overlap, not semantic adequacy.
- Long sequences: seq2seq RNNs degrade; transformers handle better.
- Biases: gender/region biases in MT models; analyze and document.
- OOM: large datasets; subset or batch.
Best Practices
- Always compare seq2seq vs transformer on identical splits.
- Track tokenizer artifacts, configs, and seeds in MLflow.
- Combine metrics (BLEU + chrF + qualitative review).
- Unit tests: round-trip translation (EN→DE→EN) sanity check.
- Guardrails: max input length, valid language codes in serving.
Reflection & Discussion Prompts
- Why do attention mechanisms improve seq2seq MT?
- What are BLEU’s limitations for translation quality?
- How might civic datasets (e.g., multilingual services) benefit from MT?
Next Steps / Advanced Extensions
- Experiment with multilingual MT models (mBART, NLLB).
- Apply PEFT/LoRA fine-tuning for domain-specific MT.
- Explore constrained decoding (terminology preservation).
- Lightweight monitoring: drift in translation quality over time.
Glossary / Key Terms
Seq2seq, attention, transformer MT, BLEU, METEOR, chrF, byte-level BPE, alignment.
Additional Resources
- [Hugging Face Transformers](https://huggingface.co/docs/transformers)
- [Hugging Face Datasets](https://huggingface.co/datasets)
- [sacreBLEU](https://github.com/mjpost/sacrebleu)
- [MLflow](https://mlflow.org/)
- [FastAPI](https://fastapi.tiangolo.com/)
Contributors
Author(s): TBD
Reviewer(s): TBD
Maintainer(s): TBD
Date updated: 2025-09-20
Dataset licenses: Tatoeba (CC BY), WMT (varies, research use).
Issues Referenced
Epic: HfLA Text Analysis Tutorials (T0–T14).
This sub-issue: T10: Machine Translation (Intro).
Metadata
Metadata
Assignees
Labels
Type
Projects
Status