Perspectival Language Models is the official codebase accompanying the paper “Pretraining Language Models for Diachronic Linguistic Change Discovery”. As part of this work, we offer a straightforward way to transfer the training pipeline to other text corpora for those interested in adapting our methods.
First, clone the repository:
git clone https://github.com/comp-int-hum/historical-perspectival-lm.gitFor an installation of the required dependencies and evaluation harness please run the setup.sh script
./setup.shAlternatively this is the description for a manual setup:
-
Install dependencies:
pip install -r requirements.txt
-
Install the Evaluation Harness:
-
Clone the evaluation harness code:
git clone -b historical-minimal-pairs https://github.com/sabrinaxinli/evaluation-pipeline-2024.git
-
Install its dependencies:
cd evaluation-pipeline-2024 pip install -e . pip install minicons pip install --upgrade accelerate cd ..
-
To train models on your own data, place the text files into the custom_data/ directory. For each category, include:
data.traindata.devdata.test
For example, see custom_data/song_lyrics, which also contains a preprocessing file (preprocessing.ipynb).
Update custom.py to specify where the data is located:
# Data settings
DATA = "LOAD_CUSTOM_DATA" # set data loading method
CUSTOM_DATA_DIRECTORY = "custom_data/song_lyrics" # your custom data directoryTo run pretraining (according to the BabyLlama2 training recipe), set:
# RUN settings
RUN_PRETRAINING = Truein custom.py.
To change the model size or other training parameters, modify the relevant configuration files in 1_training/config and reference them in
custom_pretraining.py, for example:
# training configs
TRAINER_CONFIG_1 = "1_training/config/llama-smoll-345M.yaml"
TRAINER_CONFIG_2 = "1_training/config/llama-smoll-345M.yaml"
STUDENT_CONFIG = "1_training/config/llama-smoll-345M.yaml"To fine-tune a model using DoRa, set:
# RUN settings
RUN_FINETUNING = Truein custom.py.
You also need to specify the base model path and configuration, for example:
DORA_LLAMA_CONFIG = "1_training/config/dora-llama8B.yaml"
MODEL_PATH = "your_model_path"If you use a model other than llama3-8B, adjust the configuration to target the correct modules.
Follow these steps to recreate the paper’s experiments.
To run the data preparation pipeline, set:
DATA = "DATA_PREPARATION"in custom.py.
You will need a local copy of the Gutenberg corpus; configure its path in custom_data_preparation.py:
GUTENBERG_PATH = "your_local_gutenberg_respository"A quantized Llama3 70B model was used to identify work dates. The results were stored in gb_authors_dates_1950.jsonl. This file is not recomputed by default. To force a complete recomputation, set:
# Model and prompt settings
USE_DATES_FILE = Falsein custom_data_preparation.py.
Alternatively, skip data preparation by directly loading the paper’s training data from custom_data/historical_data:
PROJECT_NAME = "historical"
DATA = "LOAD_CUSTOM_DATA"
CUSTOM_DATA_DIRECTORY = "custom_data/historical_data"To train both the pretrained and finetuned models as in the paper, enable both in custom.py:
RUN_PRETRAINING = True
RUN_FINETUNING = TrueFor finetuning on Llama3 8B, specify the local path in custom_finetuning.py:
MODEL_PATH = "your_model_path"The paper used this Llama model: meta-llama/Meta-Llama-3-8B
To replicate the paper’s evaluation on BLiMP and the cloze task, set:
RUN_EVALUATION = True
EVALUATION_TASKS_LIST = ["blimp", "cloze_task_topk"]in custom.py.
Once your data is prepared, you can start the training run locally by navigating to the perspectival_language_models directory and running:
scons -QTo run via Slurm, adjust Slurm variables in custom.py:
STEAMROLLER_ENGINE = 'slurm'
GPU_COUNT = 1
MEMORY = "64GB"
GPU_ACCOUNT = "gpu_account_name"
CPU_ACCOUNT = "cpu_account_name"
GPU_QUEUE = "gpu_queue"
CPU_QUEUE = "cpu_queue"Then start the run with:
scons -Q STEAMROLLER_ENGINE=slurmFor questions or additional information, feel free to reach out:
- Email: elisabeth.fittschen@studium.uni-hamburg.de
- GitHub Issues: Open an issue