Perspectival Language Models

Perspectival Language Models is the official codebase accompanying the paper “Pretraining Language Models for Diachronic Linguistic Change Discovery”. As part of this work, we offer a straightforward way to transfer the training pipeline to other text corpora for those interested in adapting our methods.

Installation

First, clone the repository:

git clone https://github.com/comp-int-hum/historical-perspectival-lm.git

Automatic Setup

For an installation of the required dependencies and evaluation harness please run the setup.sh script

./setup.sh

Manual Setup

Alternatively this is the description for a manual setup:

Install dependencies:
```
pip install -r requirements.txt
```

Install the Evaluation Harness:

Clone the evaluation harness code:

git clone -b historical-minimal-pairs https://github.com/sabrinaxinli/evaluation-pipeline-2024.git

Install its dependencies:

cd evaluation-pipeline-2024
pip install -e .
pip install minicons
pip install --upgrade accelerate
cd ..

Usage

Training On New Data (configuration)

Prepare Data

To train models on your own data, place the text files into the custom_data/ directory. For each category, include:

data.train
data.dev
data.test

For example, see custom_data/song_lyrics, which also contains a preprocessing file (preprocessing.ipynb).

Update custom.py to specify where the data is located:

# Data settings
DATA = "LOAD_CUSTOM_DATA"  # set data loading method
CUSTOM_DATA_DIRECTORY = "custom_data/song_lyrics"  # your custom data directory

Pretraining

To run pretraining (according to the BabyLlama2 training recipe), set:

# RUN settings
RUN_PRETRAINING = True

in custom.py.

To change the model size or other training parameters, modify the relevant configuration files in 1_training/config and reference them in custom_pretraining.py, for example:

# training configs
TRAINER_CONFIG_1 = "1_training/config/llama-smoll-345M.yaml"
TRAINER_CONFIG_2 = "1_training/config/llama-smoll-345M.yaml"
STUDENT_CONFIG   = "1_training/config/llama-smoll-345M.yaml"

Finetuning

To fine-tune a model using DoRa, set:

# RUN settings
RUN_FINETUNING = True

in custom.py.

You also need to specify the base model path and configuration, for example:

DORA_LLAMA_CONFIG = "1_training/config/dora-llama8B.yaml"
MODEL_PATH        = "your_model_path"

If you use a model other than llama3-8B, adjust the configuration to target the correct modules.

Recreating Experiments (configuration)

Follow these steps to recreate the paper’s experiments.

Data Preparation

To run the data preparation pipeline, set:

DATA = "DATA_PREPARATION"

in custom.py.

You will need a local copy of the Gutenberg corpus; configure its path in custom_data_preparation.py:

GUTENBERG_PATH = "your_local_gutenberg_respository"

A quantized Llama3 70B model was used to identify work dates. The results were stored in gb_authors_dates_1950.jsonl. This file is not recomputed by default. To force a complete recomputation, set:

# Model and prompt settings
USE_DATES_FILE = False

in custom_data_preparation.py.

Alternatively, skip data preparation by directly loading the paper’s training data from custom_data/historical_data:

PROJECT_NAME = "historical"
DATA = "LOAD_CUSTOM_DATA"
CUSTOM_DATA_DIRECTORY = "custom_data/historical_data"

Training

To train both the pretrained and finetuned models as in the paper, enable both in custom.py:

RUN_PRETRAINING = True
RUN_FINETUNING = True

For finetuning on Llama3 8B, specify the local path in custom_finetuning.py:

MODEL_PATH = "your_model_path"

The paper used this Llama model: meta-llama/Meta-Llama-3-8B

Evaluation

To replicate the paper’s evaluation on BLiMP and the cloze task, set:

RUN_EVALUATION = True
EVALUATION_TASKS_LIST = ["blimp", "cloze_task_topk"]

in custom.py.

Start Run

Once your data is prepared, you can start the training run locally by navigating to the perspectival_language_models directory and running:

scons -Q

To run via Slurm, adjust Slurm variables in custom.py:

STEAMROLLER_ENGINE = 'slurm'
GPU_COUNT = 1
MEMORY = "64GB"
GPU_ACCOUNT = "gpu_account_name"
CPU_ACCOUNT = "cpu_account_name"
GPU_QUEUE = "gpu_queue"
CPU_QUEUE = "cpu_queue"

Then start the run with:

scons -Q STEAMROLLER_ENGINE=slurm

Contact

For questions or additional information, feel free to reach out:

Email: elisabeth.fittschen@studium.uni-hamburg.de
GitHub Issues: Open an issue

Name		Name	Last commit message	Last commit date
Latest commit History 66 Commits
0_data_preparation		0_data_preparation
1_training		1_training
2_evaluation		2_evaluation
custom_data		custom_data
utils		utils
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
SConscript_data_preparation		SConscript_data_preparation
SConscript_evaluation		SConscript_evaluation
SConscript_finetuning		SConscript_finetuning
SConscript_pretraining		SConscript_pretraining
SConstruct		SConstruct
custom.py		custom.py
requirements.txt		requirements.txt
setup.sh		setup.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Perspectival Language Models

Table of Contents

Installation

Automatic Setup

Manual Setup

Usage

Training On New Data (configuration)

Prepare Data

Pretraining

Finetuning

Recreating Experiments (configuration)

Data Preparation

Training

Evaluation

Start Run

Contact

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 3

Uh oh!

Languages

License

comp-int-hum/historical-perspectival-lm

Folders and files

Latest commit

History

Repository files navigation

Perspectival Language Models

Table of Contents

Installation

Automatic Setup

Manual Setup

Usage

Training On New Data (configuration)

Prepare Data

Pretraining

Finetuning

Recreating Experiments (configuration)

Data Preparation

Training

Evaluation

Start Run

Contact

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 3

Uh oh!

Languages

Packages