Skip to content

comp-int-hum/historical-perspectival-lm

Repository files navigation

Perspectival Language Models

Perspectival Language Models is the official codebase accompanying the paper “Pretraining Language Models for Diachronic Linguistic Change Discovery”. As part of this work, we offer a straightforward way to transfer the training pipeline to other text corpora for those interested in adapting our methods.

Table of Contents

Installation

First, clone the repository:

git clone https://github.com/comp-int-hum/historical-perspectival-lm.git

Automatic Setup

For an installation of the required dependencies and evaluation harness please run the setup.sh script

./setup.sh

Manual Setup

Alternatively this is the description for a manual setup:

  1. Install dependencies:

    pip install -r requirements.txt
  2. Install the Evaluation Harness:

    1. Clone the evaluation harness code:

      git clone -b historical-minimal-pairs https://github.com/sabrinaxinli/evaluation-pipeline-2024.git
    2. Install its dependencies:

      cd evaluation-pipeline-2024
      pip install -e .
      pip install minicons
      pip install --upgrade accelerate
      cd ..

Usage

Training On New Data (configuration)

Prepare Data

To train models on your own data, place the text files into the custom_data/ directory. For each category, include:

  • data.train
  • data.dev
  • data.test

For example, see custom_data/song_lyrics, which also contains a preprocessing file (preprocessing.ipynb).

Update custom.py to specify where the data is located:

# Data settings
DATA = "LOAD_CUSTOM_DATA"  # set data loading method
CUSTOM_DATA_DIRECTORY = "custom_data/song_lyrics"  # your custom data directory

Pretraining

To run pretraining (according to the BabyLlama2 training recipe), set:

# RUN settings
RUN_PRETRAINING = True

in custom.py.

To change the model size or other training parameters, modify the relevant configuration files in 1_training/config and reference them in custom_pretraining.py, for example:

# training configs
TRAINER_CONFIG_1 = "1_training/config/llama-smoll-345M.yaml"
TRAINER_CONFIG_2 = "1_training/config/llama-smoll-345M.yaml"
STUDENT_CONFIG   = "1_training/config/llama-smoll-345M.yaml"

Finetuning

To fine-tune a model using DoRa, set:

# RUN settings
RUN_FINETUNING = True

in custom.py.

You also need to specify the base model path and configuration, for example:

DORA_LLAMA_CONFIG = "1_training/config/dora-llama8B.yaml"
MODEL_PATH        = "your_model_path"

If you use a model other than llama3-8B, adjust the configuration to target the correct modules.

Recreating Experiments (configuration)

Follow these steps to recreate the paper’s experiments.

Data Preparation

To run the data preparation pipeline, set:

DATA = "DATA_PREPARATION"

in custom.py.

You will need a local copy of the Gutenberg corpus; configure its path in custom_data_preparation.py:

GUTENBERG_PATH = "your_local_gutenberg_respository"

A quantized Llama3 70B model was used to identify work dates. The results were stored in gb_authors_dates_1950.jsonl. This file is not recomputed by default. To force a complete recomputation, set:

# Model and prompt settings
USE_DATES_FILE = False

in custom_data_preparation.py.

Alternatively, skip data preparation by directly loading the paper’s training data from custom_data/historical_data:

PROJECT_NAME = "historical"
DATA = "LOAD_CUSTOM_DATA"
CUSTOM_DATA_DIRECTORY = "custom_data/historical_data"

Training

To train both the pretrained and finetuned models as in the paper, enable both in custom.py:

RUN_PRETRAINING = True
RUN_FINETUNING = True

For finetuning on Llama3 8B, specify the local path in custom_finetuning.py:

MODEL_PATH = "your_model_path"

The paper used this Llama model: meta-llama/Meta-Llama-3-8B

Evaluation

To replicate the paper’s evaluation on BLiMP and the cloze task, set:

RUN_EVALUATION = True
EVALUATION_TASKS_LIST = ["blimp", "cloze_task_topk"]

in custom.py.

Start Run

Once your data is prepared, you can start the training run locally by navigating to the perspectival_language_models directory and running:

scons -Q

To run via Slurm, adjust Slurm variables in custom.py:

STEAMROLLER_ENGINE = 'slurm'
GPU_COUNT = 1
MEMORY = "64GB"
GPU_ACCOUNT = "gpu_account_name"
CPU_ACCOUNT = "cpu_account_name"
GPU_QUEUE = "gpu_queue"
CPU_QUEUE = "cpu_queue"

Then start the run with:

scons -Q STEAMROLLER_ENGINE=slurm

Contact

For questions or additional information, feel free to reach out:

About

Experiments in historically perspectival language modeling

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •