News Summarizer: Pegasus Fine-Tuning

This project implements a complete pipeline for fine-tuning Google's Pegasus model on the CNN/DailyMail dataset for abstractive text summarization. It is built using the Hugging Face transformers library and PyTorch.

Features

Automated Pipeline: Handles data preprocessing, tokenization, training, and evaluation.
Metric Tracking: Tracks loss and ROUGE scores (ROUGE-1, ROUGE-2, ROUGE-L) per epoch.
Visualization: Automatically generates training/validation loss and ROUGE score plots.
Modular Design: Code is organized into reusable modules within the src/ directory.

Installation

Clone the repository:

git clone https://github.com/sandeepkumaraau/News-summarizer.git
cd News-summarizer

Install dependencies:
```
pip install -r requirements.txt
```
Note: This project requires torch, transformers, evaluate, nltk, and matplotlib.

Usage

Training

To start the fine-tuning process, simply run the main training script:

python train.py

The script will:

Download and preprocess the dataset.
Fine-tune the Pegasus model.
Evaluate using ROUGE metrics after every epoch.
Save the final model and tokenizer to outputs/models/final_model.
Save performance plots to the outputs/ directory.

Configuration

Hyperparameters and model settings are defined in src/config.py. You can modify these values to adjust the training behavior:

MODEL_NAME: The pretrained model checkpoint (default: google/pegasus-cnn_dailymail).
EPOCHS: Number of training epochs.
LEARNING_RATE: Learning rate for the Adafactor optimizer.
BATCH_SIZE: Training and validation batch sizes.
DEVICE: Automatically detects CUDA, MPS (Mac), or CPU.

Project Structure

News-summarizer/
├── outputs/               # To save trained models and plots
│   ├── models/
│   └── plots/
├── src/                   # The core logic of your application
│   ├── __init__.py
│   ├── config.py          # Hyperparameters and constants
│   ├── data.py            # Data loading, preprocessing, and collators
│   ├── model.py           # Model initialization
│   ├── trainer.py         # Training and evaluation loops
│   └── utils.py           # Helper functions (cleaning, metrics, plotting)
├── train.py               # The main entry point script
├── requirements.txt       # List of dependencies
├── .gitignore             # Files to ignore 
└── README.md

Outputs

After training is complete, check the outputs/ directory for:

Saved Model: A Hugging Face compatible model folder.
Metrics: Plots visualizing the training loss and ROUGE score progression.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

News Summarizer: Pegasus Fine-Tuning

Features

Installation

Usage

Training

Configuration

Project Structure

Outputs

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
outputs/plots		outputs/plots
src		src
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt
train.py		train.py

sandeepkumaraau/News-summarizer

Folders and files

Latest commit

History

Repository files navigation

News Summarizer: Pegasus Fine-Tuning

Features

Installation

Usage

Training

Configuration

Project Structure

Outputs

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages