Baseball Pitch Sequence Prediction

A professional-grade Python package for baseball pitch sequence prediction using 7 ML models, with benchmarking, ablation studies, and MLflow experiment tracking.

Overview

This project generates synthetic baseball pitch data with realistic pitcher archetypes, pitch sequence strategies, fatigue modeling, and game situation context — then trains and compares multiple models for predicting the next pitch type.

Models

Model	Type	Description
Logistic Regression	Tabular	Baseline linear classifier
Random Forest	Tabular	Ensemble of decision trees
HMM	Sequence	Hidden Markov Model (hmmlearn)
AutoGluon	Tabular	AutoML with model ensembling
LSTM	Sequence	2-layer LSTM neural network
1D-CNN	Sequence	3-layer convolutional network
Transformer	Sequence	Self-attention encoder

All models share a unified interface (fit, predict, predict_proba) and are benchmarked via k-fold cross-validation with bootstrap confidence intervals and paired statistical tests.

Installation

# From source (development)
pip install -e ".[all,dev]"

# With all optional dependencies (AutoGluon + hmmlearn)
pip install pitch-sequencing[all]

# After install, generate training data:
pitch-generate --output-dir ./data

Quick Start

# Set up environment
python -m venv venv
source venv/bin/activate
make install            # pip install -e ".[all,dev]"

# Generate synthetic data
make data               # or: pitch-generate

# Train a single model
make train MODEL=lstm   # or: pitch-train --model lstm

# Run full benchmark (all 7 models, 5-fold CV)
make benchmark          # or: pitch-benchmark

# Run ablation studies
make ablation           # or: pitch-ablation --type feature --model lstm

# Launch MLflow UI
make mlflow             # opens at http://localhost:5000

# Run tests
make test

CLI Commands

After installation, these commands are available on your PATH:

Command	Description
`pitch-generate`	Generate synthetic pitch datasets
`pitch-train --model <name>`	Train a single model
`pitch-benchmark`	Run full benchmark suite
`pitch-ablation --type <type>`	Run ablation studies

Project Structure

├── pyproject.toml              # PEP 621 packaging + CLI entry points
├── Makefile                    # Common commands
├── src/pitch_sequencing/       # Main package
│   ├── __init__.py             # Public API (get_model, load_pitch_data, etc.)
│   ├── cli.py                  # CLI entry points (pitch-generate, etc.)
│   ├── config.py               # Config loading + dataclasses
│   ├── paths.py                # Config/data path resolution
│   ├── configs/                # Bundled YAML configs (ship with pip install)
│   │   ├── data.yaml
│   │   ├── benchmark.yaml
│   │   ├── ablation.yaml
│   │   └── models/             # Per-model hyperparameters
│   ├── data/                   # Data loading, preprocessing, simulation
│   ├── models/                 # All 7 model implementations
│   └── evaluation/             # Metrics, benchmarking, ablation, visualization
├── scripts/                    # Thin CLI wrappers (for make targets)
├── configs/                    # Dev-time config copies (mirrored in package)
├── notebooks/                  # Original Jupyter notebooks
├── data/                       # Generated datasets (not packaged)
├── experiments/                # MLflow artifacts (gitignored)
└── tests/                      # pytest test suite

Synthetic Data

The simulator generates ~384K pitch rows per run with:

Pitcher archetypes: power, finesse, slider_specialist, balanced — each with distinct pitch distributions
Sequence strategies: 8 multi-pitch patterns (e.g., FB-FB→CH, SL-SL→FB) that create learnable sequential dependencies
Count-dependent outcomes: Hit rates from 5-6% (pitcher's counts) to 19-23% (hitter's counts)
Fatigue modeling: Pitch selection degrades after archetype-specific thresholds (80-95 pitches)
Game situation: Runners on base and score differential affect pitch selection

Dataset Columns

Balls, Strikes, PitchType, Outcome, PitcherType, PitchNumber, AtBatNumber, RunnersOn, ScoreDiff, PreviousPitchType

Configuration

All settings are YAML-driven:

# configs/models/lstm.yaml
model_type: lstm
hidden_size: 64
num_layers: 2
dropout: 0.3
epochs: 20
learning_rate: 0.001
batch_size: 256

Evaluation

Metrics: Accuracy, balanced accuracy, macro precision/recall/F1, log loss, per-class metrics
Benchmarking: k-fold CV with bootstrap 95% confidence intervals
Statistical tests: Paired t-tests and Cohen's d effect sizes between models
Ablation studies: Feature importance, architecture variants, data scaling, hyperparameter sensitivity
MLflow tracking: All experiments logged with parameters, metrics, and artifacts

Notebooks

Interactive tutorials are available in notebooks/ — click to open directly in Google Colab:

Notebook	Description	Colab
Baseball Pitch Sequence Simulator	Data generation with pitcher archetypes, fatigue, and game context
HMM Pitch Predictor	Hidden Markov Model training and evaluation
LSTM Pitch Predictor	2-layer LSTM sequence model training
AutoGluon Pitch Prediction	AutoML pitch type prediction
AutoGluon Outcome Prediction	AutoML pitch outcome prediction

License

See LICENSE file.

Citation

If you use this software in your research, please cite it:

@software{hodge2026pitchsequencing,
  author       = {Hodge, John},
  title        = {Baseball Pitch Sequence Prediction},
  year         = {2026},
  url          = {https://github.com/jman4162/Baseball-Pitch-Sequence-Prediction},
  version      = {0.1.0},
  license      = {MIT}
}

Or use the CITATION.cff file for automatic citation via GitHub.

Documentation

Full documentation is available at jman4162.github.io/Baseball-Pitch-Sequence-Prediction.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Baseball Pitch Sequence Prediction

Overview

Models

Installation

Quick Start

CLI Commands

Project Structure

Synthetic Data

Dataset Columns

Configuration

Evaluation

Notebooks

License

Citation

Documentation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
.github/workflows		.github/workflows
configs		configs
data		data
docs		docs
notebooks		notebooks
scripts		scripts
src/pitch_sequencing		src/pitch_sequencing
tests		tests
.gitignore		.gitignore
CITATION.cff		CITATION.cff
CLAUDE.md		CLAUDE.md
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
Makefile		Makefile
README.md		README.md
mkdocs.yml		mkdocs.yml
pyproject.toml		pyproject.toml

Folders and files

Latest commit

History

Repository files navigation

Baseball Pitch Sequence Prediction

Overview

Models

Installation

Quick Start

CLI Commands

Project Structure

Synthetic Data

Dataset Columns

Configuration

Evaluation

Notebooks

License

Citation

Documentation

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages