A professional-grade Python package for baseball pitch sequence prediction using 7 ML models, with benchmarking, ablation studies, and MLflow experiment tracking.
This project generates synthetic baseball pitch data with realistic pitcher archetypes, pitch sequence strategies, fatigue modeling, and game situation context — then trains and compares multiple models for predicting the next pitch type.
| Model | Type | Description |
|---|---|---|
| Logistic Regression | Tabular | Baseline linear classifier |
| Random Forest | Tabular | Ensemble of decision trees |
| HMM | Sequence | Hidden Markov Model (hmmlearn) |
| AutoGluon | Tabular | AutoML with model ensembling |
| LSTM | Sequence | 2-layer LSTM neural network |
| 1D-CNN | Sequence | 3-layer convolutional network |
| Transformer | Sequence | Self-attention encoder |
All models share a unified interface (fit, predict, predict_proba) and are benchmarked via k-fold cross-validation with bootstrap confidence intervals and paired statistical tests.
# From source (development)
pip install -e ".[all,dev]"
# With all optional dependencies (AutoGluon + hmmlearn)
pip install pitch-sequencing[all]
# After install, generate training data:
pitch-generate --output-dir ./data# Set up environment
python -m venv venv
source venv/bin/activate
make install # pip install -e ".[all,dev]"
# Generate synthetic data
make data # or: pitch-generate
# Train a single model
make train MODEL=lstm # or: pitch-train --model lstm
# Run full benchmark (all 7 models, 5-fold CV)
make benchmark # or: pitch-benchmark
# Run ablation studies
make ablation # or: pitch-ablation --type feature --model lstm
# Launch MLflow UI
make mlflow # opens at http://localhost:5000
# Run tests
make testAfter installation, these commands are available on your PATH:
| Command | Description |
|---|---|
pitch-generate |
Generate synthetic pitch datasets |
pitch-train --model <name> |
Train a single model |
pitch-benchmark |
Run full benchmark suite |
pitch-ablation --type <type> |
Run ablation studies |
├── pyproject.toml # PEP 621 packaging + CLI entry points
├── Makefile # Common commands
├── src/pitch_sequencing/ # Main package
│ ├── __init__.py # Public API (get_model, load_pitch_data, etc.)
│ ├── cli.py # CLI entry points (pitch-generate, etc.)
│ ├── config.py # Config loading + dataclasses
│ ├── paths.py # Config/data path resolution
│ ├── configs/ # Bundled YAML configs (ship with pip install)
│ │ ├── data.yaml
│ │ ├── benchmark.yaml
│ │ ├── ablation.yaml
│ │ └── models/ # Per-model hyperparameters
│ ├── data/ # Data loading, preprocessing, simulation
│ ├── models/ # All 7 model implementations
│ └── evaluation/ # Metrics, benchmarking, ablation, visualization
├── scripts/ # Thin CLI wrappers (for make targets)
├── configs/ # Dev-time config copies (mirrored in package)
├── notebooks/ # Original Jupyter notebooks
├── data/ # Generated datasets (not packaged)
├── experiments/ # MLflow artifacts (gitignored)
└── tests/ # pytest test suite
The simulator generates ~384K pitch rows per run with:
- Pitcher archetypes: power, finesse, slider_specialist, balanced — each with distinct pitch distributions
- Sequence strategies: 8 multi-pitch patterns (e.g., FB-FB→CH, SL-SL→FB) that create learnable sequential dependencies
- Count-dependent outcomes: Hit rates from 5-6% (pitcher's counts) to 19-23% (hitter's counts)
- Fatigue modeling: Pitch selection degrades after archetype-specific thresholds (80-95 pitches)
- Game situation: Runners on base and score differential affect pitch selection
Balls, Strikes, PitchType, Outcome, PitcherType, PitchNumber, AtBatNumber, RunnersOn, ScoreDiff, PreviousPitchType
All settings are YAML-driven:
# configs/models/lstm.yaml
model_type: lstm
hidden_size: 64
num_layers: 2
dropout: 0.3
epochs: 20
learning_rate: 0.001
batch_size: 256- Metrics: Accuracy, balanced accuracy, macro precision/recall/F1, log loss, per-class metrics
- Benchmarking: k-fold CV with bootstrap 95% confidence intervals
- Statistical tests: Paired t-tests and Cohen's d effect sizes between models
- Ablation studies: Feature importance, architecture variants, data scaling, hyperparameter sensitivity
- MLflow tracking: All experiments logged with parameters, metrics, and artifacts
Interactive tutorials are available in notebooks/ — click to open directly in Google Colab:
| Notebook | Description | Colab |
|---|---|---|
| Baseball Pitch Sequence Simulator | Data generation with pitcher archetypes, fatigue, and game context | |
| HMM Pitch Predictor | Hidden Markov Model training and evaluation | |
| LSTM Pitch Predictor | 2-layer LSTM sequence model training | |
| AutoGluon Pitch Prediction | AutoML pitch type prediction | |
| AutoGluon Outcome Prediction | AutoML pitch outcome prediction |
See LICENSE file.
If you use this software in your research, please cite it:
@software{hodge2026pitchsequencing,
author = {Hodge, John},
title = {Baseball Pitch Sequence Prediction},
year = {2026},
url = {https://github.com/jman4162/Baseball-Pitch-Sequence-Prediction},
version = {0.1.0},
license = {MIT}
}Or use the CITATION.cff file for automatic citation via GitHub.
Full documentation is available at jman4162.github.io/Baseball-Pitch-Sequence-Prediction.