CS244C -- Advanced Topics in Networking (Winter 2026) Stanford University
Team: Jiayu Chang, Stefan Ene, Hyeonggyu Kim, Varun Ramesh
Papers replicated:
- Gavel: Heterogeneity-Aware Cluster Scheduling Policies for Deep Learning Workloads (OSDI 2020)
- Beware of Fragmentation: Scheduling GPU-Sharing Workloads with Fragmentation Gradient Descent (ATC 2023)
GPU cluster schedulers face two complementary challenges: heterogeneity (allocating the right GPU type and quantity to each job) and fragmentation (placing jobs on servers without creating unusable GPU fragments). Gavel solves the first problem with an effective throughput abstraction and max-min fairness LP. FGD solves the second with fragmentation gradient descent for server-level placement.
This project replicates both papers independently, then integrates FGD's placement algorithm into Gavel's scheduling loop. We evaluate the combined system on both the original Philly trace (108 GPUs, 3 types) and Alibaba's production cluster trace (up to 6,200 GPUs, 12 types).
Key findings:
- Gavel replication closely matches the published Figs 9, 10, and 11 for average JCT across load levels
- FGD standalone replication reproduces the correct ordering of placement strategies (FGD < BestFit < Random)
- Cluster topology matters more than algorithm choice: uniform node sizes (Alibaba split) produce only 2-8% fragmentation regardless of placement strategy, while mixed node sizes (Cluster H) produce the expected differentiation between strategies
- The packed max-min fairness policy does not scale beyond ~50 active jobs due to O(n^2) job-pair throughput tensors
When placing a new job on a multi-GPU cluster, FGD computes a fragmentation gradient for each candidate server -- measuring how much cluster-wide fragmentation would increase if the job were placed there -- then selects the server with the lowest gradient.
Setup: Alibaba Cluster H topology -- 1,200 nodes with 5,592 GPUs across mixed node sizes (462 nodes with 8 GPUs, 310 with 4, 228 with 2, and 200 with 1). Evaluation uses an inflation methodology where tasks arrive one by one and never leave, measuring fragmentation as the cluster fills. All 6 placement strategies from the paper: Random, DotProduct, GPUClustering, GPUPacking, BestFit, and FGD.
Results: Our replication reproduces the correct ordering across all metrics. For fragmentation rate, the ordering is FGD < BestFit < GPUPacking < GPUClustering < DotProduct < Random, matching the paper's Figure 7.
| Figure | Ours | Paper |
|---|---|---|
| Fig. 7 | ![]() |
![]() |
| Fig. 9 | ![]() |
![]() |
| Fig. 11 | ![]() |
![]() |
| Fig. 12 | ![]() |
![]() |
| Fig. 13 | ![]() |
![]() |
| Fig. 14 | ![]() |
![]() |
Gavel formulates GPU scheduling as a max-min fairness linear program. It uses an effective throughput abstraction to express how fast each job runs on each GPU type, then solves for the allocation that maximizes the minimum throughput across all jobs.
Setup: 108-GPU heterogeneous cluster (36 V100 + 36 P100 + 36 K80) using the Microsoft Philly trace. 312 experiments across 3 policies (MMF baseline, MMF Gavel, FTF Gavel), 3 seeds, and full arrival rate sweeps covering Figures 9, 10, and 11 from the paper.
Results: Solid lines = our replication, dashed lines = paper reference curves. At moderate load (4.0 jph), Gavel achieves mean JCT of 16.4 hours vs. 23.0 hours for the baseline -- a 29% improvement.
Each scheduling round proceeds in two phases: Gavel's LP determines GPU type and count allocations, then FGD selects specific servers to minimize fragmentation.
Our key finding is that cluster topology -- specifically, whether nodes have uniform or mixed GPU counts -- determines whether fragmentation-aware placement provides any benefit.
Setup: 6,200-GPU heterogeneous cluster with 12 GPU sub-types, uniform node sizes per type. 45 experiments (5 placement policies x 3 arrival rates x 3 seeds).
Results: Fragmentation stays at 2-8% for all placement strategies. With uniform node sizes, any job can always be packed without creating fragments -- the packing problem that FGD solves simply does not arise.
Setup: 5,592-GPU single-type cluster with mixed node sizes (462x8-GPU + 310x4-GPU + 228x2-GPU + 200x1-GPU). 360 experiments (2 policies x 4 placements x 15 arrival rates x 3 seeds).
Results: With mixed node sizes, the expected differentiation emerges: FGD < BestFit < Strided < Random for fragmentation rate. At moderate loads, FGD achieves 15-25% lower fragmentation than random placement.
| FIFO Policy | Max-Min Fairness |
|---|---|
![]() |
![]() |
- macOS or Linux (tested on macOS 14, Apple Silicon)
- Python 3.9+
git clone https://github.com/Turquoise-T/cs244c-GPU_fragment.git
cd cs244c-GPU_fragment
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements-sim.txt
# Enable pre-commit hooks (runs unit + integration tests)
git config core.hooksPath .githooks# Unit tests (15 tests, ~0.1s)
cd src/scheduler/tests
python -m unittest policies_tests -v
# Integration test (deterministic JCT check, ~3s)
python -m unittest integration_test -vThe integration test verifies deterministic output for a fixed config (36:36:36 cluster, 50 jobs, seed=0). Expected average JCT = 57,171.41s. Any deviation indicates a regression.
cd src/scheduler
python scripts/sweeps/run_sweep_static.py \
--throughputs-file simulation_throughputs.json \
--cluster-spec 4:4:4 \
--policies fifo \
--num-total-jobs-lower-bound 10 \
--num-total-jobs-upper-bound 10 \
--num-data-points 1 \
--seeds 42 \
--log-dir /tmp/gavel_test \
-vFor full-scale experiments, use Stanford's FarmShare cluster:
# Sync code
rsync -avz --exclude='.venv' --exclude='__pycache__' --exclude='results*' \
./ farmshare:~/gavel/
# Submit Gavel replication (312 experiments)
ssh farmshare "cd ~/gavel/experiments/gavel-replication && sbatch slurm/submit_full.sbatch"
# Submit FGD replication -- Cluster H (360 experiments)
ssh farmshare "cd ~/gavel/experiments/combined && sbatch slurm/submit_fgd_replication_cluster_h.sbatch"
# Check status
ssh farmshare "squeue -u \$USER"# Gavel replication overlay (Figs 9/10/11)
cd experiments/gavel-replication
python scripts/plot_results.py
# FGD replication comparison
cd experiments/combined
python scripts/plot_fgd_replication.py.
├── src/
│ ├── scheduler/ # Core scheduler code
│ │ ├── scheduler.py # Simulation loop with telemetry + profiling
│ │ ├── policies/ # Scheduling policies (FIFO, LAS, MMF, FTF, etc.)
│ │ ├── job.py # Job model with migration time estimation
│ │ ├── utils.py # Policy registry + helpers
│ │ ├── traces/ # Philly trace data
│ │ ├── simulation_throughputs.json # Philly cluster throughputs
│ │ ├── simulation_throughputs_cluster_h.json # Cluster H throughputs
│ │ └── tests/ # Unit + integration tests
│ └── fgd/ # FGD core algorithm
│ ├── fgd_placement.py # Fragmentation gradient descent placement
│ └── data/ # Alibaba cluster topology + trace data
│
├── experiments/
│ ├── gavel-replication/ # Gavel paper Figs 9, 10, 11
│ │ ├── configs/ # 312 experiment configurations
│ │ ├── results/ # Per-experiment JSON results
│ │ ├── figures/ # Replication + comparison plots
│ │ ├── scripts/ # Runner, config generators, plotting
│ │ └── slurm/ # SLURM batch scripts
│ ├── combined/ # FGD+Gavel integrated experiments
│ │ ├── configs/ # Alibaba split, FIFO, Cluster H configs
│ │ ├── results/ # Per-config result directories
│ │ ├── figures/ # Comparison plots per config
│ │ ├── scripts/ # Plotting + config generation
│ │ ├── telemetry/ # Per-experiment JSONL telemetry
│ │ └── slurm/ # SLURM batch scripts
│ └── fgd-standalone/ # Standalone FGD evaluation
│ ├── figures/ # Replication plots
│ └── paper_reference_curves.json
│
├── docs/ # Design documents and plans
├── scripts/ # Shared utilities (sync, compression)
└── requirements-sim.txt # Python dependencies
- Narayanan et al., "Gavel: Heterogeneity-Aware Cluster Scheduling Policies for Deep Learning Workloads," OSDI 2020
- Weng et al., "Beware of Fragmentation: Scheduling GPU-Sharing Workloads with Fragmentation Gradient Descent," ATC 2023
- Alibaba GPU Cluster Trace (cluster-trace-gpu-v2023)
- Microsoft Philly Trace
- Original Gavel Repository




















