Summary
The BurstBufferResources and BurstBufferStageIn pending reasons exist as vocabulary (landed in #301) but have no producer, because Spur has no burst-buffer subsystem. Today --bb/burst-buffer is only script-wrapping in the node agent — crates/spurd/src/executor.rs::wrap_with_burst_buffer prepends stage-in/out shell commands to the job script. There is no resource pool, no staging state machine, and no scheduler awareness.
This is a self-contained subsystem spanning the scheduler, the node agent, and storage — its own epic, and a likely-deferred item relative to the rest of the Category-4 lifecycle work.
Gap / work
A real burst-buffer implementation (à la Slurm's burst_buffer plugin) needs:
- Resource pool: track burst-buffer capacity cluster-wide (config + accounting), allocate/free per job.
- Stage-in / stage-out state machine: a job requesting BB enters a staging phase before/after running; the controller models the staging lifecycle.
- Scheduler hold-until-staged: jobs wait in PENDING with
Reason=BurstBuffer* while capacity is unavailable (BurstBufferResources) or data is staging in (BurstBufferStageIn), and are not dispatched until staging completes.
- Agent-side data movement:
spurd performs the actual stage-in/out (beyond the current script-wrapping passthrough), reporting staging progress/completion to the controller.
Acceptance Criteria
Notes
- Effort is large (resource pool + state machine + scheduler integration + agent data movement); realistically a multi-PR epic.
- Likely deferred until prioritized — flagged here so the two vocabulary-only reasons have a tracked home.
Related
Summary
The
BurstBufferResourcesandBurstBufferStageInpending reasons exist as vocabulary (landed in #301) but have no producer, because Spur has no burst-buffer subsystem. Today--bb/burst-buffer is only script-wrapping in the node agent —crates/spurd/src/executor.rs::wrap_with_burst_bufferprepends stage-in/out shell commands to the job script. There is no resource pool, no staging state machine, and no scheduler awareness.This is a self-contained subsystem spanning the scheduler, the node agent, and storage — its own epic, and a likely-deferred item relative to the rest of the Category-4 lifecycle work.
Gap / work
A real burst-buffer implementation (à la Slurm's
burst_bufferplugin) needs:Reason=BurstBuffer*while capacity is unavailable (BurstBufferResources) or data is staging in (BurstBufferStageIn), and are not dispatched until staging completes.spurdperforms the actual stage-in/out (beyond the current script-wrapping passthrough), reporting staging progress/completion to the controller.Acceptance Criteria
Reason=BurstBufferResources; one staging in showsReason=BurstBufferStageInNotes
Related
BurstBuffer*reason strings (vocabulary only).