Skip to content

parity(slurm): implement burst-buffer staging subsystem (BurstBuffer* reasons) #309

@yansun1996

Description

@yansun1996

Summary

The BurstBufferResources and BurstBufferStageIn pending reasons exist as vocabulary (landed in #301) but have no producer, because Spur has no burst-buffer subsystem. Today --bb/burst-buffer is only script-wrapping in the node agent — crates/spurd/src/executor.rs::wrap_with_burst_buffer prepends stage-in/out shell commands to the job script. There is no resource pool, no staging state machine, and no scheduler awareness.

This is a self-contained subsystem spanning the scheduler, the node agent, and storage — its own epic, and a likely-deferred item relative to the rest of the Category-4 lifecycle work.

Gap / work

A real burst-buffer implementation (à la Slurm's burst_buffer plugin) needs:

  • Resource pool: track burst-buffer capacity cluster-wide (config + accounting), allocate/free per job.
  • Stage-in / stage-out state machine: a job requesting BB enters a staging phase before/after running; the controller models the staging lifecycle.
  • Scheduler hold-until-staged: jobs wait in PENDING with Reason=BurstBuffer* while capacity is unavailable (BurstBufferResources) or data is staging in (BurstBufferStageIn), and are not dispatched until staging completes.
  • Agent-side data movement: spurd performs the actual stage-in/out (beyond the current script-wrapping passthrough), reporting staging progress/completion to the controller.

Acceptance Criteria

  • Burst-buffer capacity is modeled and consumed per job
  • A job blocked on BB capacity shows Reason=BurstBufferResources; one staging in shows Reason=BurstBufferStageIn
  • Jobs are not dispatched until stage-in completes; stage-out runs after the job
  • Tests drive the real staging path

Notes

  • Effort is large (resource pool + state machine + scheduler integration + agent data movement); realistically a multi-PR epic.
  • Likely deferred until prioritized — flagged here so the two vocabulary-only reasons have a tracked home.

Related

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions