Skip to content

Snakemake must only process one day's data at a time #43

@jeremyestein

Description

@jeremyestein

The naive way to use snakemake is that when it's re-run, it will re-execute any steps if an upstream data file has been updated (ie. its timestamp is newer). This means even if a data CSV from several months ago is "touched", or some of the FTPS sentinel files are deleted in a clearout, snakemake might well decide to re-execute several months worth of data, which we don't want to happen unless we really mean it!

Therefore, we need to pass an explicit date as a config parameter to snakemake, so it will only include files with that date in the filename in its processing. The date would normally be yesterday's date, since we intend to run the script daily in the early hours of the morning.

Definition of done

There is no easy mechanism by which more than a day of data can be processed at a time. It must still be possible in dev though.

Corollory: Deliberate re-processing will have to be manually invoked, so there must exist documentation for this process, and a wrapper script if needed to make it easy (but still explicit).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions