README

README
Step 1: Develop the R Workflow
Step 2: Define Dependencies
Step 3: Create the Python Runner
Step 4: Local Validation and Testing
Step 5: Create the pyproject.toml file
Step 6: Building and Pushing to ECR
Step 7: Deploy the image to the platform
- Deploy via API
- Installation on the platform (handled by Mass Dynamics)

This repository provides instructions and an example for setting up a new custom R workflow integrated into the Mass Dynamics ecosystem.

The example in this repository demonstrates how to create a workflow that modifies an input intensity dataset and returns a new transformed intensity dataset.

Step 1: Develop the R Workflow

Create an R workflow as either a package or a standalone function. Include:

a main function that executes the workflow.
an optional runner script in a separate file (e.g.process.R) - required for R packages.

Structure it as follows:

If using an R package: the runner script loads the package and calls the main workflow function. See an example runner here: https://github.com/MassDynamics/MDCustomR/blob/updates-wip/src/md_custom_r/process.R
If using a script: the runner can be the same as the main workflow function, so no separate file is required.

If building an R package, include a DESCRIPTION with version constraints for dependencies.

The example in this repository is implemented as an R package, with the workflow located at ./R/transformIntensities.R. This package depends on the limma and tidyr R packages.

When to use an R package vs an R script

Use an R package when your workflow has many functions, you want versioning and explicit dependency management, need dev tools (testing, CI/CD), or plan to release it for others to install from GitHub.

Use an R script when the workflow is simple (few functions, one main runner), you're iterating quickly, or you don't need package tooling.

The main R function should accept an intensity table and a metadata table as inputs. Additional parameters of your choice can also be included. The input and output tables must adhere to the standard Mass Dynamics format for your chosen dataset type.

Develop an optional runner function to invoke the main workflow function and produce the output as a named list. An example of this function is provided in ./src/md_custom_r/process.R.

Output dataset types

Your workflow can produce any of these dataset types and the outputs need to be dataframes. The R output list structure must match the type.

INTENSITY — Required: intensity, metadata. Optional: runtime_metadata, ptm_sites
PAIRWISE — Required: results. Optional: runtime_metadata
ANOVA — Required: results. Optional: runtime_metadata
ENRICHMENT — Required: results. Optional: runtime_metadata, database_metadata
DOSE_RESPONSE — Required: output_curves, output_volcanoes, input_drc. Optional: runtime_metadata

See md_dataset models for full details.

Step 2: Define Dependencies

Create dependencies.R to list all R packages — this file is required and runs during the Docker build. For step-by-step instructions on creating DESCRIPTION and dependencies.R (whether you're starting from scratch or have an existing package), see the tutorial.

R packages

dependencies.R — installs R packages before the R package or script. Use for base CRAN/Bioconductor packages (e.g. BiocManager::install("limma")) and packages not listed in DESCRIPTION (if using an R package)
R package only: define version constraints in DESCRIPTION (e.g. limma (>= 3.42.2))
R package only: create install.R (e.g. devtools::install()) to install your package

System dependencies

dependencies.sh — optional; installs system libraries (e.g. harfbuzz, libxml2). Often discovered when the Docker build fails during R package install.

Step 3: Create the Python Runner

Create a Python runner (process_r.py), as shown in ./src/md_custom_r/process_r.py. This file defines the form that surfaces parameters to users in the MD platform. This script uses the Mass Dynamics md_dataset Python package to prepare the R input and execute the workflow in Prefect.

The Python runner performs three things:

Define parameters — which arguments to expose to the user
Configure the form — how they appear in the UI (follow md_form guidelines)
Pass parameters — map form values to the R runner via RFuncArgs

Any parameter the user must specify must be defined in the form. Use the @md_r decorator with the r_file path and r_function from Step 1. In this file you prepare the dataframes and program arguments passed to the R runner.

Step 4: Local Validation and Testing

Before deploying, validate the workflow:

Test the R workflow — run the main function with representative data
Test via Python runner — invoke the Python entrypoint (e.g. from a Jupyter notebook) to exercise the full flow
Test via Docker — build and run the Docker image to mimic the deployment environment

A Jupyter notebook or script that invokes the Python runner with sample data helps validate end-to-end behaviour before submission. See tutorial/test-process-r.ipynb for an example using in-memory data (no AWS or MD platform required).

NOTE: When DataFrames are passed through rpy2, small representation differences (often around the 10th decimal) can occur. These are far smaller than any biological difference, but they could affect downstream analysis at times. When writing tests, use tolerance‑based comparisons (np.allclose, pandas.testing.assert_frame_equal with check_exact=False, rtol/atol) instead of exact equality (==).

Step 5: Create the pyproject.toml file

Create pyproject.toml — this file is required. It provides details about the package, including its versions, dependencies, and authors. For reference, see the example pyproject.toml in this repository.

Why is there a Python package with the R code? The Mass Dynamics platform runs Python. The Python package wraps the R workflow via md_dataset and the @md_r decorator — it is required for integration.

In pyproject.toml, specify:

This package's version
The latest md_dataset version (unless a specific version is explicitly needed)

Step 6: Building and Pushing to ECR

What is this?

This workflow packages a custom R script into a Docker image — a self-contained bundle that includes R, all dependencies, and the package itself — and pushes it to ECR (Elastic Container Registry), which is AWS's private Docker image registry. From there, the MD platform can pull and run the image in Kubernetes.

Think of it like: build a reproducible computational environment → ship it to AWS → the platform runs it on demand.

Prerequisites

AWS CLI installed and configured (aws configure) — install guide
Docker Desktop installed and running — download for Mac. After installing, make sure Docker Desktop is open before running any docker commands.
The following AWS IAM permissions on your profile:
- ecr:GetAuthorizationToken
- ecr:CreateRepository
- ecr:BatchCheckLayerAvailability
- ecr:PutImage
- ecr:InitiateLayerUpload
- ecr:UploadLayerPart
- ecr:CompleteLayerUpload

0. R base image (first time only)

The custom workflow image is built on an R base that provides Python + R on Amazon Linux. Pick one of the following.

A) Build the base locally

Use this if you need an unpublished base or a specific md_dataset revision.

Clone or fork md_dataset.
From that repo:

cd /path/to/your-fork-of-md_dataset

# Step 1 — Python+R base
docker build -t md_dataset_package-linux-base:latest -f base.Dockerfile --platform="linux/amd64" .

# Step 2 — R base (this is what custom R scripts use)
docker build \
  --build-arg BASE_IMAGE=md_dataset_package-linux-base:latest \
  -t md_dataset_package-linux-r-base:latest \
  -f r.base.Dockerfile \
  --platform="linux/amd64" .

These builds take a while — they compile R and install system libraries. Re-run only when the md_dataset base changes.

B) Pull from Docker Hub

Use this for the published R base from Mass Dynamics on Docker Hub (md_dataset_package_r_base).

docker pull massdynamics/md_dataset_package_r_base:latest

Use a specific image tag instead of latest if you need a fixed baseline.

1. Set your variables

export AWS_PROFILE=eb-services-cli
export AWS_REGION=ap-southeast-2
export IMAGE_NAME=<your-repo-name>   # e.g. md_impute_knn_tn
export IMAGE_TAG=<version>-1         # e.g. 0.1.8-1; bump the suffix on each new push

export AWS_ACCOUNT_ID=$(aws sts get-caller-identity --query Account --output text --profile $AWS_PROFILE)
export REGISTRY=${AWS_ACCOUNT_ID}.dkr.ecr.${AWS_REGION}.amazonaws.com

IMAGE_TAG follows the pattern <version>-<build_number>. The version comes from pyproject.toml. Increment the build number each time you push a new image for the same version. Each terminal session has its own environment — if you open a new terminal, re-run these exports before continuing.

2. Create the ECR repository (first time only)

ECR is where AWS stores your Docker images. This creates a private repository for this workflow.

aws ecr create-repository \
  --repository-name $IMAGE_NAME \
  --region $AWS_REGION \
  --image-tag-mutability IMMUTABLE \
  --encryption-configuration encryptionType=AES256 \
  --profile $AWS_PROFILE

IMMUTABLE means once a tag (e.g. 0.1.8-1) is pushed it cannot be overwritten — this is intentional and good practice for reproducibility. If you need to push a fix, bump the build number (e.g. 0.1.8-2). The repository will be private by default. You can confirm this in the AWS Console under ECR → Repositories. If the repository already exists this command returns an error you can safely ignore, or check first with:

aws ecr describe-repositories --repository-names $IMAGE_NAME --profile $AWS_PROFILE

3. Build the image

This builds the Docker image for your custom R workflow on top of the R base from step 0.

If you used 0A (local build): pass the local tag you created (md_dataset_package-linux-r-base:latest). That name is only used for images you built yourself in the md_dataset repo.

If you used 0B (Docker Hub): pass the image you pulled — massdynamics/md_dataset_package_r_base — not md_dataset_package-linux-r-base. After a pull, Docker knows the image by the Hub name; linux-r-base is a different local tag that does not exist unless you built 0A (or retagged manually). Omitting --build-arg BASE_IMAGE=... is equivalent here because the Dockerfile already defaults to massdynamics/md_dataset_package_r_base.

cd /path/to/your-custom-r-repo

# After step 0A (local R base from md_dataset):
docker build \
  --build-arg BASE_IMAGE=md_dataset_package-linux-r-base:latest \
  -t $IMAGE_NAME:$IMAGE_TAG \
  -f Dockerfile \
  --platform="linux/amd64" \
  .

# After step 0B (same R base stack, published on Docker Hub):
docker build \
  --build-arg BASE_IMAGE=massdynamics/md_dataset_package_r_base:latest \
  -t $IMAGE_NAME:$IMAGE_TAG \
  -f Dockerfile \
  --platform="linux/amd64" \
  .

--platform="linux/amd64" is required even on Apple Silicon Macs — the image must target the Linux/amd64 architecture that runs in the cloud.

4. Authenticate Docker with ECR

Docker needs a temporary token to push to the private registry. This command fetches one from AWS and logs Docker in automatically.

aws ecr get-login-password --region $AWS_REGION --profile $AWS_PROFILE \
  | docker login --username AWS --password-stdin ${REGISTRY}

5. Tag the image for ECR

ECR requires the image name to carry the full registry URI as a prefix before it can be pushed.

docker tag $IMAGE_NAME:$IMAGE_TAG ${REGISTRY}/$IMAGE_NAME:$IMAGE_TAG

6. Push to ECR

docker push ${REGISTRY}/$IMAGE_NAME:$IMAGE_TAG

The full image URI (needed for the deploy step) will be:

<AWS_ACCOUNT_ID>.dkr.ecr.<AWS_REGION>.amazonaws.com/<IMAGE_NAME>:<IMAGE_TAG>

Notes

Do not push a latest tag — the repo is IMMUTABLE, so latest would be locked to one digest and cannot be updated. Always use versioned tags only.
To delete an accidentally pushed tag: aws ecr batch-delete-image --repository-name $IMAGE_NAME --region $AWS_REGION --profile $AWS_PROFILE --image-ids imageTag=<tag>

Step 7: Deploy the image to the platform

Deploy via API

If you have access to the MD platform API, you can deploy the image to the platform via API. After building and pushing a Docker image to ECR, this step registers it with the MD platform so users can run it. The platform pulls the image, extracts the form schema from process_r.py, and makes the workflow available in the UI.

Requirements

A .env file in the repo root (git-ignored):

MD_API_BASE_URL=https://dev.massdynamics.com/api
MD_AUTH_TOKEN=<your token>

Get MD_AUTH_TOKEN from your MD account settings.

Deploy script

"""
Usage: source .venv/bin/activate && python deploy.py
Requires: .env with MD_API_BASE_URL and MD_AUTH_TOKEN
"""

import os, time, requests
from dotenv import load_dotenv

load_dotenv()

BASE_URL  = os.environ["MD_API_BASE_URL"].rstrip("/")
API_TOKEN = os.environ["MD_AUTH_TOKEN"]

IMAGE    = "<account_id>.dkr.ecr.<region>.amazonaws.com/<repo_name>:<version>-<build>"
JOB_NAME = "<Display Name in MD UI>"
RUN_TYPE = "INTENSITY"   # INTENSITY | PAIRWISE | ANOVA | ENRICHMENT | DOSE_RESPONSE
FLOW     = "<function name decorated with @md_r in process_r.py>"
FLOW_PKG = "<python.module.path.to.process_r>"

HEADERS = {
    "Authorization": f"Bearer {API_TOKEN}",
    "Accept": "application/vnd.md-v2+json",
    "Content-Type": "application/json",
}

payload = {
    "name": JOB_NAME,
    "run_type": RUN_TYPE,
    "public": False,
    "job_deploy_request": {
        "image": IMAGE,
        "flow_package": FLOW_PKG,
        "flow": FLOW,
    },
}

resp = requests.post(f"{BASE_URL}/jobs/create_or_update", headers=HEADERS, json=payload)

if resp.status_code == 201:
    print("Deployed.")
elif resp.status_code == 202:
    url = BASE_URL.replace("/api", "") + resp.headers["Location"]
    while True:
        r = requests.get(url, headers=HEADERS)
        if r.status_code == 201:
            print("Deployed.")
            break
        elif r.status_code == 202:
            print("  ... deploying")
            time.sleep(5)
        else:
            r.raise_for_status()
else:
    resp.raise_for_status()

Config fields

Field	Description
`IMAGE`	Full ECR URI — update this every time you push a new image
`JOB_NAME`	Display name in the MD UI
`RUN_TYPE`	Output dataset type: `INTENSITY`, `PAIRWISE`, `ANOVA`, etc.
`FLOW`	The `@md_r`-decorated function name in `process_r.py`
`FLOW_PKG`	Python module path to `process_r.py`

Image versioning

Tags follow <version>-<build>, e.g. 0.2.0-1. Bump the build number on each push for the same version. ECR is immutable — you cannot overwrite an existing tag.

Any change to process_r.py (form fields, descriptions, parameters) requires a new image and a new deploy.

To deploy the image to the platform, you need to follow the steps below:

Installation on the platform (handled by Mass Dynamics)

If you do not have access to the MD platform API, you can deploy the image to the platform by contacting MD Member Success, who will coordinate with the engineering team to have your workflow installed on the platform.

Under the hood, a new workflow is registered via the script md-dataset-deploy from the MD Dataset Package. For reference, this project includes ./infra and ./scripts/deploy with example Helm configurations - these may be useful if you need to automate deployment (e.g. via CI/CD), but installation is typically done by the Mass Dynamics team.

Note: the example deployment scripts do not cover IAM or Kubernetes Service Account setup.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

README

Step 1: Develop the R Workflow

Step 2: Define Dependencies

Step 3: Create the Python Runner

Step 4: Local Validation and Testing

Step 5: Create the pyproject.toml file

Step 6: Building and Pushing to ECR

What is this?

Prerequisites

0. R base image (first time only)

A) Build the base locally

B) Pull from Docker Hub

1. Set your variables

2. Create the ECR repository (first time only)

3. Build the image

4. Authenticate Docker with ECR

5. Tag the image for ECR

6. Push to ECR

Notes

Step 7: Deploy the image to the platform

Deploy via API

Requirements

Deploy script

Config fields

Image versioning

Installation on the platform (handled by Mass Dynamics)

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 196 Commits
.buildkite		.buildkite
.notes		.notes
R		R
data		data
infra/helm/md-custom-r		infra/helm/md-custom-r
man		man
scripts		scripts
src/md_custom_r		src/md_custom_r
tests		tests
tutorial		tutorial
.Rbuildignore		.Rbuildignore
.gitignore		.gitignore
DESCRIPTION		DESCRIPTION
Dockerfile		Dockerfile
LICENSE		LICENSE
MDCustomR.Rproj		MDCustomR.Rproj
NAMESPACE		NAMESPACE
README.md		README.md
dependencies.R		dependencies.R
install.R		install.R
pyproject.toml		pyproject.toml

Folders and files

Latest commit

History

Repository files navigation

README

Step 1: Develop the R Workflow

Step 2: Define Dependencies

Step 3: Create the Python Runner

Step 4: Local Validation and Testing

Step 5: Create the pyproject.toml file

Step 6: Building and Pushing to ECR

What is this?

Prerequisites

0. R base image (first time only)

A) Build the base locally

B) Pull from Docker Hub

1. Set your variables

2. Create the ECR repository (first time only)

3. Build the image

4. Authenticate Docker with ECR

5. Tag the image for ECR

6. Push to ECR

Notes

Step 7: Deploy the image to the platform

Deploy via API

Requirements

Deploy script

Config fields

Image versioning

Installation on the platform (handled by Mass Dynamics)

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages