Skip to content

MassDynamics/MDCustomR

Repository files navigation

README

This repository provides instructions and an example for setting up a new custom R workflow integrated into the Mass Dynamics ecosystem.

The example in this repository demonstrates how to create a workflow that modifies an input intensity dataset and returns a new transformed intensity dataset.

Step 1: Develop the R Workflow

Create an R workflow as either a package or a standalone function. Include:

  • a main function that executes the workflow.
  • an optional runner script in a separate file (e.g.process.R) - required for R packages.

Structure it as follows:

If building an R package, include a DESCRIPTION with version constraints for dependencies.

The example in this repository is implemented as an R package, with the workflow located at ./R/transformIntensities.R. This package depends on the limma and tidyr R packages.

When to use an R package vs an R script

Use an R package when your workflow has many functions, you want versioning and explicit dependency management, need dev tools (testing, CI/CD), or plan to release it for others to install from GitHub.

Use an R script when the workflow is simple (few functions, one main runner), you're iterating quickly, or you don't need package tooling.

The main R function should accept an intensity table and a metadata table as inputs. Additional parameters of your choice can also be included. The input and output tables must adhere to the standard Mass Dynamics format for your chosen dataset type.

Develop an optional runner function to invoke the main workflow function and produce the output as a named list. An example of this function is provided in ./src/md_custom_r/process.R.

Output dataset types

Your workflow can produce any of these dataset types and the outputs need to be dataframes. The R output list structure must match the type.

  • INTENSITY — Required: intensity, metadata. Optional: runtime_metadata, ptm_sites
  • PAIRWISE — Required: results. Optional: runtime_metadata
  • ANOVA — Required: results. Optional: runtime_metadata
  • ENRICHMENT — Required: results. Optional: runtime_metadata, database_metadata
  • DOSE_RESPONSE — Required: output_curves, output_volcanoes, input_drc. Optional: runtime_metadata

See md_dataset models for full details.

Step 2: Define Dependencies

Create dependencies.R to list all R packages — this file is required and runs during the Docker build. For step-by-step instructions on creating DESCRIPTION and dependencies.R (whether you're starting from scratch or have an existing package), see the tutorial.

R packages

  • dependencies.R — installs R packages before the R package or script. Use for base CRAN/Bioconductor packages (e.g. BiocManager::install("limma")) and packages not listed in DESCRIPTION (if using an R package)
  • R package only: define version constraints in DESCRIPTION (e.g. limma (>= 3.42.2))
  • R package only: create install.R (e.g. devtools::install()) to install your package

System dependencies

  • dependencies.sh — optional; installs system libraries (e.g. harfbuzz, libxml2). Often discovered when the Docker build fails during R package install.

Step 3: Create the Python Runner

Create a Python runner (process_r.py), as shown in ./src/md_custom_r/process_r.py. This file defines the form that surfaces parameters to users in the MD platform. This script uses the Mass Dynamics md_dataset Python package to prepare the R input and execute the workflow in Prefect.

The Python runner performs three things:

  1. Define parameters — which arguments to expose to the user
  2. Configure the form — how they appear in the UI (follow md_form guidelines)
  3. Pass parameters — map form values to the R runner via RFuncArgs

Any parameter the user must specify must be defined in the form. Use the @md_r decorator with the r_file path and r_function from Step 1. In this file you prepare the dataframes and program arguments passed to the R runner.

Step 4: Local Validation and Testing

Before deploying, validate the workflow:

  1. Test the R workflow — run the main function with representative data
  2. Test via Python runner — invoke the Python entrypoint (e.g. from a Jupyter notebook) to exercise the full flow
  3. Test via Docker — build and run the Docker image to mimic the deployment environment

A Jupyter notebook or script that invokes the Python runner with sample data helps validate end-to-end behaviour before submission. See tutorial/test-process-r.ipynb for an example using in-memory data (no AWS or MD platform required).

NOTE: When DataFrames are passed through rpy2, small representation differences (often around the 10th decimal) can occur. These are far smaller than any biological difference, but they could affect downstream analysis at times. When writing tests, use tolerance‑based comparisons (np.allclose, pandas.testing.assert_frame_equal with check_exact=False, rtol/atol) instead of exact equality (==).

Step 5: Create the pyproject.toml file

Create pyproject.toml — this file is required. It provides details about the package, including its versions, dependencies, and authors. For reference, see the example pyproject.toml in this repository.

Why is there a Python package with the R code? The Mass Dynamics platform runs Python. The Python package wraps the R workflow via md_dataset and the @md_r decorator — it is required for integration.

In pyproject.toml, specify:

  • This package's version
  • The latest md_dataset version (unless a specific version is explicitly needed)

Step 6: Building and Pushing to ECR

What is this?

This workflow packages a custom R script into a Docker image — a self-contained bundle that includes R, all dependencies, and the package itself — and pushes it to ECR (Elastic Container Registry), which is AWS's private Docker image registry. From there, the MD platform can pull and run the image in Kubernetes.

Think of it like: build a reproducible computational environment → ship it to AWS → the platform runs it on demand.


Prerequisites

  • AWS CLI installed and configured (aws configure) — install guide
  • Docker Desktop installed and running — download for Mac. After installing, make sure Docker Desktop is open before running any docker commands.
  • The following AWS IAM permissions on your profile:
    • ecr:GetAuthorizationToken
    • ecr:CreateRepository
    • ecr:BatchCheckLayerAvailability
    • ecr:PutImage
    • ecr:InitiateLayerUpload
    • ecr:UploadLayerPart
    • ecr:CompleteLayerUpload

0. R base image (first time only)

The custom workflow image is built on an R base that provides Python + R on Amazon Linux. Pick one of the following.

A) Build the base locally

Use this if you need an unpublished base or a specific md_dataset revision.

  1. Clone or fork md_dataset.
  2. From that repo:
cd /path/to/your-fork-of-md_dataset

# Step 1 — Python+R base
docker build -t md_dataset_package-linux-base:latest -f base.Dockerfile --platform="linux/amd64" .

# Step 2 — R base (this is what custom R scripts use)
docker build \
  --build-arg BASE_IMAGE=md_dataset_package-linux-base:latest \
  -t md_dataset_package-linux-r-base:latest \
  -f r.base.Dockerfile \
  --platform="linux/amd64" .

These builds take a while — they compile R and install system libraries. Re-run only when the md_dataset base changes.

B) Pull from Docker Hub

Use this for the published R base from Mass Dynamics on Docker Hub (md_dataset_package_r_base).

docker pull massdynamics/md_dataset_package_r_base:latest

Use a specific image tag instead of latest if you need a fixed baseline.


1. Set your variables

export AWS_PROFILE=eb-services-cli
export AWS_REGION=ap-southeast-2
export IMAGE_NAME=<your-repo-name>   # e.g. md_impute_knn_tn
export IMAGE_TAG=<version>-1         # e.g. 0.1.8-1; bump the suffix on each new push

export AWS_ACCOUNT_ID=$(aws sts get-caller-identity --query Account --output text --profile $AWS_PROFILE)
export REGISTRY=${AWS_ACCOUNT_ID}.dkr.ecr.${AWS_REGION}.amazonaws.com

IMAGE_TAG follows the pattern <version>-<build_number>. The version comes from pyproject.toml. Increment the build number each time you push a new image for the same version. Each terminal session has its own environment — if you open a new terminal, re-run these exports before continuing.


2. Create the ECR repository (first time only)

ECR is where AWS stores your Docker images. This creates a private repository for this workflow.

aws ecr create-repository \
  --repository-name $IMAGE_NAME \
  --region $AWS_REGION \
  --image-tag-mutability IMMUTABLE \
  --encryption-configuration encryptionType=AES256 \
  --profile $AWS_PROFILE

IMMUTABLE means once a tag (e.g. 0.1.8-1) is pushed it cannot be overwritten — this is intentional and good practice for reproducibility. If you need to push a fix, bump the build number (e.g. 0.1.8-2). The repository will be private by default. You can confirm this in the AWS Console under ECR → Repositories. If the repository already exists this command returns an error you can safely ignore, or check first with:

aws ecr describe-repositories --repository-names $IMAGE_NAME --profile $AWS_PROFILE

3. Build the image

This builds the Docker image for your custom R workflow on top of the R base from step 0.

If you used 0A (local build): pass the local tag you created (md_dataset_package-linux-r-base:latest). That name is only used for images you built yourself in the md_dataset repo.

If you used 0B (Docker Hub): pass the image you pulled — massdynamics/md_dataset_package_r_base — not md_dataset_package-linux-r-base. After a pull, Docker knows the image by the Hub name; linux-r-base is a different local tag that does not exist unless you built 0A (or retagged manually). Omitting --build-arg BASE_IMAGE=... is equivalent here because the Dockerfile already defaults to massdynamics/md_dataset_package_r_base.

cd /path/to/your-custom-r-repo

# After step 0A (local R base from md_dataset):
docker build \
  --build-arg BASE_IMAGE=md_dataset_package-linux-r-base:latest \
  -t $IMAGE_NAME:$IMAGE_TAG \
  -f Dockerfile \
  --platform="linux/amd64" \
  .

# After step 0B (same R base stack, published on Docker Hub):
docker build \
  --build-arg BASE_IMAGE=massdynamics/md_dataset_package_r_base:latest \
  -t $IMAGE_NAME:$IMAGE_TAG \
  -f Dockerfile \
  --platform="linux/amd64" \
  .

--platform="linux/amd64" is required even on Apple Silicon Macs — the image must target the Linux/amd64 architecture that runs in the cloud.


4. Authenticate Docker with ECR

Docker needs a temporary token to push to the private registry. This command fetches one from AWS and logs Docker in automatically.

aws ecr get-login-password --region $AWS_REGION --profile $AWS_PROFILE \
  | docker login --username AWS --password-stdin ${REGISTRY}

5. Tag the image for ECR

ECR requires the image name to carry the full registry URI as a prefix before it can be pushed.

docker tag $IMAGE_NAME:$IMAGE_TAG ${REGISTRY}/$IMAGE_NAME:$IMAGE_TAG

6. Push to ECR

docker push ${REGISTRY}/$IMAGE_NAME:$IMAGE_TAG

The full image URI (needed for the deploy step) will be:

<AWS_ACCOUNT_ID>.dkr.ecr.<AWS_REGION>.amazonaws.com/<IMAGE_NAME>:<IMAGE_TAG>

Notes

  • Do not push a latest tag — the repo is IMMUTABLE, so latest would be locked to one digest and cannot be updated. Always use versioned tags only.
  • To delete an accidentally pushed tag: aws ecr batch-delete-image --repository-name $IMAGE_NAME --region $AWS_REGION --profile $AWS_PROFILE --image-ids imageTag=<tag>

Step 7: Deploy the image to the platform

Deploy via API

If you have access to the MD platform API, you can deploy the image to the platform via API. After building and pushing a Docker image to ECR, this step registers it with the MD platform so users can run it. The platform pulls the image, extracts the form schema from process_r.py, and makes the workflow available in the UI.


Requirements

A .env file in the repo root (git-ignored):

MD_API_BASE_URL=https://dev.massdynamics.com/api
MD_AUTH_TOKEN=<your token>

Get MD_AUTH_TOKEN from your MD account settings.


Deploy script

"""
Usage: source .venv/bin/activate && python deploy.py
Requires: .env with MD_API_BASE_URL and MD_AUTH_TOKEN
"""

import os, time, requests
from dotenv import load_dotenv

load_dotenv()

BASE_URL  = os.environ["MD_API_BASE_URL"].rstrip("/")
API_TOKEN = os.environ["MD_AUTH_TOKEN"]

IMAGE    = "<account_id>.dkr.ecr.<region>.amazonaws.com/<repo_name>:<version>-<build>"
JOB_NAME = "<Display Name in MD UI>"
RUN_TYPE = "INTENSITY"   # INTENSITY | PAIRWISE | ANOVA | ENRICHMENT | DOSE_RESPONSE
FLOW     = "<function name decorated with @md_r in process_r.py>"
FLOW_PKG = "<python.module.path.to.process_r>"

HEADERS = {
    "Authorization": f"Bearer {API_TOKEN}",
    "Accept": "application/vnd.md-v2+json",
    "Content-Type": "application/json",
}

payload = {
    "name": JOB_NAME,
    "run_type": RUN_TYPE,
    "public": False,
    "job_deploy_request": {
        "image": IMAGE,
        "flow_package": FLOW_PKG,
        "flow": FLOW,
    },
}

resp = requests.post(f"{BASE_URL}/jobs/create_or_update", headers=HEADERS, json=payload)

if resp.status_code == 201:
    print("Deployed.")
elif resp.status_code == 202:
    url = BASE_URL.replace("/api", "") + resp.headers["Location"]
    while True:
        r = requests.get(url, headers=HEADERS)
        if r.status_code == 201:
            print("Deployed.")
            break
        elif r.status_code == 202:
            print("  ... deploying")
            time.sleep(5)
        else:
            r.raise_for_status()
else:
    resp.raise_for_status()

Config fields

Field Description
IMAGE Full ECR URI — update this every time you push a new image
JOB_NAME Display name in the MD UI
RUN_TYPE Output dataset type: INTENSITY, PAIRWISE, ANOVA, etc.
FLOW The @md_r-decorated function name in process_r.py
FLOW_PKG Python module path to process_r.py

Image versioning

Tags follow <version>-<build>, e.g. 0.2.0-1. Bump the build number on each push for the same version. ECR is immutable — you cannot overwrite an existing tag.

Any change to process_r.py (form fields, descriptions, parameters) requires a new image and a new deploy.

To deploy the image to the platform, you need to follow the steps below:

Installation on the platform (handled by Mass Dynamics)

If you do not have access to the MD platform API, you can deploy the image to the platform by contacting MD Member Success, who will coordinate with the engineering team to have your workflow installed on the platform.

Under the hood, a new workflow is registered via the script md-dataset-deploy from the MD Dataset Package. For reference, this project includes ./infra and ./scripts/deploy with example Helm configurations - these may be useful if you need to automate deployment (e.g. via CI/CD), but installation is typically done by the Mass Dynamics team.

Note: the example deployment scripts do not cover IAM or Kubernetes Service Account setup.

About

Mass Dynamics custom R script setup example

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors