Use native ARM runners for Docker multi-arch builds by dpark01 · Pull Request #1020 · broadinstitute/viral-ngs

dpark01 · 2026-02-04T04:20:45Z

Summary

Migrates Docker multi-arch builds from QEMU emulation to native ARM runners for 3-5x faster ARM build times.

Build Infrastructure Changes

Split each build job into 3 parallel jobs: amd64 (ubuntu-latest), arm64 (ubuntu-24.04-arm), manifest creation
Each architecture chain runs independently, only merging at manifest creation
PRs now push to GHCR (but not Quay) for full multi-arch validation
Added OCI annotations to manifests (lesson from qprimer_designer)
Architecture-specific caching to prevent cross-arch cache pollution

Test Infrastructure Changes

Added ARM64 test jobs that run only on PRs when Docker files change
Removed unnecessary GHCR authentication from test jobs (public images do not require auth)
Added ARM skip decorators for x86-only tools (novoalign, mvicuna, bmtagger)

ARM Skip Decorators Added

Tests using x86-only bioconda packages now properly skip on ARM64:

TestMvicuna - uses mvicuna (x86-only)
TestDepleteHuman - uses bmtagger (x86-only)
TestRefineAssemble - uses novoalign (x86-only)
TestRefineAssembly - uses novoalign (x86-only)
test_tool_install - skips NovoalignTool and MvicunaTool on ARM
test_novoalign method - uses novoalign (x86-only)
test_downsample_with_dedup_after - uses mvicuna via dedup (x86-only)

Dependency Version Pinning

Added minimum version constraints to prevent pip from unnecessarily upgrading conda-installed packages during Docker builds:

pyproject.toml: Removed anndata (only needed in classify image), added version constraints to pandas>=2.0.0, pyyaml>=6.0, scipy>=1.10.0, seaborn>=0.12.0
docker/requirements/core.txt: Added version constraints to pandas>=2.0.0, pyyaml>=6.0, scipy>=1.10.0, seaborn>=0.12.0
docker/requirements/baseimage.txt: Added version constraints to awscli>=1.32.0, google-cloud-storage>=2.14.0

Bug Fixes

Fixed test_krona_import_taxonomy to call main_krona() (renamed in 76de9e3)

Job structure change:

Current	New
6 build jobs (QEMU multi-arch)	18 build/manifest jobs (native)
4 x86 test jobs	4 x86 test jobs + 4 ARM64 test jobs (conditional)

Expected benefits:

Native ARM builds are 3-5x faster than QEMU emulation
Better parallelization (AMD64 and ARM64 build chains run simultaneously)
Total billable minutes should decrease despite more jobs
Elimination of transient GHCR auth failures in test jobs

Closes #1019

Test plan

Verify all build jobs complete successfully
Confirm multi-arch manifests are created correctly (docker manifest inspect)
Check ARM64 test jobs run only on PRs with Docker changes
Verify deploy-to-quay copies images correctly on push to main
Compare build times to previous QEMU-based workflow
Confirm ARM skip decorators work (tests skip, not fail, on ARM)

🤖 Generated with Claude Code

Migrates Docker multi-arch builds from QEMU emulation to native ARM runners for significantly faster build times (3-5x improvement expected). Changes: - Split each build job into parallel architecture-specific jobs: - build-{flavor}-amd64 on ubuntu-latest - build-{flavor}-arm64 on ubuntu-24.04-arm - create-manifest-{flavor} to combine into multi-arch manifest - Architecture chains run independently (amd64 and arm64 in parallel) - Each arch build depends on its parent arch-specific image, not manifest - Added OCI annotations (source, description, licenses) to manifests - Architecture-specific cache tags prevent cross-arch cache pollution - PRs now push to GHCR (enables full multi-arch validation) - Only deploy-to-quay is skipped for PRs - Simplified test jobs to always pull from registry - Added ARM64 test jobs (only run on PRs with docker changes) Job count: 14 -> 30 (but faster due to native execution + parallelization) Closes #1019

- test-phylo: Add 'as ncbi' alias to import in test_ncbi.py - test-assemble: Add missing expected.ebov.refine1.new.fasta from viral-assemble - test-classify: Fix corrupted test-reads.revert.bam (was 0 bytes, now 177 bytes) Co-Authored-By: Claude <noreply@anthropic.com>

For PRs, github.ref_name is '1020/merge' which contains '/' - invalid in Docker tags. Added image-tag-prefix output in get-version job that sanitizes the ref: - PRs: pr-<number> (e.g., pr-1020) - Branches: slashes replaced with dashes (e.g., feature/foo -> feature-foo) - Tags: unchanged (e.g., v1.25.0)

codecov · 2026-02-04T04:35:13Z

Welcome to Codecov 🎉

Once you merge this PR into your default branch, you're all set! Codecov will compare coverage reports and display results in all future pull requests.

Thanks for integrating Codecov - We've got you covered ☂️

When building derivative images (core, assemble, classify, phylo, mega), add the parent image's cache as an additional fallback. This helps when building a new branch where the derivative image cache doesn't exist yet but the parent's cache does. Cache hierarchy: - core: falls back to baseimage cache - assemble/classify/phylo/mega: fall back to core cache

- Add ARM skip decorator to TestMvicuna (mvicuna is x86-only) - Add ARM skip decorator to TestDepleteHuman (bmtagger is x86-only) - Add ARM skip decorator to TestRefineAssemble (novoalign is x86-only) - Fix krona patch path: 'krona.Krona' -> 'viral_ngs.metagenomics.krona.Krona' (patch where used, not where defined)

- Skip NovoalignTool and MvicunaTool in test_tool_install on ARM - Skip test_novoalign method in TestAlignAndFix on ARM - Skip test_downsample_with_dedup_after (uses mvicuna internally) on ARM

…tion.py

GHCR public images can be pulled without authentication, so the docker/login-action steps in test jobs are unnecessary and can cause transient failures due to network timeouts. Removed from: test-core, test-core-arm64, test-assemble, test-assemble-arm64, test-classify, test-classify-arm64, test-phylo, test-phylo-arm64

- Document 3-job pattern for each flavor (amd64, arm64, manifest) - Add build job flow diagram showing parallel architecture chains - Document ARM64 test jobs that run on PRs with docker changes - Add ARM Test Skipping section with code examples - Document registry strategy (GHCR for CI, Quay for production) - Remove outdated references to QEMU emulation

…efore - Add retry logic (5 attempts with 10s wait) to all test jobs for docker pull to handle transient GHCR connectivity timeouts - Add ARM skip decorator to test_downsample_with_dedup_before which also uses mvicuna via rmdup_mvicuna_bam (same as test_downsample_with_dedup_after)

The test was using mocker fixture from pytest-mock which is not installed. Convert to use unittest.mock.patch like other tests in the same file. Also fix patch path to 'viral_ngs.metagenomics.kaiju.Kaiju.classify'.

Copilot

Pull request overview

This PR migrates Docker multi-arch builds from QEMU emulation to native ARM runners, significantly improving ARM64 build performance (3-5x faster). The infrastructure now uses separate parallel jobs for AMD64 and ARM64 builds that are later combined into multi-arch manifests.

Changes:

Build infrastructure split into parallel native builds per architecture with manifest creation
ARM64 test jobs added for PRs with Docker file changes
ARM skip decorators added for tests using x86-only bioconda packages (novoalign, mvicuna, bmtagger)

Reviewed changes

Copilot reviewed 10 out of 11 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
.github/workflows/docker.yml	Restructured to use native ARM runners with parallel architecture-specific builds and manifest creation jobs
tests/unit/core/test_tools.py	Added platform detection and skip logic for x86-only tools
tests/unit/core/test_read_utils.py	Added ARM skip decorators for mvicuna-dependent tests
tests/unit/classify/test_metagenomics.py	Fixed test to call renamed main_krona function and updated mock paths
tests/unit/classify/test_integration_taxon_filter.py	Added ARM skip decorator for bmtagger-dependent test class
tests/unit/assemble/test_assembly_integration.py	Added ARM skip decorator for novoalign-dependent test class
tests/unit/assemble/test_assembly.py	Added ARM skip decorator for novoalign-dependent test class
tests/unit/phylo/test_ncbi.py	Changed import to use alias for consistency
tests/input/TestRefineAssembly/expected.ebov.refine1.new.fasta	Added expected test output file
AGENTS.md	Updated documentation to reflect new native ARM build architecture and registry strategy

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

tests/unit/classify/test_metagenomics.py

tests/unit/core/test_read_utils.py

.github/workflows/docker.yml

The kaiju functionality was removed from viral_ngs.metagenomics but this test was left behind. The test fails with: AttributeError: module 'viral_ngs.metagenomics' has no attribute 'kaiju' There is no parser_kaiju, no Kaiju class, and no kaiju module anywhere in the codebase - this test was testing dead code.

Instead of fail-fast mode, run all tests to see the full picture of any failures rather than stopping at the first one.

- Add missing `cmd as util_cmd` import to core/file.py (fixes NameError in test_bmtagger_db_build) - Rename local `samtools` variable to `samtools_tool` in 5 test methods to avoid shadowing the imported module (fixes UnboundLocalError)

- Use SKIP_X86_ONLY_REASON constant in test_read_utils.py skip decorators - Update test_krakenuniq to use context manager pattern for consistency

- Remove self-imports (modules importing themselves): - file.py: removed `file as util_file`, use direct function calls - misc.py: removed `misc as util_misc` (unused) - samtools.py: removed `samtools` self-import - picard.py: removed `picard` self-import, fix SamToFastqTool reference - Remove duplicate imports: - gatk.py, novoalign.py: removed duplicate samtools/picard imports - fastqc.py, splitcode.py: removed duplicate samtools import - bbmap.py: removed duplicate samtools/picard imports - Fix circular import: - file.py: use lazy import for cmd module in set_tmp_dir()

Tests in test_tools_kb_python.py and test_tools_kma.py were using the mocker fixture from pytest-mock which is not installed. Convert them to use unittest.mock.patch context managers for consistency with other tests in the codebase. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

The kb.py and kma.py modules import 'file' not 'util_file', so the patch paths need to match.

- Remove anndata from pyproject.toml (only needed in classify image, already in docker/requirements/classify.txt) - Add minimum version constraints to pandas, pyyaml, scipy, seaborn in both pyproject.toml and docker/requirements/core.txt - Add minimum version constraints to awscli, google-cloud-storage in docker/requirements/baseimage.txt This prevents pip from unnecessarily downloading and upgrading packages that are already installed via conda during Docker builds. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

dpark01 · 2026-02-04T14:51:58Z

Docker Build Performance Analysis (Post-Merge)

Analyzed GHA run 21675133318 to verify the build optimizations from this PR are working correctly.

Key Results

✅ pip optimization is working - The pip install step now completes in ~3 seconds instead of re-downloading packages. The version constraints added to docker/requirements/*.txt prevent pip from upgrading conda-installed packages.

✅ Layer caching is active - Build logs show cache manifests being imported successfully from both branch-specific and main caches.

✅ Native ARM runners provide major speedup - Comparing to the old QEMU-based workflow:

Image	Before (QEMU)	After (Native)	Improvement
core	7m 16s	3m 41s	~50% faster
assemble	11m 16s	2m 47s	~75% faster
classify	11m 54s	3m 14s	~73% faster
phylo	9m 49s	3m 22s	~66% faster
mega	15m 57s	5m 29s	~66% faster

Build Time Breakdown (core-amd64)

Cache import: ~1s
Base layer extraction: ~15s
Conda install: ~72s (only runs when requirements change)
pip install: ~3s ✅
Verification: ~6s
Layer export: ~104s (normal for mode=max caching)
Push: ~14s

Summary

Total workflow time is now ~12 minutes (including all architectures and tests), down from 25+ minutes with the QEMU-based approach. The cache export taking ~100s is expected behavior—BuildKit compresses and uploads layers to enable future cache hits.

The build is well-optimized and making efficient use of layer caching.

dpark01 and others added 3 commits February 3, 2026 23:16

dpark01 added 9 commits February 3, 2026 23:40

Add ARM skip decorators for x86-only bioconda packages

543187f

- Skip NovoalignTool and MvicunaTool in test_tool_install on ARM - Skip test_novoalign method in TestAlignAndFix on ARM - Skip test_downsample_with_dedup_after (uses mvicuna internally) on ARM

Add ARM skip decorator to TestRefineAssembly in test_assembly_integra…

c3c6b61

…tion.py

Fix test to call main_krona instead of krona (renamed in 76de9e3)

2f15ff2

Fix test_kaiju to use unittest.mock.patch instead of mocker

a38daa8

The test was using mocker fixture from pytest-mock which is not installed. Convert to use unittest.mock.patch like other tests in the same file. Also fix patch path to 'viral_ngs.metagenomics.kaiju.Kaiju.classify'.

Copilot AI review requested due to automatic review settings February 4, 2026 11:22

Copilot AI reviewed Feb 4, 2026

View reviewed changes

tests/unit/classify/test_metagenomics.py Outdated Show resolved Hide resolved

tests/unit/core/test_read_utils.py Outdated Show resolved Hide resolved

tests/unit/core/test_read_utils.py Outdated Show resolved Hide resolved

.github/workflows/docker.yml Show resolved Hide resolved

dpark01 enabled auto-merge February 4, 2026 11:23

dpark01 and others added 8 commits February 4, 2026 07:23

Remove -x flag from pytest to run all tests comprehensively

42acd3f

Instead of fail-fast mode, run all tests to see the full picture of any failures rather than stopping at the first one.

Address GitHub Copilot review comments

9e3712a

- Use SKIP_X86_ONLY_REASON constant in test_read_utils.py skip decorators - Update test_krakenuniq to use context manager pattern for consistency

Merge branch 'main' into issue-1019-native-arm-runners

bb8dc8a

Fix mock patch paths: util_file -> file

9384924

The kb.py and kma.py modules import 'file' not 'util_file', so the patch paths need to match.

dpark01 disabled auto-merge February 4, 2026 14:17

dpark01 enabled auto-merge February 4, 2026 14:23

dpark01 disabled auto-merge February 4, 2026 14:44

dpark01 merged commit 9d2b85f into main Feb 4, 2026
69 checks passed

dpark01 deleted the issue-1019-native-arm-runners branch February 4, 2026 14:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use native ARM runners for Docker multi-arch builds#1020

Use native ARM runners for Docker multi-arch builds#1020
dpark01 merged 21 commits intomainfrom
issue-1019-native-arm-runners

dpark01 commented Feb 4, 2026 •

edited

Loading

Uh oh!

codecov bot commented Feb 4, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

dpark01 commented Feb 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

dpark01 commented Feb 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Build Infrastructure Changes

Test Infrastructure Changes

ARM Skip Decorators Added

Dependency Version Pinning

Bug Fixes

Test plan

Uh oh!

codecov bot commented Feb 4, 2026

Welcome to Codecov 🎉

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

dpark01 commented Feb 4, 2026

Docker Build Performance Analysis (Post-Merge)

Key Results

Build Time Breakdown (core-amd64)

Summary

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

dpark01 commented Feb 4, 2026 •

edited

Loading