Skip to content

Use native ARM runners for Docker multi-arch builds#1020

Merged
dpark01 merged 21 commits intomainfrom
issue-1019-native-arm-runners
Feb 4, 2026
Merged

Use native ARM runners for Docker multi-arch builds#1020
dpark01 merged 21 commits intomainfrom
issue-1019-native-arm-runners

Conversation

@dpark01
Copy link
Member

@dpark01 dpark01 commented Feb 4, 2026

Summary

Migrates Docker multi-arch builds from QEMU emulation to native ARM runners for 3-5x faster ARM build times.

Build Infrastructure Changes

  • Split each build job into 3 parallel jobs: amd64 (ubuntu-latest), arm64 (ubuntu-24.04-arm), manifest creation
  • Each architecture chain runs independently, only merging at manifest creation
  • PRs now push to GHCR (but not Quay) for full multi-arch validation
  • Added OCI annotations to manifests (lesson from qprimer_designer)
  • Architecture-specific caching to prevent cross-arch cache pollution

Test Infrastructure Changes

  • Added ARM64 test jobs that run only on PRs when Docker files change
  • Removed unnecessary GHCR authentication from test jobs (public images do not require auth)
  • Added ARM skip decorators for x86-only tools (novoalign, mvicuna, bmtagger)

ARM Skip Decorators Added

Tests using x86-only bioconda packages now properly skip on ARM64:

  • TestMvicuna - uses mvicuna (x86-only)
  • TestDepleteHuman - uses bmtagger (x86-only)
  • TestRefineAssemble - uses novoalign (x86-only)
  • TestRefineAssembly - uses novoalign (x86-only)
  • test_tool_install - skips NovoalignTool and MvicunaTool on ARM
  • test_novoalign method - uses novoalign (x86-only)
  • test_downsample_with_dedup_after - uses mvicuna via dedup (x86-only)

Dependency Version Pinning

Added minimum version constraints to prevent pip from unnecessarily upgrading conda-installed packages during Docker builds:

  • pyproject.toml: Removed anndata (only needed in classify image), added version constraints to pandas>=2.0.0, pyyaml>=6.0, scipy>=1.10.0, seaborn>=0.12.0
  • docker/requirements/core.txt: Added version constraints to pandas>=2.0.0, pyyaml>=6.0, scipy>=1.10.0, seaborn>=0.12.0
  • docker/requirements/baseimage.txt: Added version constraints to awscli>=1.32.0, google-cloud-storage>=2.14.0

Bug Fixes

  • Fixed test_krona_import_taxonomy to call main_krona() (renamed in 76de9e3)

Job structure change:

Current New
6 build jobs (QEMU multi-arch) 18 build/manifest jobs (native)
4 x86 test jobs 4 x86 test jobs + 4 ARM64 test jobs (conditional)

Expected benefits:

  • Native ARM builds are 3-5x faster than QEMU emulation
  • Better parallelization (AMD64 and ARM64 build chains run simultaneously)
  • Total billable minutes should decrease despite more jobs
  • Elimination of transient GHCR auth failures in test jobs

Closes #1019

Test plan

  • Verify all build jobs complete successfully
  • Confirm multi-arch manifests are created correctly (docker manifest inspect)
  • Check ARM64 test jobs run only on PRs with Docker changes
  • Verify deploy-to-quay copies images correctly on push to main
  • Compare build times to previous QEMU-based workflow
  • Confirm ARM skip decorators work (tests skip, not fail, on ARM)

🤖 Generated with Claude Code

dpark01 and others added 3 commits February 3, 2026 23:16
Migrates Docker multi-arch builds from QEMU emulation to native ARM runners
for significantly faster build times (3-5x improvement expected).

Changes:
- Split each build job into parallel architecture-specific jobs:
  - build-{flavor}-amd64 on ubuntu-latest
  - build-{flavor}-arm64 on ubuntu-24.04-arm
  - create-manifest-{flavor} to combine into multi-arch manifest

- Architecture chains run independently (amd64 and arm64 in parallel)
- Each arch build depends on its parent arch-specific image, not manifest
- Added OCI annotations (source, description, licenses) to manifests
- Architecture-specific cache tags prevent cross-arch cache pollution

- PRs now push to GHCR (enables full multi-arch validation)
- Only deploy-to-quay is skipped for PRs

- Simplified test jobs to always pull from registry
- Added ARM64 test jobs (only run on PRs with docker changes)

Job count: 14 -> 30 (but faster due to native execution + parallelization)

Closes #1019
- test-phylo: Add 'as ncbi' alias to import in test_ncbi.py
- test-assemble: Add missing expected.ebov.refine1.new.fasta from viral-assemble
- test-classify: Fix corrupted test-reads.revert.bam (was 0 bytes, now 177 bytes)

Co-Authored-By: Claude <noreply@anthropic.com>
For PRs, github.ref_name is '1020/merge' which contains '/' - invalid in Docker tags.
Added image-tag-prefix output in get-version job that sanitizes the ref:
- PRs: pr-<number> (e.g., pr-1020)
- Branches: slashes replaced with dashes (e.g., feature/foo -> feature-foo)
- Tags: unchanged (e.g., v1.25.0)
@codecov
Copy link

codecov bot commented Feb 4, 2026

Welcome to Codecov 🎉

Once you merge this PR into your default branch, you're all set! Codecov will compare coverage reports and display results in all future pull requests.

Thanks for integrating Codecov - We've got you covered ☂️

When building derivative images (core, assemble, classify, phylo, mega),
add the parent image's cache as an additional fallback. This helps when
building a new branch where the derivative image cache doesn't exist yet
but the parent's cache does.

Cache hierarchy:
- core: falls back to baseimage cache
- assemble/classify/phylo/mega: fall back to core cache
- Add ARM skip decorator to TestMvicuna (mvicuna is x86-only)
- Add ARM skip decorator to TestDepleteHuman (bmtagger is x86-only)
- Add ARM skip decorator to TestRefineAssemble (novoalign is x86-only)
- Fix krona patch path: 'krona.Krona' -> 'viral_ngs.metagenomics.krona.Krona'
  (patch where used, not where defined)
- Skip NovoalignTool and MvicunaTool in test_tool_install on ARM
- Skip test_novoalign method in TestAlignAndFix on ARM
- Skip test_downsample_with_dedup_after (uses mvicuna internally) on ARM
GHCR public images can be pulled without authentication, so the
docker/login-action steps in test jobs are unnecessary and can
cause transient failures due to network timeouts.

Removed from: test-core, test-core-arm64, test-assemble,
test-assemble-arm64, test-classify, test-classify-arm64,
test-phylo, test-phylo-arm64
- Document 3-job pattern for each flavor (amd64, arm64, manifest)
- Add build job flow diagram showing parallel architecture chains
- Document ARM64 test jobs that run on PRs with docker changes
- Add ARM Test Skipping section with code examples
- Document registry strategy (GHCR for CI, Quay for production)
- Remove outdated references to QEMU emulation
…efore

- Add retry logic (5 attempts with 10s wait) to all test jobs for docker pull
  to handle transient GHCR connectivity timeouts
- Add ARM skip decorator to test_downsample_with_dedup_before which also uses
  mvicuna via rmdup_mvicuna_bam (same as test_downsample_with_dedup_after)
The test was using mocker fixture from pytest-mock which is not installed.
Convert to use unittest.mock.patch like other tests in the same file.
Also fix patch path to 'viral_ngs.metagenomics.kaiju.Kaiju.classify'.
Copilot AI review requested due to automatic review settings February 4, 2026 11:22
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR migrates Docker multi-arch builds from QEMU emulation to native ARM runners, significantly improving ARM64 build performance (3-5x faster). The infrastructure now uses separate parallel jobs for AMD64 and ARM64 builds that are later combined into multi-arch manifests.

Changes:

  • Build infrastructure split into parallel native builds per architecture with manifest creation
  • ARM64 test jobs added for PRs with Docker file changes
  • ARM skip decorators added for tests using x86-only bioconda packages (novoalign, mvicuna, bmtagger)

Reviewed changes

Copilot reviewed 10 out of 11 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
.github/workflows/docker.yml Restructured to use native ARM runners with parallel architecture-specific builds and manifest creation jobs
tests/unit/core/test_tools.py Added platform detection and skip logic for x86-only tools
tests/unit/core/test_read_utils.py Added ARM skip decorators for mvicuna-dependent tests
tests/unit/classify/test_metagenomics.py Fixed test to call renamed main_krona function and updated mock paths
tests/unit/classify/test_integration_taxon_filter.py Added ARM skip decorator for bmtagger-dependent test class
tests/unit/assemble/test_assembly_integration.py Added ARM skip decorator for novoalign-dependent test class
tests/unit/assemble/test_assembly.py Added ARM skip decorator for novoalign-dependent test class
tests/unit/phylo/test_ncbi.py Changed import to use alias for consistency
tests/input/TestRefineAssembly/expected.ebov.refine1.new.fasta Added expected test output file
AGENTS.md Updated documentation to reflect new native ARM build architecture and registry strategy

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@dpark01 dpark01 enabled auto-merge February 4, 2026 11:23
dpark01 and others added 8 commits February 4, 2026 07:23
The kaiju functionality was removed from viral_ngs.metagenomics but
this test was left behind. The test fails with:
  AttributeError: module 'viral_ngs.metagenomics' has no attribute 'kaiju'

There is no parser_kaiju, no Kaiju class, and no kaiju module anywhere
in the codebase - this test was testing dead code.
Instead of fail-fast mode, run all tests to see the full picture of
any failures rather than stopping at the first one.
- Add missing `cmd as util_cmd` import to core/file.py (fixes NameError
  in test_bmtagger_db_build)
- Rename local `samtools` variable to `samtools_tool` in 5 test methods
  to avoid shadowing the imported module (fixes UnboundLocalError)
- Use SKIP_X86_ONLY_REASON constant in test_read_utils.py skip decorators
- Update test_krakenuniq to use context manager pattern for consistency
- Remove self-imports (modules importing themselves):
  - file.py: removed `file as util_file`, use direct function calls
  - misc.py: removed `misc as util_misc` (unused)
  - samtools.py: removed `samtools` self-import
  - picard.py: removed `picard` self-import, fix SamToFastqTool reference

- Remove duplicate imports:
  - gatk.py, novoalign.py: removed duplicate samtools/picard imports
  - fastqc.py, splitcode.py: removed duplicate samtools import
  - bbmap.py: removed duplicate samtools/picard imports

- Fix circular import:
  - file.py: use lazy import for cmd module in set_tmp_dir()
Tests in test_tools_kb_python.py and test_tools_kma.py were using
the mocker fixture from pytest-mock which is not installed. Convert
them to use unittest.mock.patch context managers for consistency
with other tests in the codebase.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The kb.py and kma.py modules import 'file' not 'util_file', so the
patch paths need to match.
@dpark01 dpark01 disabled auto-merge February 4, 2026 14:17
- Remove anndata from pyproject.toml (only needed in classify image,
  already in docker/requirements/classify.txt)
- Add minimum version constraints to pandas, pyyaml, scipy, seaborn
  in both pyproject.toml and docker/requirements/core.txt
- Add minimum version constraints to awscli, google-cloud-storage
  in docker/requirements/baseimage.txt

This prevents pip from unnecessarily downloading and upgrading packages
that are already installed via conda during Docker builds.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@dpark01 dpark01 enabled auto-merge February 4, 2026 14:23
@dpark01 dpark01 disabled auto-merge February 4, 2026 14:44
@dpark01 dpark01 merged commit 9d2b85f into main Feb 4, 2026
69 checks passed
@dpark01 dpark01 deleted the issue-1019-native-arm-runners branch February 4, 2026 14:44
@dpark01
Copy link
Member Author

dpark01 commented Feb 4, 2026

Docker Build Performance Analysis (Post-Merge)

Analyzed GHA run 21675133318 to verify the build optimizations from this PR are working correctly.

Key Results

✅ pip optimization is working - The pip install step now completes in ~3 seconds instead of re-downloading packages. The version constraints added to docker/requirements/*.txt prevent pip from upgrading conda-installed packages.

✅ Layer caching is active - Build logs show cache manifests being imported successfully from both branch-specific and main caches.

✅ Native ARM runners provide major speedup - Comparing to the old QEMU-based workflow:

Image Before (QEMU) After (Native) Improvement
core 7m 16s 3m 41s ~50% faster
assemble 11m 16s 2m 47s ~75% faster
classify 11m 54s 3m 14s ~73% faster
phylo 9m 49s 3m 22s ~66% faster
mega 15m 57s 5m 29s ~66% faster

Build Time Breakdown (core-amd64)

  • Cache import: ~1s
  • Base layer extraction: ~15s
  • Conda install: ~72s (only runs when requirements change)
  • pip install: ~3s
  • Verification: ~6s
  • Layer export: ~104s (normal for mode=max caching)
  • Push: ~14s

Summary

Total workflow time is now ~12 minutes (including all architectures and tests), down from 25+ minutes with the QEMU-based approach. The cache export taking ~100s is expected behavior—BuildKit compresses and uploads layers to enable future cache hits.

The build is well-optimized and making efficient use of layer caching.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Use native ARM runners instead of QEMU for multi-arch Docker builds

2 participants