feat(distributed): MPI/distributed bindings for Ginkgo by rho-novatron · Pull Request #105 · Helmholtz-AI-Energy/pyGinkgo

rho-novatron · 2026-04-18T09:36:42Z

Stacked on #104 — please review & merge that first. Once #104 lands, I'll rebase this onto main and the diff will shrink to just the MPI-related changes.

PR-B: feat(distributed): MPI/distributed bindings for Ginkgo

Branch (fork): rho-novatron:rho/distributed
Base (upstream): Helmholtz-AI-Energy:refactor/array-helper (i.e.
stacked on PR-A)
Commits: 4

SHA	Subject
`a46f8cf`	fix(utils): accept compatible integer dtype-format chars cross-platform
`ec40b72`	feat(distributed): expose Ginkgo MPI bindings under `pyGinkgo.distributed`
`d440809`	docs: README section on the distributed/MPI bindings
`75b0816`	test(distributed): add 2-rank pytest harness for MPI bindings

Summary

Adds first-class Python bindings for Ginkgo's MPI-distributed types,
exposed under the new pyGinkgo.distributed submodule.

Build is opt-in: setting pyGinkgo_BUILD_MPI=ON (defaults to
OFF) configures Ginkgo with GINKGO_BUILD_MPI=ON and compiles the
new bindings. Serial builds and the existing public API are
completely unchanged.

What's exposed

C++ type	Python
`gko::experimental::mpi::communicator`	`pyGinkgo.distributed.Communicator`
`gko::experimental::distributed::Partition`	`pyGinkgo.distributed.Partition_int32_int64`
`gko::experimental::distributed::Vector`	`pyGinkgo.distributed.Vector_double` etc.
`gko::experimental::distributed::Matrix`	`pyGinkgo.distributed.Matrix_double_int32_int64`
trampoline `PyLinOp` for distributed solves	`pyGinkgo.distributed.PyLinOp_double`

mpi4py is the supported way to construct a Communicator from
Python; bindings accept MPI.Comm directly.

Why no new solver bindings?

Existing solvers (gmres_double, cg_double, …) already accept any
gko::LinOp polymorphically through their generated apply paths. A
distributed Matrix is a LinOp, so existing bindings dispatch
correctly without modification — verified in tests/cpp_bindings/ distributed/test_solver.py.

ABI safety

MPI ABI compatibility between Ginkgo (built against MPICH headers in
the conda package) and the user's mpi4py (also built against MPICH)
is checked lazily the first time a Communicator is constructed:

Build time: the MPI implementation string (mpich, openmpi,
…) is baked into the binding via a CMake-defined macro
PYGINKGO_MPI_IMPL.
Runtime: on first Communicator.__init__, mpi4py.MPI. Get_library_version() is parsed and compared against the baked
value. A round-trip C++ check (broadcast a sentinel value from
rank 0) confirms the two sides actually agree on MPI_COMM_WORLD.

Mismatches raise a clear RuntimeError instead of segfaulting.

What is intentionally deferred

Schwarz preconditioner. The non-distributed pyGinkgo
preconditioners are exposed as already-generated LinOps, but
Schwarz fundamentally requires a LinOpFactory because its
generate(distributed_matrix) step constructs the per-rank local
solver against the local block. Adding factory bindings is a
cross-cutting design change that deserves its own discussion and PR.
Distributed logger callback. Out of scope here; the standard
Ginkgo logger interface still works.
read_distributed matrix-market reader. Removed as a stub; can
return when there's a tested implementation.

Testing

22 tests, run with both 2 and 4 ranks:

mpirun -n 2 python -m pytest \
    tests/cpp_bindings/distributed/ tests/pyGinkgo/distributed/
mpirun -n 4 python -m pytest \
    tests/cpp_bindings/distributed/ tests/pyGinkgo/distributed/

Coverage:

Communicator round-trip + ABI mismatch path
Partition construction and global-to-local index mapping
Vector: gather/scatter, axpy, dot, norm against analytical answers
Matrix: SpMV against equivalent serial CSR result
PyLinOp: distributed Python-implemented operator round-trip
GMRES on a distributed Laplacian

Cross-platform note

The Windows fix in a46f8cf is necessary because long is 4 bytes
on Windows (vs 8 on Linux/macOS). The gko_array_from_pyobject
helper added in PR-A used a hard-coded format-char comparison; the
fix accepts any format char with a matching sizeof() and signedness
so the same Python integer arrays work on all three platforms.

Stacked PR

Stacked on top of PR-A (refactor/array-helper). Reviewing PR-A
first is recommended; the bulk of the new code in this PR lives in
src/cpp_bindings/distributed/ and src/cpp_bindings/mpi/, neither
of which is touched by PR-A.

…nstructors Move the duplicated __cuda_array_interface__ / buffer-protocol conversion logic from array.cpp and matrix.cpp into a single gko_array_from_pyobject<T>() template in utils.hpp. Both the gko::array(Executor, py::object) constructor and the sparse-matrix (Executor, py::object) constructor now delegate to this helper, eliminating ~260 lines of duplicated CUDA/host branching. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

pybind11's `format_descriptor<int64_t>::format()` returns 'q' (long long) but numpy's `np.int64` reports 'l' (long) on x86_64 Linux, while on Windows the relationship reverses for 32-bit ints. Same physical layout, different format chars — `check_buffer_dtype` was rejecting these as 'Incompatible dtypes'. Treat any pair of integer format chars as compatible when both are single-character signed (or both unsigned) and itemsize matches the expected ValueType. This makes Buffer-protocol conversions work uniformly across platforms. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

These bindings reuse the polymorphic gko::LinOp base, so the same factories transparently accept both single-process matrices and distributed::Matrix. Each solver returns a (logger, x) tuple from apply() so callers can introspect convergence (residual norm, iteration count) -- the standard Convergence logger pattern matching the existing GMRES binding. Addresses NovaPIC PR Helmholtz-AI-Energy#105 audit items 8 (Jacobi preconditioner for distributed) and 9 (solver introspection). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…cations - Vector_<T>.from_local_array_view: zero-copy variant of from_local_array. On a CudaExecutor with a __cuda_array_interface__-backed input, the resulting distributed.Vector aliases the caller's buffer instead of copying. Uses py::keep_alive<0,4> to tie the input lifetime to the returned vector. - Vector_<T>.gather_on_root(root=0): gather a distributed.Vector onto a single rank as a host numpy array; returns None on non-root ranks. Uses MPI gather + gather_v with the local Dense slice as the source. - Matrix_<T,L,G>.create_from_local_and_non_local: clarify in the docstring that recv_connections holds GLOBAL column ids and the non_local_linop's local column index space is the position into recv_connections, ordered by source rank then ascending global id. - Add 2-rank smoke tests for cg/bicgstab on a distributed.Matrix and for the new Vector helpers (view parity + gather correctness). Addresses NovaPIC PR Helmholtz-AI-Energy#105 audit items 5 (off-diag column convention), 13 (zero-copy CuPy <-> distributed.Vector) and 14 (gather to host). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

- Use find_package(Ginkgo) instead of find_package(ginkgo) to match the upstream GinkgoConfig.cmake naming on case-sensitive filesystems - Guard install(IMPORTED_RUNTIME_ARTIFACTS) with if(NOT Ginkgo_FOUND) since bare targets (ginkgo, ginkgo_device, ...) only exist when Ginkgo is built from source via FetchContent, not when using a pre-installed package

The install(DIRECTORY) used an absolute DESTINATION (${Python_SITELIB}) which bypasses CMAKE_INSTALL_PREFIX. Since py-build-cmake uses a staging directory as the install prefix to collect files for the wheel, the absolute path wrote files directly to the real site-packages instead of the staging area, resulting in wheels missing the compiled .so binding. - Use relative DESTINATION (${PY_BUILD_CMAKE_MODULE_NAME}) so files are installed under CMAKE_INSTALL_PREFIX where py-build-cmake picks them up - Add trailing / to source DIRECTORY to install contents, not the directory itself (avoids pyGinkgo/pyGinkgo/ nesting)

Adds an opt-in MPI/distributed surface to pyGinkgo, gated by a new `pyGinkgo_BUILD_MPI` CMake option (OFF by default — the serial build is unchanged). When enabled, the module gains: pyGinkgo.pyGinkgoBindings.mpi Communicator wraps gko::experimental::mpi::communicator around an mpi4py.MPI.Comm (no MPI_Comm_dup) map_rank_to_device_id, is_gpu_aware BUILD_MPI_IMPL / BUILD_MPI_LIBRARY_VERSION runtime_mpi_library_version(), verify_abi(comm) pyGinkgo.pyGinkgoBindings.distributed Partition_<L>_<G> build_from_global_size_uniform / contiguous / mapping Vector_<T> create / from_local_array(_deduce_size), fill, scale, add_scaled, compute_dot, compute_norm{1,2}, get_local_vector, shape, local_shape Matrix_<T>_<L>_<G> create_empty / create_from_local_linop / create_from_local_and_non_local, get_(non_)local_matrix, shape PyLinOp pybind11 trampoline so Python subclasses can implement matrix-free LinOps; the alias type is registered correctly so Python-side overrides of apply_impl are invoked from Krylov solvers. pyGinkgo.distributed (Python facade) Partition / DistributedVector / DistributedMatrix / PyLinOp plus a lazy MPI-ABI verification (build_impl vs runtime_impl plus a C++-side MPI_Comm_size round-trip on first use of any entry point that takes a communicator). The existing solver bindings (`solver.gmres_<T>`, `solver.direct`, etc.) accept any `gko::LinOp` polymorphically and therefore work unchanged with a distributed Matrix or a PyLinOp; no new solver or preconditioner bindings are introduced in this PR. Build-time / runtime safety: * `cmake/FindMpi4py.cmake` locates the active interpreter's mpi4py headers; mismatches between the MPI mpi4py was built against and the one CMake selected emit a WARNING. * `cmake/DetectMpiAbi.cmake` bakes `MPI_Get_library_version()` and the implementation flavor (MPICH/OpenMPI/IntelMPI) into a generated header, so the runtime check can give a precise error if mpi4py loads a different MPI. * `pyGinkgo.distributed` raises ImportError immediately if the C extension was built without MPI or if mpi4py is missing; the ABI round-trip happens lazily on first communicator use. `pyproject.toml` adds an optional `mpi` extra pulling in mpi4py>=3.1. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Adds pytest modules exercising pyGinkgo.distributed end-to-end under mpirun -n >= 2. The shared conftest skips automatically on a single rank and barriers between tests: tests/cpp_bindings/distributed/ conftest.py comm/exec/rank/nprocs fixtures test_communicator.py mpi4py.Comm bridging + ABI verification test_partition.py Partition.uniform / from_contiguous / from_mapping test_vector.py from_local_array, fill, norm, dot, get_local_vector test_matrix.py local-only identity CSR + apply test_pylinop.py Python LinOp subclass override is invoked test_solver.py distributed GMRES on a block-diagonal SPD (uses the existing serial GMRES binding, which dispatches polymorphically) tests/pyGinkgo/distributed/ test_facade.py high-level Python facade Verified passing on 2 and 4 ranks (22 tests). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

These bindings reuse the polymorphic gko::LinOp base, so the same factories transparently accept both single-process matrices and distributed::Matrix. Each solver returns a (logger, x) tuple from apply() so callers can introspect convergence (residual norm, iteration count) -- the standard Convergence logger pattern matching the existing GMRES binding. Addresses NovaPIC PR Helmholtz-AI-Energy#105 audit items 8 (Jacobi preconditioner for distributed) and 9 (solver introspection). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…cations - Vector_<T>.from_local_array_view: zero-copy variant of from_local_array. On a CudaExecutor with a __cuda_array_interface__-backed input, the resulting distributed.Vector aliases the caller's buffer instead of copying. Uses py::keep_alive<0,4> to tie the input lifetime to the returned vector. - Vector_<T>.gather_on_root(root=0): gather a distributed.Vector onto a single rank as a host numpy array; returns None on non-root ranks. Uses MPI gather + gather_v with the local Dense slice as the source. - Matrix_<T,L,G>.create_from_local_and_non_local: clarify in the docstring that recv_connections holds GLOBAL column ids and the non_local_linop's local column index space is the position into recv_connections, ordered by source rank then ascending global id. - Add 2-rank smoke tests for cg/bicgstab on a distributed.Matrix and for the new Vector helpers (view parity + gather correctness). Addresses NovaPIC PR Helmholtz-AI-Energy#105 audit items 5 (off-diag column convention), 13 (zero-copy CuPy <-> distributed.Vector) and 14 (gather to host). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

… helpers - PyLinOp docstring + README spell out that the apply_impl callback receives distributed.Vector inputs (local block only) and is responsible for halo exchange; recommend cupy stream sync. - README documents the (Convergence, x) tuple returned by *.apply() and the new from_local_array_view / gather_on_root helpers. Addresses NovaPIC PR Helmholtz-AI-Energy#105 audit items 11 (PyLinOp signature) and 12 (stream safety). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

greole · 2026-04-24T15:12:51Z

I worked on bindings for the distributed backend a while back. Maybe its worth checking this branch as well https://github.com/Helmholtz-AI-Energy/pyGinkgo/tree/dist/mpi

rho-novatron and others added 2 commits April 18, 2026 09:21

rho-novatron and others added 7 commits April 18, 2026 17:05

docs: README section on the distributed/MPI bindings

a231909

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

rho-novatron marked this pull request as draft April 18, 2026 17:06

rho-novatron force-pushed the rho/distributed branch from 5b8dc67 to 2149843 Compare April 18, 2026 17:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(distributed): MPI/distributed bindings for Ginkgo#105

feat(distributed): MPI/distributed bindings for Ginkgo#105
rho-novatron wants to merge 10 commits intoHelmholtz-AI-Energy:mainfrom
rho-novatron:rho/distributed

rho-novatron commented Apr 18, 2026

Uh oh!

greole commented Apr 24, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

rho-novatron commented Apr 18, 2026

PR-B: feat(distributed): MPI/distributed bindings for Ginkgo

Summary

What's exposed

Why no new solver bindings?

ABI safety

What is intentionally deferred

Testing

Cross-platform note

Stacked PR

Uh oh!

greole commented Apr 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

greole commented Apr 24, 2026 •

edited

Loading