Skip to content

feat(distributed): MPI/distributed bindings for Ginkgo#105

Draft
rho-novatron wants to merge 10 commits intoHelmholtz-AI-Energy:mainfrom
rho-novatron:rho/distributed
Draft

feat(distributed): MPI/distributed bindings for Ginkgo#105
rho-novatron wants to merge 10 commits intoHelmholtz-AI-Energy:mainfrom
rho-novatron:rho/distributed

Conversation

@rho-novatron
Copy link
Copy Markdown
Contributor

Stacked on #104 — please review & merge that first. Once #104 lands, I'll rebase this onto main and the diff will shrink to just the MPI-related changes.

PR-B: feat(distributed): MPI/distributed bindings for Ginkgo

Branch (fork): rho-novatron:rho/distributed
Base (upstream): Helmholtz-AI-Energy:refactor/array-helper (i.e.
stacked on PR-A)
Commits: 4

SHA Subject
a46f8cf fix(utils): accept compatible integer dtype-format chars cross-platform
ec40b72 feat(distributed): expose Ginkgo MPI bindings under pyGinkgo.distributed
d440809 docs: README section on the distributed/MPI bindings
75b0816 test(distributed): add 2-rank pytest harness for MPI bindings

Summary

Adds first-class Python bindings for Ginkgo's MPI-distributed types,
exposed under the new pyGinkgo.distributed submodule.

Build is opt-in: setting pyGinkgo_BUILD_MPI=ON (defaults to
OFF) configures Ginkgo with GINKGO_BUILD_MPI=ON and compiles the
new bindings. Serial builds and the existing public API are
completely unchanged.

What's exposed

C++ type Python
gko::experimental::mpi::communicator pyGinkgo.distributed.Communicator
gko::experimental::distributed::Partition pyGinkgo.distributed.Partition_int32_int64
gko::experimental::distributed::Vector pyGinkgo.distributed.Vector_double etc.
gko::experimental::distributed::Matrix pyGinkgo.distributed.Matrix_double_int32_int64
trampoline PyLinOp for distributed solves pyGinkgo.distributed.PyLinOp_double

mpi4py is the supported way to construct a Communicator from
Python; bindings accept MPI.Comm directly.

Why no new solver bindings?

Existing solvers (gmres_double, cg_double, …) already accept any
gko::LinOp polymorphically through their generated apply paths. A
distributed Matrix is a LinOp, so existing bindings dispatch
correctly without modification — verified in tests/cpp_bindings/ distributed/test_solver.py.

ABI safety

MPI ABI compatibility between Ginkgo (built against MPICH headers in
the conda package) and the user's mpi4py (also built against MPICH)
is checked lazily the first time a Communicator is constructed:

  1. Build time: the MPI implementation string (mpich, openmpi,
    …) is baked into the binding via a CMake-defined macro
    PYGINKGO_MPI_IMPL.
  2. Runtime: on first Communicator.__init__, mpi4py.MPI. Get_library_version() is parsed and compared against the baked
    value. A round-trip C++ check (broadcast a sentinel value from
    rank 0) confirms the two sides actually agree on MPI_COMM_WORLD.

Mismatches raise a clear RuntimeError instead of segfaulting.

What is intentionally deferred

  • Schwarz preconditioner. The non-distributed pyGinkgo
    preconditioners are exposed as already-generated LinOps, but
    Schwarz fundamentally requires a LinOpFactory because its
    generate(distributed_matrix) step constructs the per-rank local
    solver against the local block. Adding factory bindings is a
    cross-cutting design change that deserves its own discussion and PR.
  • Distributed logger callback. Out of scope here; the standard
    Ginkgo logger interface still works.
  • read_distributed matrix-market reader. Removed as a stub; can
    return when there's a tested implementation.

Testing

22 tests, run with both 2 and 4 ranks:

mpirun -n 2 python -m pytest \
    tests/cpp_bindings/distributed/ tests/pyGinkgo/distributed/
mpirun -n 4 python -m pytest \
    tests/cpp_bindings/distributed/ tests/pyGinkgo/distributed/

Coverage:

  • Communicator round-trip + ABI mismatch path
  • Partition construction and global-to-local index mapping
  • Vector: gather/scatter, axpy, dot, norm against analytical answers
  • Matrix: SpMV against equivalent serial CSR result
  • PyLinOp: distributed Python-implemented operator round-trip
  • GMRES on a distributed Laplacian

Cross-platform note

The Windows fix in a46f8cf is necessary because long is 4 bytes
on Windows (vs 8 on Linux/macOS). The gko_array_from_pyobject
helper added in PR-A used a hard-coded format-char comparison; the
fix accepts any format char with a matching sizeof() and signedness
so the same Python integer arrays work on all three platforms.

Stacked PR

Stacked on top of PR-A (refactor/array-helper). Reviewing PR-A
first is recommended; the bulk of the new code in this PR lives in
src/cpp_bindings/distributed/ and src/cpp_bindings/mpi/, neither
of which is touched by PR-A.

rho-novatron and others added 2 commits April 18, 2026 09:21
…nstructors

Move the duplicated __cuda_array_interface__ / buffer-protocol conversion
logic from array.cpp and matrix.cpp into a single gko_array_from_pyobject<T>()
template in utils.hpp.

Both the gko::array(Executor, py::object) constructor and the sparse-matrix
(Executor, py::object) constructor now delegate to this helper, eliminating
~260 lines of duplicated CUDA/host branching.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
pybind11's `format_descriptor<int64_t>::format()` returns 'q'
(long long) but numpy's `np.int64` reports 'l' (long) on x86_64
Linux, while on Windows the relationship reverses for 32-bit ints.
Same physical layout, different format chars — `check_buffer_dtype`
was rejecting these as 'Incompatible dtypes'.

Treat any pair of integer format chars as compatible when both are
single-character signed (or both unsigned) and itemsize matches the
expected ValueType. This makes Buffer-protocol conversions work
uniformly across platforms.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
rho-novatron added a commit to rho-novatron/pyGinkgo that referenced this pull request Apr 18, 2026
These bindings reuse the polymorphic gko::LinOp base, so the same
factories transparently accept both single-process matrices and
distributed::Matrix. Each solver returns a (logger, x) tuple from
apply() so callers can introspect convergence (residual norm,
iteration count) -- the standard Convergence logger pattern matching
the existing GMRES binding.

Addresses NovaPIC PR Helmholtz-AI-Energy#105 audit items 8 (Jacobi preconditioner for
distributed) and 9 (solver introspection).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
rho-novatron added a commit to rho-novatron/pyGinkgo that referenced this pull request Apr 18, 2026
…cations

- Vector_<T>.from_local_array_view: zero-copy variant of from_local_array.
  On a CudaExecutor with a __cuda_array_interface__-backed input, the
  resulting distributed.Vector aliases the caller's buffer instead of
  copying. Uses py::keep_alive<0,4> to tie the input lifetime to the
  returned vector.

- Vector_<T>.gather_on_root(root=0): gather a distributed.Vector onto
  a single rank as a host numpy array; returns None on non-root ranks.
  Uses MPI gather + gather_v with the local Dense slice as the source.

- Matrix_<T,L,G>.create_from_local_and_non_local: clarify in the
  docstring that recv_connections holds GLOBAL column ids and the
  non_local_linop's local column index space is the position into
  recv_connections, ordered by source rank then ascending global id.

- Add 2-rank smoke tests for cg/bicgstab on a distributed.Matrix and
  for the new Vector helpers (view parity + gather correctness).

Addresses NovaPIC PR Helmholtz-AI-Energy#105 audit items 5 (off-diag column convention),
13 (zero-copy CuPy <-> distributed.Vector) and 14 (gather to host).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
rho-novatron and others added 7 commits April 18, 2026 17:05
- Use find_package(Ginkgo) instead of find_package(ginkgo) to match the
  upstream GinkgoConfig.cmake naming on case-sensitive filesystems
- Guard install(IMPORTED_RUNTIME_ARTIFACTS) with if(NOT Ginkgo_FOUND) since
  bare targets (ginkgo, ginkgo_device, ...) only exist when Ginkgo is built
  from source via FetchContent, not when using a pre-installed package
The install(DIRECTORY) used an absolute DESTINATION (${Python_SITELIB})
which bypasses CMAKE_INSTALL_PREFIX. Since py-build-cmake uses a staging
directory as the install prefix to collect files for the wheel, the
absolute path wrote files directly to the real site-packages instead of
the staging area, resulting in wheels missing the compiled .so binding.

- Use relative DESTINATION (${PY_BUILD_CMAKE_MODULE_NAME}) so files are
  installed under CMAKE_INSTALL_PREFIX where py-build-cmake picks them up
- Add trailing / to source DIRECTORY to install contents, not the
  directory itself (avoids pyGinkgo/pyGinkgo/ nesting)
Adds an opt-in MPI/distributed surface to pyGinkgo, gated by a new
`pyGinkgo_BUILD_MPI` CMake option (OFF by default — the serial build
is unchanged). When enabled, the module gains:

  pyGinkgo.pyGinkgoBindings.mpi
    Communicator              wraps gko::experimental::mpi::communicator
                              around an mpi4py.MPI.Comm (no MPI_Comm_dup)
    map_rank_to_device_id, is_gpu_aware
    BUILD_MPI_IMPL / BUILD_MPI_LIBRARY_VERSION
    runtime_mpi_library_version(), verify_abi(comm)

  pyGinkgo.pyGinkgoBindings.distributed
    Partition_<L>_<G>         build_from_global_size_uniform / contiguous
                              / mapping
    Vector_<T>                create / from_local_array(_deduce_size),
                              fill, scale, add_scaled, compute_dot,
                              compute_norm{1,2}, get_local_vector,
                              shape, local_shape
    Matrix_<T>_<L>_<G>        create_empty / create_from_local_linop /
                              create_from_local_and_non_local,
                              get_(non_)local_matrix, shape
    PyLinOp                   pybind11 trampoline so Python subclasses
                              can implement matrix-free LinOps; the
                              alias type is registered correctly so
                              Python-side overrides of apply_impl are
                              invoked from Krylov solvers.

  pyGinkgo.distributed (Python facade)
    Partition / DistributedVector / DistributedMatrix / PyLinOp
    plus a lazy MPI-ABI verification (build_impl vs runtime_impl plus
    a C++-side MPI_Comm_size round-trip on first use of any entry
    point that takes a communicator).

The existing solver bindings (`solver.gmres_<T>`, `solver.direct`,
etc.) accept any `gko::LinOp` polymorphically and therefore work
unchanged with a distributed Matrix or a PyLinOp; no new solver or
preconditioner bindings are introduced in this PR.

Build-time / runtime safety:

* `cmake/FindMpi4py.cmake` locates the active interpreter's mpi4py
  headers; mismatches between the MPI mpi4py was built against and
  the one CMake selected emit a WARNING.
* `cmake/DetectMpiAbi.cmake` bakes `MPI_Get_library_version()` and
  the implementation flavor (MPICH/OpenMPI/IntelMPI) into a generated
  header, so the runtime check can give a precise error if mpi4py
  loads a different MPI.
* `pyGinkgo.distributed` raises ImportError immediately if the C
  extension was built without MPI or if mpi4py is missing; the ABI
  round-trip happens lazily on first communicator use.

`pyproject.toml` adds an optional `mpi` extra pulling in mpi4py>=3.1.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Adds pytest modules exercising pyGinkgo.distributed end-to-end under
mpirun -n >= 2. The shared conftest skips automatically on a
single rank and barriers between tests:

  tests/cpp_bindings/distributed/
    conftest.py             comm/exec/rank/nprocs fixtures
    test_communicator.py    mpi4py.Comm bridging + ABI verification
    test_partition.py       Partition.uniform / from_contiguous /
                            from_mapping
    test_vector.py          from_local_array, fill, norm, dot,
                            get_local_vector
    test_matrix.py          local-only identity CSR + apply
    test_pylinop.py         Python LinOp subclass override is invoked
    test_solver.py          distributed GMRES on a block-diagonal SPD
                            (uses the existing serial GMRES binding,
                             which dispatches polymorphically)
  tests/pyGinkgo/distributed/
    test_facade.py          high-level Python facade

Verified passing on 2 and 4 ranks (22 tests).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
These bindings reuse the polymorphic gko::LinOp base, so the same
factories transparently accept both single-process matrices and
distributed::Matrix. Each solver returns a (logger, x) tuple from
apply() so callers can introspect convergence (residual norm,
iteration count) -- the standard Convergence logger pattern matching
the existing GMRES binding.

Addresses NovaPIC PR Helmholtz-AI-Energy#105 audit items 8 (Jacobi preconditioner for
distributed) and 9 (solver introspection).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…cations

- Vector_<T>.from_local_array_view: zero-copy variant of from_local_array.
  On a CudaExecutor with a __cuda_array_interface__-backed input, the
  resulting distributed.Vector aliases the caller's buffer instead of
  copying. Uses py::keep_alive<0,4> to tie the input lifetime to the
  returned vector.

- Vector_<T>.gather_on_root(root=0): gather a distributed.Vector onto
  a single rank as a host numpy array; returns None on non-root ranks.
  Uses MPI gather + gather_v with the local Dense slice as the source.

- Matrix_<T,L,G>.create_from_local_and_non_local: clarify in the
  docstring that recv_connections holds GLOBAL column ids and the
  non_local_linop's local column index space is the position into
  recv_connections, ordered by source rank then ascending global id.

- Add 2-rank smoke tests for cg/bicgstab on a distributed.Matrix and
  for the new Vector helpers (view parity + gather correctness).

Addresses NovaPIC PR Helmholtz-AI-Energy#105 audit items 5 (off-diag column convention),
13 (zero-copy CuPy <-> distributed.Vector) and 14 (gather to host).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@rho-novatron rho-novatron marked this pull request as draft April 18, 2026 17:06
… helpers

- PyLinOp docstring + README spell out that the apply_impl callback
  receives distributed.Vector inputs (local block only) and is
  responsible for halo exchange; recommend cupy stream sync.
- README documents the (Convergence, x) tuple returned by *.apply()
  and the new from_local_array_view / gather_on_root helpers.

Addresses NovaPIC PR Helmholtz-AI-Energy#105 audit items 11 (PyLinOp signature) and 12
(stream safety).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@greole
Copy link
Copy Markdown
Collaborator

greole commented Apr 24, 2026

I worked on bindings for the distributed backend a while back. Maybe its worth checking this branch as well https://github.com/Helmholtz-AI-Energy/pyGinkgo/tree/dist/mpi

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants