Bug: GPU pipeline fails in Kubernetes/GKE (`libcuda.so.1` / `nvidia-smi` missing from PATH)

### Description
When running the `opendronemap/odm:gpu` image in a Kubernetes environment (specifically Google Kubernetes Engine) using standard GPU tolerations, the ODM pipeline fails to utilize the GPU and crashes during the `openmvs` stage. 

The pipeline reports `[INFO] No nvidia-smi detected`, passes `--cuda-device -2` to OpenMVS, and subsequently crashes with `error while loading shared libraries: libcuda.so.1: cannot open shared object file: No such file or directory`:

```text
[2026-03-10, 09:29:09 UTC] [INFO]    Estimating depthmaps
[2026-03-10, 09:29:09 UTC] [INFO]    No nvidia-smi detected
[2026-03-10, 09:29:09 UTC] [INFO]    running "/code/SuperBuild/install/bin/OpenMVS/DensifyPointCloud" [...] -v 0 --cuda-device -2
[2026-03-10, 09:29:09 UTC] /code/SuperBuild/install/bin/OpenMVS/DensifyPointCloud: error while loading shared libraries: libcuda.so.1: cannot open shared object file: No such file or directory
[2026-03-10, 09:29:09 UTC] Child returned 127
```
*(Log truncated for brevity)*

### To Reproduce
1. Deploy `opendronemap/odm:gpu` in a Kubernetes cluster requesting `nvidia.com/gpu: 1`.
2. Run standard ODM pipeline arguments (e.g., `--dsm --dtm --pc-quality high`).
3. Observe the logs during the `openmvs` stage. 
4. The pipeline fails with a `Child returned 127` SubprocessException.

### Expected Behavior
The container should detect the mounted GPU via `nvidia-smi`, correctly load the NVIDIA shared libraries, and execute the OpenMVS stage using `--cuda-device -1` (or the appropriate GPU ID) without crashing.

### Root Cause & Workaround
Unlike `docker run --gpus all` (which actively alters the container's environment variables at runtime to inject NVIDIA paths), Kubernetes device plugins simply mount the hardware files into `/usr/local/nvidia` and rely on the image's `ENV` instructions to make them discoverable. 

Currently, the `gpu.Dockerfile` causes issues in Kubernetes for two reasons:
1. **The `$PATH` Issue:** `/usr/local/nvidia/bin` is missing from the system `$PATH`. When `run.py` uses `subprocess.run` to call `nvidia-smi` directly, it fails, causing the pipeline to assume no GPU exists.
2. **The `$LD_LIBRARY_PATH` Issue:** In `gpu.Dockerfile`, the path is set via `ENV LD_LIBRARY_PATH="$LD_LIBRARY_PATH:/code/SuperBuild/install/lib"`. While this correctly appends the ODM paths to the CUDA base image paths at *build time*, it prevents the Kubernetes runtime from dynamically resolving `libcuda.so.1` or `libnvidia-ml.so`, which the K8s device plugin mounts at `/usr/local/nvidia/lib64`.

I successfully worked around this by manually overriding the environment variables in the Kubernetes Pod spec to explicitly include the NVIDIA mount paths:

```yaml
env:
  - name: NVIDIA_DRIVER_CAPABILITIES
    value: "compute,utility"
  - name: LD_LIBRARY_PATH
    value: "/usr/local/nvidia/lib64:/usr/local/nvidia/lib:/code/SuperBuild/install/lib"
  - name: PATH
    value: "/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin"
```

### Proposed Solution
Could the specific NVIDIA runtime paths be explicitly prepended to the `ENV` definitions inside `gpu.Dockerfile`? 
For example:
```dockerfile
ENV PATH="/usr/local/nvidia/bin:$PATH" \
    LD_LIBRARY_PATH="/usr/local/nvidia/lib64:/usr/local/nvidia/lib:$LD_LIBRARY_PATH:/code/SuperBuild/install/lib"
```
This would make the image immediately compatible out-of-the-box for Kubernetes/Cloud deployments without users having to manually map environment variables.

### A Note on Docker Tags
I noticed that the `opendronemap/odm:gpu` tag acts effectively as a "latest" tag and is automatically updated with commits to the master branch. This recently caused our automated pipelines to break unexpectedly (likely related to commit 44e3ff6e39f3f17385862ea054c05ae81b18683c which appears to have introduced changes to the underlying CUDA base image, altering the default system paths that were previously working). Would it be possible to introduce versioned GPU tags (e.g., `odm:3.6.0-gpu`) on Docker Hub so that we can pin to stable releases in production environments? 

Thank you to all the contributors for the incredible work on this project!


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug: GPU pipeline fails in Kubernetes/GKE (`libcuda.so.1` / `nvidia-smi` missing from PATH) #2001

Description

To Reproduce

Expected Behavior

Root Cause & Workaround

Proposed Solution

A Note on Docker Tags

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Bug: GPU pipeline fails in Kubernetes/GKE (libcuda.so.1 / nvidia-smi missing from PATH) #2001

Description

Description

To Reproduce

Expected Behavior

Root Cause & Workaround

Proposed Solution

A Note on Docker Tags

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Bug: GPU pipeline fails in Kubernetes/GKE (`libcuda.so.1` / `nvidia-smi` missing from PATH) #2001