Skip to content

Commit 022d653

Browse files
fix(pyops): upgarde docker to fair pyops
1 parent 8bed8df commit 022d653

File tree

4 files changed

+123
-77
lines changed

4 files changed

+123
-77
lines changed

docs/development/k8s.md

Lines changed: 29 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -12,20 +12,19 @@ For GPU support: [nvkind](https://github.com/NVIDIA/nvkind), NVIDIA driver, [nvi
1212
```bash
1313
uv sync --group k8s
1414
cd infra/dev
15-
make up # cluster + infra + port-forwards + seed data + zenml stack
16-
make run-example # init -> register -> finetune -> promote -> predict (local orchestrator)
17-
make teardown # destroy everything (kills port-forwards, removes cluster)
15+
make up # smart: creates cluster if missing, deploys infra, starts port-forwards
16+
make status # show cluster, pods, port-forward health
17+
make down # stop port-forwards (cluster stays for fast restart)
18+
make tear # destroy everything
1819
```
1920

20-
To run pipelines on the **k8s orchestrator** (steps execute as pods):
21+
Run pipelines:
2122

2223
```bash
23-
make build-image # build Docker image + load into kind workers
24-
make run-example-k8s # same workflow but steps run as k8s pods
24+
make run-example # E2E with local orchestrator
25+
make run-example-k8s # E2E with k8s orchestrator (pods pull image from ghcr.io)
2526
```
2627

27-
Individual targets: `make help`.
28-
2928
### Verifying results
3029

3130
After `make run-example` completes, inspect outputs at:
@@ -39,12 +38,12 @@ After `make run-example` completes, inspect outputs at:
3938

4039
### ZenML Stacks
4140

42-
`make stack-register` creates two stacks:
41+
`make up` registers two stacks:
4342

4443
| Stack | Orchestrator | S3 Endpoint | MLflow | Use |
4544
|-------|-------------|-------------|--------|-----|
4645
| `dev` (active) | `default` (local) | `localhost:9000` | `localhost:5000` | Local runs via port-forward (`make run-example`) |
47-
| `k8s` | `k8s_orchestrator` | `minio.fair.svc:9000` | `mlflow.fair.svc:80` | In-cluster jobs |
46+
| `k8s` | `k8s_orchestrator` | `minio.fair.svc:9000` | `mlflow.fair.svc:80` | In-cluster jobs (`make run-example-k8s`) |
4847

4948
## Architecture
5049

@@ -60,7 +59,7 @@ postgres (PG 17 + PostGIS) zenml (ghcr.io/hotosm/zenml-postgres:0.93.3
6059
+--- minio (s3://fair-data, s3://mlflow, s3://zenml)
6160
```
6261

63-
Port-forwards (via `make port-forward`):
62+
Port-forwards (managed by `make up` / `make down`):
6463

6564
| Service | Local | Cluster |
6665
|----------|-----------------|-----------------------------|
@@ -74,14 +73,32 @@ Port-forwards (via `make port-forward`):
7473

7574
Follow the [nvkind prerequisites and setup guide](https://github.com/NVIDIA/nvkind#prerequisites) to install the NVIDIA driver, nvidia-container-toolkit, and nvkind on your host. Once `nvkind` is on `$PATH`, `make up` handles the rest.
7675

77-
**What `make up` does**: `kind-config.yaml` labels workers as `inference` and `train`, with the train node getting `extraMounts` that signal GPU presence to nvkind. `make cluster-up` runs nvkind (installs toolkit inside the node, configures containerd). `make infra-up` creates the `nvidia` RuntimeClass, labels the GPU node, and deploys the device plugin.
76+
**What `make up` does**: `kind-config.yaml` labels workers as `inference` and `train`, with the train node getting `extraMounts` that signal GPU presence to nvkind. The cluster creation step runs nvkind (installs toolkit inside the node, configures containerd). The infra step creates the `nvidia` RuntimeClass, labels the GPU node, and deploys the device plugin.
7877

7978
**Caveats**:
8079

8180
- `PatchProcDriverNvidia` may fail on non-MIG single-GPU hosts — non-critical, the Makefile tolerates it.
8281
- nvkind restarts containerd on the GPU node, briefly disrupting colocated pods.
8382
- Device plugin uses `--set deviceDiscoveryStrategy=nvml` (default `auto` fails inside kind).
8483

84+
## Configuration
85+
86+
### `FAIR_LABEL_DOMAIN`
87+
88+
Node labels and taints use a configurable domain prefix (default `fair-dev.hotosm.org`).
89+
Override via environment variable:
90+
91+
```bash
92+
export FAIR_LABEL_DOMAIN=fair-dev.hotosm.org # dev
93+
make up
94+
```
95+
96+
Consumed in three places:
97+
98+
- **`kind-config.yaml`** — node labels (`${FAIR_LABEL_DOMAIN}/role`) and taints (`${FAIR_LABEL_DOMAIN}/workload`), resolved via `envsubst` at cluster creation
99+
- **`stacks/k8s.yaml`** — pod `node_selectors` and `tolerations`, resolved via `envsubst` at stack registration
100+
- **`fair/zenml/config.py`** — reads `FAIR_LABEL_DOMAIN` at runtime (default `fair.hotosm.org`) for pipeline pod scheduling
101+
85102
## Decisions
86103

87104
**kind over minikube/k3s** -- `hotosm/k8s-infra` runs upstream K8s (EKS). kind runs

infra/dev/Makefile

Lines changed: 65 additions & 56 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,6 @@
11
SHELL := /bin/bash
2-
.DEFAULT_GOAL := help
2+
.DEFAULT_GOAL := status
33

4-
# Docker socket auto-detect
54
DOCKER_HOST ?= $(if $(wildcard /var/run/docker.sock),,\
65
$(if $(wildcard $(HOME)/.colima/default/docker.sock),unix://$(HOME)/.colima/default/docker.sock,\
76
$(if $(wildcard $(HOME)/.docker/run/docker.sock),unix://$(HOME)/.docker/run/docker.sock,)))
@@ -18,34 +17,74 @@ export FAIR_LABEL_DOMAIN
1817

1918
IMAGE ?= ghcr.io/hotosm/fair-models/example-unet:v1
2019

21-
.PHONY: cluster-up cluster-down infra-up infra-down \
22-
port-forward kill-port-forward seed-data \
23-
stack-register run-example build-image run-example-k8s \
24-
up teardown help
20+
CLUSTER_EXISTS := $(shell kind get clusters 2>/dev/null | grep -qx $(CLUSTER) && echo 1)
21+
INFRA_RUNNING := $(shell kubectl get statefulset/postgres -n $(NS) -o jsonpath='{.status.readyReplicas}' 2>/dev/null)
22+
PF_RUNNING := $(shell test -f $(PID_FILE) && kill -0 $$(head -1 $(PID_FILE)) 2>/dev/null && echo 1)
2523

26-
cluster-up:
27-
-envsubst '$$FAIR_LABEL_DOMAIN' < kind-config.yaml > .kind-config-resolved.yaml
28-
-$(NVKIND) cluster create --name $(CLUSTER) --config-template .kind-config-resolved.yaml
29-
@rm -f .kind-config-resolved.yaml
30-
@kubectl apply -f - <<< '{"apiVersion":"node.k8s.io/v1","kind":"RuntimeClass","handler":"nvidia","metadata":{"name":"nvidia"}}' 2>/dev/null || true
24+
.PHONY: up down tear status run-example run-example-k8s
3125

32-
cluster-down:
33-
kind delete cluster --name $(CLUSTER)
26+
up: _ensure-cluster _ensure-infra _ensure-port-forward _seed-data _stack-register
27+
@echo "Ready."
3428

35-
infra-up:
36-
@kubectl create ns $(NS) --dry-run=client -o yaml | kubectl apply -f -
37-
@kubectl label node -l $(FAIR_LABEL_DOMAIN)/role=train nvidia.com/gpu.present=true --overwrite >/dev/null 2>/dev/null || true
38-
envsubst '$$FAIR_LABEL_DOMAIN' < postgres/statefulset.yaml | kubectl apply -n $(NS) -f postgres/service.yaml -f -
39-
kubectl rollout status -n $(NS) statefulset/postgres --timeout=120s
40-
helmfile apply
41-
kubectl rollout status -n $(NS) deployment/stac-stac --timeout=300s
29+
down: _kill-port-forward
30+
@echo "Port-forwards stopped. 'make up' to resume."
4231

43-
infra-down:
32+
tear: _kill-port-forward
4433
-helmfile destroy
4534
-kubectl delete ns nvidia --ignore-not-found
4635
-kubectl delete -n $(NS) -f postgres/
36+
kind delete cluster --name $(CLUSTER)
37+
38+
status:
39+
@echo "Cluster: $(if $(CLUSTER_EXISTS),running,not found)"
40+
@echo "Infra: $(if $(INFRA_RUNNING),running,not ready)"
41+
@echo "Port-forwards: $(if $(PF_RUNNING),active,inactive)"
42+
@echo ""
43+
@if [ "$(CLUSTER_EXISTS)" = "1" ]; then \
44+
echo "Nodes:"; kubectl get nodes -o wide --no-headers 2>/dev/null | sed 's/^/ /'; echo ""; \
45+
echo "Pods ($(NS)):"; kubectl get pods -n $(NS) --no-headers 2>/dev/null | sed 's/^/ /'; echo ""; \
46+
echo "Ports:"; while read svc ports; do \
47+
local_port=$${ports%%:*}; \
48+
if nc -z localhost $$local_port 2>/dev/null; then \
49+
echo " $$svc localhost:$$local_port ok"; \
50+
else \
51+
echo " $$svc localhost:$$local_port down"; \
52+
fi; \
53+
done < ports.conf; \
54+
fi
55+
56+
run-example:
57+
cd ../.. && uv run python examples/unet/run.py all --stac-api-url http://localhost:8082 --dsn $(PGSTAC_DSN)
58+
59+
run-example-k8s:
60+
cd ../.. && uv run zenml stack set k8s
61+
cd ../.. && AWS_ENDPOINT_URL=http://localhost:9000 uv run python examples/unet/run.py all \
62+
--stac-api-url http://localhost:8082 --dsn $(PGSTAC_DSN)
4763

48-
port-forward: kill-port-forward
64+
_ensure-cluster:
65+
ifeq ($(CLUSTER_EXISTS),)
66+
@envsubst '$$FAIR_LABEL_DOMAIN' < kind-config.yaml > .kind-config-resolved.yaml
67+
-$(NVKIND) cluster create --name $(CLUSTER) --config-template .kind-config-resolved.yaml
68+
@rm -f .kind-config-resolved.yaml
69+
@kubectl apply -f - <<< '{"apiVersion":"node.k8s.io/v1","kind":"RuntimeClass","handler":"nvidia","metadata":{"name":"nvidia"}}' 2>/dev/null || true
70+
endif
71+
72+
_ensure-infra:
73+
ifneq ($(INFRA_RUNNING),1)
74+
@kubectl create ns $(NS) --dry-run=client -o yaml | kubectl apply -f -
75+
@kubectl label node -l $(FAIR_LABEL_DOMAIN)/role=train nvidia.com/gpu.present=true --overwrite >/dev/null 2>/dev/null || true
76+
@envsubst '$$FAIR_LABEL_DOMAIN' < postgres/statefulset.yaml | kubectl apply -n $(NS) -f postgres/service.yaml -f -
77+
@kubectl rollout status -n $(NS) statefulset/postgres --timeout=120s
78+
@helmfile apply
79+
@kubectl rollout status -n $(NS) deployment/stac-stac --timeout=300s
80+
endif
81+
82+
_ensure-port-forward:
83+
ifneq ($(PF_RUNNING),1)
84+
@$(MAKE) --no-print-directory _port-forward
85+
endif
86+
87+
_port-forward: _kill-port-forward
4988
@mkdir -p .pf-logs
5089
@while read svc ports; do \
5190
( while true; do \
@@ -57,7 +96,7 @@ port-forward: kill-port-forward
5796
for i in $$(seq 1 30); do nc -z localhost $${ports%%:*} 2>/dev/null && break; sleep 1; done; \
5897
done < ports.conf
5998

60-
kill-port-forward:
99+
_kill-port-forward:
61100
@if test -f $(PID_FILE); then \
62101
while read pid; do \
63102
kill -- -$$pid 2>/dev/null || kill $$pid 2>/dev/null || true; \
@@ -66,10 +105,10 @@ kill-port-forward:
66105
fi
67106
@rm -rf .pf-logs
68107

69-
seed-data:
70-
@uv run --with minio python -c "from pathlib import Path; from minio import Minio; c = Minio('localhost:9000', 'minioadmin', 'minioadmin', secure=False); root = Path('../../data/sample'); files = [f for f in root.rglob('*') if f.is_file()]; print(f'Uploading {len(files)} files to fair-data/sample/'); [c.fput_object('fair-data', f'sample/{f.relative_to(root)}', str(f)) for f in files]; print('Done')"
108+
_seed-data:
109+
@uv run --with minio python scripts/seed_data.py
71110

72-
stack-register:
111+
_stack-register:
73112
@for i in 1 2 3 4 5; do \
74113
uv run zenml connect --url http://localhost:8080 --username default --password '' --no-verify-ssl && break; \
75114
echo "zenml connect: attempt $$i failed, retrying in 5s..." >&2; sleep 5; \
@@ -79,33 +118,3 @@ stack-register:
79118
-uv run zenml stack import k8s -f .k8s-stack-resolved.yaml --ignore-version-mismatch
80119
@rm -f .k8s-stack-resolved.yaml
81120
-uv run zenml stack set dev
82-
83-
run-example:
84-
cd ../.. && uv run python examples/unet/run.py all --stac-api-url http://localhost:8082 --dsn $(PGSTAC_DSN)
85-
86-
build-image:
87-
cd ../.. && docker buildx build -f models/example_unet/Dockerfile -t $(IMAGE) --platform linux/amd64 --load --no-cache .
88-
kind load docker-image $(IMAGE) --name $(CLUSTER) --nodes $(CLUSTER)-worker,$(CLUSTER)-worker2
89-
90-
run-example-k8s:
91-
cd ../.. && uv run zenml stack set k8s
92-
cd ../.. && AWS_ENDPOINT_URL=http://localhost:9000 uv run python examples/unet/run.py all \
93-
--stac-api-url http://localhost:8082 --dsn $(PGSTAC_DSN)
94-
95-
up: cluster-up infra-up port-forward seed-data stack-register
96-
teardown: kill-port-forward infra-down cluster-down
97-
98-
help:
99-
@echo "cluster-up Create kind cluster"
100-
@echo "cluster-down Delete kind cluster"
101-
@echo "infra-up Deploy postgres + helmfile apply"
102-
@echo "infra-down Helmfile destroy + remove postgres"
103-
@echo "port-forward Forward services to localhost (auto-reconnect)"
104-
@echo "kill-port-forward Stop port-forwards"
105-
@echo "seed-data Upload sample data to MinIO"
106-
@echo "stack-register Register ZenML stacks"
107-
@echo "build-image Build & load Docker image into kind workers"
108-
@echo "run-example Run UNet example E2E (local stack)"
109-
@echo "run-example-k8s Run UNet example E2E (k8s stack)"
110-
@echo "up Full setup"
111-
@echo "teardown Full teardown"

infra/dev/scripts/seed_data.py

Lines changed: 25 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,25 @@
1+
"""Upload sample data to MinIO for dev cluster."""
2+
3+
from __future__ import annotations
4+
5+
import sys
6+
from pathlib import Path
7+
8+
from minio import Minio
9+
10+
11+
def main() -> None:
12+
root = Path(__file__).resolve().parents[3] / "data" / "sample"
13+
if not root.exists():
14+
sys.exit(f"Sample data not found at {root}")
15+
16+
client = Minio("localhost:9000", "minioadmin", "minioadmin", secure=False)
17+
files = [f for f in root.rglob("*") if f.is_file()]
18+
print(f"Uploading {len(files)} files to fair-data/sample/")
19+
for f in files:
20+
client.fput_object("fair-data", f"sample/{f.relative_to(root)}", str(f))
21+
print("Done")
22+
23+
24+
if __name__ == "__main__":
25+
main()

models/example_unet/Dockerfile

Lines changed: 4 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -11,15 +11,10 @@ RUN --mount=type=cache,target=/root/.cache/uv \
1111
uv pip install \
1212
torch==2.10.0 \
1313
torchgeo==0.9.0 \
14-
fair-py-ops \
15-
"mlflow>=2.1.1,<4" \
16-
"universal-pathlib>=0.3.10" \
17-
"pypgstac[psycopg]>=0.9" \
18-
"pystac-client>=0.9" \
19-
"zenml[connectors-aws,connectors-kubernetes,s3fs]>=0.93.3"
20-
21-
# Runtime stage: minimal image
22-
FROM python:3.13-slim-trixie
14+
fair-py-ops==0.0.4 \
15+
16+
# Runtime stage: minimal image
17+
FROM python:3.13-slim-trixie
2318

2419
WORKDIR /app
2520

0 commit comments

Comments
 (0)