| title | Sharing is Caring | ||||||||||||||||||||||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| sub_title | How We Thought We Solved the GPU Investment Puzzle | ||||||||||||||||||||||||||||||||||||||||||||||||
| author | Marc Herren | ||||||||||||||||||||||||||||||||||||||||||||||||
| event | cloudnativeday 2025 | ||||||||||||||||||||||||||||||||||||||||||||||||
| theme |
|
||||||||||||||||||||||||||||||||||||||||||||||||
| options |
|
- Platform engineer @ bespinian
- AI / digital media coach @ remmen.io
- Trainer @ letsboot.ch
- Based on a project of a customer in the financial sector for the past 18 months
- Over 30 kubernetes clusters
- Self-service platform
imagined with midjourney
- More and more applications are including AI services
- Companies are using their own data to create new models
- Regulations and compliance require access to local models (LLMs)
imagined with midjourney
- The foundation for competitive AIs are GPUs/TPUs (Tensor Processing Unit)
- High demand, limited supply
- Securing GPUs/TPUs is now a strategic priority
imagined with midjourney
A typical AI workload can be categorized into two different types
- batch jobs (hours/days) -> Full power required, no High Availability requirement
- Machine Learning
- Report generation
- Testing
- online jobs (24/7) -> High Availability, Multiple Pods
- Inference
- Kubernetes is a widely recognized platform for any AI/ML workflow use case
- More and more tools are emerging for optimal GPU usage on k8s (scheduler/resource management)
- Cloud Native AI Working Group
imagined with midjourney
- Companies often operate multiple k8s clusters
- Separating Dev/Test/Prod environments is a basic best practice
- Regulated companies use dedicated platforms for each specific environment
imagined with midjourney
- With the AI rush, the need for more GPUs increases drastically
- GPUs are expensive and scarce resources
- Limited supply due to manufacturing constraints and US export restrictions
imagined with midjourney
Two strategic approaches emerge
Workloads to dedicated GPU clusters

- ✅ Deploy GPUs directly into platforms that require them
- ✅ No change for the end-users
- ⛔ Capacity planning must account for maintenance and redundancy
- ⛔ GPU-specific cluster adaptations apply to each platform
- ⛔ Expensive !
- ✅ Centralized k8s cluster built with many GPUs
- Specific exceptions and configurations for a specific range of clusters
- ⛔ Limited separation of environments
- ⛔ End-users must adapt to new systems and processes
- GPU drivers & dependencies due to an ESX vGPU environment
- Advanced work scheduling - the default k8s scheduler is not optimized for ML workloads
- Additional components/configurations
- Nvidia GPU Operator
- nvidia runtime class
imagined with midjourney
Would it be possible to implement both strategies?
- Central cluster with specialized configuration
- While simultaneously having GPUs available in all k8s platforms
imagined with midjourney
Open, Multi-Cloud, Multi-Cluster Kubernetes Orchestration
Enable dynamic and seamless Kubernetes multi-cluster topologies
- 2019 developed at the Politecnico di Torino
- 2023 became a spin-off
Liqo is an open-source project that enables dynamic and seamless Kubernetes multi-cluster topologies, supporting heterogeneous on-premise, cloud and edge infrastructures
Peering as a unidirectional resource and service consumption relationship between two Kubernetes clusters, a consumer and a provider
- Network fabric
- Authentication
- Resource negotiation
- Virtual node
End-to-end peering establishment
The virtual node abstraction is based on the Virtual Kubelet project
- Create the virtual node resource
- Offload the local pods
- Propagate and synchronize the accessory artifacts (e.g., Services, ConfigMaps, Secrets, …)
Pod offloading works by through a virtual node in the consumer cluster that represents resources from the provider cluster
- Assigne resources
- Namespace extension
- Pod offloading
Pod offloading works by through a virtual node in the consumer cluster that represents resources from the provider cluster
- IPAM management
- Wireguard VPN Tunnel
- In-cluster overlay network
Storage fabric enables the seamless offloading of stateful workloads to a provider clusters
- Storage binding deferral
- Data gravity
- Locality constraints
Demo Time
Provider
A 2 node cluster with 1 GPU
kubectl get nodes --context=admin@k8s-gpuConsumer
A 4 node cluster
kubectl get nodes -l liqo.io/type!=virtual-node --context=admin@k8s-labLiqo helm chart values
Provider
discovery:
config:
clusterID: k8s-gpu
clusterLabels:
liqo.io/provider: kubeadm
ipam:
podCIDR: 10.244.0.0/16
serviceCIDR: 10.96.0.0/12
telemetry:
enabled: false
offloading:
defaultNodeResources:
cpu: "4"
memory: 4Gi
pods: "30"
ephemeral-storage: 20GiConsumer
discovery:
config:
clusterID: k8s-lab
clusterLabels:
liqo.io/provider: kubeadm
ipam:
podCIDR: 10.244.0.0/16
serviceCIDR: 10.96.0.0/12
telemetry:
enabled: falseProvider
kubectl get pods -n kube-liqo --context=admin@k8s-gpuConsumer
kubectl get pods -n kube-liqo --context=admin@k8s-labIn order to perform pod offloading we need to first establish peering
liqoctl authenticate -n kube-liqo --context admin@k8s-lab --remote-context admin@k8s-gpu --remote-namespace kube-liqoand request a resource slice from the provider cluster
bat config/liqo-slice.yaml --color alwaysIf the resource slice is accepted, it will create a virtual node on the consumer cluster
kubectl get nodes -o wide --context=admin@k8s-labProvider
liqoctl info -n kube-liqo --context=admin@k8s-gpuConsumer
liqoctl info -n kube-liqo --context=admin@k8s-labSetup a wireguard vpn tunnel to communicate between the consumer and the provider cluster
liqoctl network connect --context admin@k8s-lab --namespace liqo-tenant-gpu --liqo-namespace kube-liqo --remote-liqo-namespace kube-liqo --remote-context admin@k8s-gpu --remote-namespace liqo-tenant-lab --gw-server-service-type NodePortProvider
liqoctl info -n kube-liqo peer --context=admin@k8s-gpuConsumer
liqoctl info -n kube-liqo peer --context=admin@k8s-labThe liqo-demo namespace on the consumer cluster is used
kubectl create ns liqo-demo --context=admin@k8s-labto configure pod offloading
liqoctl offload namespace liqo-demo --context=admin@k8s-labProvider
kubectl get ns --context=admin@k8s-gpu | grep demoConsumer
kubectl get ns --context=admin@k8s-lab | grep demoPod configuration
bat config/demo-pod.yaml --color alwaysService configuration
bat config/demo-service.yaml --color alwaysIngress configuration
bat config/demo-ingress.yaml --color alwaysProvider
kubectl get pod -o wide -n liqo-demo-k8s-lab --context=admin@k8s-gpuConsumer
kubectl get pod -o wide -n liqo-demo --context=admin@k8s-labkubectl describe pod -n liqo-demo -l app=web-nvidia-smi --context=admin@k8s-lab | head -n 15kubectl logs -n liqo-demo -l app=web-nvidia-smi --context=admin@k8s-labQuery the ingress using curl
kubectl get ingress --context=admin@k8s-labcurl -s https://smi.lab.remmen.ioMarch 2025
How We solved the GPU Investment Puzzle
August 2025
How We Thought We Solved the GPU Investment Puzzle
Some issues have emerged
- Networking
- Day-2 operation
Ingress not working
<html>
<head><title>504 Gateway Time-out</title></head>
<body>
<center><h1>504 Gateway Time-out</h1></center>
<hr><center>nginx</center>
</body>
</html>
Cilium's IPAM protects against IP spoofing
➜ hubble observe --follow --to-ip 10.127.66.106 --numeric
Feb 25 15:19:24.674: 10.127.66.173 (ID:50526) <> 10.127.66.106 (ID:8224) Invalid source ip DROPPED (ICMPv4 EchoRequest)
Feb 25 15:19:25.688: 10.127.66.173 (ID:50526) <> 10.127.66.106 (ID:8224) Invalid source ip DROPPED (ICMPv4 EchoRequest)
Feb 25 15:19:26.708: 10.127.66.173 (ID:50526) <> 10.127.66.106 (ID:8224) Invalid source ip DROPPED (ICMPv4 EchoRequest)
-> Deactivating SourceIPVerification is unsupported by cilium enterprise
Talos build in upgrade command fails
talosctl upgrade-k8s --context=k8s-lab --to=v1.33.4 --nodes=10.10.40.90;
automatically detected the lowest Kubernetes version 1.33.1
discovered controlplane nodes ["10.10.40.90"]
discovered worker nodes ["10.244.2.132" "10.10.40.91" "10.10.40.92" "10.10.40.93"]
> "10.10.40.90": pre-pulling registry.k8s.io/kube-apiserver:v1.33.4
> "10.10.40.90": pre-pulling registry.k8s.io/kube-controller-manager:v1.33.4
> "10.10.40.90": pre-pulling registry.k8s.io/kube-scheduler:v1.33.4
> "10.10.40.90": pre-pulling ghcr.io/siderolabs/kubelet:v1.33.4
failed pre-pulling images: error fetching kubelet spec on node 10.244.2.132: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial tcp 10.244.2.132:50000: connect: connection refused"
- Use what we learned to build a better platform
- Consolidate GPUs into a Prod and Non-Prod cluster
- DRA allows granular GPU demands (memory instead of quantity)
- The Hami Project allows a dynamic reconfiguration of the GPUs profile for better fit
Using Quotas to reserve GPUs
- Kueue allows you to over-use your quota if you need more GPUs and free resources are available
Questions ?






















