title

Sharing is Caring

sub_title

How We Thought We Solved the GPU Investment Puzzle

author

Marc Herren

event

cloudnativeday 2025

theme

name

override

light

default

code

footer

slide_title

intro_slide

margin

colors

percent
8

foreground	background
263151	fefcfe

alignment	background
left	false

style

left

template

image
images/bespinian.png

padding_bottom	padding_top	separator	font_size
3	1	false	3

title

alignment	font_size
center	4

options

list_item_newlines	implicit_slide_ends
2	true

About Me

Platform engineer @ bespinian
AI / digital media coach @ remmen.io
Trainer @ letsboot.ch

Some Context

Based on a project of a customer in the financial sector for the past 18 months
Over 30 kubernetes clusters
Self-service platform

imagined with midjourney

The AI Rush

More and more applications are including AI services
Companies are using their own data to create new models
Regulations and compliance require access to local models (LLMs)

imagined with midjourney

The GPU: An Essential Tool for Data Mining

The foundation for competitive AIs are GPUs/TPUs (Tensor Processing Unit)
High demand, limited supply
Securing GPUs/TPUs is now a strategic priority

imagined with midjourney

Characteristics of AI Workloads

A typical AI workload can be categorized into two different types

batch jobs (hours/days) -> Full power required, no High Availability requirement
- Machine Learning
- Report generation
- Testing
online jobs (24/7) -> High Availability, Multiple Pods
- Inference

Kubernetes: The Platform for AI/ML Workloads

Kubernetes is a widely recognized platform for any AI/ML workflow use case
More and more tools are emerging for optimal GPU usage on k8s (scheduler/resource management)
Cloud Native AI Working Group

imagined with midjourney

In the Real World: Multiple Clusters Are the Norm

Companies often operate multiple k8s clusters
Separating Dev/Test/Prod environments is a basic best practice
Regulated companies use dedicated platforms for each specific environment

imagined with midjourney

The GPU Becomes the Bottleneck

With the AI rush, the need for more GPUs increases drastically
GPUs are expensive and scarce resources
Limited supply due to manufacturing constraints and US export restrictions

imagined with midjourney

How to Make GPUs Available on a Platform

Two strategic approaches emerge

GPUs to existing clusters

Workloads to dedicated GPU clusters

Strategy 1: GPUs to the Clusters

✅ Deploy GPUs directly into platforms that require them
✅ No change for the end-users

⛔ Capacity planning must account for maintenance and redundancy
⛔ GPU-specific cluster adaptations apply to each platform
⛔ Expensive !

Strategy 2: Workloads to the GPU Cluster

✅ Centralized k8s cluster built with many GPUs
Specific exceptions and configurations for a specific range of clusters

⛔ Limited separation of environments
⛔ End-users must adapt to new systems and processes

Kubernetes Modifications Required

GPU drivers & dependencies due to an ESX vGPU environment
Advanced work scheduling - the default k8s scheduler is not optimized for ML workloads
Additional components/configurations
- Nvidia GPU Operator
- nvidia runtime class

imagined with midjourney

What if You Want to Have Your Cake and Eat It Too?

Would it be possible to implement both strategies?

Central cluster with specialized configuration
While simultaneously having GPUs available in all k8s platforms

imagined with midjourney

Karmada

Open, Multi-Cloud, Multi-Cluster Kubernetes Orchestration

https://karmada.io

Karmada Overview

Liqo

Enable dynamic and seamless Kubernetes multi-cluster topologies

https://liqo.io

Liqo Origin

2019 developed at the Politecnico di Torino
2023 became a spin-off

Liqo Overview

Liqo is an open-source project that enables dynamic and seamless Kubernetes multi-cluster topologies, supporting heterogeneous on-premise, cloud and edge infrastructures

Liqo Peering

Peering as a unidirectional resource and service consumption relationship between two Kubernetes clusters, a consumer and a provider

Network fabric
Authentication
Resource negotiation
Virtual node

End-to-end peering establishment

Virtual Kubelet

The virtual node abstraction is based on the Virtual Kubelet project

Create the virtual node resource
Offload the local pods
Propagate and synchronize the accessory artifacts (e.g., Services, ConfigMaps, Secrets, …)

Liqo Offloading

Pod offloading works by through a virtual node in the consumer cluster that represents resources from the provider cluster

Assigne resources
Namespace extension
Pod offloading

Liqo Network

Pod offloading works by through a virtual node in the consumer cluster that represents resources from the provider cluster

IPAM management
Wireguard VPN Tunnel
In-cluster overlay network

Liqo Storage

Storage fabric enables the seamless offloading of stateful workloads to a provider clusters

Storage binding deferral
Data gravity
Locality constraints

Demo Time

The Demo Environment

Provider

A 2 node cluster with 1 GPU

kubectl get nodes --context=admin@k8s-gpu

Consumer

A 4 node cluster

kubectl get nodes -l liqo.io/type!=virtual-node --context=admin@k8s-lab

Liqo Installation

Liqo helm chart values

Provider

  discovery:
    config:
      clusterID: k8s-gpu
      clusterLabels:
        liqo.io/provider: kubeadm
  ipam:
    podCIDR: 10.244.0.0/16
    serviceCIDR: 10.96.0.0/12
  telemetry:
    enabled: false
  offloading:
    defaultNodeResources:
      cpu: "4"
      memory: 4Gi
      pods: "30"
      ephemeral-storage: 20Gi

Consumer

  discovery:
    config:
      clusterID: k8s-lab
      clusterLabels:
        liqo.io/provider: kubeadm
  ipam:
    podCIDR: 10.244.0.0/16
    serviceCIDR: 10.96.0.0/12
  telemetry:
    enabled: false

Liqo Deployment

Provider

kubectl get pods -n kube-liqo --context=admin@k8s-gpu

Consumer

kubectl get pods -n kube-liqo --context=admin@k8s-lab

Pod Offloading

In order to perform pod offloading we need to first establish peering

liqoctl authenticate -n kube-liqo --context admin@k8s-lab --remote-context admin@k8s-gpu --remote-namespace kube-liqo

and request a resource slice from the provider cluster

bat config/liqo-slice.yaml --color always

Virtual Node

If the resource slice is accepted, it will create a virtual node on the consumer cluster

kubectl get nodes -o wide --context=admin@k8s-lab

Liqo Status

Provider

liqoctl info -n kube-liqo --context=admin@k8s-gpu

Consumer

liqoctl info -n kube-liqo --context=admin@k8s-lab

Liqo Networking

Setup a wireguard vpn tunnel to communicate between the consumer and the provider cluster

liqoctl network connect --context admin@k8s-lab --namespace liqo-tenant-gpu --liqo-namespace kube-liqo --remote-liqo-namespace kube-liqo --remote-context admin@k8s-gpu --remote-namespace liqo-tenant-lab --gw-server-service-type NodePort

Liqo Peer Status

Provider

liqoctl info -n kube-liqo peer --context=admin@k8s-gpu

Consumer

liqoctl info -n kube-liqo peer --context=admin@k8s-lab

Offload a Namespace

The liqo-demo namespace on the consumer cluster is used

kubectl create ns liqo-demo --context=admin@k8s-lab

to configure pod offloading

liqoctl offload namespace liqo-demo --context=admin@k8s-lab

Twin Namespace

Provider

kubectl get ns --context=admin@k8s-gpu | grep demo

Consumer

kubectl get ns --context=admin@k8s-lab | grep demo

Offload a Pod

Pod configuration

bat config/demo-pod.yaml --color always

Pod Configuration

Service configuration

bat config/demo-service.yaml --color always

Ingress configuration

bat config/demo-ingress.yaml --color always

Pod Status

Provider

kubectl get pod -o wide -n liqo-demo-k8s-lab --context=admin@k8s-gpu

Consumer

kubectl get pod -o wide -n liqo-demo --context=admin@k8s-lab

Work with the Pod on the Consumer

kubectl describe pod -n liqo-demo -l app=web-nvidia-smi --context=admin@k8s-lab | head -n 15

kubectl logs -n liqo-demo -l app=web-nvidia-smi --context=admin@k8s-lab

Curl

Query the ingress using curl

kubectl get ingress --context=admin@k8s-lab

curl -s https://smi.lab.remmen.io

What Made Me Change the Title

March 2025

How We solved the GPU Investment Puzzle

August 2025

How We Thought We Solved the GPU Investment Puzzle

Some issues have emerged

Networking
Day-2 operation

IPAM / Cilium

Ingress not working

<html>
<head><title>504 Gateway Time-out</title></head>
<body>
<center><h1>504 Gateway Time-out</h1></center>
<hr><center>nginx</center>
</body>
</html>

Cilium's IPAM protects against IP spoofing

➜ hubble observe --follow --to-ip 10.127.66.106 --numeric
Feb 25 15:19:24.674: 10.127.66.173 (ID:50526) <> 10.127.66.106 (ID:8224) Invalid source ip DROPPED (ICMPv4 EchoRequest)
Feb 25 15:19:25.688: 10.127.66.173 (ID:50526) <> 10.127.66.106 (ID:8224) Invalid source ip DROPPED (ICMPv4 EchoRequest)
Feb 25 15:19:26.708: 10.127.66.173 (ID:50526) <> 10.127.66.106 (ID:8224) Invalid source ip DROPPED (ICMPv4 EchoRequest)

-> Deactivating SourceIPVerification is unsupported by cilium enterprise

Talos

Talos build in upgrade command fails

talosctl upgrade-k8s --context=k8s-lab --to=v1.33.4 --nodes=10.10.40.90;

automatically detected the lowest Kubernetes version 1.33.1
discovered controlplane nodes ["10.10.40.90"]
discovered worker nodes ["10.244.2.132" "10.10.40.91" "10.10.40.92" "10.10.40.93"]
 > "10.10.40.90": pre-pulling registry.k8s.io/kube-apiserver:v1.33.4
 > "10.10.40.90": pre-pulling registry.k8s.io/kube-controller-manager:v1.33.4
 > "10.10.40.90": pre-pulling registry.k8s.io/kube-scheduler:v1.33.4
 > "10.10.40.90": pre-pulling ghcr.io/siderolabs/kubelet:v1.33.4
failed pre-pulling images: error fetching kubelet spec on node 10.244.2.132: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial tcp 10.244.2.132:50000: connect: connection refused"

Bring Workloads to the Clusters

Use what we learned to build a better platform
Consolidate GPUs into a Prod and Non-Prod cluster

Dynamic Resource allocation (DRA)

DRA allows granular GPU demands (memory instead of quantity)
The Hami Project allows a dynamic reconfiguration of the GPUs profile for better fit

https://project-hami.io/

Kueue

Using Quotas to reserve GPUs

Kueue allows you to over-use your quota if you need more GPUs and free resources are available

https://kueue.sigs.k8s.io/

Questions ?

Link to the Slides

FilesExpand file tree

gpu-sharing.md

Latest commit

History

gpu-sharing.md

File metadata and controls

About Me

Some Context

The AI Rush

The GPU: An Essential Tool for Data Mining

Characteristics of AI Workloads

Kubernetes: The Platform for AI/ML Workloads

In the Real World: Multiple Clusters Are the Norm

The GPU Becomes the Bottleneck

How to Make GPUs Available on a Platform

Strategy 1: GPUs to the Clusters

Strategy 2: Workloads to the GPU Cluster

Kubernetes Modifications Required

What if You Want to Have Your Cake and Eat It Too?

Karmada

Karmada Overview

Liqo

Liqo Origin

Liqo Overview

Liqo Peering

Virtual Kubelet

Liqo Offloading

Liqo Network

Liqo Storage

The Demo Environment

Liqo Installation

Liqo Deployment

Pod Offloading

Virtual Node

Liqo Status

Liqo Networking

Liqo Peer Status

Offload a Namespace

Twin Namespace

Offload a Pod

Pod Configuration

Pod Status

Work with the Pod on the Consumer

Curl

What Made Me Change the Title

IPAM / Cilium

Talos

Bring Workloads to the Clusters

Dynamic Resource allocation (DRA)

Kueue