Skip to content

Latest commit

 

History

History
776 lines (712 loc) · 19.9 KB

File metadata and controls

776 lines (712 loc) · 19.9 KB
title Sharing is Caring
sub_title How We Thought We Solved the GPU Investment Puzzle
author Marc Herren
event cloudnativeday 2025
theme
name override
light
default code footer slide_title intro_slide
margin colors
percent
8
foreground background
263151
fefcfe
alignment background
left
false
style left
template
image
images/bespinian.png
padding_bottom padding_top separator font_size
3
1
false
3
title
alignment font_size
center
4
options
list_item_newlines implicit_slide_ends
2
true

About Me

  • Platform engineer @ bespinian
  • AI / digital media coach @ remmen.io
  • Trainer @ letsboot.ch

image

image

Some Context

  • Based on a project of a customer in the financial sector for the past 18 months
  • Over 30 kubernetes clusters
  • Self-service platform

image:width:100%

imagined with midjourney

The AI Rush

  • More and more applications are including AI services
  • Companies are using their own data to create new models
  • Regulations and compliance require access to local models (LLMs)

image:width:100%

imagined with midjourney

The GPU: An Essential Tool for Data Mining

  • The foundation for competitive AIs are GPUs/TPUs (Tensor Processing Unit)
  • High demand, limited supply
  • Securing GPUs/TPUs is now a strategic priority

image

imagined with midjourney

Characteristics of AI Workloads

A typical AI workload can be categorized into two different types

  • batch jobs (hours/days) -> Full power required, no High Availability requirement
    • Machine Learning
    • Report generation
    • Testing
  • online jobs (24/7) -> High Availability, Multiple Pods
    • Inference

Kubernetes: The Platform for AI/ML Workloads

  • Kubernetes is a widely recognized platform for any AI/ML workflow use case
  • More and more tools are emerging for optimal GPU usage on k8s (scheduler/resource management)
  • Cloud Native AI Working Group

image

imagined with midjourney

In the Real World: Multiple Clusters Are the Norm

  • Companies often operate multiple k8s clusters
  • Separating Dev/Test/Prod environments is a basic best practice
  • Regulated companies use dedicated platforms for each specific environment

image

imagined with midjourney

The GPU Becomes the Bottleneck

  • With the AI rush, the need for more GPUs increases drastically
  • GPUs are expensive and scarce resources
  • Limited supply due to manufacturing constraints and US export restrictions

image

imagined with midjourney

How to Make GPUs Available on a Platform

Two strategic approaches emerge

GPUs to existing clusters image

Workloads to dedicated GPU clusters image

Strategy 1: GPUs to the Clusters

  • ✅ Deploy GPUs directly into platforms that require them
  • ✅ No change for the end-users
  • ⛔ Capacity planning must account for maintenance and redundancy
  • ⛔ GPU-specific cluster adaptations apply to each platform
  • ⛔ Expensive !

image

Strategy 2: Workloads to the GPU Cluster

  • ✅ Centralized k8s cluster built with many GPUs
  • Specific exceptions and configurations for a specific range of clusters
  • ⛔ Limited separation of environments
  • ⛔ End-users must adapt to new systems and processes

image

Kubernetes Modifications Required

  • GPU drivers & dependencies due to an ESX vGPU environment
  • Advanced work scheduling - the default k8s scheduler is not optimized for ML workloads
  • Additional components/configurations
    • Nvidia GPU Operator
    • nvidia runtime class

image

imagined with midjourney

What if You Want to Have Your Cake and Eat It Too?

Would it be possible to implement both strategies?

  • Central cluster with specialized configuration
  • While simultaneously having GPUs available in all k8s platforms

image

imagined with midjourney

Karmada

Open, Multi-Cloud, Multi-Cluster Kubernetes Orchestration

image:width:30%

https://karmada.io

Karmada Overview

image:width:100%

Liqo

Enable dynamic and seamless Kubernetes multi-cluster topologies

image

https://liqo.io

Liqo Origin

  • 2019 developed at the Politecnico di Torino
  • 2023 became a spin-off

image

Liqo Overview

Liqo is an open-source project that enables dynamic and seamless Kubernetes multi-cluster topologies, supporting heterogeneous on-premise, cloud and edge infrastructures

image

Liqo Peering

Peering as a unidirectional resource and service consumption relationship between two Kubernetes clusters, a consumer and a provider

  • Network fabric
  • Authentication
  • Resource negotiation
  • Virtual node

image

End-to-end peering establishment

Virtual Kubelet

The virtual node abstraction is based on the Virtual Kubelet project

  • Create the virtual node resource
  • Offload the local pods
  • Propagate and synchronize the accessory artifacts (e.g., Services, ConfigMaps, Secrets, …)

Liqo Offloading

Pod offloading works by through a virtual node in the consumer cluster that represents resources from the provider cluster

  • Assigne resources
  • Namespace extension
  • Pod offloading

image

Liqo Network

Pod offloading works by through a virtual node in the consumer cluster that represents resources from the provider cluster

  • IPAM management
  • Wireguard VPN Tunnel
  • In-cluster overlay network

image

Liqo Storage

Storage fabric enables the seamless offloading of stateful workloads to a provider clusters

  • Storage binding deferral
  • Data gravity
  • Locality constraints

Demo Time

The Demo Environment

Provider

A 2 node cluster with 1 GPU

kubectl get nodes --context=admin@k8s-gpu

Consumer

A 4 node cluster

kubectl get nodes -l liqo.io/type!=virtual-node --context=admin@k8s-lab

Liqo Installation

Liqo helm chart values

Provider

  discovery:
    config:
      clusterID: k8s-gpu
      clusterLabels:
        liqo.io/provider: kubeadm
  ipam:
    podCIDR: 10.244.0.0/16
    serviceCIDR: 10.96.0.0/12
  telemetry:
    enabled: false
  offloading:
    defaultNodeResources:
      cpu: "4"
      memory: 4Gi
      pods: "30"
      ephemeral-storage: 20Gi

Consumer

  discovery:
    config:
      clusterID: k8s-lab
      clusterLabels:
        liqo.io/provider: kubeadm
  ipam:
    podCIDR: 10.244.0.0/16
    serviceCIDR: 10.96.0.0/12
  telemetry:
    enabled: false

Liqo Deployment

Provider

kubectl get pods -n kube-liqo --context=admin@k8s-gpu

Consumer

kubectl get pods -n kube-liqo --context=admin@k8s-lab

Pod Offloading

In order to perform pod offloading we need to first establish peering

liqoctl authenticate -n kube-liqo --context admin@k8s-lab --remote-context admin@k8s-gpu --remote-namespace kube-liqo

and request a resource slice from the provider cluster

bat config/liqo-slice.yaml --color always

Virtual Node

If the resource slice is accepted, it will create a virtual node on the consumer cluster

kubectl get nodes -o wide --context=admin@k8s-lab

Liqo Status

Provider

liqoctl info -n kube-liqo --context=admin@k8s-gpu

Consumer

liqoctl info -n kube-liqo --context=admin@k8s-lab

Liqo Networking

Setup a wireguard vpn tunnel to communicate between the consumer and the provider cluster

liqoctl network connect --context admin@k8s-lab --namespace liqo-tenant-gpu --liqo-namespace kube-liqo --remote-liqo-namespace kube-liqo --remote-context admin@k8s-gpu --remote-namespace liqo-tenant-lab --gw-server-service-type NodePort

Liqo Peer Status

Provider

liqoctl info -n kube-liqo peer --context=admin@k8s-gpu

Consumer

liqoctl info -n kube-liqo peer --context=admin@k8s-lab

Offload a Namespace

The liqo-demo namespace on the consumer cluster is used

kubectl create ns liqo-demo --context=admin@k8s-lab

to configure pod offloading

liqoctl offload namespace liqo-demo --context=admin@k8s-lab

Twin Namespace

Provider

kubectl get ns --context=admin@k8s-gpu | grep demo

Consumer

kubectl get ns --context=admin@k8s-lab | grep demo

Offload a Pod

Pod configuration

bat config/demo-pod.yaml --color always

Pod Configuration

Service configuration

bat config/demo-service.yaml --color always

Ingress configuration

bat config/demo-ingress.yaml --color always

Pod Status

Provider

kubectl get pod -o wide -n liqo-demo-k8s-lab --context=admin@k8s-gpu

Consumer

kubectl get pod -o wide -n liqo-demo --context=admin@k8s-lab

Work with the Pod on the Consumer

kubectl describe pod -n liqo-demo -l app=web-nvidia-smi --context=admin@k8s-lab | head -n 15
kubectl logs -n liqo-demo -l app=web-nvidia-smi --context=admin@k8s-lab

Curl

Query the ingress using curl

kubectl get ingress --context=admin@k8s-lab
curl -s https://smi.lab.remmen.io

What Made Me Change the Title

March 2025

How We solved the GPU Investment Puzzle

August 2025

How We Thought We Solved the GPU Investment Puzzle

Some issues have emerged

  • Networking
  • Day-2 operation

IPAM / Cilium

Ingress not working

<html>
<head><title>504 Gateway Time-out</title></head>
<body>
<center><h1>504 Gateway Time-out</h1></center>
<hr><center>nginx</center>
</body>
</html>

Cilium's IPAM protects against IP spoofing

➜ hubble observe --follow --to-ip 10.127.66.106 --numeric
Feb 25 15:19:24.674: 10.127.66.173 (ID:50526) <> 10.127.66.106 (ID:8224) Invalid source ip DROPPED (ICMPv4 EchoRequest)
Feb 25 15:19:25.688: 10.127.66.173 (ID:50526) <> 10.127.66.106 (ID:8224) Invalid source ip DROPPED (ICMPv4 EchoRequest)
Feb 25 15:19:26.708: 10.127.66.173 (ID:50526) <> 10.127.66.106 (ID:8224) Invalid source ip DROPPED (ICMPv4 EchoRequest)

-> Deactivating SourceIPVerification is unsupported by cilium enterprise

Talos

Talos build in upgrade command fails

talosctl upgrade-k8s --context=k8s-lab --to=v1.33.4 --nodes=10.10.40.90;

automatically detected the lowest Kubernetes version 1.33.1
discovered controlplane nodes ["10.10.40.90"]
discovered worker nodes ["10.244.2.132" "10.10.40.91" "10.10.40.92" "10.10.40.93"]
 > "10.10.40.90": pre-pulling registry.k8s.io/kube-apiserver:v1.33.4
 > "10.10.40.90": pre-pulling registry.k8s.io/kube-controller-manager:v1.33.4
 > "10.10.40.90": pre-pulling registry.k8s.io/kube-scheduler:v1.33.4
 > "10.10.40.90": pre-pulling ghcr.io/siderolabs/kubelet:v1.33.4
failed pre-pulling images: error fetching kubelet spec on node 10.244.2.132: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial tcp 10.244.2.132:50000: connect: connection refused"

Bring Workloads to the Clusters

  • Use what we learned to build a better platform
  • Consolidate GPUs into a Prod and Non-Prod cluster

Dynamic Resource allocation (DRA)

  • DRA allows granular GPU demands (memory instead of quantity)
  • The Hami Project allows a dynamic reconfiguration of the GPUs profile for better fit

image

https://project-hami.io/

image

Kueue

Using Quotas to reserve GPUs

  • Kueue allows you to over-use your quota if you need more GPUs and free resources are available

image

https://kueue.sigs.k8s.io/

Questions ?

image Link to the Slides