Kubernetes for LLM Inference: Scaling AI Workloads

The LLM Inference Scaling Problem

LLM inference is fundamentally different from serving a typical web API. Each request consumes significant GPU memory, latency is high and variable depending on output length, and throughput depends heavily on batching efficiency. Standard horizontal pod autoscaling based on CPU or memory metrics does not translate well to GPU workloads.

Running LLM inference on Kubernetes properly requires purpose-built tooling: GPU-aware scheduling, efficient inference servers like vLLM, custom autoscaling metrics, and a clear cost strategy that accounts for expensive GPU node pools.

Setting Up GPU Node Pools

Create dedicated GPU node pools rather than mixing GPU and CPU workloads on the same nodes. This simplifies scheduling and enables independent scaling of your inference tier.

# GKE example — create a GPU node pool
gcloud container node-pools create gpu-inference   --cluster=prod-cluster   --machine-type=n1-standard-8   --accelerator=type=nvidia-tesla-t4,count=1   --num-nodes=2   --enable-autoscaling   --min-nodes=1   --max-nodes=10   --node-taints=gpu=true:NoSchedule

Tainting GPU nodes prevents non-GPU workloads from scheduling onto them, keeping your expensive GPU capacity reserved for inference.

Deploying vLLM on Kubernetes

vLLM is the best open-source inference server for self-hosted models. Its PagedAttention mechanism dramatically improves GPU memory utilization, enabling 2-4x higher throughput than naive inference implementations.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm-inference
spec:
  replicas: 2
  selector:
    matchLabels:
      app: vllm-inference
  template:
    spec:
      tolerations:
        - key: gpu
          operator: Equal
          value: "true"
          effect: NoSchedule
      containers:
        - name: vllm
          image: vllm/vllm-openai:latest
          args:
            - --model
            - mistralai/Mistral-7B-Instruct-v0.2
            - --max-model-len
            - "4096"
            - --tensor-parallel-size
            - "1"
          resources:
            limits:
              nvidia.com/gpu: "1"
              memory: "20Gi"
            requests:
              nvidia.com/gpu: "1"
              memory: "16Gi"

Autoscaling Based on GPU Utilization and Queue Depth

Standard HPA uses CPU/memory metrics. For LLM inference, you want to scale on GPU utilization and request queue depth. Use KEDA (Kubernetes Event-Driven Autoscaling) with a Prometheus trigger for more control.

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: vllm-scaledobject
spec:
  scaleTargetRef:
    name: vllm-inference
  minReplicaCount: 1
  maxReplicaCount: 8
  triggers:
    - type: prometheus
      metadata:
        serverAddress: http://prometheus:9090
        metricName: vllm_gpu_cache_usage_perc
        threshold: "70"
        query: avg(vllm:gpu_cache_usage_perc)

Scale out when GPU cache utilization exceeds 70%. Scale in during low traffic to save cost. Always keep a minimum of 1 replica running to avoid cold start latency on first requests.

Cost Control with Spot Instances

GPU instances are expensive. Use spot/preemptible instances for non-critical inference and on-demand for production SLAs. A mixed node pool with 70% spot and 30% on-demand gives substantial savings with acceptable reliability.

Use spot instances for batch inference jobs (embeddings, document processing)
Use on-demand for real-time user-facing inference with latency requirements
Implement graceful shutdown handlers to drain in-flight requests before spot termination
Use pod disruption budgets to maintain minimum availability during node replacements

Request Batching and Throughput Optimization

vLLM performs continuous batching — it groups incoming requests into batches dynamically, maximizing GPU utilization without requiring clients to batch requests themselves. Configure max_num_seqs (maximum concurrent sequences) based on your GPU memory and average context length.

For a T4 GPU with a 7B model: set max_num_seqs=32
Monitor the vllm:num_requests_running metric — if it consistently hits your limit, scale out
Enable prefix caching (--enable-prefix-caching) if your prompts share a long system prompt — it can cut first-token latency by 50% or more

Kubernetes-based LLM inference gives you the flexibility to run any open-source model at scale without paying OpenAI's per-token markup. The infrastructure complexity is real but manageable with the right setup. Start with a single GPU node and vLLM, measure your throughput, then scale from that baseline.

Kubernetes for LLM Inference: Scaling AI Workloads

The LLM Inference Scaling Problem

Setting Up GPU Node Pools

Deploying vLLM on Kubernetes

Autoscaling Based on GPU Utilization and Queue Depth

Cost Control with Spot Instances

Request Batching and Throughput Optimization

Bookt.dk — Danish Salon Booking

AWS Infrastructure for AI Workloads: The Complete Setup

Helm Charts: Packaging and Deploying Kubernetes Applications

Want to Build This for Your Team?