The LLM Inference Scaling Problem
LLM inference is fundamentally different from serving a typical web API. Each request consumes significant GPU memory, latency is high and variable depending on output length, and throughput depends heavily on batching efficiency. Standard horizontal pod autoscaling based on CPU or memory metrics does not translate well to GPU workloads.
Running LLM inference on Kubernetes properly requires purpose-built tooling: GPU-aware scheduling, efficient inference servers like vLLM, custom autoscaling metrics, and a clear cost strategy that accounts for expensive GPU node pools.
Setting Up GPU Node Pools
Create dedicated GPU node pools rather than mixing GPU and CPU workloads on the same nodes. This simplifies scheduling and enables independent scaling of your inference tier.
# GKE example — create a GPU node pool
gcloud container node-pools create gpu-inference --cluster=prod-cluster --machine-type=n1-standard-8 --accelerator=type=nvidia-tesla-t4,count=1 --num-nodes=2 --enable-autoscaling --min-nodes=1 --max-nodes=10 --node-taints=gpu=true:NoSchedule
Tainting GPU nodes prevents non-GPU workloads from scheduling onto them, keeping your expensive GPU capacity reserved for inference.
Deploying vLLM on Kubernetes
vLLM is the best open-source inference server for self-hosted models. Its PagedAttention mechanism dramatically improves GPU memory utilization, enabling 2-4x higher throughput than naive inference implementations.
apiVersion: apps/v1
kind: Deployment
metadata:
name: vllm-inference
spec:
replicas: 2
selector:
matchLabels:
app: vllm-inference
template:
spec:
tolerations:
- key: gpu
operator: Equal
value: "true"
effect: NoSchedule
containers:
- name: vllm
image: vllm/vllm-openai:latest
args:
- --model
- mistralai/Mistral-7B-Instruct-v0.2
- --max-model-len
- "4096"
- --tensor-parallel-size
- "1"
resources:
limits:
nvidia.com/gpu: "1"
memory: "20Gi"
requests:
nvidia.com/gpu: "1"
memory: "16Gi"
Autoscaling Based on GPU Utilization and Queue Depth
Standard HPA uses CPU/memory metrics. For LLM inference, you want to scale on GPU utilization and request queue depth. Use KEDA (Kubernetes Event-Driven Autoscaling) with a Prometheus trigger for more control.
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: vllm-scaledobject
spec:
scaleTargetRef:
name: vllm-inference
minReplicaCount: 1
maxReplicaCount: 8
triggers:
- type: prometheus
metadata:
serverAddress: http://prometheus:9090
metricName: vllm_gpu_cache_usage_perc
threshold: "70"
query: avg(vllm:gpu_cache_usage_perc)
Scale out when GPU cache utilization exceeds 70%. Scale in during low traffic to save cost. Always keep a minimum of 1 replica running to avoid cold start latency on first requests.
Cost Control with Spot Instances
GPU instances are expensive. Use spot/preemptible instances for non-critical inference and on-demand for production SLAs. A mixed node pool with 70% spot and 30% on-demand gives substantial savings with acceptable reliability.
- Use spot instances for batch inference jobs (embeddings, document processing)
- Use on-demand for real-time user-facing inference with latency requirements
- Implement graceful shutdown handlers to drain in-flight requests before spot termination
- Use pod disruption budgets to maintain minimum availability during node replacements
Request Batching and Throughput Optimization
vLLM performs continuous batching — it groups incoming requests into batches dynamically, maximizing GPU utilization without requiring clients to batch requests themselves. Configure max_num_seqs (maximum concurrent sequences) based on your GPU memory and average context length.
- For a T4 GPU with a 7B model: set
max_num_seqs=32 - Monitor the
vllm:num_requests_runningmetric — if it consistently hits your limit, scale out - Enable prefix caching (
--enable-prefix-caching) if your prompts share a long system prompt — it can cut first-token latency by 50% or more
Kubernetes-based LLM inference gives you the flexibility to run any open-source model at scale without paying OpenAI's per-token markup. The infrastructure complexity is real but manageable with the right setup. Start with a single GPU node and vLLM, measure your throughput, then scale from that baseline.