DevOps for AI Is Not MLOps
The term "MLOps" typically refers to managing the full lifecycle of custom-trained ML models: data pipelines, training runs, model registries, and serving infrastructure. In 2025, most companies are not training custom models — they are calling OpenAI, Anthropic, or Google APIs and orchestrating those calls with frameworks like LangChain or LlamaIndex.
This creates a new category of operational concerns that sits between traditional DevOps and full MLOps. You still need version control, CI/CD, observability, and incident response — but applied to prompts, context management, and LLM provider dependencies rather than model weights.
Prompt Version Control
Prompts are code. Treat them as such. Store all system prompts in your repository as versioned text files or structured YAML. Never hardcode prompts in application logic where changes cannot be tracked or reviewed.
# prompts/customer-support-v3.yaml
version: 3
model: claude-3-5-sonnet-20241022
system: |
You are a customer support agent for Acme Corp.
Tone: Professional, empathetic, concise.
Always: Acknowledge the issue before offering solutions.
Never: Promise refunds without checking the refund policy tool.
Refund policy: Products can be returned within 30 days if unopened.
temperature: 0.3
max_tokens: 500
Every prompt change goes through pull request review. Changes to production prompts require at least one reviewer who understands the use case context. Tag prompt versions and never overwrite — append a new version number.
LLM Observability Stack
Standard APM tools (Datadog, New Relic) do not understand LLM-specific metrics. Instrument your AI application with a purpose-built tool: Langfuse (open-source, self-hostable) or LangSmith (LangChain's hosted platform).
Track these metrics per model and per prompt version:
- Latency (TTFT): Time to first token. User experience degrades sharply above 2 seconds.
- Total latency: Time to complete response. Set P99 alerts.
- Token usage: Input and output tokens per request. This IS your cost metric.
- Error rate: API errors, timeouts, and content filter rejections.
- Trace completeness: For multi-step agents, track which steps fail most often.
Cost Controls and Budget Alerts
LLM API costs can spike unexpectedly. A single bug that triggers excessive retries, or a prompt that generates longer-than-expected outputs, can turn a $200/month AI feature into a $5000 surprise invoice.
# Token budget middleware example
class TokenBudgetMiddleware:
def __init__(self, daily_limit: int = 1_000_000):
self.daily_limit = daily_limit
self.redis = Redis()
async def check_budget(self, estimated_tokens: int):
today = datetime.now().strftime('%Y-%m-%d')
used = int(self.redis.get(f'tokens:{today}') or 0)
if used + estimated_tokens > self.daily_limit:
raise BudgetExceededError('Daily token budget exceeded')
self.redis.incr(f'tokens:{today}', estimated_tokens)
self.redis.expire(f'tokens:{today}', 86400)
- Set AWS/GCP/Azure budget alerts at 50%, 80%, and 100% of expected monthly spend
- Implement per-user and per-feature token quotas in your application layer
- Use model tiering: route simple queries to cheaper models (Haiku, GPT-4o-mini) automatically
Incident Response for AI Systems
AI incidents have unique characteristics compared to traditional software outages. Quality degradation often happens silently — the system returns 200 OK while producing wrong or harmful outputs. Standard uptime monitoring misses these failures entirely.
Types of AI Incidents
- Provider outage: OpenAI/Anthropic API down. Mitigate with fallback providers.
- Quality regression: Output quality drops after a prompt or model change. Detect with automated eval.
- Cost explosion: Token usage spikes. Detect with real-time cost monitoring.
- Prompt injection attack: Adversarial user inputs manipulate model behavior. Detect with input validation and output monitoring.
Define a runbook for each incident type before it happens. Know in advance: who gets paged, what the rollback procedure is, and when to switch to a fallback provider.
Multi-Provider Strategy
Depending on a single LLM provider is an operational risk. Design your AI layer with an abstraction that allows switching providers. When OpenAI has an outage, you want to flip a config flag and route to Anthropic — not rewrite half your codebase.
- Use LiteLLM as a provider-agnostic proxy in front of all LLM calls
- Keep a tested fallback prompt for your secondary provider (models behave differently)
- Validate that your use case is within secondary provider's usage policies before you need it
AI DevOps in 2025 is about operating probabilistic systems reliably. The practices are not fundamentally different from good software engineering — version control, observability, testing, incident response — but the implementation details require AI-specific tooling and thinking.