The Three Pillars: Metrics, Logs, and Traces
Observability is not the same as monitoring. Monitoring tells you when something is broken. Observability lets you understand why. A complete observability stack covers three pillars: metrics (what is happening), logs (what happened in detail), and traces (how a request moved through the system). This guide focuses on the open-source stack — Prometheus for metrics, Loki for logs, and Grafana for visualisation and alerting — which is production-grade and runs entirely on your own infrastructure.
Prometheus Setup and Service Discovery
Prometheus scrapes metrics from HTTP endpoints on a configurable interval. In Kubernetes, it uses service discovery to automatically find pods with the correct annotations — no manual target configuration needed.
# prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s
scrape_configs:
- job_name: 'kubernetes-pods'
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
action: replace
target_label: __metrics_path__
regex: (.+)
- job_name: 'node-exporter'
static_configs:
- targets: ['node-exporter:9100']
Annotate your pods with prometheus.io/scrape: "true" and prometheus.io/port: "3000" to opt them into scraping.
Instrumenting Your Application
Expose a /metrics endpoint from your Node.js API using prom-client. Track the RED metrics — Rate, Errors, Duration — for every service.
import client from 'prom-client'
const registry = new client.Registry()
client.collectDefaultMetrics({ register: registry })
const httpRequestDuration = new client.Histogram({
name: 'http_request_duration_seconds',
help: 'HTTP request duration in seconds',
labelNames: ['method', 'route', 'status_code'],
buckets: [0.01, 0.05, 0.1, 0.3, 0.5, 1, 2, 5],
registers: [registry],
})
// Middleware
app.use((req, res, next) => {
const end = httpRequestDuration.startTimer()
res.on('finish', () => {
end({ method: req.method, route: req.route?.path, status_code: res.statusCode })
})
next()
})
app.get('/metrics', async (req, res) => {
res.set('Content-Type', registry.contentType)
res.send(await registry.metrics())
})
Log Aggregation with Loki and Promtail
Loki stores logs indexed only by labels (not by content), making it dramatically cheaper than Elasticsearch for log storage. Promtail ships logs from your pods to Loki, using Kubernetes metadata to add labels automatically.
# promtail-config.yml
clients:
- url: http://loki:3100/loki/api/v1/push
scrape_configs:
- job_name: kubernetes-pods
kubernetes_sd_configs:
- role: pod
pipeline_stages:
- json:
expressions:
level: level
msg: message
- labels:
level:
relabel_configs:
- source_labels: [__meta_kubernetes_pod_name]
target_label: pod
- source_labels: [__meta_kubernetes_namespace]
target_label: namespace
Grafana Dashboards: The Metrics That Matter
Build one dashboard per service, and one infrastructure overview dashboard. Every service dashboard should answer four questions at a glance: Is request rate normal? Is error rate acceptable? Is latency within SLA? Are resources (CPU, memory) being exhausted?
# Key PromQL queries for your dashboards
# Request rate
rate(http_request_duration_seconds_count[5m])
# Error rate
rate(http_request_duration_seconds_count{status_code=~"5.."}[5m])
/ rate(http_request_duration_seconds_count[5m])
# P99 latency
histogram_quantile(0.99,
rate(http_request_duration_seconds_bucket[5m])
)
# CPU usage per pod
rate(container_cpu_usage_seconds_total[5m])
Alertmanager: On-Call Alerting
Define alerts in Prometheus rules files and route them through Alertmanager to PagerDuty, Slack, or email. Write alerts that page people only when human action is actually required — alert fatigue from noisy alerts is a serious operational hazard.
# alert_rules.yml
groups:
- name: api.rules
rules:
- alert: HighErrorRate
expr: |
rate(http_request_duration_seconds_count{status_code=~"5.."}[5m])
/ rate(http_request_duration_seconds_count[5m]) > 0.01
for: 5m
labels:
severity: critical
annotations:
summary: "Error rate above 1% for 5 minutes"
runbook: "https://wiki.internal/runbooks/high-error-rate"
Every alert should have a runbook link. An alert without a runbook is a page that will be ignored or resolved incorrectly at 3am.