16 min read A. Sulaiman

Complete Monitoring with Prometheus and Grafana

Build a complete observability stack with Prometheus, Grafana, and Alertmanager for production Kubernetes clusters

Effective monitoring is critical for maintaining reliable systems. This guide shows you how to build a complete observability stack using Prometheus, Grafana, and Alertmanager.

Monitoring Architecture
Figure 1: Complete monitoring and observability architecture

Why Prometheus + Grafana?

  • Open Source - No vendor lock-in
  • Kubernetes Native - Built for cloud-native apps
  • Powerful Query Language - PromQL for metrics
  • Beautiful Dashboards - Grafana visualization
  • Alert Management - Proactive issue detection

1. Prometheus Setup

Helm Installation

# Add Prometheus Helm repository
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update

# Install Prometheus
helm install prometheus prometheus-community/kube-prometheus-stack \
  --namespace monitoring \
  --create-namespace \
  --set prometheus.prometheusSpec.retention=30d \
  --set prometheus.prometheusSpec.storageSpec.volumeClaimTemplate.spec.resources.requests.storage=50Gi

Custom Values

# prometheus-values.yaml
prometheus:
  prometheusSpec:
    retention: 30d
    retentionSize: "45GB"

    resources:
      requests:
        cpu: 500m
        memory: 2Gi
      limits:
        cpu: 2000m
        memory: 4Gi

    storageSpec:
      volumeClaimTemplate:
        spec:
          accessModes: ["ReadWriteOnce"]
          resources:
            requests:
              storage: 50Gi

    # Service monitors
    serviceMonitorSelectorNilUsesHelmValues: false
    podMonitorSelectorNilUsesHelmValues: false

grafana:
  adminPassword: "secure-password-here"

  persistence:
    enabled: true
    size: 10Gi

  dashboardProviders:
    dashboardproviders.yaml:
      apiVersion: 1
      providers:
      - name: 'default'
        folder: 'General'
        type: file
        options:
          path: /var/lib/grafana/dashboards/default

alertmanager:
  config:
    global:
      resolve_timeout: 5m
    route:
      group_by: ['alertname', 'cluster', 'service']
      group_wait: 10s
      group_interval: 10s
      repeat_interval: 12h
      receiver: 'slack-notifications'

    receivers:
    - name: 'slack-notifications'
      slack_configs:
      - api_url: 'YOUR_SLACK_WEBHOOK_URL'
        channel: '#alerts'
        title: 'Kubernetes Alert'

2. Application Instrumentation

Python Flask Example

# app.py
from flask import Flask, jsonify
from prometheus_client import Counter, Histogram, Gauge, generate_latest
import time

app = Flask(__name__)

# Metrics
REQUEST_COUNT = Counter(
    'app_requests_total',
    'Total request count',
    ['method', 'endpoint', 'status']
)

REQUEST_DURATION = Histogram(
    'app_request_duration_seconds',
    'Request duration in seconds',
    ['method', 'endpoint']
)

ACTIVE_REQUESTS = Gauge(
    'app_active_requests',
    'Number of active requests'
)

@app.before_request
def before_request():
    ACTIVE_REQUESTS.inc()
    request.start_time = time.time()

@app.after_request
def after_request(response):
    ACTIVE_REQUESTS.dec()

    request_duration = time.time() - request.start_time
    REQUEST_DURATION.labels(
        method=request.method,
        endpoint=request.endpoint
    ).observe(request_duration)

    REQUEST_COUNT.labels(
        method=request.method,
        endpoint=request.endpoint,
        status=response.status_code
    ).inc()

    return response

@app.route('/metrics')
def metrics():
    return generate_latest()

@app.route('/api/users')
def get_users():
    # Your application logic
    return jsonify({"users": []})

if __name__ == '__main__':
    app.run(host='0.0.0.0', port=5000)

ServiceMonitor for Prometheus

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: my-app
  namespace: production
  labels:
    app: my-app
spec:
  selector:
    matchLabels:
      app: my-app
  endpoints:
  - port: metrics
    interval: 30s
    path: /metrics

Prometheus Scraping
Figure 2: Prometheus service discovery and scraping

3. Essential PromQL Queries

Request Rate

# Requests per second
rate(app_requests_total[5m])

# By endpoint
sum(rate(app_requests_total[5m])) by (endpoint)

# Error rate (4xx and 5xx)
sum(rate(app_requests_total{status=~"4..|5.."}[5m]))
/
sum(rate(app_requests_total[5m]))

Latency

# 95th percentile latency
histogram_quantile(0.95,
  rate(app_request_duration_seconds_bucket[5m])
)

# Average latency
rate(app_request_duration_seconds_sum[5m])
/
rate(app_request_duration_seconds_count[5m])

Resource Usage

# CPU usage
rate(container_cpu_usage_seconds_total[5m])

# Memory usage
container_memory_usage_bytes

# Disk usage
(1 - (node_filesystem_avail_bytes / node_filesystem_size_bytes)) * 100

4. Grafana Dashboards

Import Community Dashboards

Popular dashboard IDs:
- 315 - Kubernetes cluster monitoring
- 6417 - Kubernetes deployment statefulset daemonset
- 747 - Kubernetes pod monitoring
- 1860 - Node Exporter full

Custom Dashboard Example

{
  "dashboard": {
    "title": "Application Metrics",
    "panels": [
      {
        "title": "Request Rate",
        "targets": [
          {
            "expr": "sum(rate(app_requests_total[5m])) by (endpoint)"
          }
        ],
        "type": "graph"
      },
      {
        "title": "Error Rate",
        "targets": [
          {
            "expr": "sum(rate(app_requests_total{status=~\"5..\"}[5m]))"
          }
        ],
        "type": "stat",
        "thresholds": [
          { "value": 0, "color": "green" },
          { "value": 0.01, "color": "yellow" },
          { "value": 0.05, "color": "red" }
        ]
      }
    ]
  }
}

Grafana Dashboard
Figure 3: Custom Grafana dashboard with key metrics

5. Alerting Rules

PrometheusRule

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: app-alerts
  namespace: production
spec:
  groups:
  - name: app.rules
    interval: 30s
    rules:
    # High error rate
    - alert: HighErrorRate
      expr: |
        sum(rate(app_requests_total{status=~"5.."}[5m]))
        /
        sum(rate(app_requests_total[5m]))
        > 0.05
      for: 5m
      labels:
        severity: critical
      annotations:
        summary: "High error rate detected"
        description: "Error rate is {{ $value | humanizePercentage }} (threshold: 5%)"

    # High latency
    - alert: HighLatency
      expr: |
        histogram_quantile(0.95,
          rate(app_request_duration_seconds_bucket[5m])
        ) > 1
      for: 5m
      labels:
        severity: warning
      annotations:
        summary: "High latency detected"
        description: "95th percentile latency is {{ $value }}s"

    # Pod down
    - alert: PodDown
      expr: |
        kube_pod_status_phase{phase="Running"} == 0
      for: 5m
      labels:
        severity: critical
      annotations:
        summary: "Pod {{ $labels.pod }} is down"
        description: "Pod has been down for more than 5 minutes"

    # High memory usage
    - alert: HighMemoryUsage
      expr: |
        (container_memory_usage_bytes / container_spec_memory_limit_bytes) > 0.9
      for: 5m
      labels:
        severity: warning
      annotations:
        summary: "High memory usage"
        description: "Container {{ $labels.container }} using {{ $value | humanizePercentage }} of memory limit"

6. Alertmanager Configuration

apiVersion: v1
kind: ConfigMap
metadata:
  name: alertmanager-config
  namespace: monitoring
data:
  alertmanager.yml: |
    global:
      resolve_timeout: 5m
      slack_api_url: 'YOUR_SLACK_WEBHOOK'

    route:
      receiver: 'default-receiver'
      group_by: ['alertname', 'cluster', 'service']
      group_wait: 10s
      group_interval: 10s
      repeat_interval: 12h

      routes:
      # Critical alerts to PagerDuty
      - match:
          severity: critical
        receiver: 'pagerduty'
        continue: true

      # All alerts to Slack
      - match_re:
          severity: (warning|critical)
        receiver: 'slack-notifications'

    receivers:
    - name: 'default-receiver'

    - name: 'slack-notifications'
      slack_configs:
      - channel: '#alerts'
        title: '{{ .GroupLabels.alertname }}'
        text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'
        send_resolved: true

    - name: 'pagerduty'
      pagerduty_configs:
      - service_key: 'YOUR_PAGERDUTY_KEY'
        description: '{{ .GroupLabels.alertname }}'

7. Log Aggregation with Loki

# Loki installation
apiVersion: v1
kind: ConfigMap
metadata:
  name: loki-config
  namespace: monitoring
data:
  loki.yaml: |
    auth_enabled: false

    server:
      http_listen_port: 3100

    ingester:
      lifecycler:
        ring:
          kvstore:
            store: inmemory
          replication_factor: 1

    schema_config:
      configs:
        - from: 2020-10-24
          store: boltdb-shipper
          object_store: s3
          schema: v11
          index:
            prefix: index_
            period: 24h

Query Logs in Grafana

# All logs from namespace
{namespace="production"}

# Error logs
{namespace="production"} |= "error"

# JSON parsing
{namespace="production"} | json | level="error"

# Rate of errors
rate({namespace="production"} |= "error" [5m])

8. Distributed Tracing

Jaeger Setup

# Install Jaeger operator
kubectl create namespace observability
kubectl apply -f https://github.com/jaegertracing/jaeger-operator/releases/download/v1.42.0/jaeger-operator.yaml

# Create Jaeger instance
kubectl apply -f - <<EOF
apiVersion: jaegertracing.io/v1
kind: Jaeger
metadata:
  name: jaeger
  namespace: observability
spec:
  strategy: production
  storage:
    type: elasticsearch
EOF

9. Complete Observability Stack

Component Purpose Retention
Prometheus Metrics collection 30 days
Grafana Visualization N/A
Alertmanager Alert routing N/A
Loki Log aggregation 7 days
Jaeger Distributed tracing 7 days
Node Exporter Node metrics N/A
kube-state-metrics Kubernetes metrics N/A

10. Best Practices

Metric Naming

# Good naming convention
app_request_duration_seconds
app_requests_total
app_active_connections

# Bad naming
requestTime
total_requests
connections

Label Usage

# Good - Cardinality controlled
REQUEST_COUNT.labels(
    method="GET",
    endpoint="/api/users",
    status=200
)

# Bad - High cardinality (user_id changes frequently)
REQUEST_COUNT.labels(
    user_id="12345",  # Don't do this!
    endpoint="/api/users"
)

Monitoring Checklist

  • [ ] Prometheus installed with persistent storage
  • [ ] Grafana dashboards configured
  • [ ] Application metrics exposed
  • [ ] ServiceMonitors created
  • [ ] Alert rules defined
  • [ ] Alertmanager configured
  • [ ] Log aggregation setup
  • [ ] Tracing implemented (optional)
  • [ ] On-call rotation established

Common Metrics to Track

Golden Signals (SRE)

  • Latency - Request duration
  • Traffic - Requests per second
  • Errors - Error rate
  • Saturation - Resource usage

RED Method

  • Rate - Requests per second
  • Errors - Failed requests
  • Duration - Response time

USE Method

  • Utilization - % time resource busy
  • Saturation - Queue depth
  • Errors - Error count

Troubleshooting

# Check Prometheus targets
kubectl port-forward -n monitoring svc/prometheus-kube-prometheus-prometheus 9090:9090

# View Alertmanager
kubectl port-forward -n monitoring svc/prometheus-kube-prometheus-alertmanager 9093:9093

# Check Grafana
kubectl port-forward -n monitoring svc/prometheus-grafana 3000:80

# Validate ServiceMonitor
kubectl get servicemonitor -A

# Check Prometheus logs
kubectl logs -n monitoring prometheus-prometheus-0 -c prometheus

Conclusion

A complete observability stack provides:
- ✅ Real-time visibility into system health
- ✅ Proactive alerting before users notice issues
- ✅ Historical data for capacity planning
- ✅ Fast troubleshooting with correlated data
- ✅ Performance insights for optimization

Start with the basics and expand as your monitoring needs grow.


Need help with observability? Contact us for monitoring and SRE consulting.