Self-Host Grafana + Prometheus: Full Metrics Stack 2026

TL;DR

Prometheus (Apache 2.0, ~56K stars, Go) is the industry-standard metrics collection and alerting toolkit. Grafana (AGPL 3.0, ~64K stars, Go + TypeScript) is the visualization layer — beautiful dashboards, alerting, and exploration. Together they form the most deployed self-hosted observability stack. Add Loki for log aggregation and you have a complete replace for DataDog ($15-23/host/month) or New Relic.

Key Takeaways

Prometheus: Time-series metrics database with a powerful query language (PromQL)
Grafana: Dashboards, alerting, and data exploration for Prometheus, Loki, and 50+ other sources
Node Exporter: Collects OS metrics (CPU, RAM, disk, network) from Linux hosts
cAdvisor: Collects Docker container metrics automatically
Loki: Log aggregation with the same label-based model as Prometheus (cheaper than Elasticsearch for logs)
Alertmanager: Routes alerts to Slack, PagerDuty, email with deduplication and silencing

The Full Stack

Your Services → Prometheus Exporters → Prometheus (scrape + store)
                                             ↓
                                        Alertmanager → Slack/PagerDuty/Email
                                             ↓
                                          Grafana (visualize + alert)

Your Apps → Promtail → Loki → Grafana (logs)

Part 1: Docker Compose — Full Stack

# docker-compose.yml
services:
  prometheus:
    image: prom/prometheus:latest
    container_name: prometheus
    restart: unless-stopped
    ports:
      - "9090:9090"
    volumes:
      - ./prometheus/prometheus.yml:/etc/prometheus/prometheus.yml:ro
      - ./prometheus/rules:/etc/prometheus/rules:ro
      - prometheus_data:/prometheus
    command:
      - "--config.file=/etc/prometheus/prometheus.yml"
      - "--storage.tsdb.path=/prometheus"
      - "--storage.tsdb.retention.time=30d"
      - "--web.console.libraries=/usr/share/prometheus/console_libraries"
      - "--web.console.templates=/usr/share/prometheus/consoles"
      - "--web.enable-lifecycle"    # Enable config hot-reload

  grafana:
    image: grafana/grafana:latest
    container_name: grafana
    restart: unless-stopped
    ports:
      - "3000:3000"
    volumes:
      - grafana_data:/var/lib/grafana
      - ./grafana/provisioning:/etc/grafana/provisioning
    environment:
      GF_SECURITY_ADMIN_PASSWORD: "${GRAFANA_PASSWORD}"
      GF_USERS_ALLOW_SIGN_UP: "false"
      GF_SERVER_DOMAIN: "grafana.yourdomain.com"
      GF_SERVER_ROOT_URL: "https://grafana.yourdomain.com"

  node-exporter:
    image: prom/node-exporter:latest
    container_name: node-exporter
    restart: unless-stopped
    pid: host
    network_mode: host
    volumes:
      - /proc:/host/proc:ro
      - /sys:/host/sys:ro
      - /:/rootfs:ro
    command:
      - "--path.procfs=/host/proc"
      - "--path.rootfs=/rootfs"
      - "--path.sysfs=/host/sys"
      - "--collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/)"

  cadvisor:
    image: gcr.io/cadvisor/cadvisor:latest
    container_name: cadvisor
    restart: unless-stopped
    ports:
      - "8080:8080"
    volumes:
      - /:/rootfs:ro
      - /var/run:/var/run:ro
      - /sys:/sys:ro
      - /var/lib/docker/:/var/lib/docker:ro
      - /dev/disk/:/dev/disk:ro
    privileged: true
    devices:
      - /dev/kmsg

  alertmanager:
    image: prom/alertmanager:latest
    container_name: alertmanager
    restart: unless-stopped
    ports:
      - "9093:9093"
    volumes:
      - ./alertmanager/alertmanager.yml:/etc/alertmanager/alertmanager.yml:ro
      - alertmanager_data:/alertmanager

  loki:
    image: grafana/loki:latest
    container_name: loki
    restart: unless-stopped
    ports:
      - "3100:3100"
    volumes:
      - ./loki/loki-config.yml:/etc/loki/local-config.yaml:ro
      - loki_data:/loki

  promtail:
    image: grafana/promtail:latest
    container_name: promtail
    restart: unless-stopped
    volumes:
      - ./promtail/promtail-config.yml:/etc/promtail/config.yml:ro
      - /var/log:/var/log:ro
      - /var/lib/docker/containers:/var/lib/docker/containers:ro

volumes:
  prometheus_data:
  grafana_data:
  alertmanager_data:
  loki_data:

Part 2: Prometheus Configuration

# prometheus/prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s

alerting:
  alertmanagers:
    - static_configs:
        - targets: ["alertmanager:9093"]

rule_files:
  - "rules/*.yml"

scrape_configs:
  # Prometheus self-monitoring:
  - job_name: "prometheus"
    static_configs:
      - targets: ["localhost:9090"]

  # Linux host metrics:
  - job_name: "node-exporter"
    static_configs:
      - targets: ["node-exporter:9100"]

  # Docker container metrics:
  - job_name: "cadvisor"
    static_configs:
      - targets: ["cadvisor:8080"]

  # Additional hosts (add more servers):
  - job_name: "remote-servers"
    static_configs:
      - targets:
          - "server2.yourdomain.com:9100"
          - "server3.yourdomain.com:9100"
    labels:
      env: production

Part 3: Alert Rules

# prometheus/rules/host-alerts.yml
groups:
  - name: host-alerts
    rules:
      - alert: HighCPUUsage
        expr: 100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 90
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High CPU usage on {{ $labels.instance }}"
          description: "CPU usage is {{ $value | printf \"%.1f\" }}%"

      - alert: LowDiskSpace
        expr: (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) * 100 < 15
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Low disk space on {{ $labels.instance }}"
          description: "Disk is {{ $value | printf \"%.1f\" }}% free"

      - alert: HighMemoryUsage
        expr: (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 90
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High memory usage on {{ $labels.instance }}"

      - alert: ContainerDown
        expr: absent(container_last_seen{name!=""})
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Container {{ $labels.name }} is down"

Part 4: Alertmanager Configuration

# alertmanager/alertmanager.yml
global:
  slack_api_url: "https://hooks.slack.com/services/YOUR/WEBHOOK"

route:
  group_by: ["alertname", "instance"]
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  receiver: "slack-notifications"
  routes:
    - match:
        severity: critical
      receiver: "pagerduty"

receivers:
  - name: "slack-notifications"
    slack_configs:
      - channel: "#alerts"
        title: "{{ range .Alerts }}{{ .Annotations.summary }}{{ end }}"
        text: "{{ range .Alerts }}{{ .Annotations.description }}{{ end }}"

  - name: "pagerduty"
    pagerduty_configs:
      - service_key: "your-pagerduty-key"

inhibit_rules:
  - source_match:
      severity: critical
    target_match:
      severity: warning
    equal: ["alertname", "instance"]

Part 5: Loki Configuration

# loki/loki-config.yml
auth_enabled: false

server:
  http_listen_port: 3100

ingester:
  lifecycler:
    ring:
      kvstore:
        store: inmemory
      replication_factor: 1

schema_config:
  configs:
    - from: "2024-01-01"
      store: boltdb-shipper
      object_store: filesystem
      schema: v11
      index:
        prefix: index_
        period: 24h

storage_config:
  boltdb_shipper:
    active_index_directory: /loki/index
    cache_location: /loki/index_cache
  filesystem:
    directory: /loki/chunks

limits_config:
  retention_period: 30d

# promtail/promtail-config.yml
server:
  http_listen_port: 9080

clients:
  - url: http://loki:3100/loki/api/v1/push

scrape_configs:
  - job_name: "docker"
    docker_sd_configs:
      - host: unix:///var/run/docker.sock
        refresh_interval: 5s
    relabel_configs:
      - source_labels: ["__meta_docker_container_name"]
        target_label: "container"
      - source_labels: ["__meta_docker_compose_service"]
        target_label: "service"

Part 6: Grafana Setup

Visit http://your-server:3000
Log in with admin / your GRAFANA_PASSWORD
Configuration → Data Sources → Add data source:
- Prometheus: URL http://prometheus:9090
- Loki: URL http://loki:3100

Import Pre-built Dashboards

In Grafana → + → Import:

Dashboard	ID	What it shows
Node Exporter Full	`1860`	CPU, RAM, disk, network for Linux hosts
Docker / cAdvisor	`14282`	Container CPU, RAM, network
Nginx	`9614`	Nginx requests, errors, latency
PostgreSQL	`9628`	Queries, connections, cache hit rate
Redis	`11835`	Memory, ops/s, hit rate
Alertmanager	`9578`	Active alerts, notification history

Part 7: HTTPS with Caddy

grafana.yourdomain.com {
    reverse_proxy grafana:3000
}

prometheus.yourdomain.com {
    basicauth {
        admin $2a$14$...
    }
    reverse_proxy prometheus:9090
}

PromQL Quick Reference

# CPU usage % per host:
100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

# RAM available GB:
node_memory_MemAvailable_bytes / 1024 / 1024 / 1024

# Disk usage %:
100 - ((node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) * 100)

# Container CPU usage:
rate(container_cpu_usage_seconds_total{name!=""}[5m]) * 100

# HTTP requests/s by service:
rate(http_requests_total[5m])

# Error rate %:
rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) * 100

Maintenance

# Update all services:
docker compose pull
docker compose up -d

# Reload Prometheus config (no restart):
curl -X POST http://localhost:9090/-/reload

# Backup Grafana:
tar -czf grafana-backup-$(date +%Y%m%d).tar.gz \
  $(docker volume inspect grafana_data --format '{{.Mountpoint}}')

# Check Prometheus targets:
curl http://localhost:9090/api/v1/targets | jq '.data.activeTargets[].health'

See all open source monitoring and observability tools at OSSAlt.com/categories/monitoring.

Comments