Self-Host Grafana + Prometheus: Full Metrics Stack 2026
·OSSAlt Team
grafanaprometheusmonitoringmetricsalertmanagerlokiself-hosting2026
TL;DR
Prometheus (Apache 2.0, ~56K stars, Go) is the industry-standard metrics collection and alerting toolkit. Grafana (AGPL 3.0, ~64K stars, Go + TypeScript) is the visualization layer — beautiful dashboards, alerting, and exploration. Together they form the most deployed self-hosted observability stack. Add Loki for log aggregation and you have a complete replace for DataDog ($15-23/host/month) or New Relic.
Key Takeaways
- Prometheus: Time-series metrics database with a powerful query language (PromQL)
- Grafana: Dashboards, alerting, and data exploration for Prometheus, Loki, and 50+ other sources
- Node Exporter: Collects OS metrics (CPU, RAM, disk, network) from Linux hosts
- cAdvisor: Collects Docker container metrics automatically
- Loki: Log aggregation with the same label-based model as Prometheus (cheaper than Elasticsearch for logs)
- Alertmanager: Routes alerts to Slack, PagerDuty, email with deduplication and silencing
The Full Stack
Your Services → Prometheus Exporters → Prometheus (scrape + store)
↓
Alertmanager → Slack/PagerDuty/Email
↓
Grafana (visualize + alert)
Your Apps → Promtail → Loki → Grafana (logs)
Part 1: Docker Compose — Full Stack
# docker-compose.yml
services:
prometheus:
image: prom/prometheus:latest
container_name: prometheus
restart: unless-stopped
ports:
- "9090:9090"
volumes:
- ./prometheus/prometheus.yml:/etc/prometheus/prometheus.yml:ro
- ./prometheus/rules:/etc/prometheus/rules:ro
- prometheus_data:/prometheus
command:
- "--config.file=/etc/prometheus/prometheus.yml"
- "--storage.tsdb.path=/prometheus"
- "--storage.tsdb.retention.time=30d"
- "--web.console.libraries=/usr/share/prometheus/console_libraries"
- "--web.console.templates=/usr/share/prometheus/consoles"
- "--web.enable-lifecycle" # Enable config hot-reload
grafana:
image: grafana/grafana:latest
container_name: grafana
restart: unless-stopped
ports:
- "3000:3000"
volumes:
- grafana_data:/var/lib/grafana
- ./grafana/provisioning:/etc/grafana/provisioning
environment:
GF_SECURITY_ADMIN_PASSWORD: "${GRAFANA_PASSWORD}"
GF_USERS_ALLOW_SIGN_UP: "false"
GF_SERVER_DOMAIN: "grafana.yourdomain.com"
GF_SERVER_ROOT_URL: "https://grafana.yourdomain.com"
node-exporter:
image: prom/node-exporter:latest
container_name: node-exporter
restart: unless-stopped
pid: host
network_mode: host
volumes:
- /proc:/host/proc:ro
- /sys:/host/sys:ro
- /:/rootfs:ro
command:
- "--path.procfs=/host/proc"
- "--path.rootfs=/rootfs"
- "--path.sysfs=/host/sys"
- "--collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/)"
cadvisor:
image: gcr.io/cadvisor/cadvisor:latest
container_name: cadvisor
restart: unless-stopped
ports:
- "8080:8080"
volumes:
- /:/rootfs:ro
- /var/run:/var/run:ro
- /sys:/sys:ro
- /var/lib/docker/:/var/lib/docker:ro
- /dev/disk/:/dev/disk:ro
privileged: true
devices:
- /dev/kmsg
alertmanager:
image: prom/alertmanager:latest
container_name: alertmanager
restart: unless-stopped
ports:
- "9093:9093"
volumes:
- ./alertmanager/alertmanager.yml:/etc/alertmanager/alertmanager.yml:ro
- alertmanager_data:/alertmanager
loki:
image: grafana/loki:latest
container_name: loki
restart: unless-stopped
ports:
- "3100:3100"
volumes:
- ./loki/loki-config.yml:/etc/loki/local-config.yaml:ro
- loki_data:/loki
promtail:
image: grafana/promtail:latest
container_name: promtail
restart: unless-stopped
volumes:
- ./promtail/promtail-config.yml:/etc/promtail/config.yml:ro
- /var/log:/var/log:ro
- /var/lib/docker/containers:/var/lib/docker/containers:ro
volumes:
prometheus_data:
grafana_data:
alertmanager_data:
loki_data:
Part 2: Prometheus Configuration
# prometheus/prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s
alerting:
alertmanagers:
- static_configs:
- targets: ["alertmanager:9093"]
rule_files:
- "rules/*.yml"
scrape_configs:
# Prometheus self-monitoring:
- job_name: "prometheus"
static_configs:
- targets: ["localhost:9090"]
# Linux host metrics:
- job_name: "node-exporter"
static_configs:
- targets: ["node-exporter:9100"]
# Docker container metrics:
- job_name: "cadvisor"
static_configs:
- targets: ["cadvisor:8080"]
# Additional hosts (add more servers):
- job_name: "remote-servers"
static_configs:
- targets:
- "server2.yourdomain.com:9100"
- "server3.yourdomain.com:9100"
labels:
env: production
Part 3: Alert Rules
# prometheus/rules/host-alerts.yml
groups:
- name: host-alerts
rules:
- alert: HighCPUUsage
expr: 100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 90
for: 5m
labels:
severity: warning
annotations:
summary: "High CPU usage on {{ $labels.instance }}"
description: "CPU usage is {{ $value | printf \"%.1f\" }}%"
- alert: LowDiskSpace
expr: (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) * 100 < 15
for: 5m
labels:
severity: critical
annotations:
summary: "Low disk space on {{ $labels.instance }}"
description: "Disk is {{ $value | printf \"%.1f\" }}% free"
- alert: HighMemoryUsage
expr: (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 90
for: 5m
labels:
severity: warning
annotations:
summary: "High memory usage on {{ $labels.instance }}"
- alert: ContainerDown
expr: absent(container_last_seen{name!=""})
for: 1m
labels:
severity: critical
annotations:
summary: "Container {{ $labels.name }} is down"
Part 4: Alertmanager Configuration
# alertmanager/alertmanager.yml
global:
slack_api_url: "https://hooks.slack.com/services/YOUR/WEBHOOK"
route:
group_by: ["alertname", "instance"]
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
receiver: "slack-notifications"
routes:
- match:
severity: critical
receiver: "pagerduty"
receivers:
- name: "slack-notifications"
slack_configs:
- channel: "#alerts"
title: "{{ range .Alerts }}{{ .Annotations.summary }}{{ end }}"
text: "{{ range .Alerts }}{{ .Annotations.description }}{{ end }}"
- name: "pagerduty"
pagerduty_configs:
- service_key: "your-pagerduty-key"
inhibit_rules:
- source_match:
severity: critical
target_match:
severity: warning
equal: ["alertname", "instance"]
Part 5: Loki Configuration
# loki/loki-config.yml
auth_enabled: false
server:
http_listen_port: 3100
ingester:
lifecycler:
ring:
kvstore:
store: inmemory
replication_factor: 1
schema_config:
configs:
- from: "2024-01-01"
store: boltdb-shipper
object_store: filesystem
schema: v11
index:
prefix: index_
period: 24h
storage_config:
boltdb_shipper:
active_index_directory: /loki/index
cache_location: /loki/index_cache
filesystem:
directory: /loki/chunks
limits_config:
retention_period: 30d
# promtail/promtail-config.yml
server:
http_listen_port: 9080
clients:
- url: http://loki:3100/loki/api/v1/push
scrape_configs:
- job_name: "docker"
docker_sd_configs:
- host: unix:///var/run/docker.sock
refresh_interval: 5s
relabel_configs:
- source_labels: ["__meta_docker_container_name"]
target_label: "container"
- source_labels: ["__meta_docker_compose_service"]
target_label: "service"
Part 6: Grafana Setup
- Visit
http://your-server:3000 - Log in with
admin/ yourGRAFANA_PASSWORD - Configuration → Data Sources → Add data source:
- Prometheus: URL
http://prometheus:9090 - Loki: URL
http://loki:3100
- Prometheus: URL
Import Pre-built Dashboards
In Grafana → + → Import:
| Dashboard | ID | What it shows |
|---|---|---|
| Node Exporter Full | 1860 | CPU, RAM, disk, network for Linux hosts |
| Docker / cAdvisor | 14282 | Container CPU, RAM, network |
| Nginx | 9614 | Nginx requests, errors, latency |
| PostgreSQL | 9628 | Queries, connections, cache hit rate |
| Redis | 11835 | Memory, ops/s, hit rate |
| Alertmanager | 9578 | Active alerts, notification history |
Part 7: HTTPS with Caddy
grafana.yourdomain.com {
reverse_proxy grafana:3000
}
prometheus.yourdomain.com {
basicauth {
admin $2a$14$...
}
reverse_proxy prometheus:9090
}
PromQL Quick Reference
# CPU usage % per host:
100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
# RAM available GB:
node_memory_MemAvailable_bytes / 1024 / 1024 / 1024
# Disk usage %:
100 - ((node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) * 100)
# Container CPU usage:
rate(container_cpu_usage_seconds_total{name!=""}[5m]) * 100
# HTTP requests/s by service:
rate(http_requests_total[5m])
# Error rate %:
rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) * 100
Maintenance
# Update all services:
docker compose pull
docker compose up -d
# Reload Prometheus config (no restart):
curl -X POST http://localhost:9090/-/reload
# Backup Grafana:
tar -czf grafana-backup-$(date +%Y%m%d).tar.gz \
$(docker volume inspect grafana_data --format '{{.Mountpoint}}')
# Check Prometheus targets:
curl http://localhost:9090/api/v1/targets | jq '.data.activeTargets[].health'
See all open source monitoring and observability tools at OSSAlt.com/categories/monitoring.