Best Open Source Monitoring Tools in 2026

TL;DR

Datadog costs $15–23/host/month — a 10-server infrastructure costs $1,800–2,760/year. Uptime Kuma replaces Better Stack and Pingdom with 20+ monitor types and 90+ notification channels on 256 MB RAM. Grafana + Prometheus replaces Datadog for infrastructure metrics and dashboards. The full open source observability stack costs the price of one VPS regardless of how many servers you monitor.

Key Takeaways

Uptime Kuma (MIT, 62K+ stars) is the best uptime/status monitoring tool — 20 monitor types, beautiful status pages, and Docker container monitoring
Grafana (AGPL-3.0, 65K+ stars) is the universal visualization layer for metrics, logs, and traces from any data source
Prometheus (Apache-2.0, 56K+ stars) is the dominant open source time-series metrics system with a powerful query language (PromQL)
Netdata (GPL-3.0, 72K+ stars) provides 1-second real-time monitoring with zero configuration and ML-based anomaly detection
Grafana Loki (AGPL-3.0, 24K+ stars) is the lightweight log aggregation system designed to work alongside Prometheus
A complete self-hosted monitoring stack (Uptime Kuma + Grafana + Prometheus + Loki) costs $15–20/month vs $1,800+/year for Datadog

Building a Layered Monitoring Strategy

Monitoring isn't one problem — it's four:

Is it up? — Uptime monitoring (Uptime Kuma)
How is it performing? — Metrics collection (Prometheus + Grafana)
What happened? — Log aggregation (Loki or OpenSearch)
What's broken right now? — Real-time monitoring (Netdata)

Commercial tools like Datadog try to solve all four in one platform. Open source tools solve each layer independently and compose well together. The standard open source stack is called the "LGTM stack": Loki (logs), Grafana (visualization), Tempo (traces), Mimir/Prometheus (metrics).

Uptime Kuma — Best Uptime Monitoring

Uptime Kuma is one of the most popular self-hosted tools on GitHub — 62K+ stars, ranking among the top 200 repositories globally. The project earns that popularity by nailing a specific job: tell you when something is down, and make the status page beautiful.

Monitor types cover every uptime check you need:

HTTP/HTTPS with expected status codes and keyword matching
TCP port monitoring
Ping (ICMP)
DNS record monitoring
Docker container status via Docker socket
Push monitors for cron jobs and scheduled tasks (heartbeat-style)
Real Browser monitoring via Puppeteer
GameDig (game server status)
MQTT
RDP, RADIUS

Notification integrations span 90+ destinations: Slack, Discord, Telegram, PagerDuty, OpsGenie, email (SMTP), webhook, Pushover, ntfy, Gotify, Matrix, and many others. Configure multiple notification channels per monitor and route alerts based on severity.

# Uptime Kuma Docker Compose
services:
  uptime-kuma:
    image: louislam/uptime-kuma:latest
    restart: unless-stopped
    ports:
      - "3001:3001"
    volumes:
      - uptime_kuma_data:/app/data
      - /var/run/docker.sock:/var/run/docker.sock  # For Docker monitoring
volumes:
  uptime_kuma_data:

Status pages are first-class features. You configure which monitors appear on a public status page, group them by service category, and customize the page with your logo and domain. Companies use Uptime Kuma status pages as their public incident communication pages.

Key features:

20+ monitor types
90+ notification channels
Public and private status pages
Multiple status pages per instance
Maintenance windows (suppress alerts during planned downtime)
Certificate monitoring (SSL expiry alerts)
Docker container monitoring
Push/heartbeat monitors for cron jobs
Certificate info and expiry tracking
Two-factor authentication for admin
256 MB RAM footprint

Grafana + Prometheus — Best Metrics Stack

Grafana and Prometheus are designed to work together and form the backbone of most open source observability setups. Prometheus collects metrics; Grafana visualizes them. They're deployed separately but integrate deeply.

Prometheus scrapes metrics from your services at configurable intervals. It discovers targets via static config, Kubernetes service discovery, AWS EC2, Consul, and many other mechanisms. Exporters translate metrics from systems that don't natively expose Prometheus metrics — there are exporters for Node.js, Python, MySQL, PostgreSQL, Redis, NGINX, HAProxy, and 200+ other services.

PromQL (Prometheus Query Language) is one of the most expressive query languages for time-series data:

# 95th percentile request latency over last 5 minutes
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))

# Memory usage as percentage of available
100 - (100 * node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)

# Alert: Error rate above 1% over 5 minutes
sum(rate(http_requests_total{status=~"5.."}[5m])) /
sum(rate(http_requests_total[5m])) > 0.01

# Prometheus + Grafana + Node Exporter
services:
  prometheus:
    image: prom/prometheus:latest
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
      - prometheus_data:/prometheus
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.retention.time=30d'
    ports:
      - "9090:9090"
  grafana:
    image: grafana/grafana:latest
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=your-admin-password
      - GF_USERS_ALLOW_SIGN_UP=false
    volumes:
      - grafana_data:/var/lib/grafana
    ports:
      - "3000:3000"
    depends_on:
      - prometheus
  node-exporter:
    image: prom/node-exporter:latest
    pid: host
    network_mode: host
    volumes:
      - /proc:/host/proc:ro
      - /sys:/host/sys:ro
      - /:/rootfs:ro
volumes:
  prometheus_data:
  grafana_data:

Grafana's dashboard ecosystem is extensive — thousands of community dashboards available at grafana.com/grafana/dashboards cover infrastructure, databases, Kubernetes, cloud providers, and application frameworks. Import a dashboard by ID and it's ready in seconds.

Grafana Alerting fires notifications to Slack, PagerDuty, OpsGenie, email, and webhooks when metrics cross thresholds. Silence rules suppress alerts during maintenance. Contact points and notification policies control routing.

Key features (Grafana):

Universal data source support (Prometheus, Loki, InfluxDB, PostgreSQL, MySQL, Elasticsearch, CloudWatch, and 50+ more)
1,000+ community dashboards
Alert rules with routing and silencing
Annotations for deployment events
User permissions and teams
Embedded dashboards in other applications
Plugin system

Netdata — Best Real-Time Monitoring

Netdata's value proposition is unique: it shows you what your server is doing right now, at 1-second resolution, with zero configuration. Deploy the Netdata agent on a server, and within 30 seconds you have live dashboards for CPU, memory, disk I/O, network, running processes, Docker containers, and any services it auto-detects.

The auto-discovery is genuinely impressive. Netdata detects and starts monitoring MySQL, PostgreSQL, Redis, MongoDB, NGINX, Apache, HAProxy, and 400+ other services automatically based on what's running — no manual configuration of exporters or scrape configs.

ML-based anomaly detection runs on every metric. Netdata builds a baseline of "normal" behavior for each metric and surfaces anomalies in the UI. This proactive alerting catches unusual patterns before they become incidents.

# Netdata install (handles everything automatically)
bash <(curl -Ss https://my-netdata.io/kickstart.sh)

The Netdata agent is lightweight — 256 MB RAM — and can stream metrics to a centralized Netdata parent node, or to any TSDB (TimescaleDB, Prometheus, InfluxDB) for long-term retention.

Grafana Loki — Log Aggregation

Loki is Grafana Labs' log aggregation system, designed to be "like Prometheus, but for logs." The key architectural difference from Elasticsearch/OpenSearch: Loki indexes log labels (metadata) but not log content. You stream logs with labels and search by label first, then filter content with string matching.

This keeps storage costs low — Loki compresses log content efficiently and only indexes the small label set. For log volumes that would be expensive in OpenSearch (hundreds of GB/month), Loki is dramatically cheaper.

# Add Loki to your Prometheus/Grafana stack
services:
  loki:
    image: grafana/loki:latest
    ports:
      - "3100:3100"
    volumes:
      - loki_data:/loki
  promtail:
    image: grafana/promtail:latest
    volumes:
      - /var/log:/var/log:ro
      - /var/run/docker.sock:/var/run/docker.sock
      - ./promtail-config.yml:/etc/promtail/config.yml
volumes:
  loki_data:

The Complete Self-Hosted Stack

Concern	Tool	Purpose
Is it up?	Uptime Kuma	HTTP, TCP, DNS, Docker monitoring + status page
CPU/memory/disk	Prometheus + Node Exporter	System metrics collection
Visualize everything	Grafana	Dashboards, alerting, annotation
Logs	Loki + Promtail	Log aggregation and search
Real-time	Netdata	1-second granularity, auto-discovery
Public status	OpenStatus	User-facing status page

This stack fits on a Hetzner CPX21 (3 vCPU, 4 GB RAM, €8.79/month) for environments with 5–10 monitored servers.

Cost Comparison

Solution	Cost	Coverage
Datadog (10 hosts)	$1,800–2,760/year	Full observability
Better Stack (Pro)	$1,020/year	Uptime + logs
Grafana Cloud (free tier)	$0 (limited)	10K metrics, 50GB logs
Full self-hosted stack	$105–210/year (VPS)	Unlimited

Alerting Rules and Incident Response

The monitoring stack is only as useful as its alerting. Raw metrics in Grafana dashboards require someone to watch the dashboard — useful for post-incident investigation, but not for catching issues at 3 AM. Effective self-hosted monitoring requires configuring alert rules that fire automatically and routing those alerts to the right people.

Prometheus AlertManager is the standard alerting layer for Prometheus-based stacks. AlertManager receives alerts from Prometheus recording rules, deduplicates them (so you don't get 50 identical alerts for the same disk spike), groups related alerts, and routes them to notification channels. Configuration is YAML-based: define inhibition rules (suppress disk space alerts when the server is already alerted for being down), grouping logic (batch all alerts from the same host into one notification), and routing trees that direct different alert severity levels to different receivers.

The most common routing configuration for small teams sends critical alerts to PagerDuty or OpsGenie (for on-call paging) and warning alerts to Slack (for async awareness). AlertManager integrates directly with both. The routing YAML lets you specify matchers — route alerts where severity="critical" and service="database" to the database-team Slack channel, while routing severity="warning" alerts to a general-ops channel.

Grafana's built-in alerting (added in Grafana 8+) provides an alternative to Prometheus AlertManager that works across all data sources — not just Prometheus, but also Loki logs, InfluxDB, and any Grafana-connected database. Grafana alerts support multi-dimensional alert rules, where a single rule watches every instance of a metric and fires per-instance alerts. This is useful for "alert me when any monitored server's disk is above 90%" — one rule covers all servers dynamically. Grafana's alerting also integrates with its on-call routing, which distributes pages based on schedules.

Silence and maintenance windows prevent alert storms during planned maintenance. Before running kernel updates or restarting services, create a silence window in AlertManager or Grafana that suppresses alerts from the affected hosts for the expected maintenance duration. Without silences, every planned maintenance event generates dozens of false-positive alerts that train your team to ignore alert noise — the fastest way to miss a real incident.

Uptime Kuma's alert logic is simpler: set a check interval (every 60 seconds), define consecutive failures before alerting (2-3 failures prevents single-packet-loss false positives), and configure notification channels. Uptime Kuma supports 90+ notification integrations natively — Telegram, Discord, Slack, email, PagerDuty, Webhook. For small teams monitoring a handful of services, Uptime Kuma's simple alert model is often sufficient. Complex multi-condition alert rules require Prometheus AlertManager.

Runbook links in alerts. The most underused alerting feature is runbook integration. An alert that fires and routes to your on-call channel is useful. An alert that fires, routes to your on-call channel, and includes a link to the runbook that describes how to investigate and resolve this specific alert class is transformative. Both Prometheus AlertManager and Grafana alerting support annotations on alert rules where you can embed a runbook URL. Maintaining runbook links requires discipline — they decay as runbooks move and procedures change — but the investment reduces mean time to resolution on repeated alert types. Create a runbook stub for every new alert rule at the same time you write the alert rule.

Dashboard as documentation. Grafana dashboards serve dual purpose: real-time operational visibility and historical analysis. Design dashboards with this dual purpose in mind. Include text panels with brief explanations of what each graph represents and what normal values look like. Annotate graphs with deployment markers (Grafana's annotation feature lets you mark when a deployment happened, so you can correlate performance changes to specific code changes). A dashboard that a new engineer can interpret without guidance is significantly more valuable than one that requires tribal knowledge.

For the full Grafana + Prometheus + Loki setup guide, see Grafana + Prometheus + Loki self-hosted observability stack 2026.

The SaaS-to-Self-Hosted Migration Guide (Free PDF)