Skip to content

Monitoring & Observability

🌱 Seedling — SLOs and alerting stack documented. Loki and Grafana dashboards as code still TODO.

The Problem I Was Solving

A self-hosted stack with no monitoring is a black box. You find out something is broken when a service stops responding — not when it starts degrading.

I wanted to know about problems before users do (even if the only "user" is me).


Stack

Netdata → realtime metrics, live web UI, zero config Prometheus → persistent metrics, 30-day retention, PromQL node-exporter → host-level metrics scraped by Prometheus Alertmanager → routes Prometheus alerts → Telegram

text

Why both Netdata and Prometheus?

TRIZ contradiction: realtime visibility vs persistent history.

Netdata is great for "what is happening right now" — auto-discovers Docker containers, shows per-container CPU/RAM without any configuration. But it keeps only ~1 hour of history.

Prometheus solves the retention problem and handles alerting rules. They don't overlap — both stay.


SLOs for eigenstack

Defined before setting up alerts, not after.

Metric Target Alert
Uptime 99.5% Down > 5 min → CRITICAL
Response time < 500ms p95 — (tracked, not alerted)
Disk usage < 80% > 80% WARNING, > 90% CRITICAL
Memory usage < 75% > 85% WARNING
Backup success 100% daily Any failure → CRITICAL
SSL expiry > 7 days < 7 days → CRITICAL

Alerting

Alertmanager routes everything to Telegram. One bot, one chat, immediate mobile notification.

Rules live in prometheus/alerts.yml — stored in git alongside the rest of eigenstack config. Changing an alert threshold is a commit, not a click in a UI.


What's Still Missing

  • Loki — no log aggregation yet. Currently: docker logs + grep when something breaks. Works, but painful for post-incident review.
  • Grafana dashboards as code — Grafana is running, but dashboards were clicked together manually. They should be JSON files in git.
  • Alerting on backup content, not just backup job — current alert fires if the cron job fails. It does NOT verify the backup is actually restorable.