Monitoring & Observability
🌱 Seedling — SLOs and alerting stack documented. Loki and Grafana dashboards as code still TODO.
The Problem I Was Solving
A self-hosted stack with no monitoring is a black box. You find out something is broken when a service stops responding — not when it starts degrading.
I wanted to know about problems before users do (even if the only "user" is me).
Stack
Netdata → realtime metrics, live web UI, zero config Prometheus → persistent metrics, 30-day retention, PromQL node-exporter → host-level metrics scraped by Prometheus Alertmanager → routes Prometheus alerts → Telegram
text
Why both Netdata and Prometheus?
TRIZ contradiction: realtime visibility vs persistent history.
Netdata is great for "what is happening right now" — auto-discovers Docker containers, shows per-container CPU/RAM without any configuration. But it keeps only ~1 hour of history.
Prometheus solves the retention problem and handles alerting rules. They don't overlap — both stay.
SLOs for eigenstack
Defined before setting up alerts, not after.
| Metric | Target | Alert |
|---|---|---|
| Uptime | 99.5% | Down > 5 min → CRITICAL |
| Response time | < 500ms p95 | — (tracked, not alerted) |
| Disk usage | < 80% | > 80% WARNING, > 90% CRITICAL |
| Memory usage | < 75% | > 85% WARNING |
| Backup success | 100% daily | Any failure → CRITICAL |
| SSL expiry | > 7 days | < 7 days → CRITICAL |
Alerting
Alertmanager routes everything to Telegram. One bot, one chat, immediate mobile notification.
Rules live in prometheus/alerts.yml — stored in git alongside the rest
of eigenstack config. Changing an alert threshold is a commit, not a
click in a UI.
What's Still Missing
- Loki — no log aggregation yet. Currently:
docker logs+ grep when something breaks. Works, but painful for post-incident review. - Grafana dashboards as code — Grafana is running, but dashboards were clicked together manually. They should be JSON files in git.
- Alerting on backup content, not just backup job — current alert fires if the cron job fails. It does NOT verify the backup is actually restorable.