Skip to main content

Monitoring

Prometheus-compatible metrics stack with VictoriaMetrics, Grafana, Loki, and Fluent Bit.

This namespace deploys the full observability stack for the homelab cluster. It combines VictoriaMetrics for metrics storage, Grafana for dashboards, Loki for log aggregation, and Fluent Bit as the log collector DaemonSet. Self-hosting this stack avoids cloud observability costs while providing full access to all cluster telemetry.

Alternatives considered

Cloud Hosted

ToolOpen SourceFree TierMonthly Cost
Grafana CloudYesLimitedFrom $19/mo
DatadogNoNoFrom $15/host
New RelicNoLimitedPay-as-you-go

Installation

Architecture

  • HelmReleases: victoria-stack (VictoriaMetrics operator, vmsingle, vmstack + Grafana), loki, fluent-bit
  • DaemonSet: Fluent Bit log collector on all nodes, mounts /var/log
  • Additional: grafana-to-ntfy Deployment proxies Grafana alerts to ntfy; OpenTelemetry Collector
  • Storage: Longhorn-encrypted PVCs for VictoriaMetrics and Loki; S3/MinIO for Loki chunk storage
  • Networking: HTTPRoutes for Grafana, VictoriaMetrics UI, and Loki

Security

  • Fluent Bit runs as runAsUser: 0 (requires node log access)
  • grafana-to-ntfy runs as runAsUser: 1000, runAsNonRoot: true, readOnlyRootFilesystem: true, capabilities dropped
  • All secrets SOPS-encrypted with age

Updates

Managed by Renovate. grafana-to-ntfy and otelcol images are digest-pinned.

Data Management

  • PVCs: Longhorn-encrypted PVCs for VictoriaMetrics (vmsingle, vlsingle) and Loki data
  • S3: MinIO / Loki chunk storage for long-term log retention
  • Backups: No k8up schedule present. Data durability via Longhorn replication.

User Management

Grafana OIDC configured via GF_AUTH_GENERIC_OAUTH_* env vars from SOPS-encrypted secret. Users authenticated via the cluster's OIDC provider.

Configuration Management

  • Helm chart values in ConfigMaps for victoria-stack, Loki, and Fluent Bit
  • Grafana OIDC credentials, SMTP config, and ntfy auth from SOPS-encrypted secrets
  • ntfy-auth secret used by grafana-to-ntfy for push notification delivery

Administration

Usage

Access Grafana to view cluster dashboards, query metrics with PromQL/MetricsQL, and browse logs via Loki. Alerts configured in Grafana are forwarded to ntfy via the grafana-to-ntfy proxy service. Fluent Bit collects container logs from all nodes automatically.

Cluster-specific deviations from the above live in the per-cluster README — see k8s/apps/talos/monitoring/README.md.

Cluster Deployment

Depends on

Monitoring — Talos cluster

Cluster-specific notes only. General product info, "why we use it", and alternatives live in docusaurus/docs/apps/monitoring.mdx.

Deviations from defaults

Defaults live in docusaurus/docs/apps/monitoring.mdx — document anything this cluster does differently here, with a one-line reason.

Kubernetes Metadata
Rendered manifests (kustomize build)
apiVersion: apps/v1
kind: Deployment
metadata:
labels:
app: grafana-to-ntfy
name: grafana-to-ntfy
namespace: monitoring
spec:
replicas: 1
selector:
matchLabels:
app: grafana-to-ntfy
strategy:
rollingUpdate: null
type: Recreate
template:
metadata:
labels:
app: grafana-to-ntfy
spec:
containers:
- envFrom:
- secretRef:
name: grafana-to-ntfy
image: kittyandrew/grafana-to-ntfy:latest@sha256:e1386f61db297b37ba4a6a056dfc370e9bee16c0cf394a24b581bdf0cd859a3e
livenessProbe:
failureThreshold: 5
httpGet:
path: /health
port: 8080
initialDelaySeconds: 5
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 5
name: grafana-to-ntfy
ports:
- containerPort: 8080
name: http
protocol: TCP
readinessProbe:
failureThreshold: 5
httpGet:
path: /health
port: 8080
initialDelaySeconds: 5
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 5
terminationGracePeriodSeconds: 60