Skip to main content

Disaster recovery drill

This is the documented end-to-end path from the Talos production cluster is gone to every app is running again, every PVC is restored. The trigger doesn't matter — three dead Proxmox nodes, a botched upgrade, ransomware, an exuberant rm — the procedure is the same.

The total wall-clock time, with no surprises, is around half a day. Most of that is waiting for downloads and reconciliation, not active work.

Prerequisites — what survives a full disaster

The recovery only works because four things live outside the production cluster:

  1. Gitea — the source-of-truth Git repo. Hosted on the same cluster, but mirrored to Codeberg. If Gitea is unreachable, you bootstrap Flux against the Codeberg mirror.
  2. The SOPS age key — needed to decrypt every Secret in the repo. The cluster-side copy is in flux-system/; the out-of-cluster copies live encrypted on the cold-storage drives and printed on paper in a safe. See Operations → SOPS and Topics → SOPS / age key rotation.
  3. The Restic repository password — needed to read the warm-tier snapshots. Stored alongside the age key.
  4. The cold-tier drives themselves — for the unlikely case where Hetzner Object Storage is also unrecoverable.

If any one of those four is missing, the recovery is degraded; if more than one is missing, treat it as a fresh start rather than a recovery.

The procedure

1. Bring back compute ┌──────────────────────────────────┐
│ Tofu apply on Proxmox / hcloud │
└──────────────────┬───────────────┘

2. Reinstall Talos ▼
┌──────────────────────────────────┐
│ talhelper genconfig + apply │
└──────────────────┬───────────────┘

3. Reseed cluster trust ▼
┌──────────────────────────────────┐
│ age key, restic password, │
│ netbird setup keys back in │
└──────────────────┬───────────────┘

4. Bootstrap Flux ▼
┌──────────────────────────────────┐
│ flux bootstrap git … │
│ (against Gitea or Codeberg) │
└──────────────────┬───────────────┘

5. Wait for platform to converge ▼
┌──────────────────────────────────┐
│ Cilium · Longhorn · CNPG · k8up│
│ etc. reconcile │
└──────────────────┬───────────────┘

6. Restore PVCs from Restic ▼
┌──────────────────────────────────┐
│ per-app: kubectl restic restore│
└──────────────────┬───────────────┘

7. Restore Postgres clusters from Restic ▼
┌──────────────────────────────────┐
│ per-app: psql < restored.sql │
└──────────────────┬───────────────┘

8. Apps come up green ▼
┌──────────────────────────────────┐
│ Gatus / Prometheus show OK │
└──────────────────────────────────┘

1. Compute

OpenTofu reinstantiates the Proxmox VMs (six of them: 3× control-plane, 3× worker) and the Hetzner edge instance. If Proxmox itself is gone, that's a hypervisor rebuild first — outside the scope of this drill; document that on its own.

cd tofu/environment/production
tofu init && tofu plan -out=plan && tofu apply plan

cd ../edge
tofu init && tofu plan -out=plan && tofu apply plan

The VMs come up empty, waiting for a Talos image.

2. Talos

PXE-boot or attach the Talos ISO to each VM, then push the per-node config:

talhelper genconfig
talhelper gencommand apply | sh

# Bootstrap etcd on the first control-plane only
talosctl bootstrap --nodes <cp-01-ip>

# Retrieve the kubeconfig
talhelper gencommand kubeconfig | sh

kubectl get nodes should now show six Talos nodes (or one for the edge cluster). The cluster has no workloads yet.

3. Reseed trust material

Before Flux can reconcile, three secrets must already be on the cluster:

# SOPS age key (decrypts every committed Secret)
kubectl -n flux-system create secret generic sops-age \
--from-file=age.agekey=/path/to/recovered/age.key

# Restic repository password (lets k8up read snapshots)
kubectl -n k8up create secret generic restic-credentials \
--from-literal=password=$RECOVERED_RESTIC_PASSWORD

# NetBird setup keys, if the new cluster joins as a fresh peer
# (use the netbird/tofu output or the NetBird UI)

The age key in particular is the load-bearing piece — without it, every SOPS-encrypted manifest fails to decrypt and the cluster idles with Kustomization resources in Failed state.

4. Bootstrap Flux

Point Flux at the homelab repo (Gitea, or Codeberg if Gitea is unreachable):

flux bootstrap git \
--url=ssh://git@gitea.web.kueber.eu/johnny/homelab.git \
--branch=main \
--path=k8s/clusters/talos \
--private-key-file=/path/to/flux/deploy_key

Within a minute the source-controller pulls the tarball, the kustomize-controller starts applying — Cilium first (CNI is needed for everything else), then the rest of the platform layer, then the components, then apps.

5. Wait for the platform to converge

flux get all -A --watch

Expect ~5–10 minutes for the full platform layer to reach Ready: True. Image pulls dominate; once Spegel catches up its mirror, subsequent pulls are local.

App pods will appear but most will be Pending — waiting for their PVCs, which don't exist yet.

6. Restore PVCs from Restic

For each app whose PVC carries k8up.io/backup: "true", a Restore resource pointed at the matching Restic snapshot brings the PVC back. The pattern:

apiVersion: k8up.io/v1
kind: Restore
metadata:
name: restore-<app>
namespace: <app>
spec:
snapshot: latest
restoreMethod:
folder:
claimName: <app>-data # the existing-but-empty PVC
backend:
repoPasswordSecretRef:
name: restic-credentials
key: password
s3:
endpoint: https://<hetzner-objectstorage-endpoint>
bucket: <repo-bucket>

This can be batched — apply all Restore resources at once and let k8up serialize the work.

7. Restore Postgres clusters

CNPG clusters that were annotated with k8up.io/backupcommand: pg_dump have logical dumps in Restic. The postgres-restore runbook is the canonical procedure for each one — kubectl exec into the CNPG primary and stream the dump through pv. With ~10 Postgres-backed apps, this is the longest step (a couple of hours for the bigger ones).

8. Convergence

When the last PVC is restored and the last Restore resource hits Completed, apps come up. Gatus starts going green; Monitoring's dashboards stop showing 5xx spikes.

If a specific app doesn't come up:

  • Pod stuck Pending? — PVC not bound yet, or Longhorn is still building replicas.
  • Pod stuck CrashLoopBackOff? — most often a database that hasn't been restored yet (Step 7).
  • HTTPRoute 503? — Envoy Gateway is up but the upstream endpoint isn't healthy. Same root cause as the above.

What this drill doesn't cover

  • Edge cluster recovery. Same playbook, but smaller: re-create the cx33, push Talos, bootstrap Flux pointed at k8s/clusters/edge. No PVCs to restore (the edge cluster doesn't run stateful workloads).
  • DNS recovery. If the home DNS is down (AdGuard on Maresa unreachable), inbound requests to *.kueber.eu keep working via the public-zone delegation — but internal admin URLs (*.maresa.int.kueber.eu) won't resolve until Maresa is back.
  • Gitea recovery. If the cluster and the Gitea instance are gone, bootstrap Flux against Codeberg first, restore Gitea from its k8up snapshot, then re-point Flux at the recovered Gitea. The mirror direction (Gitea → Codeberg) means Codeberg may be a few minutes behind on the most recent commits.

Practice schedule

The drill above takes most of a working day if it's the first time. The honest advice: practice it before you need it.

  • Quarterly: restore one random PVC into a throwaway namespace and verify the data.
  • Yearly: run the full procedure against a sandbox cluster (a kind cluster + a fresh Hetzner project counts) to catch documentation drift.

When the actual disaster hits, the documentation you wrote a year ago will not be the documentation you need. The drill is what keeps the gap small enough to close in hours.

See also