aly badawy/homelab
all systems operational
// guides · recovery

Disaster recovery

Recovery procedures for the most likely failure modes. Most are self-healing. The worst case — complete node destruction — is a full rebuild with Longhorn NAS backups, which takes under an hour.

read before you need it Vault unseal key required
Keep your Vault unseal key offline. It is the only secret not stored anywhere in the cluster. If you lose it, Vault's data is permanently inaccessible — there is no recovery path. Store it in a password manager and a physical backup, separately.

01 Normal reboot / power failure

This is fully automatic. No action needed. The recovery sequence takes ~6 minutes:

  1. k3s restarts. All pods start immediately except Vault (sealed).
  2. vault-auto-unseal CronJob runs every minute. Longhorn reattaches the PVC at ~4–5 min.
  3. Vault unseals. vault-0 passes readiness probe.
  4. eso-recovery CronJob detects Vault ready + ClusterSecretStore degraded → restarts ESO.
  5. ESO reconnects. All ExternalSecret resources sync. Apps recover.
Walk away. The cluster heals itself. Check back after 6–7 minutes — everything should be healthy.

02 Vault stuck sealed

If Vault doesn't unseal after 10 minutes, something prevented the CronJob from running or the unseal key is wrong. Diagnose:

vault seal debugging bash
# check the last unseal CronJob run
$ kubectl get jobs -n security --sort-by=.metadata.creationTimestamp
$ kubectl logs -n security -l job-name=vault-auto-unseal --tail=30

# check Vault pod is actually running
$ kubectl get pod vault-0 -n security

# check the PVC is bound (Longhorn might still be reattaching)
$ kubectl get pvc vault-data-lh -n security

# check if the unseal key is present
$ kubectl get secret vault-unseal-key -n security

# manually unseal if the CronJob is failing
$ kubectl port-forward -n security svc/vault 8200:8200
$ export VAULT_ADDR=http://localhost:8200
$ vault operator unseal

03 ESO ClusterSecretStore degraded

If apps are failing to get secrets even after Vault is unsealed, ESO may be stuck in exponential backoff. The eso-recovery CronJob handles this automatically, but you can force it manually:

force ESO reconnect bash
# check store status
$ kubectl get clustersecretstore k8s-secrets

# restart all three ESO deployments to clear backoff
$ kubectl rollout restart deployment -n security \
    external-secrets external-secrets-webhook external-secrets-cert-controller

# wait for ready, then check store again
$ kubectl rollout status deployment/external-secrets -n security
$ kubectl get clustersecretstore k8s-secrets

04 ArgoCD app stuck / OutOfSync

An app stuck at OutOfSync usually has one of three causes: a resource that ArgoCD can't reconcile, a SyncFailed error, or a dependency that hasn't come up yet.

ArgoCD sync debugging bash
# list all apps and their health/sync status
$ kubectl get applications -n argocd

# get detailed status for a failing app
$ kubectl describe application <app-name> -n argocd

# force a hard refresh (clears ArgoCD's cached state)
$ kubectl patch application <app-name> -n argocd \
    --type merge -p '{"metadata":{"annotations":{"argocd.argoproj.io/refresh":"hard"}}}'

05 Node destruction — full rebuild

If the physical server is unrecoverable (hardware failure, OS corruption), the recovery path is a full rebuild. Longhorn NAS backups contain all stateful data. Nothing in Git is lost — Git is the source of truth.

Recovery steps:

  1. Provision a new Ubuntu 26.04 server (same hardware or a replacement).
  2. Set up SSH key auth for user homelab with NOPASSWD sudo.
  3. Get your Vault unseal key from offline backup and your Cloudflare API token.
  4. Run ./provision/rebuild.sh — answer "no" when asked about fresh cluster (to restore from NAS backups).
  5. Restore all Longhorn volumes from the Longhorn UI (Step 8).
  6. Run ./provision/activate-gitops.sh.

All application data (Vault secrets, PostgreSQL databases, Nextcloud files, Immich photos, Authentik users) is restored from Longhorn NAS backups. The only data that cannot be restored from Longhorn backups is Immich data stored on NAS directly — that lives on the NFS mount and is unaffected by a node rebuild.

Expected data loss window. Longhorn backups run on the default recurring schedule. The maximum data loss is the time since the last backup completed. Check the Longhorn UI under "Recurring Jobs" to see the schedule and last run time.

06 Vault unseal key rotation

If you need to rotate the Vault unseal key (e.g. after a security incident or staff change):

rotate Vault unseal key bash
# 1. generate a new unseal key (requires the root token or recovery key)
$ vault operator rekey

# 2. update the Kubernetes Secret with the new key
$ kubectl create secret generic vault-unseal-key \
    --namespace=security \
    --from-literal=key="<new-unseal-key>" \
    --dry-run=client -o yaml | kubectl apply -f -

# 3. update your offline backup
last updated 2026-06-08 · view source on GitHub