01 Normal reboot / power failure
This is fully automatic. No action needed. The recovery sequence takes ~6 minutes:
- k3s restarts. All pods start immediately except Vault (sealed).
-
vault-auto-unsealCronJob runs every minute. Longhorn reattaches the PVC at ~4–5 min. - Vault unseals.
vault-0passes readiness probe. -
eso-recoveryCronJob detects Vault ready +ClusterSecretStoredegraded → restarts ESO. -
ESO reconnects. All
ExternalSecretresources sync. Apps recover.
02 Vault stuck sealed
If Vault doesn't unseal after 10 minutes, something prevented the CronJob from running or the unseal key is wrong. Diagnose:
# check the last unseal CronJob run
$ kubectl get jobs -n security --sort-by=.metadata.creationTimestamp
$ kubectl logs -n security -l job-name=vault-auto-unseal --tail=30
# check Vault pod is actually running
$ kubectl get pod vault-0 -n security
# check the PVC is bound (Longhorn might still be reattaching)
$ kubectl get pvc vault-data-lh -n security
# check if the unseal key is present
$ kubectl get secret vault-unseal-key -n security
# manually unseal if the CronJob is failing
$ kubectl port-forward -n security svc/vault 8200:8200
$ export VAULT_ADDR=http://localhost:8200
$ vault operator unseal
03 ESO ClusterSecretStore degraded
If apps are failing to get secrets even after Vault is unsealed, ESO may
be stuck in exponential backoff. The eso-recovery CronJob
handles this automatically, but you can force it manually:
# check store status
$ kubectl get clustersecretstore k8s-secrets
# restart all three ESO deployments to clear backoff
$ kubectl rollout restart deployment -n security \
external-secrets external-secrets-webhook external-secrets-cert-controller
# wait for ready, then check store again
$ kubectl rollout status deployment/external-secrets -n security
$ kubectl get clustersecretstore k8s-secrets
04 ArgoCD app stuck / OutOfSync
An app stuck at OutOfSync usually has one of three causes: a
resource that ArgoCD can't reconcile, a SyncFailed error, or
a dependency that hasn't come up yet.
# list all apps and their health/sync status
$ kubectl get applications -n argocd
# get detailed status for a failing app
$ kubectl describe application <app-name> -n argocd
# force a hard refresh (clears ArgoCD's cached state)
$ kubectl patch application <app-name> -n argocd \
--type merge -p '{"metadata":{"annotations":{"argocd.argoproj.io/refresh":"hard"}}}'
05 Node destruction — full rebuild
If the physical server is unrecoverable (hardware failure, OS corruption), the recovery path is a full rebuild. Longhorn NAS backups contain all stateful data. Nothing in Git is lost — Git is the source of truth.
Recovery steps:
- Provision a new Ubuntu 26.04 server (same hardware or a replacement).
-
Set up SSH key auth for user
homelabwith NOPASSWD sudo. - Get your Vault unseal key from offline backup and your Cloudflare API token.
-
Run
./provision/rebuild.sh— answer "no" when asked about fresh cluster (to restore from NAS backups). - Restore all Longhorn volumes from the Longhorn UI (Step 8).
- Run
./provision/activate-gitops.sh.
All application data (Vault secrets, PostgreSQL databases, Nextcloud files, Immich photos, Authentik users) is restored from Longhorn NAS backups. The only data that cannot be restored from Longhorn backups is Immich data stored on NAS directly — that lives on the NFS mount and is unaffected by a node rebuild.
06 Vault unseal key rotation
If you need to rotate the Vault unseal key (e.g. after a security incident or staff change):
# 1. generate a new unseal key (requires the root token or recovery key)
$ vault operator rekey
# 2. update the Kubernetes Secret with the new key
$ kubectl create secret generic vault-unseal-key \
--namespace=security \
--from-literal=key="<new-unseal-key>" \
--dry-run=client -o yaml | kubectl apply -f -
# 3. update your offline backup