Backups — homelab.badawy

01 Backup layers

Layer	What's protected	Destination	Recovery
Longhorn volume backups	All stateful Kubernetes PVCs (Vault, PostgreSQL, Nextcloud, Authentik, Immich)	NAS NFS: `/var/nfs/shared/backups/pvcs`	Restore from Longhorn UI during cluster rebuild
PostgreSQL SQL dumps	All databases in the shared PostgreSQL instance	NAS NFS: `/mnt/nas/backups`	Restore from `.sql` dump file with `psql`
NAS storage protection	All NAS data including backups themselves

02 Longhorn volume backups

Longhorn's recurring job system automatically snapshots and backs up every annotated PVC. The default schedule runs against any PVC carrying recurring-job-group.longhorn.io/default: enabled.

PVCs covered by Longhorn backups text

namespace   PVC name                  data stored
────────────────────────────────────────────────────────────────
security    vault-data             Vault KV secrets (all app credentials)
db          postgres-data          All PostgreSQL databases
cloud       nextcloud-config         Nextcloud application data
security    authentik-media        Authentik uploaded media
security    authentik-templates    Authentik custom templates
immich      immich-pvc (see note)     Immich app data (not the photo library)

Immich photo library is NAS-backed, not Longhorn-backed. Immich stores photos at /mnt/nas/immich — a direct NFS mount from the NAS. That data is protected by the NAS storage layer, not Longhorn. A cluster rebuild does not affect the photo library — it lives on the NAS and is remounted on first deploy.

The backup target and polling interval are set in the Longhorn Helm values:

k8s/components/longhorn/helm-values.yaml (backup section) yaml

defaultSettings:
  backupTarget: nfs://172.20.20.2:/var/nfs/shared/backups/pvcs
  backupTargetCredentialSecret: ""  # no auth needed (NAS on trusted Servers VLAN)
  backupPollInterval: 300  # seconds between polls

03 Backup schedule

All layers run on a coordinated schedule. Longhorn snapshots run first, backups to NAS run 30 minutes later, and NAS snapshots run 29 minutes after that — giving Longhorn time to finish before the NAS preserves its own state.

Layer	Event	Times (daily)
Longhorn	Volume snapshot	12:00 AM · 6:00 AM · 12:00 PM · 6:00 PM
Longhorn → NAS	Backup to NFS target	12:30 AM · 6:30 AM · 12:30 PM · 6:30 PM
NAS (UNAS 4)	Filesystem snapshot (RAID 5)	12:59 AM · 6:59 AM · 12:59 PM · 6:59 PM
NAS → Remote NAS	Off-site backup	3:00 AM (daily)

04 PostgreSQL SQL dumps

A pg-dump CronJob in the db namespace runs periodic SQL dumps of all databases and writes them to /mnt/nas/backups. This provides a second, SQL-level backup independent of Longhorn — useful for point-in-time restore of specific tables or schemas without needing to restore an entire volume.

check recent pg-dump jobs bash

# list recent dump jobs
$ kubectl get jobs -n db --sort-by=.metadata.creationTimestamp

# check the last dump succeeded
$ kubectl logs -n db -l job-name=pg-dump --tail=20

# list dump files on the NAS
$ ls -lh /mnt/nas/backups/*.sql 2>/dev/null

05 NAS storage protection

The NAS itself needs protection — if it fails, all Longhorn backups and SQL dumps are lost. The UNAS 4 uses RAID 5, so a single drive failure results in no data loss. Snapshots and off-site replication add additional recovery options.

Property	Value
RAID / redundancy	RAID 5 — single drive failure tolerated. NAS self-rebuilds when a replacement drive is inserted.
Snapshot schedule	Every 6 hours at 12:59, 6:59, 12:59, 18:59 — timed 29 min after Longhorn backups complete.
Off-site backup	Backed up to a remote NAS every 24 hours at 3:00 AM.

06 Recovery objectives

Scenario	RPO (max data loss)	RTO (recovery time)	Method
k3s node failure	Since last Longhorn backup (≤6 hours)	~30 min	Full rebuild — rebuild guide
Single app failure	0 (redeploy from Git)	<5 min	ArgoCD re-sync or pod restart
Database corruption	Since last Longhorn backup or pg-dump (≤6 hours)	~15 min	Restore from Longhorn backup (preferred); fall back to `.sql` dump if backup is not recent enough
NAS drive failure	0 — RAID 5, no data loss on single drive failure	RAID rebuild time (varies by drive size)	Replace failed drive; NAS automatically rebuilds RAID array
NAS server failure	Since last off-site backup (≤24 hours)	~1 hour	Restore from remote NAS backup to replacement hardware
Vault data loss	Since last Longhorn backup (≤6 hours)	~15 min	Vault data lives in a Longhorn PVC — restore the PVC, then unseal with the offline unseal key
Vault unseal key lost	N/A	Significantly longer — manual	Without the unseal key, Vault cannot be unsealed. Recovery requires manual re-initialization and re-seeding all secrets. Store the unseal key offline and never commit it to Git.

last updated 2026-06-08