aly badawy/homelab
all systems operational
// networking · backups

Backups

Three independent backup layers protect the homelab data. Longhorn backs up every stateful PVC to the NAS. PostgreSQL dumps add SQL-level snapshots. The NAS itself is protected at the storage level.

Longhorn → NAS pg-dump CronJob

01 Backup layers

LayerWhat's protectedDestinationRecovery
Longhorn volume backups All stateful Kubernetes PVCs (Vault, PostgreSQL, Nextcloud, Authentik, Immich) NAS NFS: /var/nfs/shared/backups/pvcs Restore from Longhorn UI during cluster rebuild
PostgreSQL SQL dumps All databases in the shared PostgreSQL instance NAS NFS: /mnt/nas/backups Restore from .sql dump file with psql
NAS storage protection All NAS data including backups themselves

02 Longhorn volume backups

Longhorn's recurring job system automatically snapshots and backs up every annotated PVC. The default schedule runs against any PVC carrying recurring-job-group.longhorn.io/default: enabled.

PVCs covered by Longhorn backups text
namespace   PVC name                  data stored
────────────────────────────────────────────────────────────────
security    vault-data-lh             Vault KV secrets (all app credentials)
db          postgres-data-lh          All PostgreSQL databases
cloud       nextcloud-data-lh         Nextcloud application data
security    authentik-media-lh        Authentik uploaded media
security    authentik-templates-lh    Authentik custom templates
immich      immich-pvc (see note)     Immich app data (not the photo library)
Immich photo library is NAS-backed, not Longhorn-backed. Immich stores photos at /mnt/nas/immich — a direct NFS mount from the NAS. That data is protected by the NAS storage layer, not Longhorn. A cluster rebuild does not affect the photo library — it lives on the NAS and is remounted on first deploy.

The backup target and polling interval are set in the Longhorn Helm values:

k8s/components/longhorn/helm-values.yaml (backup section) yaml
defaultSettings:
  backupTarget: nfs://172.20.20.2:/var/nfs/shared/backups/pvcs
  backupTargetCredentialSecret: ""  # no auth needed (NAS on trusted Servers VLAN)
  backupPollInterval: 300  # seconds between polls

03 Backup schedule

All layers run on a coordinated schedule. Longhorn snapshots run first, backups to NAS run 30 minutes later, and NAS snapshots run 29 minutes after that — giving Longhorn time to finish before the NAS preserves its own state.

LayerEventTimes (daily)
Longhorn Volume snapshot 12:00 AM · 6:00 AM · 12:00 PM · 6:00 PM
Longhorn → NAS Backup to NFS target 12:30 AM · 6:30 AM · 12:30 PM · 6:30 PM
NAS (UNAS 4) Filesystem snapshot (RAID 5) 12:59 AM · 6:59 AM · 12:59 PM · 6:59 PM
NAS → Remote NAS Off-site backup 3:00 AM (daily)

04 PostgreSQL SQL dumps

A pg-dump CronJob in the db namespace runs periodic SQL dumps of all databases and writes them to /mnt/nas/backups. This provides a second, SQL-level backup independent of Longhorn — useful for point-in-time restore of specific tables or schemas without needing to restore an entire volume.

check recent pg-dump jobs bash
# list recent dump jobs
$ kubectl get jobs -n db --sort-by=.metadata.creationTimestamp

# check the last dump succeeded
$ kubectl logs -n db -l job-name=pg-dump --tail=20

# list dump files on the NAS
$ ls -lh /mnt/nas/backups/*.sql 2>/dev/null

05 NAS storage protection

The NAS itself needs protection — if it fails, all Longhorn backups and SQL dumps are lost. The UNAS 4 uses RAID 5, so a single drive failure results in no data loss. Snapshots and off-site replication add additional recovery options.

PropertyValue
RAID / redundancyRAID 5 — single drive failure tolerated. NAS self-rebuilds when a replacement drive is inserted.
Snapshot scheduleEvery 6 hours at 12:59, 6:59, 12:59, 18:59 — timed 29 min after Longhorn backups complete.
Off-site backupBacked up to a remote NAS every 24 hours at 3:00 AM.

06 Recovery objectives

ScenarioRPO (max data loss)RTO (recovery time)Method
k3s node failure Since last Longhorn backup (≤6 hours) ~30 min Full rebuild — rebuild guide
Single app failure 0 (redeploy from Git) <5 min ArgoCD re-sync or pod restart
Database corruption Since last Longhorn backup or pg-dump (≤6 hours) ~15 min Restore from Longhorn backup (preferred); fall back to .sql dump if backup is not recent enough
NAS drive failure 0 — RAID 5, no data loss on single drive failure RAID rebuild time (varies by drive size) Replace failed drive; NAS automatically rebuilds RAID array
NAS server failure Since last off-site backup (≤24 hours) ~1 hour Restore from remote NAS backup to replacement hardware
Vault data loss Since last Longhorn backup (≤6 hours) ~15 min Vault data lives in a Longhorn PVC — restore the PVC, then unseal with the offline unseal key
Vault unseal key lost N/A Significantly longer — manual Without the unseal key, Vault cannot be unsealed. Recovery requires manual re-initialization and re-seeding all secrets. Store the unseal key offline and never commit it to Git.
last updated 2026-06-08