Backup & Restore Runbook¶

Classification: CONFIDENTIAL — Internal Use Only Document: infrastructure/runbooks/backup-restore.md · v1.1 · 2026-04-17 · GPUS-IT

1. Overview¶

This runbook covers day-to-day backup operations, backup verification, and step-by-step restore procedures for all four WDC servers (SKY, RAIN, SUN, WIND). It is a companion to the Disaster Recovery Plan.

Backup pipeline summary:

What	How	When	Where
Server configs, services, logs	`/usr/local/bin/gpus-backup.sh`	02:00 UTC daily (cron)	NAS + GCS
Portal source (4 portals)	`/usr/local/bin/gpus-portal-backup.sh` on SKY (root crontab)	02:30 UTC daily	GCS
ESXi VM snapshots	vSphere automated schedule	Daily/Weekly/Monthly	NAS (`vmstorage`)

2. Backup Locations¶

NAS (primary — fast restore)¶

vmstorage.wdc.us.gl3:/volume1/backups/
  ├── sky/
  │   └── YYYY-MM-DD/
  │       ├── named-YYYY-MM-DD.tar.gz       # BIND zones + DNSSEC keys
  │       ├── dhcp-YYYY-MM-DD.tar.gz        # DHCP config + leases
  │       ├── etc-YYYY-MM-DD.tar.gz         # /etc (excl. named/dhcp)
  │       ├── aide-YYYY-MM-DD.tar.gz        # AIDE integrity database
  │       ├── logs-YYYY-MM-DD.tar.gz        # /var/log
  │       └── home-YYYY-MM-DD.tar.gz        # /home + /root/.ssh
  ├── rain/  (same structure)
  ├── sun/
  │   └── YYYY-MM-DD/
  │       ├── prometheus-YYYY-MM-DD.tar.gz  # Prometheus TSDB + config
  │       ├── grafana-YYYY-MM-DD.tar.gz     # Grafana dashboards + config
  │       ├── etc-YYYY-MM-DD.tar.gz
  │       ├── aide-YYYY-MM-DD.tar.gz
  │       ├── logs-YYYY-MM-DD.tar.gz
  │       └── home-YYYY-MM-DD.tar.gz
  └── wind/
      └── YYYY-MM-DD/
          ├── elasticsearch-YYYY-MM-DD.tar.gz
          ├── logstash-YYYY-MM-DD.tar.gz
          ├── kibana-YYYY-MM-DD.tar.gz
          ├── etc-YYYY-MM-DD.tar.gz
          ├── aide-YYYY-MM-DD.tar.gz
          ├── logs-YYYY-MM-DD.tar.gz
          └── home-YYYY-MM-DD.tar.gz

Retention: 30 days (auto-pruned by backup script) Mounted at /mnt/nas-backup on each server

GCS (secondary — offsite / site loss)¶

gs://gpus-infra-backups-wdc/
  ├── sky/YYYY-MM-DD/    (same archives as NAS)
  ├── rain/YYYY-MM-DD/
  ├── sun/YYYY-MM-DD/
  └── wind/YYYY-MM-DD/

Retention: 90 days (GCS lifecycle policy)

3. Checking Backup Status¶

View last backup log on a server¶

tail -20 /var/log/gpus-backup.log

Verify today's backup ran successfully¶

grep "Backup finished\|ERROR" /var/log/gpus-backup.log | tail -5

List available restore points on NAS¶

ls /mnt/nas-backup/<server>/
# e.g.
ls /mnt/nas-backup/sky/

List available restore points in GCS¶

gcloud storage ls gs://gpus-infra-backups-wdc/<server>/
# e.g.
gcloud storage ls gs://gpus-infra-backups-wdc/sky/

Check contents of a specific backup¶

# NAS
ls /mnt/nas-backup/sky/2026-03-16/

# GCS
gcloud storage ls gs://gpus-infra-backups-wdc/sky/2026-03-16/

Run a manual backup immediately¶

sudo /usr/local/bin/gpus-backup.sh

4. Restore Procedures¶

Restore source priority: 1. ESXi snapshot (fastest — whole VM, no manual restore needed) 2. NAS backup (fast — on-prem, no VPN needed) 3. GCS backup (last resort — use when NAS unavailable or full site loss)

4.1 Restore SKY (DNS/DHCP)¶

RAIN auto-assumes primary DNS/DHCP when SKY is down. No service interruption to clients.

# --- Set the restore date (latest available) ---
# From NAS:
SERVER=sky
RESTOREDATE=$(ls /mnt/nas-backup/${SERVER}/ | grep -E '^[0-9]{4}-[0-9]{2}-[0-9]{2}$' | sort | tail -1)
echo "Restoring from: ${RESTOREDATE}"

# --- Extract archives ---
cd /mnt/nas-backup/${SERVER}/${RESTOREDATE}/
tar -xzf named-${RESTOREDATE}.tar.gz -C /
tar -xzf dhcp-${RESTOREDATE}.tar.gz -C /
tar -xzf etc-${RESTOREDATE}.tar.gz -C /
# Optional — restore home dirs
tar -xzf home-${RESTOREDATE}.tar.gz -C /

# --- Fix ownership ---
chown -R named:named /var/named/
chmod 600 /var/named/*.key /var/named/*.private 2>/dev/null || true

# --- Restore services ---
systemctl enable --now named dhcpd fail2ban auditd firewalld

# --- Verify DNS ---
rndc reload
sleep 2
dig @192.168.120.1 sky.wdc.us.gl3 A +short

# --- Re-sign DNSSEC ---
rndc sign wdc.us.gl3
rndc sign 120.168.192.in-addr.arpa

# --- Post-restore steps (mandatory) ---
aide --update && mv /var/lib/aide/aide.db.new.gz /var/lib/aide/aide.db.gz
echo "$(date) [RESTORE] SKY restored from ${RESTOREDATE} backup" >> /var/log/asset-inventory.log

If restoring from GCS instead:

SERVER=sky
RESTOREDATE=$(gcloud storage ls gs://gpus-infra-backups-wdc/${SERVER}/ | grep -E '[0-9]{4}-[0-9]{2}-[0-9]{2}' | sort | tail -1 | sed 's|.*/\([^/]*\)/|\1|')
mkdir -p /tmp/restore/${SERVER}
gcloud storage cp "gs://gpus-infra-backups-wdc/${SERVER}/${RESTOREDATE}/*.tar.gz" /tmp/restore/${SERVER}/
cd /tmp/restore/${SERVER}/
tar -xzf named-${RESTOREDATE}.tar.gz -C /
tar -xzf dhcp-${RESTOREDATE}.tar.gz -C /
tar -xzf etc-${RESTOREDATE}.tar.gz -C /
# Then continue from "Fix ownership" above

4.2 Restore RAIN (DNS/DHCP)¶

SKY continues serving DNS/DHCP. No service interruption.

SERVER=rain
RESTOREDATE=$(ls /mnt/nas-backup/${SERVER}/ | grep -E '^[0-9]{4}-[0-9]{2}-[0-9]{2}$' | sort | tail -1)
echo "Restoring from: ${RESTOREDATE}"

cd /mnt/nas-backup/${SERVER}/${RESTOREDATE}/
tar -xzf named-${RESTOREDATE}.tar.gz -C /
tar -xzf dhcp-${RESTOREDATE}.tar.gz -C /
tar -xzf etc-${RESTOREDATE}.tar.gz -C /
tar -xzf home-${RESTOREDATE}.tar.gz -C /

chown -R named:named /var/named/
chmod 600 /var/named/*.key /var/named/*.private 2>/dev/null || true

systemctl enable --now named dhcpd fail2ban auditd firewalld

# Verify zone transfer from SKY
rndc reload
sleep 2
dig @192.168.120.2 rain.wdc.us.gl3 A +short

# Verify DHCP failover peer
journalctl -u dhcpd --since "5 minutes ago" | grep -i failover

# Post-restore steps (mandatory)
aide --update && mv /var/lib/aide/aide.db.new.gz /var/lib/aide/aide.db.gz
echo "$(date) [RESTORE] RAIN restored from ${RESTOREDATE} backup" >> /var/log/asset-inventory.log

4.3 Restore SUN (Prometheus/Grafana)¶

DNS/DHCP unaffected. Only monitoring visibility is lost until SUN is restored.

SERVER=sun
RESTOREDATE=$(ls /mnt/nas-backup/${SERVER}/ | grep -E '^[0-9]{4}-[0-9]{2}-[0-9]{2}$' | sort | tail -1)
echo "Restoring from: ${RESTOREDATE}"

cd /mnt/nas-backup/${SERVER}/${RESTOREDATE}/
tar -xzf prometheus-${RESTOREDATE}.tar.gz -C /
tar -xzf grafana-${RESTOREDATE}.tar.gz -C /
tar -xzf etc-${RESTOREDATE}.tar.gz -C /
tar -xzf home-${RESTOREDATE}.tar.gz -C /

# Fix ownership
chown -R prometheus:prometheus /var/lib/prometheus/ /etc/prometheus/
chown -R grafana:grafana /var/lib/grafana/ /etc/grafana/

# Restore services
systemctl enable --now prometheus node_exporter grafana-server fail2ban auditd firewalld

# Verify Prometheus
sleep 5
curl -s http://192.168.120.3:9090/api/v1/targets | python3 -m json.tool | grep health

# Verify Grafana
curl -s http://192.168.120.3:3000/api/health

# Post-restore steps (mandatory)
aide --update && mv /var/lib/aide/aide.db.new.gz /var/lib/aide/aide.db.gz
echo "$(date) [RESTORE] SUN restored from ${RESTOREDATE} backup" >> /var/log/asset-inventory.log

4.4 Restore WIND (ELK Stack)¶

DNS/DHCP unaffected. SKY/RAIN rsyslog queues logs locally — they replay automatically when WIND reconnects.

SERVER=wind
RESTOREDATE=$(ls /mnt/nas-backup/${SERVER}/ | grep -E '^[0-9]{4}-[0-9]{2}-[0-9]{2}$' | sort | tail -1)
echo "Restoring from: ${RESTOREDATE}"

cd /mnt/nas-backup/${SERVER}/${RESTOREDATE}/
tar -xzf elasticsearch-${RESTOREDATE}.tar.gz -C /
tar -xzf logstash-${RESTOREDATE}.tar.gz -C /
tar -xzf kibana-${RESTOREDATE}.tar.gz -C /
tar -xzf etc-${RESTOREDATE}.tar.gz -C /
tar -xzf home-${RESTOREDATE}.tar.gz -C /

# Fix ownership
chown -R elasticsearch:elasticsearch /var/lib/elasticsearch/ /etc/elasticsearch/
chown -R logstash:logstash /etc/logstash/
chown -R kibana:kibana /etc/kibana/

# Restore services (ES must be up before logstash/kibana)
systemctl enable --now elasticsearch
sleep 15  # Wait for ES to initialize
systemctl enable --now logstash kibana fail2ban auditd firewalld

# Verify ES cluster health
curl -s http://192.168.120.4:9200/_cluster/health | python3 -m json.tool

# Verify Kibana
sleep 10
curl -s http://192.168.120.4:5601/api/status | python3 -m json.tool | grep overall

# Post-restore steps (mandatory)
aide --update && mv /var/lib/aide/aide.db.new.gz /var/lib/aide/aide.db.gz
echo "$(date) [RESTORE] WIND restored from ${RESTOREDATE} backup" >> /var/log/asset-inventory.log

4.5 Restore a Specific File or Directory¶

If you need to recover a single file rather than a full server restore:

# Example: recover a single BIND zone file from SKY's NAS backup
SERVER=sky
RESTOREDATE=2026-03-16
TMPDIR=/tmp/partial-restore

mkdir -p ${TMPDIR}
tar -xzf /mnt/nas-backup/${SERVER}/${RESTOREDATE}/named-${RESTOREDATE}.tar.gz \
    -C ${TMPDIR} \
    var/named/wdc.us.gl3.zone   # extract only the specific file

# Review before restoring
cat ${TMPDIR}/var/named/wdc.us.gl3.zone

# Copy back into place
cp ${TMPDIR}/var/named/wdc.us.gl3.zone /var/named/wdc.us.gl3.zone
chown named:named /var/named/wdc.us.gl3.zone
rndc reload

5. NAS Mount Reference¶

The NAS is mounted at /mnt/nas-backup on all four servers. If it is not mounted:

# Check fstab entry
grep nas-backup /etc/fstab

# Verify NFS export is accessible
showmount -e vmstorage.wdc.us.gl3

# Mount
mount /mnt/nas-backup
mountpoint /mnt/nas-backup && df -h /mnt/nas-backup

Expected fstab entry:

vmstorage.wdc.us.gl3:/volume1/backups  /mnt/nas-backup  nfs  defaults,_netdev,nfsvers=4,rsize=65536,wsize=65536,hard,timeo=600,retrans=3  0  0

6. Backup Script Reference¶

Item	Detail
Script path	`/usr/local/bin/gpus-backup.sh`
Version	2.0 (2026-03-16)
Cron	`0 2 * * * root /usr/local/bin/gpus-backup.sh` (in `/etc/cron.d/`)
Log	`/var/log/gpus-backup.log`
Asset log	`/var/log/asset-inventory.log`
NAS retention	30 days
GCS retention	90 days (lifecycle policy on `gs://gpus-infra-backups-wdc`)
GCS auth	`/etc/gpus-backup-agent-key.json`

To view or edit the script:

cat /usr/local/bin/gpus-backup.sh

7. Portal Source Backup (SKY)¶

Portal source trees are archived independently of the server-config backups because they live in the gpus-infra-portals repo on SKY rather than on each server. The script /usr/local/bin/gpus-portal-backup.sh runs from SKY's root crontab and backs up the four portal source directories directly to GCS.

Item	Detail
Host	SKY (root crontab)
Script path	`/usr/local/bin/gpus-portal-backup.sh`
Cron	`30 2 * * * root /usr/local/bin/gpus-portal-backup.sh` — 02:30 UTC daily
Log	`/var/log/gpus-portal-backup.log`
Local retention	30 days
GCS retention	90 days (lifecycle policy on `gs://gpus-infra-backups-wdc`)

Portals backed up¶

The script archives the four portal source directories and uploads each to its own dated GCS path:

Portal	Source directory	GCS destination
MkDocs portal	`mkdocs/`	`gs://gpus-infra-backups-wdc/portals/mkdocs/<date>/`
Status site	`status-site/`	`gs://gpus-infra-backups-wdc/portals/status-site/<date>/`
Security site	`security-site/`	`gs://gpus-infra-backups-wdc/portals/security-site/<date>/`
SOC site	`soc-site/`	`gs://gpus-infra-backups-wdc/portals/soc-site/<date>/`

Checking portal backup status¶

# On SKY — view the most recent run
sudo tail -30 /var/log/gpus-portal-backup.log

# Confirm today's archives landed in GCS
gcloud storage ls "gs://gpus-infra-backups-wdc/portals/*/$(date -u +%Y-%m-%d)/"

Manual portal backup¶

sudo /usr/local/bin/gpus-portal-backup.sh

Ordering note: Portal backup (02:30 UTC) runs 30 minutes after the server-config backup (02:00 UTC) and 30 minutes before the daily Lynis scan (03:00 UTC). This ordering is intentional — portal backup depends on the NAS/GCS credentials loaded by the earlier backup run, and the Lynis scan reads the backup log to verify freshness.

8. Forms Portal — Backup & Restore¶

Forms portal (Cloud SQL + GCS attachments) is covered by four independent backup layers. Defense in depth: any single compromise (key rotation error, accidental deletion, region outage) leaves three other recovery paths intact.

Backup layers¶

#	Layer	Source of truth	Retention	Owner
1	Cloud SQL automated backups + PITR	GCP-managed	7 days (backups), continuous (PITR)	GCP
2	Daily SQL export to GCS	MAPLE cron 03:00 UTC via `/usr/local/bin/gpus-forms-db-backup.sh`	90 days	GPUS-IT (script prunes automatically)
3	Attachments bucket	`gs://gpus-forms-attachments` with object versioning + 7-year retention	7 years (retention lock pending — see 90-day review Cowork task)	GPUS-IT
4	YAML form definitions	git history in `gpus-infra-portals/forms/`	Indefinite (git)	GPUS-IT

Layer 2 details — daily Cloud SQL export¶

Script: /usr/local/bin/gpus-forms-db-backup.sh on MAPLE
Cron: /etc/cron.d/gpus-forms-db-backup — 0 3 * * * root /usr/local/bin/gpus-forms-db-backup.sh
Log: /var/log/gpus-forms-db-backup.log
Destination: gs://gpus-infra-backups-wdc/forms/YYYY-MM-DD/gpus-forms-db.sql.gz
Auth path: MAPLE SA (maple-agent@gpus-infra.iam.gserviceaccount.com) has roles/cloudsql.admin conditional on the gpus-forms-db instance; the per-instance Cloud SQL service agent p1056766133984-umgfbk@gcp-sa-cloud-sql.iam.gserviceaccount.com has roles/storage.objectAdmin on the backup bucket (this is the SA that actually writes the export object)
Retention: 90-day prune at end of each run
Verification: exits non-zero if export < 1024 bytes

Restore procedures¶

Full playbooks in Forms Portal — DR Playbook:

DR-FP-01 — Cloud SQL restore from PITR (most common; RPO 15 min, RTO ~30 min)
DR-FP-02 — Attachments bucket restore (from GCS object versioning)
DR-FP-03 — KMS key compromise (envelope-encryption re-wrap; requires gpus-forms-dek-wrapper rotation)
DR-FP-04 — Region outage (failover to us-east1 snapshot + Cloud Run redeploy)
DR-FP-05 — Full-stack restore (from GCS export; used when PITR window exhausted)

Target: RTO 4 hours, RPO 15 minutes (PITR granularity). Attachment RPO is zero thanks to GCS versioning + retention.

Test schedule¶

Quarterly — DR-FP-01 cold-restore test; scheduled via Cowork recurring task forms-portal-backup-restore-verify-quarterly. First run: 2026-07-15.
Annually — DR-FP-03 + DR-FP-04 tabletop exercise with Director of Cyber Security.
On change — re-verify after any major infra change (DB tier, KMS rotation policy, region migration).

Test results¶

Date	Playbook	Result	Notes
(none yet)	—	—	First scheduled test 2026-07-15