Backup & Restore Runbook¶
Classification: CONFIDENTIAL — Internal Use Only Document:
infrastructure/runbooks/backup-restore.md· v1.1 · 2026-04-17 · GPUS-IT
1. Overview¶
This runbook covers day-to-day backup operations, backup verification, and step-by-step restore procedures for all four WDC servers (SKY, RAIN, SUN, WIND). It is a companion to the Disaster Recovery Plan.
Backup pipeline summary:
| What | How | When | Where |
|---|---|---|---|
| Server configs, services, logs | /usr/local/bin/gpus-backup.sh |
02:00 UTC daily (cron) | NAS + GCS |
| Portal source (4 portals) | /usr/local/bin/gpus-portal-backup.sh on SKY (root crontab) |
02:30 UTC daily | GCS |
| ESXi VM snapshots | vSphere automated schedule | Daily/Weekly/Monthly | NAS (vmstorage) |
2. Backup Locations¶
NAS (primary — fast restore)¶
vmstorage.wdc.us.gl3:/volume1/backups/
├── sky/
│ └── YYYY-MM-DD/
│ ├── named-YYYY-MM-DD.tar.gz # BIND zones + DNSSEC keys
│ ├── dhcp-YYYY-MM-DD.tar.gz # DHCP config + leases
│ ├── etc-YYYY-MM-DD.tar.gz # /etc (excl. named/dhcp)
│ ├── aide-YYYY-MM-DD.tar.gz # AIDE integrity database
│ ├── logs-YYYY-MM-DD.tar.gz # /var/log
│ └── home-YYYY-MM-DD.tar.gz # /home + /root/.ssh
├── rain/ (same structure)
├── sun/
│ └── YYYY-MM-DD/
│ ├── prometheus-YYYY-MM-DD.tar.gz # Prometheus TSDB + config
│ ├── grafana-YYYY-MM-DD.tar.gz # Grafana dashboards + config
│ ├── etc-YYYY-MM-DD.tar.gz
│ ├── aide-YYYY-MM-DD.tar.gz
│ ├── logs-YYYY-MM-DD.tar.gz
│ └── home-YYYY-MM-DD.tar.gz
└── wind/
└── YYYY-MM-DD/
├── elasticsearch-YYYY-MM-DD.tar.gz
├── logstash-YYYY-MM-DD.tar.gz
├── kibana-YYYY-MM-DD.tar.gz
├── etc-YYYY-MM-DD.tar.gz
├── aide-YYYY-MM-DD.tar.gz
├── logs-YYYY-MM-DD.tar.gz
└── home-YYYY-MM-DD.tar.gz
/mnt/nas-backup on each server
GCS (secondary — offsite / site loss)¶
gs://gpus-infra-backups-wdc/
├── sky/YYYY-MM-DD/ (same archives as NAS)
├── rain/YYYY-MM-DD/
├── sun/YYYY-MM-DD/
└── wind/YYYY-MM-DD/
3. Checking Backup Status¶
View last backup log on a server¶
Verify today's backup ran successfully¶
List available restore points on NAS¶
List available restore points in GCS¶
gcloud storage ls gs://gpus-infra-backups-wdc/<server>/
# e.g.
gcloud storage ls gs://gpus-infra-backups-wdc/sky/
Check contents of a specific backup¶
# NAS
ls /mnt/nas-backup/sky/2026-03-16/
# GCS
gcloud storage ls gs://gpus-infra-backups-wdc/sky/2026-03-16/
Run a manual backup immediately¶
4. Restore Procedures¶
Restore source priority: 1. ESXi snapshot (fastest — whole VM, no manual restore needed) 2. NAS backup (fast — on-prem, no VPN needed) 3. GCS backup (last resort — use when NAS unavailable or full site loss)
4.1 Restore SKY (DNS/DHCP)¶
RAIN auto-assumes primary DNS/DHCP when SKY is down. No service interruption to clients.
# --- Set the restore date (latest available) ---
# From NAS:
SERVER=sky
RESTOREDATE=$(ls /mnt/nas-backup/${SERVER}/ | grep -E '^[0-9]{4}-[0-9]{2}-[0-9]{2}$' | sort | tail -1)
echo "Restoring from: ${RESTOREDATE}"
# --- Extract archives ---
cd /mnt/nas-backup/${SERVER}/${RESTOREDATE}/
tar -xzf named-${RESTOREDATE}.tar.gz -C /
tar -xzf dhcp-${RESTOREDATE}.tar.gz -C /
tar -xzf etc-${RESTOREDATE}.tar.gz -C /
# Optional — restore home dirs
tar -xzf home-${RESTOREDATE}.tar.gz -C /
# --- Fix ownership ---
chown -R named:named /var/named/
chmod 600 /var/named/*.key /var/named/*.private 2>/dev/null || true
# --- Restore services ---
systemctl enable --now named dhcpd fail2ban auditd firewalld
# --- Verify DNS ---
rndc reload
sleep 2
dig @192.168.120.1 sky.wdc.us.gl3 A +short
# --- Re-sign DNSSEC ---
rndc sign wdc.us.gl3
rndc sign 120.168.192.in-addr.arpa
# --- Post-restore steps (mandatory) ---
aide --update && mv /var/lib/aide/aide.db.new.gz /var/lib/aide/aide.db.gz
echo "$(date) [RESTORE] SKY restored from ${RESTOREDATE} backup" >> /var/log/asset-inventory.log
If restoring from GCS instead:
SERVER=sky
RESTOREDATE=$(gcloud storage ls gs://gpus-infra-backups-wdc/${SERVER}/ | grep -E '[0-9]{4}-[0-9]{2}-[0-9]{2}' | sort | tail -1 | sed 's|.*/\([^/]*\)/|\1|')
mkdir -p /tmp/restore/${SERVER}
gcloud storage cp "gs://gpus-infra-backups-wdc/${SERVER}/${RESTOREDATE}/*.tar.gz" /tmp/restore/${SERVER}/
cd /tmp/restore/${SERVER}/
tar -xzf named-${RESTOREDATE}.tar.gz -C /
tar -xzf dhcp-${RESTOREDATE}.tar.gz -C /
tar -xzf etc-${RESTOREDATE}.tar.gz -C /
# Then continue from "Fix ownership" above
4.2 Restore RAIN (DNS/DHCP)¶
SKY continues serving DNS/DHCP. No service interruption.
SERVER=rain
RESTOREDATE=$(ls /mnt/nas-backup/${SERVER}/ | grep -E '^[0-9]{4}-[0-9]{2}-[0-9]{2}$' | sort | tail -1)
echo "Restoring from: ${RESTOREDATE}"
cd /mnt/nas-backup/${SERVER}/${RESTOREDATE}/
tar -xzf named-${RESTOREDATE}.tar.gz -C /
tar -xzf dhcp-${RESTOREDATE}.tar.gz -C /
tar -xzf etc-${RESTOREDATE}.tar.gz -C /
tar -xzf home-${RESTOREDATE}.tar.gz -C /
chown -R named:named /var/named/
chmod 600 /var/named/*.key /var/named/*.private 2>/dev/null || true
systemctl enable --now named dhcpd fail2ban auditd firewalld
# Verify zone transfer from SKY
rndc reload
sleep 2
dig @192.168.120.2 rain.wdc.us.gl3 A +short
# Verify DHCP failover peer
journalctl -u dhcpd --since "5 minutes ago" | grep -i failover
# Post-restore steps (mandatory)
aide --update && mv /var/lib/aide/aide.db.new.gz /var/lib/aide/aide.db.gz
echo "$(date) [RESTORE] RAIN restored from ${RESTOREDATE} backup" >> /var/log/asset-inventory.log
4.3 Restore SUN (Prometheus/Grafana)¶
DNS/DHCP unaffected. Only monitoring visibility is lost until SUN is restored.
SERVER=sun
RESTOREDATE=$(ls /mnt/nas-backup/${SERVER}/ | grep -E '^[0-9]{4}-[0-9]{2}-[0-9]{2}$' | sort | tail -1)
echo "Restoring from: ${RESTOREDATE}"
cd /mnt/nas-backup/${SERVER}/${RESTOREDATE}/
tar -xzf prometheus-${RESTOREDATE}.tar.gz -C /
tar -xzf grafana-${RESTOREDATE}.tar.gz -C /
tar -xzf etc-${RESTOREDATE}.tar.gz -C /
tar -xzf home-${RESTOREDATE}.tar.gz -C /
# Fix ownership
chown -R prometheus:prometheus /var/lib/prometheus/ /etc/prometheus/
chown -R grafana:grafana /var/lib/grafana/ /etc/grafana/
# Restore services
systemctl enable --now prometheus node_exporter grafana-server fail2ban auditd firewalld
# Verify Prometheus
sleep 5
curl -s http://192.168.120.3:9090/api/v1/targets | python3 -m json.tool | grep health
# Verify Grafana
curl -s http://192.168.120.3:3000/api/health
# Post-restore steps (mandatory)
aide --update && mv /var/lib/aide/aide.db.new.gz /var/lib/aide/aide.db.gz
echo "$(date) [RESTORE] SUN restored from ${RESTOREDATE} backup" >> /var/log/asset-inventory.log
4.4 Restore WIND (ELK Stack)¶
DNS/DHCP unaffected. SKY/RAIN rsyslog queues logs locally — they replay automatically when WIND reconnects.
SERVER=wind
RESTOREDATE=$(ls /mnt/nas-backup/${SERVER}/ | grep -E '^[0-9]{4}-[0-9]{2}-[0-9]{2}$' | sort | tail -1)
echo "Restoring from: ${RESTOREDATE}"
cd /mnt/nas-backup/${SERVER}/${RESTOREDATE}/
tar -xzf elasticsearch-${RESTOREDATE}.tar.gz -C /
tar -xzf logstash-${RESTOREDATE}.tar.gz -C /
tar -xzf kibana-${RESTOREDATE}.tar.gz -C /
tar -xzf etc-${RESTOREDATE}.tar.gz -C /
tar -xzf home-${RESTOREDATE}.tar.gz -C /
# Fix ownership
chown -R elasticsearch:elasticsearch /var/lib/elasticsearch/ /etc/elasticsearch/
chown -R logstash:logstash /etc/logstash/
chown -R kibana:kibana /etc/kibana/
# Restore services (ES must be up before logstash/kibana)
systemctl enable --now elasticsearch
sleep 15 # Wait for ES to initialize
systemctl enable --now logstash kibana fail2ban auditd firewalld
# Verify ES cluster health
curl -s http://192.168.120.4:9200/_cluster/health | python3 -m json.tool
# Verify Kibana
sleep 10
curl -s http://192.168.120.4:5601/api/status | python3 -m json.tool | grep overall
# Post-restore steps (mandatory)
aide --update && mv /var/lib/aide/aide.db.new.gz /var/lib/aide/aide.db.gz
echo "$(date) [RESTORE] WIND restored from ${RESTOREDATE} backup" >> /var/log/asset-inventory.log
4.5 Restore a Specific File or Directory¶
If you need to recover a single file rather than a full server restore:
# Example: recover a single BIND zone file from SKY's NAS backup
SERVER=sky
RESTOREDATE=2026-03-16
TMPDIR=/tmp/partial-restore
mkdir -p ${TMPDIR}
tar -xzf /mnt/nas-backup/${SERVER}/${RESTOREDATE}/named-${RESTOREDATE}.tar.gz \
-C ${TMPDIR} \
var/named/wdc.us.gl3.zone # extract only the specific file
# Review before restoring
cat ${TMPDIR}/var/named/wdc.us.gl3.zone
# Copy back into place
cp ${TMPDIR}/var/named/wdc.us.gl3.zone /var/named/wdc.us.gl3.zone
chown named:named /var/named/wdc.us.gl3.zone
rndc reload
5. NAS Mount Reference¶
The NAS is mounted at /mnt/nas-backup on all four servers. If it is not mounted:
# Check fstab entry
grep nas-backup /etc/fstab
# Verify NFS export is accessible
showmount -e vmstorage.wdc.us.gl3
# Mount
mount /mnt/nas-backup
mountpoint /mnt/nas-backup && df -h /mnt/nas-backup
Expected fstab entry:
vmstorage.wdc.us.gl3:/volume1/backups /mnt/nas-backup nfs defaults,_netdev,nfsvers=4,rsize=65536,wsize=65536,hard,timeo=600,retrans=3 0 0
6. Backup Script Reference¶
| Item | Detail |
|---|---|
| Script path | /usr/local/bin/gpus-backup.sh |
| Version | 2.0 (2026-03-16) |
| Cron | 0 2 * * * root /usr/local/bin/gpus-backup.sh (in /etc/cron.d/) |
| Log | /var/log/gpus-backup.log |
| Asset log | /var/log/asset-inventory.log |
| NAS retention | 30 days |
| GCS retention | 90 days (lifecycle policy on gs://gpus-infra-backups-wdc) |
| GCS auth | /etc/gpus-backup-agent-key.json |
To view or edit the script:
7. Portal Source Backup (SKY)¶
Portal source trees are archived independently of the server-config backups because they live in the gpus-infra-portals repo on SKY rather than on each server. The script /usr/local/bin/gpus-portal-backup.sh runs from SKY's root crontab and backs up the four portal source directories directly to GCS.
| Item | Detail |
|---|---|
| Host | SKY (root crontab) |
| Script path | /usr/local/bin/gpus-portal-backup.sh |
| Cron | 30 2 * * * root /usr/local/bin/gpus-portal-backup.sh — 02:30 UTC daily |
| Log | /var/log/gpus-portal-backup.log |
| Local retention | 30 days |
| GCS retention | 90 days (lifecycle policy on gs://gpus-infra-backups-wdc) |
Portals backed up¶
The script archives the four portal source directories and uploads each to its own dated GCS path:
| Portal | Source directory | GCS destination |
|---|---|---|
| MkDocs portal | mkdocs/ |
gs://gpus-infra-backups-wdc/portals/mkdocs/<date>/ |
| Status site | status-site/ |
gs://gpus-infra-backups-wdc/portals/status-site/<date>/ |
| Security site | security-site/ |
gs://gpus-infra-backups-wdc/portals/security-site/<date>/ |
| SOC site | soc-site/ |
gs://gpus-infra-backups-wdc/portals/soc-site/<date>/ |
Checking portal backup status¶
# On SKY — view the most recent run
sudo tail -30 /var/log/gpus-portal-backup.log
# Confirm today's archives landed in GCS
gcloud storage ls "gs://gpus-infra-backups-wdc/portals/*/$(date -u +%Y-%m-%d)/"
Manual portal backup¶
Ordering note: Portal backup (02:30 UTC) runs 30 minutes after the server-config backup (02:00 UTC) and 30 minutes before the daily Lynis scan (03:00 UTC). This ordering is intentional — portal backup depends on the NAS/GCS credentials loaded by the earlier backup run, and the Lynis scan reads the backup log to verify freshness.
8. Forms Portal — Backup & Restore¶
Forms portal (Cloud SQL + GCS attachments) is covered by four independent backup layers. Defense in depth: any single compromise (key rotation error, accidental deletion, region outage) leaves three other recovery paths intact.
Backup layers¶
| # | Layer | Source of truth | Retention | Owner |
|---|---|---|---|---|
| 1 | Cloud SQL automated backups + PITR | GCP-managed | 7 days (backups), continuous (PITR) | GCP |
| 2 | Daily SQL export to GCS | MAPLE cron 03:00 UTC via /usr/local/bin/gpus-forms-db-backup.sh |
90 days | GPUS-IT (script prunes automatically) |
| 3 | Attachments bucket | gs://gpus-forms-attachments with object versioning + 7-year retention |
7 years (retention lock pending — see 90-day review Cowork task) | GPUS-IT |
| 4 | YAML form definitions | git history in gpus-infra-portals/forms/ |
Indefinite (git) | GPUS-IT |
Layer 2 details — daily Cloud SQL export¶
- Script:
/usr/local/bin/gpus-forms-db-backup.shon MAPLE - Cron:
/etc/cron.d/gpus-forms-db-backup—0 3 * * * root /usr/local/bin/gpus-forms-db-backup.sh - Log:
/var/log/gpus-forms-db-backup.log - Destination:
gs://gpus-infra-backups-wdc/forms/YYYY-MM-DD/gpus-forms-db.sql.gz - Auth path: MAPLE SA (
maple-agent@gpus-infra.iam.gserviceaccount.com) hasroles/cloudsql.adminconditional on thegpus-forms-dbinstance; the per-instance Cloud SQL service agentp1056766133984-umgfbk@gcp-sa-cloud-sql.iam.gserviceaccount.comhasroles/storage.objectAdminon the backup bucket (this is the SA that actually writes the export object) - Retention: 90-day prune at end of each run
- Verification: exits non-zero if export < 1024 bytes
Restore procedures¶
Full playbooks in Forms Portal — DR Playbook:
- DR-FP-01 — Cloud SQL restore from PITR (most common; RPO 15 min, RTO ~30 min)
- DR-FP-02 — Attachments bucket restore (from GCS object versioning)
- DR-FP-03 — KMS key compromise (envelope-encryption re-wrap; requires
gpus-forms-dek-wrapperrotation) - DR-FP-04 — Region outage (failover to us-east1 snapshot + Cloud Run redeploy)
- DR-FP-05 — Full-stack restore (from GCS export; used when PITR window exhausted)
Target: RTO 4 hours, RPO 15 minutes (PITR granularity). Attachment RPO is zero thanks to GCS versioning + retention.
Test schedule¶
- Quarterly — DR-FP-01 cold-restore test; scheduled via Cowork recurring task
forms-portal-backup-restore-verify-quarterly. First run: 2026-07-15. - Annually — DR-FP-03 + DR-FP-04 tabletop exercise with Director of Cyber Security.
- On change — re-verify after any major infra change (DB tier, KMS rotation policy, region migration).
Test results¶
| Date | Playbook | Result | Notes |
|---|---|---|---|
| (none yet) | — | — | First scheduled test 2026-07-15 |