Skip to content

Backup & Restore Runbook

Classification: CONFIDENTIAL — Internal Use Only Document: infrastructure/runbooks/backup-restore.md · v1.1 · 2026-04-17 · GPUS-IT


1. Overview

This runbook covers day-to-day backup operations, backup verification, and step-by-step restore procedures for all four WDC servers (SKY, RAIN, SUN, WIND). It is a companion to the Disaster Recovery Plan.

Backup pipeline summary:

What How When Where
Server configs, services, logs /usr/local/bin/gpus-backup.sh 02:00 UTC daily (cron) NAS + GCS
Portal source (4 portals) /usr/local/bin/gpus-portal-backup.sh on SKY (root crontab) 02:30 UTC daily GCS
ESXi VM snapshots vSphere automated schedule Daily/Weekly/Monthly NAS (vmstorage)

2. Backup Locations

NAS (primary — fast restore)

vmstorage.wdc.us.gl3:/volume1/backups/
  ├── sky/
  │   └── YYYY-MM-DD/
  │       ├── named-YYYY-MM-DD.tar.gz       # BIND zones + DNSSEC keys
  │       ├── dhcp-YYYY-MM-DD.tar.gz        # DHCP config + leases
  │       ├── etc-YYYY-MM-DD.tar.gz         # /etc (excl. named/dhcp)
  │       ├── aide-YYYY-MM-DD.tar.gz        # AIDE integrity database
  │       ├── logs-YYYY-MM-DD.tar.gz        # /var/log
  │       └── home-YYYY-MM-DD.tar.gz        # /home + /root/.ssh
  ├── rain/  (same structure)
  ├── sun/
  │   └── YYYY-MM-DD/
  │       ├── prometheus-YYYY-MM-DD.tar.gz  # Prometheus TSDB + config
  │       ├── grafana-YYYY-MM-DD.tar.gz     # Grafana dashboards + config
  │       ├── etc-YYYY-MM-DD.tar.gz
  │       ├── aide-YYYY-MM-DD.tar.gz
  │       ├── logs-YYYY-MM-DD.tar.gz
  │       └── home-YYYY-MM-DD.tar.gz
  └── wind/
      └── YYYY-MM-DD/
          ├── elasticsearch-YYYY-MM-DD.tar.gz
          ├── logstash-YYYY-MM-DD.tar.gz
          ├── kibana-YYYY-MM-DD.tar.gz
          ├── etc-YYYY-MM-DD.tar.gz
          ├── aide-YYYY-MM-DD.tar.gz
          ├── logs-YYYY-MM-DD.tar.gz
          └── home-YYYY-MM-DD.tar.gz
Retention: 30 days (auto-pruned by backup script) Mounted at /mnt/nas-backup on each server

GCS (secondary — offsite / site loss)

gs://gpus-infra-backups-wdc/
  ├── sky/YYYY-MM-DD/    (same archives as NAS)
  ├── rain/YYYY-MM-DD/
  ├── sun/YYYY-MM-DD/
  └── wind/YYYY-MM-DD/
Retention: 90 days (GCS lifecycle policy)


3. Checking Backup Status

View last backup log on a server

tail -20 /var/log/gpus-backup.log

Verify today's backup ran successfully

grep "Backup finished\|ERROR" /var/log/gpus-backup.log | tail -5

List available restore points on NAS

ls /mnt/nas-backup/<server>/
# e.g.
ls /mnt/nas-backup/sky/

List available restore points in GCS

gcloud storage ls gs://gpus-infra-backups-wdc/<server>/
# e.g.
gcloud storage ls gs://gpus-infra-backups-wdc/sky/

Check contents of a specific backup

# NAS
ls /mnt/nas-backup/sky/2026-03-16/

# GCS
gcloud storage ls gs://gpus-infra-backups-wdc/sky/2026-03-16/

Run a manual backup immediately

sudo /usr/local/bin/gpus-backup.sh

4. Restore Procedures

Restore source priority: 1. ESXi snapshot (fastest — whole VM, no manual restore needed) 2. NAS backup (fast — on-prem, no VPN needed) 3. GCS backup (last resort — use when NAS unavailable or full site loss)


4.1 Restore SKY (DNS/DHCP)

RAIN auto-assumes primary DNS/DHCP when SKY is down. No service interruption to clients.

# --- Set the restore date (latest available) ---
# From NAS:
SERVER=sky
RESTOREDATE=$(ls /mnt/nas-backup/${SERVER}/ | grep -E '^[0-9]{4}-[0-9]{2}-[0-9]{2}$' | sort | tail -1)
echo "Restoring from: ${RESTOREDATE}"

# --- Extract archives ---
cd /mnt/nas-backup/${SERVER}/${RESTOREDATE}/
tar -xzf named-${RESTOREDATE}.tar.gz -C /
tar -xzf dhcp-${RESTOREDATE}.tar.gz -C /
tar -xzf etc-${RESTOREDATE}.tar.gz -C /
# Optional — restore home dirs
tar -xzf home-${RESTOREDATE}.tar.gz -C /

# --- Fix ownership ---
chown -R named:named /var/named/
chmod 600 /var/named/*.key /var/named/*.private 2>/dev/null || true

# --- Restore services ---
systemctl enable --now named dhcpd fail2ban auditd firewalld

# --- Verify DNS ---
rndc reload
sleep 2
dig @192.168.120.1 sky.wdc.us.gl3 A +short

# --- Re-sign DNSSEC ---
rndc sign wdc.us.gl3
rndc sign 120.168.192.in-addr.arpa

# --- Post-restore steps (mandatory) ---
aide --update && mv /var/lib/aide/aide.db.new.gz /var/lib/aide/aide.db.gz
echo "$(date) [RESTORE] SKY restored from ${RESTOREDATE} backup" >> /var/log/asset-inventory.log

If restoring from GCS instead:

SERVER=sky
RESTOREDATE=$(gcloud storage ls gs://gpus-infra-backups-wdc/${SERVER}/ | grep -E '[0-9]{4}-[0-9]{2}-[0-9]{2}' | sort | tail -1 | sed 's|.*/\([^/]*\)/|\1|')
mkdir -p /tmp/restore/${SERVER}
gcloud storage cp "gs://gpus-infra-backups-wdc/${SERVER}/${RESTOREDATE}/*.tar.gz" /tmp/restore/${SERVER}/
cd /tmp/restore/${SERVER}/
tar -xzf named-${RESTOREDATE}.tar.gz -C /
tar -xzf dhcp-${RESTOREDATE}.tar.gz -C /
tar -xzf etc-${RESTOREDATE}.tar.gz -C /
# Then continue from "Fix ownership" above


4.2 Restore RAIN (DNS/DHCP)

SKY continues serving DNS/DHCP. No service interruption.

SERVER=rain
RESTOREDATE=$(ls /mnt/nas-backup/${SERVER}/ | grep -E '^[0-9]{4}-[0-9]{2}-[0-9]{2}$' | sort | tail -1)
echo "Restoring from: ${RESTOREDATE}"

cd /mnt/nas-backup/${SERVER}/${RESTOREDATE}/
tar -xzf named-${RESTOREDATE}.tar.gz -C /
tar -xzf dhcp-${RESTOREDATE}.tar.gz -C /
tar -xzf etc-${RESTOREDATE}.tar.gz -C /
tar -xzf home-${RESTOREDATE}.tar.gz -C /

chown -R named:named /var/named/
chmod 600 /var/named/*.key /var/named/*.private 2>/dev/null || true

systemctl enable --now named dhcpd fail2ban auditd firewalld

# Verify zone transfer from SKY
rndc reload
sleep 2
dig @192.168.120.2 rain.wdc.us.gl3 A +short

# Verify DHCP failover peer
journalctl -u dhcpd --since "5 minutes ago" | grep -i failover

# Post-restore steps (mandatory)
aide --update && mv /var/lib/aide/aide.db.new.gz /var/lib/aide/aide.db.gz
echo "$(date) [RESTORE] RAIN restored from ${RESTOREDATE} backup" >> /var/log/asset-inventory.log

4.3 Restore SUN (Prometheus/Grafana)

DNS/DHCP unaffected. Only monitoring visibility is lost until SUN is restored.

SERVER=sun
RESTOREDATE=$(ls /mnt/nas-backup/${SERVER}/ | grep -E '^[0-9]{4}-[0-9]{2}-[0-9]{2}$' | sort | tail -1)
echo "Restoring from: ${RESTOREDATE}"

cd /mnt/nas-backup/${SERVER}/${RESTOREDATE}/
tar -xzf prometheus-${RESTOREDATE}.tar.gz -C /
tar -xzf grafana-${RESTOREDATE}.tar.gz -C /
tar -xzf etc-${RESTOREDATE}.tar.gz -C /
tar -xzf home-${RESTOREDATE}.tar.gz -C /

# Fix ownership
chown -R prometheus:prometheus /var/lib/prometheus/ /etc/prometheus/
chown -R grafana:grafana /var/lib/grafana/ /etc/grafana/

# Restore services
systemctl enable --now prometheus node_exporter grafana-server fail2ban auditd firewalld

# Verify Prometheus
sleep 5
curl -s http://192.168.120.3:9090/api/v1/targets | python3 -m json.tool | grep health

# Verify Grafana
curl -s http://192.168.120.3:3000/api/health

# Post-restore steps (mandatory)
aide --update && mv /var/lib/aide/aide.db.new.gz /var/lib/aide/aide.db.gz
echo "$(date) [RESTORE] SUN restored from ${RESTOREDATE} backup" >> /var/log/asset-inventory.log

4.4 Restore WIND (ELK Stack)

DNS/DHCP unaffected. SKY/RAIN rsyslog queues logs locally — they replay automatically when WIND reconnects.

SERVER=wind
RESTOREDATE=$(ls /mnt/nas-backup/${SERVER}/ | grep -E '^[0-9]{4}-[0-9]{2}-[0-9]{2}$' | sort | tail -1)
echo "Restoring from: ${RESTOREDATE}"

cd /mnt/nas-backup/${SERVER}/${RESTOREDATE}/
tar -xzf elasticsearch-${RESTOREDATE}.tar.gz -C /
tar -xzf logstash-${RESTOREDATE}.tar.gz -C /
tar -xzf kibana-${RESTOREDATE}.tar.gz -C /
tar -xzf etc-${RESTOREDATE}.tar.gz -C /
tar -xzf home-${RESTOREDATE}.tar.gz -C /

# Fix ownership
chown -R elasticsearch:elasticsearch /var/lib/elasticsearch/ /etc/elasticsearch/
chown -R logstash:logstash /etc/logstash/
chown -R kibana:kibana /etc/kibana/

# Restore services (ES must be up before logstash/kibana)
systemctl enable --now elasticsearch
sleep 15  # Wait for ES to initialize
systemctl enable --now logstash kibana fail2ban auditd firewalld

# Verify ES cluster health
curl -s http://192.168.120.4:9200/_cluster/health | python3 -m json.tool

# Verify Kibana
sleep 10
curl -s http://192.168.120.4:5601/api/status | python3 -m json.tool | grep overall

# Post-restore steps (mandatory)
aide --update && mv /var/lib/aide/aide.db.new.gz /var/lib/aide/aide.db.gz
echo "$(date) [RESTORE] WIND restored from ${RESTOREDATE} backup" >> /var/log/asset-inventory.log

4.5 Restore a Specific File or Directory

If you need to recover a single file rather than a full server restore:

# Example: recover a single BIND zone file from SKY's NAS backup
SERVER=sky
RESTOREDATE=2026-03-16
TMPDIR=/tmp/partial-restore

mkdir -p ${TMPDIR}
tar -xzf /mnt/nas-backup/${SERVER}/${RESTOREDATE}/named-${RESTOREDATE}.tar.gz \
    -C ${TMPDIR} \
    var/named/wdc.us.gl3.zone   # extract only the specific file

# Review before restoring
cat ${TMPDIR}/var/named/wdc.us.gl3.zone

# Copy back into place
cp ${TMPDIR}/var/named/wdc.us.gl3.zone /var/named/wdc.us.gl3.zone
chown named:named /var/named/wdc.us.gl3.zone
rndc reload

5. NAS Mount Reference

The NAS is mounted at /mnt/nas-backup on all four servers. If it is not mounted:

# Check fstab entry
grep nas-backup /etc/fstab

# Verify NFS export is accessible
showmount -e vmstorage.wdc.us.gl3

# Mount
mount /mnt/nas-backup
mountpoint /mnt/nas-backup && df -h /mnt/nas-backup

Expected fstab entry:

vmstorage.wdc.us.gl3:/volume1/backups  /mnt/nas-backup  nfs  defaults,_netdev,nfsvers=4,rsize=65536,wsize=65536,hard,timeo=600,retrans=3  0  0


6. Backup Script Reference

Item Detail
Script path /usr/local/bin/gpus-backup.sh
Version 2.0 (2026-03-16)
Cron 0 2 * * * root /usr/local/bin/gpus-backup.sh (in /etc/cron.d/)
Log /var/log/gpus-backup.log
Asset log /var/log/asset-inventory.log
NAS retention 30 days
GCS retention 90 days (lifecycle policy on gs://gpus-infra-backups-wdc)
GCS auth /etc/gpus-backup-agent-key.json

To view or edit the script:

cat /usr/local/bin/gpus-backup.sh


7. Portal Source Backup (SKY)

Portal source trees are archived independently of the server-config backups because they live in the gpus-infra-portals repo on SKY rather than on each server. The script /usr/local/bin/gpus-portal-backup.sh runs from SKY's root crontab and backs up the four portal source directories directly to GCS.

Item Detail
Host SKY (root crontab)
Script path /usr/local/bin/gpus-portal-backup.sh
Cron 30 2 * * * root /usr/local/bin/gpus-portal-backup.sh02:30 UTC daily
Log /var/log/gpus-portal-backup.log
Local retention 30 days
GCS retention 90 days (lifecycle policy on gs://gpus-infra-backups-wdc)

Portals backed up

The script archives the four portal source directories and uploads each to its own dated GCS path:

Portal Source directory GCS destination
MkDocs portal mkdocs/ gs://gpus-infra-backups-wdc/portals/mkdocs/<date>/
Status site status-site/ gs://gpus-infra-backups-wdc/portals/status-site/<date>/
Security site security-site/ gs://gpus-infra-backups-wdc/portals/security-site/<date>/
SOC site soc-site/ gs://gpus-infra-backups-wdc/portals/soc-site/<date>/

Checking portal backup status

# On SKY — view the most recent run
sudo tail -30 /var/log/gpus-portal-backup.log

# Confirm today's archives landed in GCS
gcloud storage ls "gs://gpus-infra-backups-wdc/portals/*/$(date -u +%Y-%m-%d)/"

Manual portal backup

sudo /usr/local/bin/gpus-portal-backup.sh

Ordering note: Portal backup (02:30 UTC) runs 30 minutes after the server-config backup (02:00 UTC) and 30 minutes before the daily Lynis scan (03:00 UTC). This ordering is intentional — portal backup depends on the NAS/GCS credentials loaded by the earlier backup run, and the Lynis scan reads the backup log to verify freshness.


8. Forms Portal — Backup & Restore

Forms portal (Cloud SQL + GCS attachments) is covered by four independent backup layers. Defense in depth: any single compromise (key rotation error, accidental deletion, region outage) leaves three other recovery paths intact.

Backup layers

# Layer Source of truth Retention Owner
1 Cloud SQL automated backups + PITR GCP-managed 7 days (backups), continuous (PITR) GCP
2 Daily SQL export to GCS MAPLE cron 03:00 UTC via /usr/local/bin/gpus-forms-db-backup.sh 90 days GPUS-IT (script prunes automatically)
3 Attachments bucket gs://gpus-forms-attachments with object versioning + 7-year retention 7 years (retention lock pending — see 90-day review Cowork task) GPUS-IT
4 YAML form definitions git history in gpus-infra-portals/forms/ Indefinite (git) GPUS-IT

Layer 2 details — daily Cloud SQL export

  • Script: /usr/local/bin/gpus-forms-db-backup.sh on MAPLE
  • Cron: /etc/cron.d/gpus-forms-db-backup0 3 * * * root /usr/local/bin/gpus-forms-db-backup.sh
  • Log: /var/log/gpus-forms-db-backup.log
  • Destination: gs://gpus-infra-backups-wdc/forms/YYYY-MM-DD/gpus-forms-db.sql.gz
  • Auth path: MAPLE SA (maple-agent@gpus-infra.iam.gserviceaccount.com) has roles/cloudsql.admin conditional on the gpus-forms-db instance; the per-instance Cloud SQL service agent p1056766133984-umgfbk@gcp-sa-cloud-sql.iam.gserviceaccount.com has roles/storage.objectAdmin on the backup bucket (this is the SA that actually writes the export object)
  • Retention: 90-day prune at end of each run
  • Verification: exits non-zero if export < 1024 bytes

Restore procedures

Full playbooks in Forms Portal — DR Playbook:

  • DR-FP-01 — Cloud SQL restore from PITR (most common; RPO 15 min, RTO ~30 min)
  • DR-FP-02 — Attachments bucket restore (from GCS object versioning)
  • DR-FP-03 — KMS key compromise (envelope-encryption re-wrap; requires gpus-forms-dek-wrapper rotation)
  • DR-FP-04 — Region outage (failover to us-east1 snapshot + Cloud Run redeploy)
  • DR-FP-05 — Full-stack restore (from GCS export; used when PITR window exhausted)

Target: RTO 4 hours, RPO 15 minutes (PITR granularity). Attachment RPO is zero thanks to GCS versioning + retention.

Test schedule

  • Quarterly — DR-FP-01 cold-restore test; scheduled via Cowork recurring task forms-portal-backup-restore-verify-quarterly. First run: 2026-07-15.
  • Annually — DR-FP-03 + DR-FP-04 tabletop exercise with Director of Cyber Security.
  • On change — re-verify after any major infra change (DB tier, KMS rotation policy, region migration).

Test results

Date Playbook Result Notes
(none yet) First scheduled test 2026-07-15