Disaster Recovery Plan¶
Classification: CONFIDENTIAL — Internal Use Only Document:
response-plans/drp.md· v1.1 · 2026-03-16 · GPUS-IT
1. Purpose & Scope¶
This Disaster Recovery Plan (DRP) defines the procedures, roles, and recovery objectives for restoring Greenpeace US IT (GPUS-IT) infrastructure following a disruptive event. It covers all four WDC on-premises servers (SKY, RAIN, SUN, WIND), GCP cloud infrastructure (gpus-infra), the Cloud VPN tunnel, and all dependent services.
Compliance alignment:
| Framework | Reference |
|---|---|
| CIS Controls v8 | CIS 11.1 — Establish and maintain a data recovery process |
| CIS Controls v8 | CIS 11.2 — Perform automated backups |
| CIS Controls v8 | CIS 11.3 — Protect recovery data |
| CIS Controls v8 | CIS 11.4 — Establish and maintain an isolated instance of recovery data |
| CIS Controls v8 | CIS 17.7 — Conduct routine DR exercises |
| PCI-DSS | Requirement 12.3 — Protect all system components from known security vulnerabilities |
| NIST SP 800-53 | CP-2 — Contingency Plan |
| NIST SP 800-53 | CP-9 — Information System Backup |
| NIST SP 800-53 | CP-10 — Information System Recovery and Reconstitution |
2. Recovery Objectives¶
| Server / Service | RTO (Snapshot Restore) | RTO (Config Restore) | RTO (Full Rebuild) | RPO |
|---|---|---|---|---|
| SKY (Primary DNS/DHCP) | < 5 min (RAIN auto-assumes) | < 30 min | < 4 hours | 24 hours |
| RAIN (Secondary DNS/DHCP) | < 30 min | < 1 hour | < 4 hours | 24 hours |
| SUN (Prometheus/Grafana) | < 30 min | < 1 hour | < 3 hours | 24 hours (config) / 0 (metrics) |
| WIND (ELK Stack) | < 30 min | < 1 hour | < 4 hours | 24 hours (config) / 0–15 min (logs) |
| GCP VPN Tunnel | < 15 min | < 1 hour | < 2 hours | N/A (stateless) |
| Cloud Run Services | < 5 min (auto-redeploy) | < 30 min | < 1 hour | N/A (container image) |
Service dependency chain:
Client DNS/DHCP ──► SKY or RAIN (independent of SUN and WIND)
↑ SKY/RAIN operate independently if SUN/WIND are down
↑ SUN/WIND failure = loss of visibility only, NOT loss of DNS/DHCP
Monitoring/Logging ──► SUN → Prometheus/Grafana
└──► WIND → Elasticsearch/Logstash/Kibana
GCP Services ──► Cloud VPN ──► WDC network
3. Roles & Responsibilities¶
| Role | DR Responsibility | Contact |
|---|---|---|
| Director of Cyber Security | Plan activation authority, P1/P2 escalation, external comms | On-call phone |
| DNS/DHCP Admin | SKY and RAIN recovery, DNSSEC key management | On-call phone + SSH |
| Monitoring/Logging Admin | SUN and WIND recovery, log pipeline restoration | On-call phone + SSH |
| Security Operations | Threat analysis during incident, AIDE verification | SOC hotline |
| Network Operations | ESXi vSwitch, Meraki MX100, VPN tunnel | NOC hotline |
| Backup Admin | Snapshot management, archive restores, backup verification | On-call phone |
| GCP Admin | Cloud Run, VPN, Terraform state restoration | GCP Console + CLI |
4. Disaster Scenarios & Response Matrix¶
| Scenario | Severity | Primary Response | Secondary Response |
|---|---|---|---|
| Single server failure (SKY or RAIN) | P2 | RAIN/SKY auto-assumes; rebuild failed server | ESXi snapshot restore |
| Both DNS/DHCP servers down | P1 | Emergency /etc/hosts on critical clients; rebuild both |
Bare-metal restore from NFS |
| SUN (monitoring) failure | P2 | DNS/DHCP unaffected; restore from snapshot | Rebuild from config backup |
| WIND (logging) failure | P2 | DNS/DHCP unaffected; rsyslog queues on SKY/RAIN | Restore from snapshot; verify log continuity |
| All four servers down | P1 | Activate full DR drill; ESXi host assessment first | Sequential rebuild: SKY → RAIN → SUN → WIND |
| ESXi host failure | P1 | Assess hardware; restore VMs from NFS backup store | Rebuild ESXi 6.7; deploy VMs; restore configs |
| GCP VPN tunnel down | P2 | Verify Meraki MX100 tunnel config; re-key if needed | Cloud VPN gateway failover |
| Cloud Run service failure | P2 | gcloud run deploy redeploy from Artifact Registry |
Rebuild image via Cloud Build |
| GCS backup bucket loss | P1 | Restore from on-prem NFS offsite backup | Re-establish GCS pipeline; audit all backups |
| Full site loss (WDC) | P1 | GCP services remain operational; DNS failover planning | Activate remote rebuild from GCS archives |
5. Backup Architecture¶
5.1 Backup Pipeline Overview¶
All four servers run an identical backup pipeline via /usr/local/bin/gpus-backup.sh (v2.0), scheduled at 02:00 daily via root crontab. Each backup runs in parallel on each server independently. The pipeline has two destinations:
/usr/local/bin/gpus-backup.sh (02:00 daily, each server)
│
├─► NAS → vmstorage.wdc.us.gl3:/volume1/backups/<server>/<YYYY-MM-DD>/
│ Mounted at /mnt/nas-backup on each server
│ Retention: 30 days (auto-pruned by script)
│
└─► GCS → gs://gpus-infra-backups-wdc/<server>/<YYYY-MM-DD>/
Retention: 90 days (GCS lifecycle policy)
Log: /var/log/gpus-backup.log on each server
Asset log: /var/log/asset-inventory.log on each server
5.2 Per-Server Backup Contents¶
| Server | Script | NAS Path | GCS Path | Archives |
|---|---|---|---|---|
| SKY | /usr/local/bin/gpus-backup.sh |
/mnt/nas-backup/sky/YYYY-MM-DD/ |
gs://gpus-infra-backups-wdc/sky/YYYY-MM-DD/ |
named, dhcp, etc, aide, logs, home |
| RAIN | /usr/local/bin/gpus-backup.sh |
/mnt/nas-backup/rain/YYYY-MM-DD/ |
gs://gpus-infra-backups-wdc/rain/YYYY-MM-DD/ |
named, dhcp, etc, aide, logs, home |
| SUN | /usr/local/bin/gpus-backup.sh |
/mnt/nas-backup/sun/YYYY-MM-DD/ |
gs://gpus-infra-backups-wdc/sun/YYYY-MM-DD/ |
prometheus, grafana, etc, aide, logs, home |
| WIND | /usr/local/bin/gpus-backup.sh |
/mnt/nas-backup/wind/YYYY-MM-DD/ |
gs://gpus-infra-backups-wdc/wind/YYYY-MM-DD/ |
elasticsearch, logstash, kibana, etc, aide, logs, home |
5.3 ESXi Snapshot Schedule¶
| Frequency | Retention | Scope |
|---|---|---|
| Daily | 7 snapshots | All four VMs |
| Weekly | 4 snapshots | All four VMs |
| Monthly | 12 snapshots | All four VMs |
5.4 GCP / Cloud Backup¶
| Resource | Backup Method | Location | Retention |
|---|---|---|---|
| On-prem config archives | gpus-backup.sh over VPN |
gs://gpus-infra-backups-wdc/<server>/ |
90 days |
| Terraform state | GCS versioning | gs://gpus-infra-tf-state/ |
Versioned (indefinite) |
| MkDocs portal source | gpus-portal-backup.sh (02:30 daily, Mac) |
gs://gpus-infra-tf-state/mkdocs-backup/ |
30 days |
| Container images | Artifact Registry | us-central1-docker.pkg.dev/gpus-infra/gpus-images/ |
Tagged + latest |
5.5 Backup Verification¶
# Check last backup completed successfully on any server
tail -5 /var/log/gpus-backup.log
# List today's GCS backup for a server
gcloud storage ls gs://gpus-infra-backups-wdc/<server>/$(date +%Y-%m-%d)/
# List all available restore points for a server
gcloud storage ls gs://gpus-infra-backups-wdc/<server>/
# Check NAS backup
ls /mnt/nas-backup/<server>/
6. Server Recovery Procedures¶
Restore priority: ESXi snapshot (fastest) → NAS backup → GCS backup (last resort / site loss) For detailed step-by-step restore procedures see: Backup & Restore Runbook
6.1 SKY — Primary DNS/DHCP Recovery¶
Pre-conditions: RAIN is up and auto-assuming primary DNS/DHCP. DNS/DHCP service is not interrupted.
# Option A: ESXi snapshot restore (preferred — < 5 min)
# vSphere Client → SKY VM → Snapshots → Revert to last daily snapshot
# Option B: Restore from NAS backup
RESTOREDATE=$(ls /mnt/nas-backup/sky/ | grep -v recycle | sort | tail -1)
mkdir -p /restore/sky
tar -xzf /mnt/nas-backup/sky/${RESTOREDATE}/named-${RESTOREDATE}.tar.gz -C /
tar -xzf /mnt/nas-backup/sky/${RESTOREDATE}/dhcp-${RESTOREDATE}.tar.gz -C /
tar -xzf /mnt/nas-backup/sky/${RESTOREDATE}/etc-${RESTOREDATE}.tar.gz -C /
# Option C: Restore from GCS (full site loss / NAS unavailable)
RESTOREDATE=$(gcloud storage ls gs://gpus-infra-backups-wdc/sky/ | sort | tail -1 | tr -d '/')
gcloud storage cp "gs://gpus-infra-backups-wdc/sky/${RESTOREDATE}/*.tar.gz" /tmp/restore/
tar -xzf /tmp/restore/named-${RESTOREDATE}.tar.gz -C /
tar -xzf /tmp/restore/dhcp-${RESTOREDATE}.tar.gz -C /
tar -xzf /tmp/restore/etc-${RESTOREDATE}.tar.gz -C /
# Restore services
systemctl enable --now named dhcpd fail2ban auditd firewalld
# Restore DNSSEC keys (included in named archive)
chown -R named:named /var/named/
chmod 600 /var/named/*.key /var/named/*.private 2>/dev/null || true
# Verify zone transfer from RAIN
rndc reload
dig @192.168.120.1 sky.wdc.us.gl3 A
# Re-sign DNSSEC
rndc sign wdc.us.gl3
rndc sign 120.168.192.in-addr.arpa
# Post-recovery checklist
sudo aide --update && sudo mv /var/lib/aide/aide.db.new.gz /var/lib/aide/aide.db.gz
echo "$(date) [DR] SKY recovery complete" >> /var/log/asset-inventory.log
6.2 RAIN — Secondary DNS/DHCP Recovery¶
# SKY continues serving DNS/DHCP — no service interruption
# Option A: ESXi snapshot restore (preferred)
# Option B: NAS restore
RESTOREDATE=$(ls /mnt/nas-backup/rain/ | grep -v recycle | sort | tail -1)
tar -xzf /mnt/nas-backup/rain/${RESTOREDATE}/named-${RESTOREDATE}.tar.gz -C /
tar -xzf /mnt/nas-backup/rain/${RESTOREDATE}/dhcp-${RESTOREDATE}.tar.gz -C /
tar -xzf /mnt/nas-backup/rain/${RESTOREDATE}/etc-${RESTOREDATE}.tar.gz -C /
# Option C: GCS restore
RESTOREDATE=$(gcloud storage ls gs://gpus-infra-backups-wdc/rain/ | sort | tail -1 | tr -d '/')
gcloud storage cp "gs://gpus-infra-backups-wdc/rain/${RESTOREDATE}/*.tar.gz" /tmp/restore/
for f in /tmp/restore/*.tar.gz; do tar -xzf $f -C /; done
# Restore services and verify zone transfer from SKY
systemctl enable --now named dhcpd fail2ban auditd firewalld
chown -R named:named /var/named/
rndc reload
dig @192.168.120.2 rain.wdc.us.gl3 A
# Verify DHCP failover peer
journalctl -u dhcpd | grep -i failover
# Post-recovery checklist
sudo aide --update && sudo mv /var/lib/aide/aide.db.new.gz /var/lib/aide/aide.db.gz
echo "$(date) [DR] RAIN recovery complete" >> /var/log/asset-inventory.log
6.3 SUN — Prometheus/Grafana Recovery¶
# DNS/DHCP unaffected. Loss of monitoring visibility only.
# Option A: ESXi snapshot restore (preferred)
# Option B: NAS restore
RESTOREDATE=$(ls /mnt/nas-backup/sun/ | grep -v recycle | sort | tail -1)
tar -xzf /mnt/nas-backup/sun/${RESTOREDATE}/prometheus-${RESTOREDATE}.tar.gz -C /
tar -xzf /mnt/nas-backup/sun/${RESTOREDATE}/grafana-${RESTOREDATE}.tar.gz -C /
tar -xzf /mnt/nas-backup/sun/${RESTOREDATE}/etc-${RESTOREDATE}.tar.gz -C /
# Option C: GCS restore
RESTOREDATE=$(gcloud storage ls gs://gpus-infra-backups-wdc/sun/ | sort | tail -1 | tr -d '/')
gcloud storage cp "gs://gpus-infra-backups-wdc/sun/${RESTOREDATE}/*.tar.gz" /tmp/restore/
for f in /tmp/restore/*.tar.gz; do tar -xzf $f -C /; done
# Restore services
systemctl enable --now prometheus node_exporter grafana-server fail2ban auditd firewalld
# Verify Prometheus targets
curl -s http://192.168.120.3:9090/api/v1/targets | python3 -m json.tool | grep health
# Verify Grafana
curl -s http://192.168.120.3:3000/api/health
# Post-recovery checklist
sudo aide --update && sudo mv /var/lib/aide/aide.db.new.gz /var/lib/aide/aide.db.gz
echo "$(date) [DR] SUN recovery complete" >> /var/log/asset-inventory.log
6.4 WIND — ELK Stack Recovery¶
# DNS/DHCP unaffected. SKY/RAIN rsyslog queues logs locally during WIND outage.
# Option A: ESXi snapshot restore (preferred)
# Option B: NAS restore
RESTOREDATE=$(ls /mnt/nas-backup/wind/ | grep -v recycle | sort | tail -1)
tar -xzf /mnt/nas-backup/wind/${RESTOREDATE}/elasticsearch-${RESTOREDATE}.tar.gz -C /
tar -xzf /mnt/nas-backup/wind/${RESTOREDATE}/logstash-${RESTOREDATE}.tar.gz -C /
tar -xzf /mnt/nas-backup/wind/${RESTOREDATE}/kibana-${RESTOREDATE}.tar.gz -C /
tar -xzf /mnt/nas-backup/wind/${RESTOREDATE}/etc-${RESTOREDATE}.tar.gz -C /
# Option C: GCS restore
RESTOREDATE=$(gcloud storage ls gs://gpus-infra-backups-wdc/wind/ | sort | tail -1 | tr -d '/')
gcloud storage cp "gs://gpus-infra-backups-wdc/wind/${RESTOREDATE}/*.tar.gz" /tmp/restore/
for f in /tmp/restore/*.tar.gz; do tar -xzf $f -C /; done
# Fix ES ownership and restore services
chown -R elasticsearch:elasticsearch /var/lib/elasticsearch/
systemctl enable --now elasticsearch logstash kibana fail2ban auditd firewalld
# Verify ES cluster health
curl -s http://192.168.120.4:9200/_cluster/health | python3 -m json.tool
# Flush queued logs from SKY/RAIN (rsyslog replays automatically on reconnect)
# Verify Kibana
curl -s http://192.168.120.4:5601/api/status | python3 -m json.tool | grep overall
# Post-recovery checklist
sudo aide --update && sudo mv /var/lib/aide/aide.db.new.gz /var/lib/aide/aide.db.gz
echo "$(date) [DR] WIND recovery complete" >> /var/log/asset-inventory.log
6.5 GCP VPN Tunnel Recovery¶
# Step 1: Check tunnel status
gcloud compute vpn-tunnels describe gpus-vpn-tunnel-wdc \
--region=us-central1 \
--project=gpus-infra \
--format="value(status,detailedStatus)"
# Step 2: If tunnel is DOWN — verify Meraki MX100 IKEv2 config
# WDC peer: 38.140.146.68 | GCP gateway: 130.211.194.72
# IKEv2, AES-256, SHA-256, DH Group 14
# Step 3: Delete and recreate tunnel if needed
gcloud compute vpn-tunnels delete gpus-vpn-tunnel-wdc \
--region=us-central1 --project=gpus-infra --quiet
# Re-apply Terraform
cd ~/terraform/gpus-infra
terraform apply -target=google_compute_vpn_tunnel.wdc_tunnel
# Step 4: Verify connectivity
ping -c 3 192.168.120.1 # From GCP VPC instance
6.6 Cloud Run Services Recovery¶
# Status site recovery
gcloud run deploy gpus-status-site \
--image=us-central1-docker.pkg.dev/gpus-infra/gpus-images/status-site:latest \
--region=us-central1 --project=gpus-infra
# MkDocs portal recovery
gcloud run deploy gpus-mkdocs-portal \
--image=us-central1-docker.pkg.dev/gpus-infra/gpus-images/mkdocs:latest \
--region=us-central1 --project=gpus-infra
# If image lost — rebuild from source
cd ~/terraform/gpus-infra/mkdocs
gcloud builds submit --config=cloudbuild.yaml .
7. Full-Site Loss Procedure (WDC)¶
In the event the entire WDC site is unavailable (power, physical destruction, extended ISP outage):
- GCP services remain operational — status.greenpeace.us and infra.greenpeace.us continue serving via Cloud Run.
- DNS failover — update public DNS to point internal resolution to a temporary resolver if required.
- Remote access — all configuration archives are in GCS (
gpus-infrabackup bucket). VPN can be re-established to a replacement on-prem site. - Rebuild order — SKY → RAIN → SUN → WIND. DNS/DHCP takes priority.
- Terraform — all GCP infrastructure can be rebuilt from
gpus-infra-tf-stateGCS bucket:terraform init && terraform apply. - Timeline — full four-server rebuild from GCS archives: estimated 6–8 hours.
8. DR Testing Schedule¶
| Test | Servers | Frequency | Owner |
|---|---|---|---|
| SKY failover to RAIN | SKY, RAIN | Quarterly | DNS Admin |
| RAIN rebuild from backup | RAIN | Quarterly | DNS/Backup Admin |
| SUN snapshot restore | SUN | Monthly | Monitoring Admin |
| WIND snapshot restore | WIND | Monthly | Monitoring/Logging Admin |
| End-to-end pipeline test | All four | Monthly | Monitoring Admin |
| Full four-server DR drill | All four | Annually | Director of Cyber Security + Full Team |
| Backup archive verification | All four | Weekly | Backup Admin |
| GCS backup integrity check | GCP | Monthly | GCP Admin |
| Cloud Run redeploy drill | GCP | Quarterly | GCP Admin |
9. Post-Recovery Validation Checklist¶
After any recovery action on any server:
# 1. AIDE baseline update (mandatory)
sudo aide --update && sudo mv /var/lib/aide/aide.db.new.gz /var/lib/aide/aide.db.gz
# 2. Log recovery event
echo "$(date) [DR] Recovery of <SERVER> complete — <scenario>" >> /var/log/asset-inventory.log
# 3. Re-sign DNSSEC (SKY/RAIN only, if zone files restored)
rndc sign wdc.us.gl3
rndc sign 120.168.192.in-addr.arpa
# 4. Verify all services
systemctl status named dhcpd # SKY/RAIN
systemctl status prometheus grafana-server # SUN
systemctl status elasticsearch logstash kibana # WIND
# 5. Verify network connectivity
ping -c 2 192.168.120.1 # SKY
ping -c 2 192.168.120.2 # RAIN
ping -c 2 192.168.120.3 # SUN
ping -c 2 192.168.120.4 # WIND
# 6. Run CIS compliance check
# (Monthly full check; post-recovery spot check on restored server)
# 7. Notify Director of Cyber Security — recovery complete, services verified
10. Plan Maintenance¶
This plan is reviewed and tested:
- Annually — full review by Director of Cyber Security + DNS Admin + Monitoring Admin
- After any DR event — updated within 5 business days of recovery completion
- After any infrastructure change — reviewed for impact
Document version: v1.1 · 2026-03-16 · GPUS-IT · Classification: CONFIDENTIAL — Internal Use Only
See also¶
- Forms Portal — DR Playbook — DR-FP-01 through DR-FP-05 (RTO 4h, RPO 15min): Cloud SQL restore, attachments bucket loss, KMS key compromise, region outage, full-stack restore