Skip to content

Disaster Recovery Plan

Classification: CONFIDENTIAL — Internal Use Only Document: response-plans/drp.md · v1.1 · 2026-03-16 · GPUS-IT


1. Purpose & Scope

This Disaster Recovery Plan (DRP) defines the procedures, roles, and recovery objectives for restoring Greenpeace US IT (GPUS-IT) infrastructure following a disruptive event. It covers all four WDC on-premises servers (SKY, RAIN, SUN, WIND), GCP cloud infrastructure (gpus-infra), the Cloud VPN tunnel, and all dependent services.

Compliance alignment:

Framework Reference
CIS Controls v8 CIS 11.1 — Establish and maintain a data recovery process
CIS Controls v8 CIS 11.2 — Perform automated backups
CIS Controls v8 CIS 11.3 — Protect recovery data
CIS Controls v8 CIS 11.4 — Establish and maintain an isolated instance of recovery data
CIS Controls v8 CIS 17.7 — Conduct routine DR exercises
PCI-DSS Requirement 12.3 — Protect all system components from known security vulnerabilities
NIST SP 800-53 CP-2 — Contingency Plan
NIST SP 800-53 CP-9 — Information System Backup
NIST SP 800-53 CP-10 — Information System Recovery and Reconstitution

2. Recovery Objectives

Server / Service RTO (Snapshot Restore) RTO (Config Restore) RTO (Full Rebuild) RPO
SKY (Primary DNS/DHCP) < 5 min (RAIN auto-assumes) < 30 min < 4 hours 24 hours
RAIN (Secondary DNS/DHCP) < 30 min < 1 hour < 4 hours 24 hours
SUN (Prometheus/Grafana) < 30 min < 1 hour < 3 hours 24 hours (config) / 0 (metrics)
WIND (ELK Stack) < 30 min < 1 hour < 4 hours 24 hours (config) / 0–15 min (logs)
GCP VPN Tunnel < 15 min < 1 hour < 2 hours N/A (stateless)
Cloud Run Services < 5 min (auto-redeploy) < 30 min < 1 hour N/A (container image)

Service dependency chain:

Client DNS/DHCP ──► SKY or RAIN    (independent of SUN and WIND)
                    ↑ SKY/RAIN operate independently if SUN/WIND are down
                    ↑ SUN/WIND failure = loss of visibility only, NOT loss of DNS/DHCP
Monitoring/Logging ──► SUN → Prometheus/Grafana
                   └──► WIND → Elasticsearch/Logstash/Kibana
GCP Services ──► Cloud VPN ──► WDC network


3. Roles & Responsibilities

Role DR Responsibility Contact
Director of Cyber Security Plan activation authority, P1/P2 escalation, external comms On-call phone
DNS/DHCP Admin SKY and RAIN recovery, DNSSEC key management On-call phone + SSH
Monitoring/Logging Admin SUN and WIND recovery, log pipeline restoration On-call phone + SSH
Security Operations Threat analysis during incident, AIDE verification SOC hotline
Network Operations ESXi vSwitch, Meraki MX100, VPN tunnel NOC hotline
Backup Admin Snapshot management, archive restores, backup verification On-call phone
GCP Admin Cloud Run, VPN, Terraform state restoration GCP Console + CLI

4. Disaster Scenarios & Response Matrix

Scenario Severity Primary Response Secondary Response
Single server failure (SKY or RAIN) P2 RAIN/SKY auto-assumes; rebuild failed server ESXi snapshot restore
Both DNS/DHCP servers down P1 Emergency /etc/hosts on critical clients; rebuild both Bare-metal restore from NFS
SUN (monitoring) failure P2 DNS/DHCP unaffected; restore from snapshot Rebuild from config backup
WIND (logging) failure P2 DNS/DHCP unaffected; rsyslog queues on SKY/RAIN Restore from snapshot; verify log continuity
All four servers down P1 Activate full DR drill; ESXi host assessment first Sequential rebuild: SKY → RAIN → SUN → WIND
ESXi host failure P1 Assess hardware; restore VMs from NFS backup store Rebuild ESXi 6.7; deploy VMs; restore configs
GCP VPN tunnel down P2 Verify Meraki MX100 tunnel config; re-key if needed Cloud VPN gateway failover
Cloud Run service failure P2 gcloud run deploy redeploy from Artifact Registry Rebuild image via Cloud Build
GCS backup bucket loss P1 Restore from on-prem NFS offsite backup Re-establish GCS pipeline; audit all backups
Full site loss (WDC) P1 GCP services remain operational; DNS failover planning Activate remote rebuild from GCS archives

5. Backup Architecture

5.1 Backup Pipeline Overview

All four servers run an identical backup pipeline via /usr/local/bin/gpus-backup.sh (v2.0), scheduled at 02:00 daily via root crontab. Each backup runs in parallel on each server independently. The pipeline has two destinations:

/usr/local/bin/gpus-backup.sh (02:00 daily, each server)
        ├─► NAS  → vmstorage.wdc.us.gl3:/volume1/backups/<server>/<YYYY-MM-DD>/
        │         Mounted at /mnt/nas-backup on each server
        │         Retention: 30 days (auto-pruned by script)
        └─► GCS  → gs://gpus-infra-backups-wdc/<server>/<YYYY-MM-DD>/
                  Retention: 90 days (GCS lifecycle policy)

Log: /var/log/gpus-backup.log on each server Asset log: /var/log/asset-inventory.log on each server

5.2 Per-Server Backup Contents

Server Script NAS Path GCS Path Archives
SKY /usr/local/bin/gpus-backup.sh /mnt/nas-backup/sky/YYYY-MM-DD/ gs://gpus-infra-backups-wdc/sky/YYYY-MM-DD/ named, dhcp, etc, aide, logs, home
RAIN /usr/local/bin/gpus-backup.sh /mnt/nas-backup/rain/YYYY-MM-DD/ gs://gpus-infra-backups-wdc/rain/YYYY-MM-DD/ named, dhcp, etc, aide, logs, home
SUN /usr/local/bin/gpus-backup.sh /mnt/nas-backup/sun/YYYY-MM-DD/ gs://gpus-infra-backups-wdc/sun/YYYY-MM-DD/ prometheus, grafana, etc, aide, logs, home
WIND /usr/local/bin/gpus-backup.sh /mnt/nas-backup/wind/YYYY-MM-DD/ gs://gpus-infra-backups-wdc/wind/YYYY-MM-DD/ elasticsearch, logstash, kibana, etc, aide, logs, home

5.3 ESXi Snapshot Schedule

Frequency Retention Scope
Daily 7 snapshots All four VMs
Weekly 4 snapshots All four VMs
Monthly 12 snapshots All four VMs

5.4 GCP / Cloud Backup

Resource Backup Method Location Retention
On-prem config archives gpus-backup.sh over VPN gs://gpus-infra-backups-wdc/<server>/ 90 days
Terraform state GCS versioning gs://gpus-infra-tf-state/ Versioned (indefinite)
MkDocs portal source gpus-portal-backup.sh (02:30 daily, Mac) gs://gpus-infra-tf-state/mkdocs-backup/ 30 days
Container images Artifact Registry us-central1-docker.pkg.dev/gpus-infra/gpus-images/ Tagged + latest

5.5 Backup Verification

# Check last backup completed successfully on any server
tail -5 /var/log/gpus-backup.log

# List today's GCS backup for a server
gcloud storage ls gs://gpus-infra-backups-wdc/<server>/$(date +%Y-%m-%d)/

# List all available restore points for a server
gcloud storage ls gs://gpus-infra-backups-wdc/<server>/

# Check NAS backup
ls /mnt/nas-backup/<server>/

6. Server Recovery Procedures

Restore priority: ESXi snapshot (fastest) → NAS backup → GCS backup (last resort / site loss) For detailed step-by-step restore procedures see: Backup & Restore Runbook

6.1 SKY — Primary DNS/DHCP Recovery

Pre-conditions: RAIN is up and auto-assuming primary DNS/DHCP. DNS/DHCP service is not interrupted.

# Option A: ESXi snapshot restore (preferred — < 5 min)
# vSphere Client → SKY VM → Snapshots → Revert to last daily snapshot

# Option B: Restore from NAS backup
RESTOREDATE=$(ls /mnt/nas-backup/sky/ | grep -v recycle | sort | tail -1)
mkdir -p /restore/sky
tar -xzf /mnt/nas-backup/sky/${RESTOREDATE}/named-${RESTOREDATE}.tar.gz -C /
tar -xzf /mnt/nas-backup/sky/${RESTOREDATE}/dhcp-${RESTOREDATE}.tar.gz -C /
tar -xzf /mnt/nas-backup/sky/${RESTOREDATE}/etc-${RESTOREDATE}.tar.gz -C /

# Option C: Restore from GCS (full site loss / NAS unavailable)
RESTOREDATE=$(gcloud storage ls gs://gpus-infra-backups-wdc/sky/ | sort | tail -1 | tr -d '/')
gcloud storage cp "gs://gpus-infra-backups-wdc/sky/${RESTOREDATE}/*.tar.gz" /tmp/restore/
tar -xzf /tmp/restore/named-${RESTOREDATE}.tar.gz -C /
tar -xzf /tmp/restore/dhcp-${RESTOREDATE}.tar.gz -C /
tar -xzf /tmp/restore/etc-${RESTOREDATE}.tar.gz -C /

# Restore services
systemctl enable --now named dhcpd fail2ban auditd firewalld

# Restore DNSSEC keys (included in named archive)
chown -R named:named /var/named/
chmod 600 /var/named/*.key /var/named/*.private 2>/dev/null || true

# Verify zone transfer from RAIN
rndc reload
dig @192.168.120.1 sky.wdc.us.gl3 A

# Re-sign DNSSEC
rndc sign wdc.us.gl3
rndc sign 120.168.192.in-addr.arpa

# Post-recovery checklist
sudo aide --update && sudo mv /var/lib/aide/aide.db.new.gz /var/lib/aide/aide.db.gz
echo "$(date) [DR] SKY recovery complete" >> /var/log/asset-inventory.log

6.2 RAIN — Secondary DNS/DHCP Recovery

# SKY continues serving DNS/DHCP — no service interruption

# Option A: ESXi snapshot restore (preferred)
# Option B: NAS restore
RESTOREDATE=$(ls /mnt/nas-backup/rain/ | grep -v recycle | sort | tail -1)
tar -xzf /mnt/nas-backup/rain/${RESTOREDATE}/named-${RESTOREDATE}.tar.gz -C /
tar -xzf /mnt/nas-backup/rain/${RESTOREDATE}/dhcp-${RESTOREDATE}.tar.gz -C /
tar -xzf /mnt/nas-backup/rain/${RESTOREDATE}/etc-${RESTOREDATE}.tar.gz -C /

# Option C: GCS restore
RESTOREDATE=$(gcloud storage ls gs://gpus-infra-backups-wdc/rain/ | sort | tail -1 | tr -d '/')
gcloud storage cp "gs://gpus-infra-backups-wdc/rain/${RESTOREDATE}/*.tar.gz" /tmp/restore/
for f in /tmp/restore/*.tar.gz; do tar -xzf $f -C /; done

# Restore services and verify zone transfer from SKY
systemctl enable --now named dhcpd fail2ban auditd firewalld
chown -R named:named /var/named/
rndc reload
dig @192.168.120.2 rain.wdc.us.gl3 A

# Verify DHCP failover peer
journalctl -u dhcpd | grep -i failover

# Post-recovery checklist
sudo aide --update && sudo mv /var/lib/aide/aide.db.new.gz /var/lib/aide/aide.db.gz
echo "$(date) [DR] RAIN recovery complete" >> /var/log/asset-inventory.log

6.3 SUN — Prometheus/Grafana Recovery

# DNS/DHCP unaffected. Loss of monitoring visibility only.

# Option A: ESXi snapshot restore (preferred)
# Option B: NAS restore
RESTOREDATE=$(ls /mnt/nas-backup/sun/ | grep -v recycle | sort | tail -1)
tar -xzf /mnt/nas-backup/sun/${RESTOREDATE}/prometheus-${RESTOREDATE}.tar.gz -C /
tar -xzf /mnt/nas-backup/sun/${RESTOREDATE}/grafana-${RESTOREDATE}.tar.gz -C /
tar -xzf /mnt/nas-backup/sun/${RESTOREDATE}/etc-${RESTOREDATE}.tar.gz -C /

# Option C: GCS restore
RESTOREDATE=$(gcloud storage ls gs://gpus-infra-backups-wdc/sun/ | sort | tail -1 | tr -d '/')
gcloud storage cp "gs://gpus-infra-backups-wdc/sun/${RESTOREDATE}/*.tar.gz" /tmp/restore/
for f in /tmp/restore/*.tar.gz; do tar -xzf $f -C /; done

# Restore services
systemctl enable --now prometheus node_exporter grafana-server fail2ban auditd firewalld

# Verify Prometheus targets
curl -s http://192.168.120.3:9090/api/v1/targets | python3 -m json.tool | grep health

# Verify Grafana
curl -s http://192.168.120.3:3000/api/health

# Post-recovery checklist
sudo aide --update && sudo mv /var/lib/aide/aide.db.new.gz /var/lib/aide/aide.db.gz
echo "$(date) [DR] SUN recovery complete" >> /var/log/asset-inventory.log

6.4 WIND — ELK Stack Recovery

# DNS/DHCP unaffected. SKY/RAIN rsyslog queues logs locally during WIND outage.

# Option A: ESXi snapshot restore (preferred)
# Option B: NAS restore
RESTOREDATE=$(ls /mnt/nas-backup/wind/ | grep -v recycle | sort | tail -1)
tar -xzf /mnt/nas-backup/wind/${RESTOREDATE}/elasticsearch-${RESTOREDATE}.tar.gz -C /
tar -xzf /mnt/nas-backup/wind/${RESTOREDATE}/logstash-${RESTOREDATE}.tar.gz -C /
tar -xzf /mnt/nas-backup/wind/${RESTOREDATE}/kibana-${RESTOREDATE}.tar.gz -C /
tar -xzf /mnt/nas-backup/wind/${RESTOREDATE}/etc-${RESTOREDATE}.tar.gz -C /

# Option C: GCS restore
RESTOREDATE=$(gcloud storage ls gs://gpus-infra-backups-wdc/wind/ | sort | tail -1 | tr -d '/')
gcloud storage cp "gs://gpus-infra-backups-wdc/wind/${RESTOREDATE}/*.tar.gz" /tmp/restore/
for f in /tmp/restore/*.tar.gz; do tar -xzf $f -C /; done

# Fix ES ownership and restore services
chown -R elasticsearch:elasticsearch /var/lib/elasticsearch/
systemctl enable --now elasticsearch logstash kibana fail2ban auditd firewalld

# Verify ES cluster health
curl -s http://192.168.120.4:9200/_cluster/health | python3 -m json.tool

# Flush queued logs from SKY/RAIN (rsyslog replays automatically on reconnect)
# Verify Kibana
curl -s http://192.168.120.4:5601/api/status | python3 -m json.tool | grep overall

# Post-recovery checklist
sudo aide --update && sudo mv /var/lib/aide/aide.db.new.gz /var/lib/aide/aide.db.gz
echo "$(date) [DR] WIND recovery complete" >> /var/log/asset-inventory.log

6.5 GCP VPN Tunnel Recovery

# Step 1: Check tunnel status
gcloud compute vpn-tunnels describe gpus-vpn-tunnel-wdc \
    --region=us-central1 \
    --project=gpus-infra \
    --format="value(status,detailedStatus)"

# Step 2: If tunnel is DOWN — verify Meraki MX100 IKEv2 config
# WDC peer: 38.140.146.68 | GCP gateway: 130.211.194.72
# IKEv2, AES-256, SHA-256, DH Group 14

# Step 3: Delete and recreate tunnel if needed
gcloud compute vpn-tunnels delete gpus-vpn-tunnel-wdc \
    --region=us-central1 --project=gpus-infra --quiet

# Re-apply Terraform
cd ~/terraform/gpus-infra
terraform apply -target=google_compute_vpn_tunnel.wdc_tunnel

# Step 4: Verify connectivity
ping -c 3 192.168.120.1  # From GCP VPC instance

6.6 Cloud Run Services Recovery

# Status site recovery
gcloud run deploy gpus-status-site \
    --image=us-central1-docker.pkg.dev/gpus-infra/gpus-images/status-site:latest \
    --region=us-central1 --project=gpus-infra

# MkDocs portal recovery
gcloud run deploy gpus-mkdocs-portal \
    --image=us-central1-docker.pkg.dev/gpus-infra/gpus-images/mkdocs:latest \
    --region=us-central1 --project=gpus-infra

# If image lost — rebuild from source
cd ~/terraform/gpus-infra/mkdocs
gcloud builds submit --config=cloudbuild.yaml .

7. Full-Site Loss Procedure (WDC)

In the event the entire WDC site is unavailable (power, physical destruction, extended ISP outage):

  1. GCP services remain operational — status.greenpeace.us and infra.greenpeace.us continue serving via Cloud Run.
  2. DNS failover — update public DNS to point internal resolution to a temporary resolver if required.
  3. Remote access — all configuration archives are in GCS (gpus-infra backup bucket). VPN can be re-established to a replacement on-prem site.
  4. Rebuild order — SKY → RAIN → SUN → WIND. DNS/DHCP takes priority.
  5. Terraform — all GCP infrastructure can be rebuilt from gpus-infra-tf-state GCS bucket: terraform init && terraform apply.
  6. Timeline — full four-server rebuild from GCS archives: estimated 6–8 hours.

8. DR Testing Schedule

Test Servers Frequency Owner
SKY failover to RAIN SKY, RAIN Quarterly DNS Admin
RAIN rebuild from backup RAIN Quarterly DNS/Backup Admin
SUN snapshot restore SUN Monthly Monitoring Admin
WIND snapshot restore WIND Monthly Monitoring/Logging Admin
End-to-end pipeline test All four Monthly Monitoring Admin
Full four-server DR drill All four Annually Director of Cyber Security + Full Team
Backup archive verification All four Weekly Backup Admin
GCS backup integrity check GCP Monthly GCP Admin
Cloud Run redeploy drill GCP Quarterly GCP Admin

9. Post-Recovery Validation Checklist

After any recovery action on any server:

# 1. AIDE baseline update (mandatory)
sudo aide --update && sudo mv /var/lib/aide/aide.db.new.gz /var/lib/aide/aide.db.gz

# 2. Log recovery event
echo "$(date) [DR] Recovery of <SERVER> complete — <scenario>" >> /var/log/asset-inventory.log

# 3. Re-sign DNSSEC (SKY/RAIN only, if zone files restored)
rndc sign wdc.us.gl3
rndc sign 120.168.192.in-addr.arpa

# 4. Verify all services
systemctl status named dhcpd    # SKY/RAIN
systemctl status prometheus grafana-server  # SUN
systemctl status elasticsearch logstash kibana  # WIND

# 5. Verify network connectivity
ping -c 2 192.168.120.1  # SKY
ping -c 2 192.168.120.2  # RAIN
ping -c 2 192.168.120.3  # SUN
ping -c 2 192.168.120.4  # WIND

# 6. Run CIS compliance check
# (Monthly full check; post-recovery spot check on restored server)

# 7. Notify Director of Cyber Security — recovery complete, services verified

10. Plan Maintenance

This plan is reviewed and tested:

  • Annually — full review by Director of Cyber Security + DNS Admin + Monitoring Admin
  • After any DR event — updated within 5 business days of recovery completion
  • After any infrastructure change — reviewed for impact

Document version: v1.1 · 2026-03-16 · GPUS-IT · Classification: CONFIDENTIAL — Internal Use Only


See also

  • Forms Portal — DR Playbook — DR-FP-01 through DR-FP-05 (RTO 4h, RPO 15min): Cloud SQL restore, attachments bucket loss, KMS key compromise, region outage, full-stack restore