Skip to content

Disaster Recovery — WDC On-Prem

Classification: CONFIDENTIAL — Internal Use Only Document: response-plans/wdc-on-prem-dr.md · v1.0 · 2026-05-12 · GPUS-IT


Scope

This document covers DR for the WDC on-prem cluster (water.wdc.us.gl3, future fire.wdc.us.gl3, future flower.wdc.us.gl3) and the tenant VMs hosted on it (ocean.wdc.us.gl3 today; sky, rain, wind, sun on Fire later). It is a section of the broader Greenpeace USA Disaster Recovery Plan.

1. Objectives

Asset Tier RTO RPO
Tier 1 (e.g. ocean/KACE) 4 h 4 h
Tier 2 8 h 12 h
Tier 3 24 h 24 h
WDC site (full loss) 24 h (Tier 1 only, in GCP cold standby) 24 h

2. Disaster Classifications

Tier Definition Examples Lead Decision Maker
1 — Catastrophic WDC office unreachable / destroyed Fire, flood, prolonged power loss > 24 h Director, Cyber Security & Cloud Architecture
2 — Major Cluster offline, hardware salvageable Both Meraki + core switch failure, NAS failure IT Operations Lead
3 — Moderate Single host failure, VMs need migration water hardware failure On-call engineer
4 — Minor Single VM failure, no host impact ocean corruption On-call engineer

3. Recovery Strategy

3.1 Tier 4 (single VM)

  1. Open KACE ticket.
  2. Restore from last good Veeam restore point (RPO ≤ 4 h for Tier 1).
  3. Validate per Restore Testing checklist.
  4. Sign off and close ticket.

3.2 Tier 3 (single host)

  1. Power on standby spares / vMotion-eligible peer host (once cluster has 2+ nodes).
  2. Until Fire/Flower exist: emergency rebuild of water from the Hypervisor Rebuild Runbook, then restore ocean from latest backup.
  3. Communicate to staff: ETA, scope, expected impact.

3.3 Tier 2 (cluster offline, hardware OK)

  1. Isolate suspected cause (switching, storage, hypervisor).
  2. Engage vendor support (VMware, NAS vendor, Meraki) in parallel.
  3. If recovery > RTO, declare Tier 1 and proceed to §3.4.

3.4 Tier 1 (site loss → failover to GCP cold standby)

The WDC cluster has a cold-standby footprint in GCP sufficient to bring Tier 1 workloads online from offsite GCS backups.

Failover steps:

  1. Declare disaster — Director (or delegate) authorizes failover.
  2. Activate the DR project in GCP (gpus-wdc-dr, region us-east1, kept dormant).
  3. Stand up Veeam restore proxy in the DR project via Terraform (gpus-wdc-dr/terraform).
  4. Restore Tier 1 VMs (ocean first) from gs://gpus-wdc-backups using Veeam Direct Restore to GCP.
  5. Re-point DNS for tenant services to the DR IPs (TTL is set to 300 s on Tier 1 records to support fast failover).
  6. Verify functional: SSO, KACE inventory, agent check-in, SOC telemetry.
  7. Communicate restoration to staff via status page + email.

Failback is the inverse, performed during a planned maintenance window once WDC is rebuilt and Veeam shows successful test restores at the home site.

4. Roles & Responsibilities

Role Primary Backup
DR Coordinator Director, Cyber Security & Cloud Architecture IT Operations Lead
Hypervisor Recovery Senior SysAdmin IT Ops Engineer
Network Recovery Network Engineer Director
Backup / Restore IT Ops Engineer Senior SysAdmin
Communications Comms Lead Director
SOC liaison SOC Analyst on-call Cyber Sec Engineer

5. Communications Plan

  • Internal channel: Slack #wdc-ops + email it-ops@greenpeace.org.
  • Staff updates: status.greenpeace.us and all-staff email cadence — initial within 30 min, then every 60 min until resolved.
  • Vendors: VMware support, NAS vendor, Meraki support, GCP support.

6. Dependencies

Dependency Owner DR Consideration
Meraki cloud (config) Meraki Configs are cloud-hosted; replacement device pulls config on registration
GCS bucket gpus-wdc-backups Cyber Sec Multi-region; bucket lock prevents deletion
Veeam licensing IT Ops NFR license available; production keys in vault
DNS (*.wdc.us.gl3) Cloud team TTLs lowered on Tier 1 records to 300 s

7. Test Register

Date Type Scope Result Findings Closed
2026-05-12 Plan baselined n/a Initial draft n/a
to be added File-level restore ocean config files
to be added Full-VM restore ocean to sandbox
to be added Quarterly host rebuild Lab ESXi
to be added Annual cross-region drill Tier 1 to GCP DR

8. Compliance Mapping

Framework Requirement Addressed by
NIST CSF 2.0 RC.RP, RC.CO Sections 3, 5
CIS Controls v8 Control 11 — Data Recovery §3, §7 + backup schedule
PCI-DSS v4.0 Req. 12.10 §3.4 invocation criteria, §5 comms

9. Change Log

Date Change By
2026-05-12 Initial WDC on-prem DR section R. Chhetry