Disaster Recovery — WDC On-Prem
Classification: CONFIDENTIAL — Internal Use Only
Document: response-plans/wdc-on-prem-dr.md · v1.0 · 2026-05-12 · GPUS-IT
Scope
This document covers DR for the WDC on-prem cluster (water.wdc.us.gl3, future fire.wdc.us.gl3, future flower.wdc.us.gl3) and the tenant VMs hosted on it (ocean.wdc.us.gl3 today; sky, rain, wind, sun on Fire later). It is a section of the broader Greenpeace USA Disaster Recovery Plan.
1. Objectives
| Asset Tier |
RTO |
RPO |
Tier 1 (e.g. ocean/KACE) |
4 h |
4 h |
| Tier 2 |
8 h |
12 h |
| Tier 3 |
24 h |
24 h |
| WDC site (full loss) |
24 h (Tier 1 only, in GCP cold standby) |
24 h |
2. Disaster Classifications
| Tier |
Definition |
Examples |
Lead Decision Maker |
| 1 — Catastrophic |
WDC office unreachable / destroyed |
Fire, flood, prolonged power loss > 24 h |
Director, Cyber Security & Cloud Architecture |
| 2 — Major |
Cluster offline, hardware salvageable |
Both Meraki + core switch failure, NAS failure |
IT Operations Lead |
| 3 — Moderate |
Single host failure, VMs need migration |
water hardware failure |
On-call engineer |
| 4 — Minor |
Single VM failure, no host impact |
ocean corruption |
On-call engineer |
3. Recovery Strategy
3.1 Tier 4 (single VM)
- Open KACE ticket.
- Restore from last good Veeam restore point (RPO ≤ 4 h for Tier 1).
- Validate per Restore Testing checklist.
- Sign off and close ticket.
3.2 Tier 3 (single host)
- Power on standby spares / vMotion-eligible peer host (once cluster has 2+ nodes).
- Until Fire/Flower exist: emergency rebuild of
water from the Hypervisor Rebuild Runbook, then restore ocean from latest backup.
- Communicate to staff: ETA, scope, expected impact.
3.3 Tier 2 (cluster offline, hardware OK)
- Isolate suspected cause (switching, storage, hypervisor).
- Engage vendor support (VMware, NAS vendor, Meraki) in parallel.
- If recovery > RTO, declare Tier 1 and proceed to §3.4.
3.4 Tier 1 (site loss → failover to GCP cold standby)
The WDC cluster has a cold-standby footprint in GCP sufficient to bring Tier 1 workloads online from offsite GCS backups.
Failover steps:
- Declare disaster — Director (or delegate) authorizes failover.
- Activate the DR project in GCP (
gpus-wdc-dr, region us-east1, kept dormant).
- Stand up Veeam restore proxy in the DR project via Terraform (
gpus-wdc-dr/terraform).
- Restore Tier 1 VMs (
ocean first) from gs://gpus-wdc-backups using Veeam Direct Restore to GCP.
- Re-point DNS for tenant services to the DR IPs (TTL is set to 300 s on Tier 1 records to support fast failover).
- Verify functional: SSO, KACE inventory, agent check-in, SOC telemetry.
- Communicate restoration to staff via status page + email.
Failback is the inverse, performed during a planned maintenance window once WDC is rebuilt and Veeam shows successful test restores at the home site.
4. Roles & Responsibilities
| Role |
Primary |
Backup |
| DR Coordinator |
Director, Cyber Security & Cloud Architecture |
IT Operations Lead |
| Hypervisor Recovery |
Senior SysAdmin |
IT Ops Engineer |
| Network Recovery |
Network Engineer |
Director |
| Backup / Restore |
IT Ops Engineer |
Senior SysAdmin |
| Communications |
Comms Lead |
Director |
| SOC liaison |
SOC Analyst on-call |
Cyber Sec Engineer |
5. Communications Plan
- Internal channel: Slack
#wdc-ops + email it-ops@greenpeace.org.
- Staff updates:
status.greenpeace.us and all-staff email cadence — initial within 30 min, then every 60 min until resolved.
- Vendors: VMware support, NAS vendor, Meraki support, GCP support.
6. Dependencies
| Dependency |
Owner |
DR Consideration |
| Meraki cloud (config) |
Meraki |
Configs are cloud-hosted; replacement device pulls config on registration |
GCS bucket gpus-wdc-backups |
Cyber Sec |
Multi-region; bucket lock prevents deletion |
| Veeam licensing |
IT Ops |
NFR license available; production keys in vault |
DNS (*.wdc.us.gl3) |
Cloud team |
TTLs lowered on Tier 1 records to 300 s |
7. Test Register
| Date |
Type |
Scope |
Result |
Findings |
Closed |
| 2026-05-12 |
Plan baselined |
— |
n/a |
Initial draft |
n/a |
| to be added |
File-level restore |
ocean config files |
— |
— |
— |
| to be added |
Full-VM restore |
ocean to sandbox |
— |
— |
— |
| to be added |
Quarterly host rebuild |
Lab ESXi |
— |
— |
— |
| to be added |
Annual cross-region drill |
Tier 1 to GCP DR |
— |
— |
— |
8. Compliance Mapping
| Framework |
Requirement |
Addressed by |
| NIST CSF 2.0 |
RC.RP, RC.CO |
Sections 3, 5 |
| CIS Controls v8 |
Control 11 — Data Recovery |
§3, §7 + backup schedule |
| PCI-DSS v4.0 |
Req. 12.10 |
§3.4 invocation criteria, §5 comms |
9. Change Log
| Date |
Change |
By |
| 2026-05-12 |
Initial WDC on-prem DR section |
R. Chhetry |