Snapshot & Backup Schedule — WDC On-Prem¶
Classification: CONFIDENTIAL — Internal Use Only Document:
architecture/wdc/backup-snapshots/schedule.md· v1.0 · 2026-05-12 · GPUS-IT
3-2-1 Design
3 copies of every protected workload — production VM, NAS snapshot, GCS offsite. 2 different media — block storage (NAS) and object storage (GCS). 1 copy offsite — Google Cloud Storage bucket in a separate region from the WDC office.
1. Protected Assets¶
| Asset | Host | Tier | RPO | RTO |
|---|---|---|---|---|
ocean.wdc.us.gl3 (KACE SMA) |
water | Tier 1 — Critical | 4 h | 4 h |
Future: sky.wdc.us.gl3 |
fire | Tier 1 | 4 h | 4 h |
Future: rain.wdc.us.gl3 |
fire | Tier 2 | 12 h | 8 h |
Future: wind.wdc.us.gl3 |
fire | Tier 2 | 12 h | 8 h |
Future: sun.wdc.us.gl3 |
fire | Tier 3 | 24 h | 24 h |
ESXi host config (water, future fire/flower) |
n/a | Tier 1 | 24 h | 2 h (reapply config) |
| NAS volumes hosting VMDKs | NAS | Tier 1 | 4 h | 4 h |
Tiering rules
- Tier 1: revenue, security tooling, or staff productivity blocking.
- Tier 2: important but tolerates a workday of data loss.
- Tier 3: dev / utility / easy to rebuild from IaC.
2. Snapshot Schedule (VM-level, vSphere)¶
Snapshots are short-lived recovery points — they are not backups. Maximum age is enforced.
| Workload | Trigger | Frequency | Retention | Quiesced? | Memory? |
|---|---|---|---|---|---|
| Tier 1 VMs | Scheduled | Every 4 h | 24 h (6 snaps max) | Yes | No |
| Tier 1 VMs | Pre-change | Manual, before any patch / config change | 24 h | Yes | Yes |
| Tier 2 VMs | Scheduled | Every 12 h | 48 h | Yes | No |
| Tier 3 VMs | On demand only | — | 72 h max | Yes | No |
| Any VM | Snapshot age guard | — | Auto-delete at 72 h | — | — |
Snapshot hygiene
vSphere snapshots degrade performance the longer they live and the larger their delta. A monitor job alerts SOC when any snapshot exceeds 72 h or 50 GB delta.
3. Backup Schedule (Image-level, Veeam)¶
Backups are taken with Veeam Backup & Replication writing to the on-prem NAS repository, then copied offsite to GCS.
| Job | Source | Frequency | Local Retention (NAS) | Offsite Retention (GCS) | Encrypted |
|---|---|---|---|---|---|
| WDC-Tier1-Daily | All Tier 1 VMs | Daily 22:00 ET | 14 daily | 30 daily | AES-256 |
| WDC-Tier1-Weekly | All Tier 1 VMs | Sun 23:00 ET | 4 weekly | 12 weekly | AES-256 |
| WDC-Tier1-Monthly | All Tier 1 VMs | 1st of month | 3 monthly | 12 monthly | AES-256 |
| WDC-Tier1-Annual | All Tier 1 VMs | Jan 1 | n/a | 7 annual | AES-256 |
| WDC-Tier2-Daily | All Tier 2 VMs | Daily 23:00 ET | 7 daily | 14 daily | AES-256 |
| WDC-Tier2-Weekly | All Tier 2 VMs | Sun 23:30 ET | 4 weekly | 8 weekly | AES-256 |
| WDC-Tier3-Weekly | All Tier 3 VMs | Sun 00:30 ET | 4 weekly | 4 weekly | AES-256 |
| WDC-ESXi-Config | Host profiles + IaC export | Daily 21:00 ET | 14 daily | 30 daily | AES-256 |
3.1 Storage Targets¶
| Target | Type | Path | Capacity | Notes |
|---|---|---|---|---|
| Primary (on-prem) | NAS NFS share | nas-wdc-01:/backups/wdc |
10 TB usable | Immutable / object-lock enabled |
| Offsite | GCS bucket | gs://gpus-wdc-backups (Coldline → Archive lifecycle) |
unlimited | Bucket lock + retention policy (30 d minimum) |
3.2 Encryption & Key Management¶
- All backup jobs use AES-256 with a Veeam-managed encryption key.
- Master key escrowed in the IT password vault; two-person rule for retrieval.
- GCS bucket uses CMEK (customer-managed encryption keys) from Google KMS, key rotation 90 d.
4. Restore Testing¶
Untested backups are not backups. Restore tests are mandatory and tracked.
| Test | Frequency | Method | Owner | Pass criteria |
|---|---|---|---|---|
| File-level restore (random VM) | Weekly | Veeam Instant Restore to sandbox | IT Ops | File matches, hash verified |
| Full-VM restore (Tier 1) | Monthly | Restore to isolated dr-sandbox network |
IT Ops + SOC | VM boots, services up, no malware indicators |
| Bare-metal ESXi rebuild + config restore | Quarterly | Rebuild lab host from runbook | Cyber Sec | Host profile compliant, telemetry flowing |
| Full DR drill (cross-region) | Annual | Restore Tier 1 from GCS into GCP cold-standby project | All | RTO/RPO met for every Tier 1 asset |
Each test is logged in the DR test register.
5. Monitoring¶
- Veeam → Wazuh integration via syslog: every job emits start, success, failure, and warning events.
- Failure of any Tier 1 job → immediate page to on-call.
- Two consecutive failures of any tier → high-severity ticket auto-created in KACE.
- Daily summary email to
it-ops@greenpeace.org. - Dashboard tile on
soc.greenpeace.usshows last-good-backup age for every protected VM.
6. Compliance Mapping¶
| Control / Framework | Section | How addressed |
|---|---|---|
| CIS Controls v8 | Control 11 — Data Recovery | Tiered schedule, offsite copy, tested restores |
| NIST CSF 2.0 | PR.DS-11 (data backup), RC.RP (recovery planning) | Documented schedule + DR plan |
| PCI-DSS v4.0 | Req. 9.4.1, 12.10.1 | Backup media security, IR/DR alignment |
| Greenpeace IRP | §8 (Recovery) | Backups feed the IRP recovery phase |
7. Change Log¶
| Date | Change | By |
|---|---|---|
| 2026-05-12 | Initial schedule authored for WDC cluster | R. Chhetry |