2026-05-09 WDC HVAC/UPS Cascade — Post-Incident Review¶

Document version: 1.0 (2026-05-13) Author: Rajesh Chhetry — Director of Cyber Security / IT Infrastructure Lead, GPUS Classification: Internal — IT Operations Repository path: mkdocs-portal/docs/incidents/2026-05-09-wdc-hvac-ups-cascade.md

1. Executive summary¶

On Saturday 2026-05-09, the HVAC unit serving the WDC Eye Street server room failed. Ambient temperature rose to approximately 110°F (43°C) per the HVAC unit's onboard temperature reader. The APC UPS detected the over-temperature condition and cut power to IT equipment as a protective response. DNS, DHCP, the on-premise hypervisor stack (FIRE ESXi), and all hosted VMs (SKY, RAIN, SUN, WIND, plus the Synology NAS VM) went offline.

The outage was not noticed until Monday 2026-05-11 at approximately 07:00 EST, roughly 43 hours after initial failure. Once discovered, IT brought services back online in six hours through a deliberate bottom-up restoration sequence. Total outage duration: approximately 67 hours (Sat 12:00 EST → Mon 13:00 EST).

GCP cloud infrastructure was unaffected throughout. The Cloud Run portals (status, infra, soc, forms) remained available externally. The site-to-site VPN tunnel dropped when WAN1 went down (known P3 gap — Classic VPN single-peer architecture).

Two Elasticsearch indices were lost to unsafe-shutdown corruption on WIND (wdc-logs-2026.05.09 and dhcp-leases-2026.05.09). Both were recovered to disk preservation but removed from the live cluster. WIND was returned to service on 2026-05-13 after extensive diagnostic and repair work documented in Section 4.

The incident exposed a single environmental failure domain that cascaded through power → network → storage → compute → services. It also exposed a discovery-time gap: nobody was alerted to the building-level failure for 43 hours.

2. Timeline¶

Time (EST)	Event	Source
2026-05-09 ~12:00 Sat	HVAC unit fails	HVAC unit log
2026-05-09 (post-12:00)	Server room temperature rises to ~110°F	HVAC temperature reader
2026-05-09 (post-12:00)	APC UPS detects over-temperature, cuts power to IT equipment	APC event log (TBC)
2026-05-09 (immediately after)	DNS, DHCP, network switches, MX95 firewall pair, ESXi hypervisor, all WDC VMs unsafe shutdown	Inferred
2026-05-09 (immediately after)	GCP site-to-site VPN tunnel drops (WAN1 down, Classic VPN single-peer)	Memory entry #1
2026-05-09 to 2026-05-11	WDC offline. GCP services continue operating externally. No internal alerting fires.	—
2026-05-11 ~07:00 Mon	Outage discovered	Operations
2026-05-11 (HVAC restart)	HVAC unit reset, cooling begins	Operations
2026-05-11 (during cooling)	Power cables moved from tripped APC to direct power strips for restoration	Operations
2026-05-11 (sequence)	Switches and MX95 firewall pair brought up	Operations
2026-05-11 (sequence)	Synology NAS VM brought up	Operations
2026-05-11 (sequence)	FIRE ESXi hypervisor brought up	Operations
2026-05-11 (sequence)	SKY and RAIN brought up; DNS/DHCP restored	Operations
2026-05-11 (sequence)	SUN and WIND brought up	Operations
2026-05-11 (sequence)	Wireless APs received DHCP, greenlink and guest SSIDs restored	Operations
2026-05-11 ~13:00 Mon	Services restored. Staff email sent.	Operations
2026-05-12 Tue	APC reset, equipment migrated back from power strips to UPS	Operations
2026-05-11 to 2026-05-13	WIND remained in degraded state — VM reachable but Elasticsearch service failed	Diagnostic chat 2026-05-13
2026-05-13	WIND XFS filesystem repair completed; cluster restored to yellow	Recovery session 2026-05-13

Approximate total outage duration: 67 hours (Saturday 12:00 EST to Monday 13:00 EST). Discovery-to-restoration duration: 6 hours (Monday 07:00 to 13:00 EST). Detection latency: approximately 43 hours.

3. Cascade analysis¶

3.1 The failure chain¶

HVAC unit failure leads to server room ambient temperature rising to approximately 110°F. APC UPS detects over-temperature and trips as a protective response. All IT equipment loses power simultaneously: network switches, MX95 firewall pair, Synology NAS, and FIRE ESXi host all go down. All hosted VMs on FIRE (SKY, RAIN, SUN, WIND) suffer unsafe shutdown. DNS, DHCP, Prometheus, Elasticsearch, and Wazuh all become unavailable. The site-to-site VPN tunnel drops because Classic VPN uses a single peer (WAN1 VRRP VIP) and WAN1 went down with the rest of the equipment.

3.2 Why this was a single failure domain¶

A single environmental event (HVAC failure) propagated through every layer of the on-premise stack within minutes. The APC UPS did exactly what it was designed to do — protect equipment from thermal damage by cutting power. But the protection action and the failure action were the same action: total loss of power across all IT equipment in the rack.

There is no architectural redundancy in WDC that survives a server-room-level environmental event:

Both MX95 firewalls are in the same room
All four production VMs (SKY/RAIN/SUN/WIND) run on a single ESXi host (FIRE)
The Synology NAS that serves as ESXi datastore and NFS backup target is also in the same room
The APC UPS is single-unit, not paired
WAN1 (Cogent) drop took the VPN tunnel with it (Classic VPN single-peer architecture)

This is a structural property of the WDC site, not a misconfiguration. Multi-site redundancy was outside scope until the GCP cloud expansion (MAPLE, CEDAR, OAK) was built. That expansion is what kept the SOC operational during this incident.

3.3 What stayed up — GCP cloud and why¶

GCP infrastructure was entirely unaffected:

Cloud Run portals (status, infra, soc, forms) continued serving externally
MAPLE (Prometheus + Grafana + Wazuh Manager) continued operating
CEDAR (Elasticsearch + Logstash + Kibana + Wazuh Indexer) continued operating
OAK (OpenVAS scanner) continued operating
Cloud SQL forms database continued operating

What was affected on the cloud side:

The site-to-site VPN tunnel dropped when WAN1 went down. This is the known P3 gap: Classic VPN uses a single peer IP (WAN1 VRRP VIP); WAN1 failure drops the tunnel even though WAN2 (Zayo) was still up. HA VPN with BGP would have failed over.
Monitoring of on-premise hosts via Prometheus scrape stopped. MAPLE's view of the on-prem servers went blank.
SOC dashboard's on-premise data feeds went stale. CEDAR's view of on-prem Wazuh agents went silent.

The cloud SIEM remained the canonical view of security state throughout. Wazuh alerts from cloud VMs continued ingesting normally.

3.4 The discovery-time gap¶

This is the single biggest finding in this incident.

The outage began Saturday at approximately 12:00 EST. It was not discovered until Monday at approximately 07:00 EST. That is roughly 43 hours during which the entire on-premise infrastructure was offline and nobody knew.

Why nobody knew:

No automated monitoring at the building or environmental layer (no temperature alerting, no UPS-event alerting reaching an external destination)
The Prometheus alerting that does exist runs on MAPLE in GCP — and while MAPLE could see that the on-prem scrape targets were down, those alerts depend on alertmanager routing that has not been fully configured for off-hours pager-style escalation
Weekend staff were not on-site
External-facing portals (status, infra, soc, forms) were running on GCP and reachable, so end-users had no symptom to report
Internal staff at WDC who would have noticed were not in the building over the weekend

Detection through normal Monday-morning physical presence at the office is the path that worked. That is not a control. The next incident of this type, if it happens on a holiday weekend or with staff on vacation, could last longer.

4. WIND recovery — 2026-05-13¶

48 hours after the building came back online, WIND was still in a degraded state. The VM was up and reachable, but the Elasticsearch service was failed because the var-lib-elasticsearch.mount systemd unit had failed. The data disk could not be mounted at boot due to XFS corruption from the unsafe shutdown.

4.1 Diagnostic sequence¶

VM reachability confirmed — ping, SSH working, uptime ~46 hours (consistent with reboot after the thermal event)
Service state characterized — elasticsearch.service failed exit 78 because /var/lib/elasticsearch/tmp didn't exist (because the mount didn't mount). kibana and logstash were active but writing to a non-functional backend.
LVM state intact — data-elastic LV present, 167.64 GB, active but not opened (because fstab line was commented out and mount never attempted).
Kernel reported XFS corruption — dmesg showed: XFS (dm-7): Metadata CRC error detected at xfs_dir3_leaf_read_verify ... xfs_dir3_leaf1 block 0x13c79aa8 — Unmount and run xfs_repair
Read-only mount succeeded — mount -o ro,norecovery worked, filesystem readable
XFS metadata intact — xfs_info showed clean geometry, primary superblock valid, no torn metadata write
xfs_repair -n dry-run — Identified: log can't be replayed (needs -L to zero), metadata CRC error in dir3 leaf block of inode 403716877 (single corrupted directory), 3 orphan inodes, counter drift
Corrupted inode identified — xfs_db revealed inode 403716877 belonged to /indices/SfkGa5GWQfm5ENM_LbesLw/0/index/, which is the Lucene index directory for one shard. Index identified as wdc-logs-2026.05.09 — the daily rolling log index for the day the heat event started.
Backup posture verified — Most recent off-host (GCS) ES snapshot was 2026-05-08 03:46, predating the broken index's creation. Snapshot restore could not recover this index either way. Decision: proceed with repair rather than full restore, since repair preserves the other 308 indices' data through their last-write times.

4.2 What xfs_repair found vs predicted¶

xfs_repair -L ran cleanly in approximately 60 seconds. Exit code 0.

Predicted findings (matched): - Log zeroed (Phase 2) - 4 leftover CoW extents cleared - 3 orphan inodes moved to lost+found (134218835, 134267921, 134267955 — all 0-byte FIFOs from April, pre-existing) - Counter drift corrected - dir3 leaf block 0x13c79aa8 of inode 403716877 marked invalid, directory rebuilt with whatever children survived

Unexpected but explainable: - 257 reflink-flag clears across all 4 allocation groups. Routine post-unclean-shutdown cleanup — XFS stops tracking stale CoW relationships. File data untouched. - Three benign warnings at process exit (recursive-buffer-locking, cache_purge, cache_zero_check) — known xfs_repair tool internal artifacts.

Phase 6 partially recovered the broken directory rather than zeroing it. Post-repair inspection showed the index/ directory had 50 files with most Lucene segment triples (.cfe/.cfs/.si) intact. This was better than the dry-run predicted.

4.3 The two lost indices¶

After ES startup, the cluster came up RED with TWO unassigned primary shards, not one:

wdc-logs-2026.05.09 — Expected. Even though the data directory was moved aside (SfkGa5GWQfm5ENM_LbesLw.broken-2026-05-13), ES retained the index in cluster state metadata. ES tried to allocate, found no data, marked red. Resolved by DELETE /wdc-logs-2026.05.09 to remove from cluster state.

dhcp-leases-2026.05.09 — Unexpected. Translog truncated mid-record at byte 4769 (one byte short of a 4-byte payload header). XFS reported the file as intact at the filesystem level — the truncation is inside the file. ES couldn't replay the translog past the corruption point. Allocation explain showed: TranslogCorruptedException: translog from source ... is corrupted, translog truncated ... read past EOF.

This second corruption was invisible to xfs_repair because it's not an XFS metadata problem. The file existed, the byte count matched what XFS believed, the inode was clean. The damage was at the application data layer.

Both indices were preserved on disk and in /backup/logging/wind-incident-2026-05-11/ before deletion from cluster state. Forensic reattachment remains possible if the data is ever needed.

4.4 Multi-layer corruption from unsafe shutdown¶

An unsafe shutdown of an actively-writing Elasticsearch node produces corruption at multiple independent layers:

Layer	Type of damage	Detection tool
XFS metadata (directory entries, inodes, log)	Block CRC errors, log replay failures	`xfs_repair -n`
File contents (translog records, Lucene segments)	Truncated records mid-write, partial last commits	ES startup recovery, or `elasticsearch-shard` tool
Cluster state	References to indices whose data is now broken	`_cluster/allocation/explain`

Each layer needs its own diagnostic and repair tool. xfs_repair handles only the first. The second can be invisible until ES tries to read it. The third only surfaces during cluster startup.

Implication: future recovery work after unsafe shutdown must plan for multi-pass diagnosis. The "clean cluster on first startup attempt" outcome assumed at the start of this recovery was optimistic.

5. Lessons¶

5.1 The detection-time gap is the most important finding¶

43 hours of outage went unnoticed because no monitoring path reached a human. This is not a server-room problem or a UPS problem. It is an alerting-architecture problem.

The on-prem servers had monitoring (Prometheus on SUN/WIND scraping local exporters), but that monitoring lived on the same failure domain it was supposed to monitor. The cloud-side monitoring (MAPLE) could see that scrape targets were down, but its alert routing was not configured to escalate to off-hours notification reliably.

5.2 Single environmental failure domain¶

Every on-prem service depends on one HVAC unit, one UPS, one ESXi host, one network switch stack. Any one of those failing takes everything down.

This is not a "fix in one week" problem — it's a budget and capital expenditure conversation about server room redundancy. Worth raising with facilities and leadership as a structural risk, not as an IT-only concern.

5.3 fstab without nofail blocks recovery¶

The data-elastic fstab line was commented out before the incident — apparently a defensive measure from prior maintenance. Had it been uncommented without nofail, an unmountable disk would have blocked WIND's boot entirely, requiring single-user-mode recovery rather than what we did.

New standard: all data-disk fstab entries on WDC and GCP servers should include nofail,x-systemd.automount. Boot must always complete. Mount failures should be reported via monitoring, not by blocking boot.

This has been applied to WIND. SKY, RAIN, SUN, and the GCP VMs (OAK, MAPLE, CEDAR) should be reviewed.

5.4 Multi-layer corruption requires multi-tool recovery¶

Plan future unsafe-shutdown recovery around the assumption that XFS metadata corruption, file content corruption, and cluster state inconsistency are independent and each needs its own tool. Don't expect one repair pass to find everything.

5.5 Off-host backups validated under stress¶

The GCS-mirrored backup at gs://gpus-infra-backups-wdc/wind/ was intact and accessible during WIND's downtime. Local /backup/logging/ on WIND was unreachable while WIND was down. The off-host mirror pattern works as intended.

Gap: /backup/logging/ lives on the same VG (data) as the broken data-elastic LV — same physical disk (sdb). If sdb had failed, both would have gone together. The GCS mirror is the only protection against that case. Backup-source-isolation pattern should be reviewed for SUN and the GCP VMs.

5.6 CEDAR cloud SIEM kept SOC capability during outage¶

CEDAR (the cloud Elasticsearch + Wazuh Indexer) continued ingesting cloud VM alerts throughout the WDC outage. SOC dashboard remained externally reachable. The cloud expansion paid for itself in this incident.

Caveat: wazuh-indexer on CEDAR is currently inactive (discovered during diagnostic). Cloud-side SIEM has degraded redundancy. Filed as separate follow-up.

5.7 Read-first discipline prevented worse outcomes¶

The recovery used dry-run analysis (xfs_repair -n), read-only mount inspection, and metadata reads (xfs_db -r) before any destructive action. This took longer than "run xfs_repair -L and see what happens" but caught the second corrupted index (dhcp-leases translog) before it could create more damage.

The discipline of "diagnose before acting on production data" should be the standard for all future infrastructure recovery work.

6. What worked¶

Bottom-up restoration sequence on 2026-05-11 — HVAC, then network, then storage, then hypervisor, then VMs, then services, then APs, then wireless. Methodical, no skipped layers.
Temporary power-strip migration during APC recovery — let restoration proceed in parallel with UPS reset, didn't gate on the protective device being healthy.
Tuesday APC re-migration after the UPS itself was verified — equipment moved back to UPS only after the protective device was confirmed safe.
GCP cloud expansion (MAPLE, CEDAR, OAK) — kept SOC and external portals operational throughout.
GCS off-host backup mirror — preserved 2026-05-08 ES snapshot, available throughout the WIND outage.
Staff communication — incident notification email sent on 2026-05-11 once services restored, three pre-drafted versions allowed quick selection by audience tone.
ESXi snapshot policy on FIRE — daily and weekly snapshots of SKY/RAIN/SUN/WIND were running. While not needed for this recovery (the LV-level repair worked), the safety net existed.
Read-first diagnostic discipline during WIND recovery — found the second corrupted index before destructive action, preserved both on disk, complete forensic record in /backup/logging/wind-incident-2026-05-11/.

7. What didn't work — open gaps¶

#	Gap	Severity	Notes
G1	No automated alerting for building-level failure	Critical	43-hour detection gap. Highest-priority follow-up.
G2	Server-room temperature monitoring with alerting	Critical	HVAC has temperature reader but no external alerting. APC UPS event logging may also not be wired to external monitoring.
G3	Off-hours pager-style escalation	Critical	MAPLE/Prometheus alertmanager exists but routing for weekend/off-hours not configured.
G4	Classic VPN single-peer dropped tunnel as expected	High	Already filed as P3. This incident reinforces priority. HA VPN with BGP failover to WAN2 (Zayo) needed.
G5	NAS mount paths failed on WIND post-recovery	Medium	`mnt-backup-nas` and `mnt-nas-backup` still failed as of 2026-05-13. Synology reachability or path renaming. Separate investigation.
G6	wazuh-indexer inactive on CEDAR	Medium	Discovered during WIND recovery. Cloud SIEM redundancy degraded. Separate investigation.
G7	`/backup/logging/` on same disk as `data-elastic` (WIND sdb)	Medium	If sdb had failed, local backups would have been unrecoverable. GCS mirror is the only off-host copy. Backup-source-isolation needs review.
G8	ESXi snapshots didn't prevent VM data corruption	Medium	Snapshots existed but weren't used in recovery because LV-level repair was sufficient. Snapshots are crash-consistent, not application-consistent. For ES specifically, application-level snapshot is the authoritative restore artifact.
G9	Two ES indices lost (wdc-logs-2026.05.09, dhcp-leases-2026.05.09)	Low (operationally), Medium (forensically)	Both preserved on disk for potential reattachment. DHCP lease history specifically may be needed for any later forensic correlation.
G10	fstab without `nofail` would have blocked boot	Resolved	Now hardened on WIND with `nofail,x-systemd.automount`. Same fix needed on SKY, RAIN, SUN, GCP VMs.
G11	Single failure domain (one HVAC, one UPS, one ESXi host, one switch stack)	Structural	Long-term: requires facilities investment, leadership conversation.

8. Follow-up items¶

#	Action	Priority	Owner	Target
F1	Building/temperature alerting reaching off-hours notification	P1	IT + Facilities	2 weeks
F2	MAPLE alertmanager off-hours escalation configuration	P1	IT	1 week
F3	APC UPS event log to external SIEM (CEDAR or MAPLE)	P1	IT	2 weeks
F4	Review and harden fstab on SKY, RAIN, SUN with `nofail,x-systemd.automount`	P2	IT	1 week
F5	Review and harden fstab on GCP VMs (OAK, MAPLE, CEDAR) with same pattern	P2	IT	1 week
F6	NAS mount investigation on WIND (mnt-backup-nas, mnt-nas-backup)	P2	IT	1 week
F7	wazuh-indexer restoration on CEDAR	P2	IT	1 week
F8	HA VPN migration with BGP (already filed as P3)	P3 to P2	IT	1 month — promoted due to this incident
F9	Backup-source-isolation review (GCS off-host mirroring for all critical data, separate VG from data)	P3	IT	2 weeks
F10	Incident runbook `rb-009: Datacenter heat/power event recovery`	P3	IT	2 weeks
F11	Runbooks for Meraki, NAS, ESXi failure scenarios (rb-006/007/008)	P3	IT	1 month
F12	Facilities conversation: server room redundancy (HVAC pairing, UPS pairing)	P3	Facilities + Leadership	Budget cycle
F13	Tabletop exercise: same incident, but during a holiday weekend	P3	IT	Quarterly
F14	Controlled reattachment experiment for wdc-logs-2026.05.09 if data needed	P4	IT	As needed
F15	Controlled reattachment experiment for dhcp-leases-2026.05.09 if data needed	P4	IT	As needed

9. Forensic record — preserved data¶

All preservation in gs://gpus-infra-backups-wdc/wind/ and on WIND at /backup/logging/wind-incident-2026-05-11/:

Path	Contents	Size
`/backup/logging/wind-incident-2026-05-11/wdc-logs-2026.05.09-translog/`	Pre-repair preservation of broken index's translog + state files	~10 KB, 7 files
`/backup/logging/wind-incident-2026-05-11/xfs-repair-20260513-114321.log`	Full xfs_repair output	14,889 bytes, 363 lines
`/backup/logging/wind-incident-2026-05-11/broken-index-post-repair/`	wdc-logs-2026.05.09 full data dir, post-XFS-repair, pre-rename	1.4 MB, 50 files in 0/index/
`/backup/logging/wind-incident-2026-05-11/dhcp-leases-broken-post-repair/`	dhcp-leases-2026.05.09 full data dir before DELETE	736 KB
`/backup/logging/wind-incident-2026-05-11/dhcp-leases-broken-post-repair/translog-tail/`	Explicit translog file capture	13 files
`/var/lib/elasticsearch/indices/SfkGa5GWQfm5ENM_LbesLw.broken-2026-05-13/`	On-disk renamed wdc-logs data dir, ignored by ES	1.4 MB
`/etc/fstab.bak-20260513-120959`	fstab backup before the nofail+x-systemd.automount edit	—

10. Appendix: xfs_repair output summary¶

Full log preserved at /backup/logging/wind-incident-2026-05-11/xfs-repair-20260513-114321.log.

Key findings:

Phase 2 — zero log: Log unreplayable, zeroed. clearing needs_recovery flag.
Phase 3 — process known inodes: 3 orphan inodes identified (134218835, 134267921, 134267955), all pre-existing 0-byte FIFOs from 2026-04-08, unrelated to the heat event.
Phase 4 — reflink cleanup: 257 inodes had reflink flags cleared across all 4 allocation groups. Routine post-unclean-shutdown.
Phase 5 — AG headers rebuilt: Superblock counters reset.
Phase 6 — directory verification: Metadata CRC error detected ... xfs_dir3_leaf1 block 0x13c79aa8/0x1000 ... leaf block 8388608 for directory inode 403716877 bad CRC ... rebuilding directory inode 403716877 (partial recovery — most children reattached). Three orphan inodes moved to lost+found/.
Phase 7 — link count verification: Clean. Three benign warnings at process exit (recursive-buffer-locking, cache_purge, cache_zero_check) — known xfs_repair tool internal artifacts.
Exit code: 0

End of document.

Document maintained in mkdocs-portal/docs/incidents/. Version-controlled.