Skip to content

2026-05-09 WDC HVAC/UPS Cascade — Post-Incident Review

Document version: 1.0 (2026-05-13) Author: Rajesh Chhetry — Director of Cyber Security / IT Infrastructure Lead, GPUS Classification: Internal — IT Operations Repository path: mkdocs-portal/docs/incidents/2026-05-09-wdc-hvac-ups-cascade.md


1. Executive summary

On Saturday 2026-05-09, the HVAC unit serving the WDC Eye Street server room failed. Ambient temperature rose to approximately 110°F (43°C) per the HVAC unit's onboard temperature reader. The APC UPS detected the over-temperature condition and cut power to IT equipment as a protective response. DNS, DHCP, the on-premise hypervisor stack (FIRE ESXi), and all hosted VMs (SKY, RAIN, SUN, WIND, plus the Synology NAS VM) went offline.

The outage was not noticed until Monday 2026-05-11 at approximately 07:00 EST, roughly 43 hours after initial failure. Once discovered, IT brought services back online in six hours through a deliberate bottom-up restoration sequence. Total outage duration: approximately 67 hours (Sat 12:00 EST → Mon 13:00 EST).

GCP cloud infrastructure was unaffected throughout. The Cloud Run portals (status, infra, soc, forms) remained available externally. The site-to-site VPN tunnel dropped when WAN1 went down (known P3 gap — Classic VPN single-peer architecture).

Two Elasticsearch indices were lost to unsafe-shutdown corruption on WIND (wdc-logs-2026.05.09 and dhcp-leases-2026.05.09). Both were recovered to disk preservation but removed from the live cluster. WIND was returned to service on 2026-05-13 after extensive diagnostic and repair work documented in Section 4.

The incident exposed a single environmental failure domain that cascaded through power → network → storage → compute → services. It also exposed a discovery-time gap: nobody was alerted to the building-level failure for 43 hours.


2. Timeline

Time (EST) Event Source
2026-05-09 ~12:00 Sat HVAC unit fails HVAC unit log
2026-05-09 (post-12:00) Server room temperature rises to ~110°F HVAC temperature reader
2026-05-09 (post-12:00) APC UPS detects over-temperature, cuts power to IT equipment APC event log (TBC)
2026-05-09 (immediately after) DNS, DHCP, network switches, MX95 firewall pair, ESXi hypervisor, all WDC VMs unsafe shutdown Inferred
2026-05-09 (immediately after) GCP site-to-site VPN tunnel drops (WAN1 down, Classic VPN single-peer) Memory entry #1
2026-05-09 to 2026-05-11 WDC offline. GCP services continue operating externally. No internal alerting fires.
2026-05-11 ~07:00 Mon Outage discovered Operations
2026-05-11 (HVAC restart) HVAC unit reset, cooling begins Operations
2026-05-11 (during cooling) Power cables moved from tripped APC to direct power strips for restoration Operations
2026-05-11 (sequence) Switches and MX95 firewall pair brought up Operations
2026-05-11 (sequence) Synology NAS VM brought up Operations
2026-05-11 (sequence) FIRE ESXi hypervisor brought up Operations
2026-05-11 (sequence) SKY and RAIN brought up; DNS/DHCP restored Operations
2026-05-11 (sequence) SUN and WIND brought up Operations
2026-05-11 (sequence) Wireless APs received DHCP, greenlink and guest SSIDs restored Operations
2026-05-11 ~13:00 Mon Services restored. Staff email sent. Operations
2026-05-12 Tue APC reset, equipment migrated back from power strips to UPS Operations
2026-05-11 to 2026-05-13 WIND remained in degraded state — VM reachable but Elasticsearch service failed Diagnostic chat 2026-05-13
2026-05-13 WIND XFS filesystem repair completed; cluster restored to yellow Recovery session 2026-05-13

Approximate total outage duration: 67 hours (Saturday 12:00 EST to Monday 13:00 EST). Discovery-to-restoration duration: 6 hours (Monday 07:00 to 13:00 EST). Detection latency: approximately 43 hours.


3. Cascade analysis

3.1 The failure chain

HVAC unit failure leads to server room ambient temperature rising to approximately 110°F. APC UPS detects over-temperature and trips as a protective response. All IT equipment loses power simultaneously: network switches, MX95 firewall pair, Synology NAS, and FIRE ESXi host all go down. All hosted VMs on FIRE (SKY, RAIN, SUN, WIND) suffer unsafe shutdown. DNS, DHCP, Prometheus, Elasticsearch, and Wazuh all become unavailable. The site-to-site VPN tunnel drops because Classic VPN uses a single peer (WAN1 VRRP VIP) and WAN1 went down with the rest of the equipment.

3.2 Why this was a single failure domain

A single environmental event (HVAC failure) propagated through every layer of the on-premise stack within minutes. The APC UPS did exactly what it was designed to do — protect equipment from thermal damage by cutting power. But the protection action and the failure action were the same action: total loss of power across all IT equipment in the rack.

There is no architectural redundancy in WDC that survives a server-room-level environmental event:

  • Both MX95 firewalls are in the same room
  • All four production VMs (SKY/RAIN/SUN/WIND) run on a single ESXi host (FIRE)
  • The Synology NAS that serves as ESXi datastore and NFS backup target is also in the same room
  • The APC UPS is single-unit, not paired
  • WAN1 (Cogent) drop took the VPN tunnel with it (Classic VPN single-peer architecture)

This is a structural property of the WDC site, not a misconfiguration. Multi-site redundancy was outside scope until the GCP cloud expansion (MAPLE, CEDAR, OAK) was built. That expansion is what kept the SOC operational during this incident.

3.3 What stayed up — GCP cloud and why

GCP infrastructure was entirely unaffected:

  • Cloud Run portals (status, infra, soc, forms) continued serving externally
  • MAPLE (Prometheus + Grafana + Wazuh Manager) continued operating
  • CEDAR (Elasticsearch + Logstash + Kibana + Wazuh Indexer) continued operating
  • OAK (OpenVAS scanner) continued operating
  • Cloud SQL forms database continued operating

What was affected on the cloud side:

  • The site-to-site VPN tunnel dropped when WAN1 went down. This is the known P3 gap: Classic VPN uses a single peer IP (WAN1 VRRP VIP); WAN1 failure drops the tunnel even though WAN2 (Zayo) was still up. HA VPN with BGP would have failed over.
  • Monitoring of on-premise hosts via Prometheus scrape stopped. MAPLE's view of the on-prem servers went blank.
  • SOC dashboard's on-premise data feeds went stale. CEDAR's view of on-prem Wazuh agents went silent.

The cloud SIEM remained the canonical view of security state throughout. Wazuh alerts from cloud VMs continued ingesting normally.

3.4 The discovery-time gap

This is the single biggest finding in this incident.

The outage began Saturday at approximately 12:00 EST. It was not discovered until Monday at approximately 07:00 EST. That is roughly 43 hours during which the entire on-premise infrastructure was offline and nobody knew.

Why nobody knew:

  1. No automated monitoring at the building or environmental layer (no temperature alerting, no UPS-event alerting reaching an external destination)
  2. The Prometheus alerting that does exist runs on MAPLE in GCP — and while MAPLE could see that the on-prem scrape targets were down, those alerts depend on alertmanager routing that has not been fully configured for off-hours pager-style escalation
  3. Weekend staff were not on-site
  4. External-facing portals (status, infra, soc, forms) were running on GCP and reachable, so end-users had no symptom to report
  5. Internal staff at WDC who would have noticed were not in the building over the weekend

Detection through normal Monday-morning physical presence at the office is the path that worked. That is not a control. The next incident of this type, if it happens on a holiday weekend or with staff on vacation, could last longer.


4. WIND recovery — 2026-05-13

48 hours after the building came back online, WIND was still in a degraded state. The VM was up and reachable, but the Elasticsearch service was failed because the var-lib-elasticsearch.mount systemd unit had failed. The data disk could not be mounted at boot due to XFS corruption from the unsafe shutdown.

4.1 Diagnostic sequence

  1. VM reachability confirmed — ping, SSH working, uptime ~46 hours (consistent with reboot after the thermal event)
  2. Service state characterizedelasticsearch.service failed exit 78 because /var/lib/elasticsearch/tmp didn't exist (because the mount didn't mount). kibana and logstash were active but writing to a non-functional backend.
  3. LVM state intactdata-elastic LV present, 167.64 GB, active but not opened (because fstab line was commented out and mount never attempted).
  4. Kernel reported XFS corruptiondmesg showed: XFS (dm-7): Metadata CRC error detected at xfs_dir3_leaf_read_verify ... xfs_dir3_leaf1 block 0x13c79aa8 — Unmount and run xfs_repair
  5. Read-only mount succeededmount -o ro,norecovery worked, filesystem readable
  6. XFS metadata intactxfs_info showed clean geometry, primary superblock valid, no torn metadata write
  7. xfs_repair -n dry-run — Identified: log can't be replayed (needs -L to zero), metadata CRC error in dir3 leaf block of inode 403716877 (single corrupted directory), 3 orphan inodes, counter drift
  8. Corrupted inode identifiedxfs_db revealed inode 403716877 belonged to /indices/SfkGa5GWQfm5ENM_LbesLw/0/index/, which is the Lucene index directory for one shard. Index identified as wdc-logs-2026.05.09 — the daily rolling log index for the day the heat event started.
  9. Backup posture verified — Most recent off-host (GCS) ES snapshot was 2026-05-08 03:46, predating the broken index's creation. Snapshot restore could not recover this index either way. Decision: proceed with repair rather than full restore, since repair preserves the other 308 indices' data through their last-write times.

4.2 What xfs_repair found vs predicted

xfs_repair -L ran cleanly in approximately 60 seconds. Exit code 0.

Predicted findings (matched): - Log zeroed (Phase 2) - 4 leftover CoW extents cleared - 3 orphan inodes moved to lost+found (134218835, 134267921, 134267955 — all 0-byte FIFOs from April, pre-existing) - Counter drift corrected - dir3 leaf block 0x13c79aa8 of inode 403716877 marked invalid, directory rebuilt with whatever children survived

Unexpected but explainable: - 257 reflink-flag clears across all 4 allocation groups. Routine post-unclean-shutdown cleanup — XFS stops tracking stale CoW relationships. File data untouched. - Three benign warnings at process exit (recursive-buffer-locking, cache_purge, cache_zero_check) — known xfs_repair tool internal artifacts.

Phase 6 partially recovered the broken directory rather than zeroing it. Post-repair inspection showed the index/ directory had 50 files with most Lucene segment triples (.cfe/.cfs/.si) intact. This was better than the dry-run predicted.

4.3 The two lost indices

After ES startup, the cluster came up RED with TWO unassigned primary shards, not one:

wdc-logs-2026.05.09 — Expected. Even though the data directory was moved aside (SfkGa5GWQfm5ENM_LbesLw.broken-2026-05-13), ES retained the index in cluster state metadata. ES tried to allocate, found no data, marked red. Resolved by DELETE /wdc-logs-2026.05.09 to remove from cluster state.

dhcp-leases-2026.05.09 — Unexpected. Translog truncated mid-record at byte 4769 (one byte short of a 4-byte payload header). XFS reported the file as intact at the filesystem level — the truncation is inside the file. ES couldn't replay the translog past the corruption point. Allocation explain showed: TranslogCorruptedException: translog from source ... is corrupted, translog truncated ... read past EOF.

This second corruption was invisible to xfs_repair because it's not an XFS metadata problem. The file existed, the byte count matched what XFS believed, the inode was clean. The damage was at the application data layer.

Both indices were preserved on disk and in /backup/logging/wind-incident-2026-05-11/ before deletion from cluster state. Forensic reattachment remains possible if the data is ever needed.

4.4 Multi-layer corruption from unsafe shutdown

An unsafe shutdown of an actively-writing Elasticsearch node produces corruption at multiple independent layers:

Layer Type of damage Detection tool
XFS metadata (directory entries, inodes, log) Block CRC errors, log replay failures xfs_repair -n
File contents (translog records, Lucene segments) Truncated records mid-write, partial last commits ES startup recovery, or elasticsearch-shard tool
Cluster state References to indices whose data is now broken _cluster/allocation/explain

Each layer needs its own diagnostic and repair tool. xfs_repair handles only the first. The second can be invisible until ES tries to read it. The third only surfaces during cluster startup.

Implication: future recovery work after unsafe shutdown must plan for multi-pass diagnosis. The "clean cluster on first startup attempt" outcome assumed at the start of this recovery was optimistic.


5. Lessons

5.1 The detection-time gap is the most important finding

43 hours of outage went unnoticed because no monitoring path reached a human. This is not a server-room problem or a UPS problem. It is an alerting-architecture problem.

The on-prem servers had monitoring (Prometheus on SUN/WIND scraping local exporters), but that monitoring lived on the same failure domain it was supposed to monitor. The cloud-side monitoring (MAPLE) could see that scrape targets were down, but its alert routing was not configured to escalate to off-hours notification reliably.

5.2 Single environmental failure domain

Every on-prem service depends on one HVAC unit, one UPS, one ESXi host, one network switch stack. Any one of those failing takes everything down.

This is not a "fix in one week" problem — it's a budget and capital expenditure conversation about server room redundancy. Worth raising with facilities and leadership as a structural risk, not as an IT-only concern.

5.3 fstab without nofail blocks recovery

The data-elastic fstab line was commented out before the incident — apparently a defensive measure from prior maintenance. Had it been uncommented without nofail, an unmountable disk would have blocked WIND's boot entirely, requiring single-user-mode recovery rather than what we did.

New standard: all data-disk fstab entries on WDC and GCP servers should include nofail,x-systemd.automount. Boot must always complete. Mount failures should be reported via monitoring, not by blocking boot.

This has been applied to WIND. SKY, RAIN, SUN, and the GCP VMs (OAK, MAPLE, CEDAR) should be reviewed.

5.4 Multi-layer corruption requires multi-tool recovery

Plan future unsafe-shutdown recovery around the assumption that XFS metadata corruption, file content corruption, and cluster state inconsistency are independent and each needs its own tool. Don't expect one repair pass to find everything.

5.5 Off-host backups validated under stress

The GCS-mirrored backup at gs://gpus-infra-backups-wdc/wind/ was intact and accessible during WIND's downtime. Local /backup/logging/ on WIND was unreachable while WIND was down. The off-host mirror pattern works as intended.

Gap: /backup/logging/ lives on the same VG (data) as the broken data-elastic LV — same physical disk (sdb). If sdb had failed, both would have gone together. The GCS mirror is the only protection against that case. Backup-source-isolation pattern should be reviewed for SUN and the GCP VMs.

5.6 CEDAR cloud SIEM kept SOC capability during outage

CEDAR (the cloud Elasticsearch + Wazuh Indexer) continued ingesting cloud VM alerts throughout the WDC outage. SOC dashboard remained externally reachable. The cloud expansion paid for itself in this incident.

Caveat: wazuh-indexer on CEDAR is currently inactive (discovered during diagnostic). Cloud-side SIEM has degraded redundancy. Filed as separate follow-up.

5.7 Read-first discipline prevented worse outcomes

The recovery used dry-run analysis (xfs_repair -n), read-only mount inspection, and metadata reads (xfs_db -r) before any destructive action. This took longer than "run xfs_repair -L and see what happens" but caught the second corrupted index (dhcp-leases translog) before it could create more damage.

The discipline of "diagnose before acting on production data" should be the standard for all future infrastructure recovery work.


6. What worked

  • Bottom-up restoration sequence on 2026-05-11 — HVAC, then network, then storage, then hypervisor, then VMs, then services, then APs, then wireless. Methodical, no skipped layers.
  • Temporary power-strip migration during APC recovery — let restoration proceed in parallel with UPS reset, didn't gate on the protective device being healthy.
  • Tuesday APC re-migration after the UPS itself was verified — equipment moved back to UPS only after the protective device was confirmed safe.
  • GCP cloud expansion (MAPLE, CEDAR, OAK) — kept SOC and external portals operational throughout.
  • GCS off-host backup mirror — preserved 2026-05-08 ES snapshot, available throughout the WIND outage.
  • Staff communication — incident notification email sent on 2026-05-11 once services restored, three pre-drafted versions allowed quick selection by audience tone.
  • ESXi snapshot policy on FIRE — daily and weekly snapshots of SKY/RAIN/SUN/WIND were running. While not needed for this recovery (the LV-level repair worked), the safety net existed.
  • Read-first diagnostic discipline during WIND recovery — found the second corrupted index before destructive action, preserved both on disk, complete forensic record in /backup/logging/wind-incident-2026-05-11/.

7. What didn't work — open gaps

# Gap Severity Notes
G1 No automated alerting for building-level failure Critical 43-hour detection gap. Highest-priority follow-up.
G2 Server-room temperature monitoring with alerting Critical HVAC has temperature reader but no external alerting. APC UPS event logging may also not be wired to external monitoring.
G3 Off-hours pager-style escalation Critical MAPLE/Prometheus alertmanager exists but routing for weekend/off-hours not configured.
G4 Classic VPN single-peer dropped tunnel as expected High Already filed as P3. This incident reinforces priority. HA VPN with BGP failover to WAN2 (Zayo) needed.
G5 NAS mount paths failed on WIND post-recovery Medium mnt-backup-nas and mnt-nas-backup still failed as of 2026-05-13. Synology reachability or path renaming. Separate investigation.
G6 wazuh-indexer inactive on CEDAR Medium Discovered during WIND recovery. Cloud SIEM redundancy degraded. Separate investigation.
G7 /backup/logging/ on same disk as data-elastic (WIND sdb) Medium If sdb had failed, local backups would have been unrecoverable. GCS mirror is the only off-host copy. Backup-source-isolation needs review.
G8 ESXi snapshots didn't prevent VM data corruption Medium Snapshots existed but weren't used in recovery because LV-level repair was sufficient. Snapshots are crash-consistent, not application-consistent. For ES specifically, application-level snapshot is the authoritative restore artifact.
G9 Two ES indices lost (wdc-logs-2026.05.09, dhcp-leases-2026.05.09) Low (operationally), Medium (forensically) Both preserved on disk for potential reattachment. DHCP lease history specifically may be needed for any later forensic correlation.
G10 fstab without nofail would have blocked boot Resolved Now hardened on WIND with nofail,x-systemd.automount. Same fix needed on SKY, RAIN, SUN, GCP VMs.
G11 Single failure domain (one HVAC, one UPS, one ESXi host, one switch stack) Structural Long-term: requires facilities investment, leadership conversation.

8. Follow-up items

# Action Priority Owner Target
F1 Building/temperature alerting reaching off-hours notification P1 IT + Facilities 2 weeks
F2 MAPLE alertmanager off-hours escalation configuration P1 IT 1 week
F3 APC UPS event log to external SIEM (CEDAR or MAPLE) P1 IT 2 weeks
F4 Review and harden fstab on SKY, RAIN, SUN with nofail,x-systemd.automount P2 IT 1 week
F5 Review and harden fstab on GCP VMs (OAK, MAPLE, CEDAR) with same pattern P2 IT 1 week
F6 NAS mount investigation on WIND (mnt-backup-nas, mnt-nas-backup) P2 IT 1 week
F7 wazuh-indexer restoration on CEDAR P2 IT 1 week
F8 HA VPN migration with BGP (already filed as P3) P3 to P2 IT 1 month — promoted due to this incident
F9 Backup-source-isolation review (GCS off-host mirroring for all critical data, separate VG from data) P3 IT 2 weeks
F10 Incident runbook rb-009: Datacenter heat/power event recovery P3 IT 2 weeks
F11 Runbooks for Meraki, NAS, ESXi failure scenarios (rb-006/007/008) P3 IT 1 month
F12 Facilities conversation: server room redundancy (HVAC pairing, UPS pairing) P3 Facilities + Leadership Budget cycle
F13 Tabletop exercise: same incident, but during a holiday weekend P3 IT Quarterly
F14 Controlled reattachment experiment for wdc-logs-2026.05.09 if data needed P4 IT As needed
F15 Controlled reattachment experiment for dhcp-leases-2026.05.09 if data needed P4 IT As needed

9. Forensic record — preserved data

All preservation in gs://gpus-infra-backups-wdc/wind/ and on WIND at /backup/logging/wind-incident-2026-05-11/:

Path Contents Size
/backup/logging/wind-incident-2026-05-11/wdc-logs-2026.05.09-translog/ Pre-repair preservation of broken index's translog + state files ~10 KB, 7 files
/backup/logging/wind-incident-2026-05-11/xfs-repair-20260513-114321.log Full xfs_repair output 14,889 bytes, 363 lines
/backup/logging/wind-incident-2026-05-11/broken-index-post-repair/ wdc-logs-2026.05.09 full data dir, post-XFS-repair, pre-rename 1.4 MB, 50 files in 0/index/
/backup/logging/wind-incident-2026-05-11/dhcp-leases-broken-post-repair/ dhcp-leases-2026.05.09 full data dir before DELETE 736 KB
/backup/logging/wind-incident-2026-05-11/dhcp-leases-broken-post-repair/translog-tail/ Explicit translog file capture 13 files
/var/lib/elasticsearch/indices/SfkGa5GWQfm5ENM_LbesLw.broken-2026-05-13/ On-disk renamed wdc-logs data dir, ignored by ES 1.4 MB
/etc/fstab.bak-20260513-120959 fstab backup before the nofail+x-systemd.automount edit

10. Appendix: xfs_repair output summary

Full log preserved at /backup/logging/wind-incident-2026-05-11/xfs-repair-20260513-114321.log.

Key findings:

  • Phase 2 — zero log: Log unreplayable, zeroed. clearing needs_recovery flag.
  • Phase 3 — process known inodes: 3 orphan inodes identified (134218835, 134267921, 134267955), all pre-existing 0-byte FIFOs from 2026-04-08, unrelated to the heat event.
  • Phase 4 — reflink cleanup: 257 inodes had reflink flags cleared across all 4 allocation groups. Routine post-unclean-shutdown.
  • Phase 5 — AG headers rebuilt: Superblock counters reset.
  • Phase 6 — directory verification: Metadata CRC error detected ... xfs_dir3_leaf1 block 0x13c79aa8/0x1000 ... leaf block 8388608 for directory inode 403716877 bad CRC ... rebuilding directory inode 403716877 (partial recovery — most children reattached). Three orphan inodes moved to lost+found/.
  • Phase 7 — link count verification: Clean. Three benign warnings at process exit (recursive-buffer-locking, cache_purge, cache_zero_check) — known xfs_repair tool internal artifacts.
  • Exit code: 0

End of document.

Document maintained in mkdocs-portal/docs/incidents/. Version-controlled.