Skip to content

Hypervisor Rebuild Runbook — water.wdc.us.gl3

Classification: CONFIDENTIAL — Internal Use Only Document: architecture/wdc/hypervisors/water-rebuild-runbook.md · v1.0 · 2026-05-12 · GPUS-IT


Host Profile

Hostname: water.wdc.us.gl3 Role: First ESXi host of the WDC on-prem cluster. Future cluster peers: fire.wdc.us.gl3, flower.wdc.us.gl3 Primary tenant VM: ocean.wdc.us.gl3 (Quest KACE SMA appliance) Future VMs on Fire: sky, rain, wind, sun

1. Purpose & Scope

This runbook describes the end-to-end procedure to rebuild water.wdc.us.gl3 from bare metal to a hardened, monitored production ESXi host ready to receive the ocean KACE SMA workload.

It satisfies:

  • CIS Controls v8 — 1, 2, 3, 4, 5, 6, 8, 12, 13
  • NIST CSF 2.0 — PR.AA, PR.PS, PR.IR, DE.CM
  • PCI-DSS v4.0 — Req. 2.2 (secure config), Req. 6.3 (vulnerability mgmt), Req. 10 (logging)
  • VMware ESXi Security Configuration Guide (current applicable version)

2. Pre-Flight Checklist

# Check Status
1 Power path verified: Water plugs into Fennel PDU (legacy, 192.168.122.91), upstream is Pickle UPS (192.168.122.90) via Purslane PDU. See APC UPS & PDU Inventory.
1a Risk acknowledged: Fennel is a legacy PDU with no email alerts and is a single point of failure for Water, GL5 FW, Flower, and VMware Storage. Replacement tracked separately.
1b Pickle UPS battery health ≥ 80% and runtime calibration current
2 Core switch ports for Water identified and configured (mgmt, vMotion, VM, storage VLANs)
3 Static IPs reserved in IPAM: mgmt, vMotion, vSAN/iSCSI, IPMI
4 DNS A and PTR records created for water.wdc.us.gl3
5 ESXi installer ISO downloaded and SHA-256 verified
6 KACE SMA OVA staged on NAS
7 Wazuh agent install package available
8 Change ticket approved in KACE
9 Backup target (NAS + GCS bucket) reachable and write-tested
10 Maintenance window communicated to staff

3. Network Plan

VLAN Purpose Subnet Water IP
10 Management (vmk0) 192.168.10.0/24 192.168.10.20
20 vMotion (vmk1) 192.168.20.0/24 192.168.20.20
30 Storage / NFS to NAS (vmk2) 192.168.30.0/24 192.168.30.20
40 VM production (port group VM-Prod) 192.168.40.0/24 n/a
99 IPMI / iDRAC OOB 192.168.99.0/24 192.168.99.20

Management network isolation

The management VLAN (10) must be reachable only from the IT admin VLAN and via Client VPN — never from general workstation segments. Enforce via Meraki L3 firewall rules.

4. Rebuild Procedure

4.1 BIOS / Firmware

  1. Boot to BIOS, confirm:
    • Latest stable BIOS / iLO / iDRAC firmware applied.
    • Secure Boot: Enabled.
    • TPM 2.0: Enabled and cleared.
    • Intel VT-x / VT-d (or AMD-V / IOMMU): Enabled.
    • Boot order: Internal SSD/M.2 only — disable USB and PXE for production.
  2. Configure IPMI / iDRAC:
    • Dedicated IPMI NIC on VLAN 99.
    • Strong local password stored in the IT password vault.
    • Disable IPMI over LAN unless required; if required, restrict source IPs.
    • SNMPv3 enabled for Wazuh polling.

4.2 Install ESXi

  1. Mount the verified ESXi ISO via iDRAC virtual media.
  2. Install to the internal M.2/SSD (not USB).
  3. During install, set the root password to a long passphrase from the vault (do not reuse from any other host).
  4. Configure management network (vmk0) on VLAN 10 with the IP from §3.
  5. First boot — confirm host reaches DNS, NTP, and the management gateway.

4.3 Initial Hardening — ESXi Host

Apply these immediately after first boot. All settings must be captured in a versioned hardening script committed to the IaC repo.

Control Setting Value
Lockdown Mode Security Profile → Lockdown Mode Strict (allow only vCenter & exception users)
SSH Service policy Start and stop manually (default off)
ESXi Shell Service policy Start and stop manually (default off)
Shell timeout UserVars.ESXiShellTimeOut 600
Shell interactive timeout UserVars.ESXiShellInteractiveTimeOut 600
DCUI timeout UserVars.DcuiTimeOut 600
Account lockout Security.AccountLockFailures 5
Lockout duration Security.AccountUnlockTime 900
Password complexity Security.PasswordQualityControl retry=3 min=disabled,disabled,disabled,15,15
NTP Multiple sources time.google.com, time.cloudflare.com
Syslog Forward to Wazuh & Splunk tcp://wazuh.gpus.internal:6514 (TLS)
Firewall Default deny inbound; allow only mgmt/vMotion/storage/syslog Per zone
MOB Config.HostAgent.plugins.solo.enableMob false
SNMP v3 only, encrypted, read-only community removed v3 user from Wazuh
TLS UserVars.ESXiVPsDisabledProtocols sslv3,tlsv1,tlsv1.1
Welcome / Issue banner Legal banner present Yes — set from template

Hardening script

Use the project repo gpus-esxi-hardening (Ansible) — playbook harden_water.yml. All values above are parameterized; the script is the source of truth, this table is human-readable reference.

4.3.1 PowerChute on Water

water.wdc.us.gl3 must be enrolled in PowerChute Network Shutdown listening to Pickle UPS (192.168.122.90) as its upstream UPS. Triggers and VM shutdown priority are defined in APC UPS & PDU Inventory §8. Note that Water's power path runs through Fennel PDU, but the event source for shutdown is Pickle (the UPS upstream of Purslane → Fennel).

4.4 Storage Configuration

  1. Datastore layout on Water:
    • datastore-water-local — local SSD, used only for the host scratch partition and ISO library.
    • datastore-wdc-nas-01 — NFS export from the WDC NAS (VLAN 30), mounted to all future cluster nodes.
  2. Mount NFS with hardMount, sync, and Kerberos auth where supported by the NAS.
  3. Set up vSphere replication to GCS-backed Veeam repository (see Snapshot & Backup Schedule).

4.5 Add to vCenter

  1. Add water.wdc.us.gl3 to vCenter under the WDC-Cluster cluster object.
  2. Apply the WDC-Baseline host profile.
  3. Confirm host profile compliance shows green before deploying any VM.

4.6 Deploy ocean.wdc.us.gl3 (KACE SMA)

  1. Deploy the KACE SMA OVA to datastore-wdc-nas-01.
  2. Place on port group VM-Prod (VLAN 40), assigned static IP from IPAM.
  3. Configure the appliance per the KACE SMA install guide.
  4. Set the appliance VM's vm.shutdown_priority = high so PowerChute graceful shutdown halts it first.
  5. Confirm KACE can reach the appliance update server and the Wazuh manager.

4.7 Telemetry & SOC Onboarding

Source Destination Method Verified
ESXi syslog Wazuh manager Syslog/TLS 6514
ESXi syslog Splunk HEC HTTPS 8088
vCenter events Wazuh vSphere API integration
IPMI / iDRAC Wazuh SNMPv3 trap + poll
ocean VM (KACE) Wazuh agent installed Agent

After onboarding, confirm the host and its tenant VM appear in:

  • Wazuh agents list with active status
  • Splunk index gpus_wdc receiving events
  • SOC dashboard at soc.greenpeace.us (WDC pane)

4.8 Backup & Snapshot Enrollment

Enroll Water and ocean in the schedules defined in Snapshot & Backup Schedule. Capture a manual pre-production backup before declaring the host ready.

4.9 Validation & Sign-Off

Test Pass criteria Status
ESXi reaches NTP, DNS, syslog targets All four green
Host profile compliance Compliant
Wazuh agent + vCenter integration ingesting events Events visible in last 15 min
Lockdown Mode = Strict Confirmed in vCenter
PowerChute test (simulated outage) VMs shut down in priority order, host powers off
Backup test restore of ocean Restore to isolated network succeeds
Penetration scan from internal scanner No critical / high findings

Sign-off requires both IT Operations and Cyber Security approval. Record in KACE change ticket.

5. Rollback

If the rebuild fails validation:

  1. Quarantine the host (move to WDC-Quarantine vCenter folder; disable mgmt port at switch except for admin VLAN).
  2. Restore previous configuration from IaC repo if applicable, or re-image and restart from §4.2.
  3. Open a high-severity ticket in KACE and notify #wdc-ops.

6. Future Cluster Additions

When fire.wdc.us.gl3 and flower.wdc.us.gl3 are built, follow this same runbook with these deltas:

  • Reuse the WDC-Baseline host profile for one-click compliance.
  • Assign next IPs in the management/vMotion/storage subnets.
  • Enable DRS and HA at the cluster level only after the third host is online.
  • Migrate ocean to shared NAS storage so it can vMotion freely.

The Fire host will eventually run the sky, rain, wind, and sun VMs — each gets its own onboarding ticket and follows the VM onboarding checklist (to be authored when Fire is built).

7. Change Log

Date Change By
2026-05-12 Initial runbook authored for Water build R. Chhetry