Hypervisor Rebuild Runbook — water.wdc.us.gl3¶
Classification: CONFIDENTIAL — Internal Use Only Document:
architecture/wdc/hypervisors/water-rebuild-runbook.md· v1.0 · 2026-05-12 · GPUS-IT
Host Profile
Hostname: water.wdc.us.gl3
Role: First ESXi host of the WDC on-prem cluster.
Future cluster peers: fire.wdc.us.gl3, flower.wdc.us.gl3
Primary tenant VM: ocean.wdc.us.gl3 (Quest KACE SMA appliance)
Future VMs on Fire: sky, rain, wind, sun
1. Purpose & Scope¶
This runbook describes the end-to-end procedure to rebuild water.wdc.us.gl3 from bare metal to a hardened, monitored production ESXi host ready to receive the ocean KACE SMA workload.
It satisfies:
- CIS Controls v8 — 1, 2, 3, 4, 5, 6, 8, 12, 13
- NIST CSF 2.0 — PR.AA, PR.PS, PR.IR, DE.CM
- PCI-DSS v4.0 — Req. 2.2 (secure config), Req. 6.3 (vulnerability mgmt), Req. 10 (logging)
- VMware ESXi Security Configuration Guide (current applicable version)
2. Pre-Flight Checklist¶
| # | Check | Status |
|---|---|---|
| 1 | Power path verified: Water plugs into Fennel PDU (legacy, 192.168.122.91), upstream is Pickle UPS (192.168.122.90) via Purslane PDU. See APC UPS & PDU Inventory. | ☐ |
| 1a | Risk acknowledged: Fennel is a legacy PDU with no email alerts and is a single point of failure for Water, GL5 FW, Flower, and VMware Storage. Replacement tracked separately. | ☐ |
| 1b | Pickle UPS battery health ≥ 80% and runtime calibration current | ☐ |
| 2 | Core switch ports for Water identified and configured (mgmt, vMotion, VM, storage VLANs) | ☐ |
| 3 | Static IPs reserved in IPAM: mgmt, vMotion, vSAN/iSCSI, IPMI | ☐ |
| 4 | DNS A and PTR records created for water.wdc.us.gl3 |
☐ |
| 5 | ESXi installer ISO downloaded and SHA-256 verified | ☐ |
| 6 | KACE SMA OVA staged on NAS | ☐ |
| 7 | Wazuh agent install package available | ☐ |
| 8 | Change ticket approved in KACE | ☐ |
| 9 | Backup target (NAS + GCS bucket) reachable and write-tested | ☐ |
| 10 | Maintenance window communicated to staff | ☐ |
3. Network Plan¶
| VLAN | Purpose | Subnet | Water IP |
|---|---|---|---|
| 10 | Management (vmk0) | 192.168.10.0/24 | 192.168.10.20 |
| 20 | vMotion (vmk1) | 192.168.20.0/24 | 192.168.20.20 |
| 30 | Storage / NFS to NAS (vmk2) | 192.168.30.0/24 | 192.168.30.20 |
| 40 | VM production (port group VM-Prod) |
192.168.40.0/24 | n/a |
| 99 | IPMI / iDRAC OOB | 192.168.99.0/24 | 192.168.99.20 |
Management network isolation
The management VLAN (10) must be reachable only from the IT admin VLAN and via Client VPN — never from general workstation segments. Enforce via Meraki L3 firewall rules.
4. Rebuild Procedure¶
4.1 BIOS / Firmware¶
- Boot to BIOS, confirm:
- Latest stable BIOS / iLO / iDRAC firmware applied.
- Secure Boot: Enabled.
- TPM 2.0: Enabled and cleared.
- Intel VT-x / VT-d (or AMD-V / IOMMU): Enabled.
- Boot order: Internal SSD/M.2 only — disable USB and PXE for production.
- Configure IPMI / iDRAC:
- Dedicated IPMI NIC on VLAN 99.
- Strong local password stored in the IT password vault.
- Disable IPMI over LAN unless required; if required, restrict source IPs.
- SNMPv3 enabled for Wazuh polling.
4.2 Install ESXi¶
- Mount the verified ESXi ISO via iDRAC virtual media.
- Install to the internal M.2/SSD (not USB).
- During install, set the root password to a long passphrase from the vault (do not reuse from any other host).
- Configure management network (vmk0) on VLAN 10 with the IP from §3.
- First boot — confirm host reaches DNS, NTP, and the management gateway.
4.3 Initial Hardening — ESXi Host¶
Apply these immediately after first boot. All settings must be captured in a versioned hardening script committed to the IaC repo.
| Control | Setting | Value |
|---|---|---|
| Lockdown Mode | Security Profile → Lockdown Mode |
Strict (allow only vCenter & exception users) |
| SSH | Service policy | Start and stop manually (default off) |
| ESXi Shell | Service policy | Start and stop manually (default off) |
| Shell timeout | UserVars.ESXiShellTimeOut |
600 |
| Shell interactive timeout | UserVars.ESXiShellInteractiveTimeOut |
600 |
| DCUI timeout | UserVars.DcuiTimeOut |
600 |
| Account lockout | Security.AccountLockFailures |
5 |
| Lockout duration | Security.AccountUnlockTime |
900 |
| Password complexity | Security.PasswordQualityControl |
retry=3 min=disabled,disabled,disabled,15,15 |
| NTP | Multiple sources | time.google.com, time.cloudflare.com |
| Syslog | Forward to Wazuh & Splunk | tcp://wazuh.gpus.internal:6514 (TLS) |
| Firewall | Default deny inbound; allow only mgmt/vMotion/storage/syslog | Per zone |
| MOB | Config.HostAgent.plugins.solo.enableMob |
false |
| SNMP | v3 only, encrypted, read-only community removed | v3 user from Wazuh |
| TLS | UserVars.ESXiVPsDisabledProtocols |
sslv3,tlsv1,tlsv1.1 |
| Welcome / Issue banner | Legal banner present | Yes — set from template |
Hardening script
Use the project repo gpus-esxi-hardening (Ansible) — playbook harden_water.yml. All values above are parameterized; the script is the source of truth, this table is human-readable reference.
4.3.1 PowerChute on Water¶
water.wdc.us.gl3 must be enrolled in PowerChute Network Shutdown listening to Pickle UPS (192.168.122.90) as its upstream UPS. Triggers and VM shutdown priority are defined in APC UPS & PDU Inventory §8. Note that Water's power path runs through Fennel PDU, but the event source for shutdown is Pickle (the UPS upstream of Purslane → Fennel).
4.4 Storage Configuration¶
- Datastore layout on Water:
datastore-water-local— local SSD, used only for the host scratch partition and ISO library.datastore-wdc-nas-01— NFS export from the WDC NAS (VLAN 30), mounted to all future cluster nodes.
- Mount NFS with
hardMount,sync, and Kerberos auth where supported by the NAS. - Set up vSphere replication to GCS-backed Veeam repository (see Snapshot & Backup Schedule).
4.5 Add to vCenter¶
- Add
water.wdc.us.gl3to vCenter under theWDC-Clustercluster object. - Apply the WDC-Baseline host profile.
- Confirm host profile compliance shows green before deploying any VM.
4.6 Deploy ocean.wdc.us.gl3 (KACE SMA)¶
- Deploy the KACE SMA OVA to
datastore-wdc-nas-01. - Place on port group
VM-Prod(VLAN 40), assigned static IP from IPAM. - Configure the appliance per the KACE SMA install guide.
- Set the appliance VM's
vm.shutdown_priority = highso PowerChute graceful shutdown halts it first. - Confirm KACE can reach the appliance update server and the Wazuh manager.
4.7 Telemetry & SOC Onboarding¶
| Source | Destination | Method | Verified |
|---|---|---|---|
| ESXi syslog | Wazuh manager | Syslog/TLS 6514 | ☐ |
| ESXi syslog | Splunk HEC | HTTPS 8088 | ☐ |
| vCenter events | Wazuh | vSphere API integration | ☐ |
| IPMI / iDRAC | Wazuh | SNMPv3 trap + poll | ☐ |
ocean VM (KACE) |
Wazuh agent installed | Agent | ☐ |
After onboarding, confirm the host and its tenant VM appear in:
- Wazuh agents list with
activestatus - Splunk index
gpus_wdcreceiving events - SOC dashboard at
soc.greenpeace.us(WDC pane)
4.8 Backup & Snapshot Enrollment¶
Enroll Water and ocean in the schedules defined in Snapshot & Backup Schedule. Capture a manual pre-production backup before declaring the host ready.
4.9 Validation & Sign-Off¶
| Test | Pass criteria | Status |
|---|---|---|
| ESXi reaches NTP, DNS, syslog targets | All four green | ☐ |
| Host profile compliance | Compliant | ☐ |
| Wazuh agent + vCenter integration ingesting events | Events visible in last 15 min | ☐ |
| Lockdown Mode = Strict | Confirmed in vCenter | ☐ |
| PowerChute test (simulated outage) | VMs shut down in priority order, host powers off | ☐ |
Backup test restore of ocean |
Restore to isolated network succeeds | ☐ |
| Penetration scan from internal scanner | No critical / high findings | ☐ |
Sign-off requires both IT Operations and Cyber Security approval. Record in KACE change ticket.
5. Rollback¶
If the rebuild fails validation:
- Quarantine the host (move to
WDC-QuarantinevCenter folder; disable mgmt port at switch except for admin VLAN). - Restore previous configuration from IaC repo if applicable, or re-image and restart from §4.2.
- Open a high-severity ticket in KACE and notify
#wdc-ops.
6. Future Cluster Additions¶
When fire.wdc.us.gl3 and flower.wdc.us.gl3 are built, follow this same runbook with these deltas:
- Reuse the WDC-Baseline host profile for one-click compliance.
- Assign next IPs in the management/vMotion/storage subnets.
- Enable DRS and HA at the cluster level only after the third host is online.
- Migrate
oceanto shared NAS storage so it can vMotion freely.
The Fire host will eventually run the sky, rain, wind, and sun VMs — each gets its own onboarding ticket and follows the VM onboarding checklist (to be authored when Fire is built).
7. Change Log¶
| Date | Change | By |
|---|---|---|
| 2026-05-12 | Initial runbook authored for Water build | R. Chhetry |